- Vision Encoder Service (OpenCLIP ViT-L/14, GPU-accelerated)
- FastAPI app with text/image embedding endpoints (768-dim)
- Docker support with NVIDIA GPU runtime
- Port 8001, health checks, model info API
- Qdrant Vector Database integration
- Port 6333/6334 (HTTP/gRPC)
- Image embeddings storage (768-dim, Cosine distance)
- Auto collection creation
- Vision RAG implementation
- VisionEncoderClient (Python client for API)
- Image Search module (text-to-image, image-to-image)
- Vision RAG routing in DAGI Router (mode: image_search)
- VisionEncoderProvider integration
- Documentation (5000+ lines)
- SYSTEM-INVENTORY.md - Complete system inventory
- VISION-ENCODER-STATUS.md - Service status
- VISION-RAG-IMPLEMENTATION.md - Implementation details
- vision_encoder_deployment_task.md - Deployment checklist
- services/vision-encoder/README.md - Deployment guide
- Updated WARP.md, INFRASTRUCTURE.md, Jupyter Notebook
- Testing
- test-vision-encoder.sh - Smoke tests (6 tests)
- Unit tests for client, image search, routing
- Services: 17 total (added Vision Encoder + Qdrant)
- AI Models: 3 (qwen3:8b, OpenCLIP ViT-L/14, BAAI/bge-m3)
- GPU Services: 2 (Vision Encoder, Ollama)
- VRAM Usage: ~10 GB (concurrent)
Status: Production Ready ✅
12 KiB
🌐 Crawl4AI Service — Status
Версія: 1.0.0 (MVP)
Останнє оновлення: 2025-01-17
Статус: ✅ Implemented (MVP Ready)
🎯 Overview
Crawl4AI Service — веб-краулер для автоматичного завантаження та обробки веб-контенту (HTML, PDF, зображення) через PARSER Service. Інтегрований з OCR pipeline для автоматичної обробки документів з URLs.
Документація:
- docs/cursor/crawl4ai_web_crawler_task.md — Implementation task
- docs/cursor/CRAWL4AI_SERVICE_REPORT.md — Detailed report
✅ Implementation Complete
Дата завершення: 2025-01-17
Core Module
Location: services/parser-service/app/crawler/crawl4ai_service.py
Lines of Code: 204
Functions:
- ✅
crawl_url()— Краулінг веб-сторінок (markdown/text/HTML)- Async/sync support
- Playwright integration (optional)
- Timeout handling
- Error handling with fallback
- ✅
download_document()— Завантаження PDF та images- HTTP download with streaming
- Content-Type validation
- Size limits
- ✅ Async context manager — Automatic cleanup
- ✅ Lazy initialization — Initialize only when used
Integration with PARSER Service
Location: services/parser-service/app/api/endpoints.py (lines 117-223)
Implemented:
- ✅ Replaced TODO with full
doc_urlimplementation - ✅ Automatic type detection (PDF/Image/HTML)
- ✅ Integration with existing OCR pipeline
- ✅ Flow:
- PDF/Images: Download → OCR
- HTML: Crawl → Markdown → Text → Image → OCR
Endpoints:
POST /ocr/parse— Withdoc_urlparameterPOST /ocr/parse_markdown— Withdoc_urlparameterPOST /ocr/parse_qa— Withdoc_urlparameterPOST /ocr/parse_chunks— Withdoc_urlparameter
Configuration
Location: services/parser-service/app/core/config.py
Parameters:
CRAWL4AI_ENABLED = True # Enable/disable crawler
CRAWL4AI_USE_PLAYWRIGHT = False # Use Playwright for JS rendering
CRAWL4AI_TIMEOUT = 30 # Request timeout (seconds)
CRAWL4AI_MAX_PAGES = 1 # Max pages to crawl
Environment Variables:
CRAWL4AI_ENABLED=true
CRAWL4AI_USE_PLAYWRIGHT=false
CRAWL4AI_TIMEOUT=30
CRAWL4AI_MAX_PAGES=1
Dependencies
File: services/parser-service/requirements.txt
crawl4ai>=0.3.0 # Web crawler with async support
Optional (for Playwright):
# If CRAWL4AI_USE_PLAYWRIGHT=true
playwright install chromium
Integration with Router
Location: providers/ocr_provider.py
Updated:
- ✅ Pass
doc_urlas form data to PARSER Service - ✅ Support for
doc_urlparameter in RouterRequest
Usage Example:
# Via Router
response = await router_client.route_request(
mode="doc_parse",
dao_id="test-dao",
payload={
"doc_url": "https://example.com/document.pdf",
"output_mode": "qa_pairs"
}
)
🌐 Supported Formats
1. PDF Documents
- ✅ Download via HTTP/HTTPS
- ✅ Pass to OCR pipeline
- ✅ Convert to images → Parse
2. Images
- ✅ Formats: PNG, JPEG, GIF, TIFF, BMP
- ✅ Download and validate
- ✅ Pass to OCR pipeline
3. HTML Pages
- ✅ Crawl and extract content
- ✅ Convert to Markdown
- ✅ Basic text → image conversion
- ⚠️ Limitation: Simple text rendering (max 5000 chars, 60 lines)
4. JavaScript-Rendered Pages (Optional)
- ✅ Playwright integration available
- ⚠️ Disabled by default (performance)
- 🔧 Enable:
CRAWL4AI_USE_PLAYWRIGHT=true
🔄 Data Flow
User Request
│
▼
┌────────────┐
│ Gateway │
└─────┬──────┘
│
▼
┌────────────┐
│ Router │
└─────┬──────┘
│ doc_url
▼
┌────────────┐
│ PARSER │
│ Service │
└─────┬──────┘
│
▼
┌──────────────┐
│ Crawl4AI Svc │
└─────┬────────┘
│
┌───┴────┐
│ │
▼ ▼
PDF/IMG HTML
│ │
│ ┌───┴───┐
│ │ Crawl │
│ │Extract│
│ └───┬───┘
│ │
└────┬───┘
▼
┌──────────┐
│ OCR │
│ Pipeline │
└─────┬────┘
│
▼
┌──────────┐
│ Parsed │
│ Document │
└──────────┘
📊 Statistics
Code Size:
- Crawler module: 204 lines
- Integration code: 107 lines
- Total: ~311 lines
Configuration:
- Parameters: 4
- Environment variables: 4
Dependencies:
- New: 1 (
crawl4ai) - Optional: Playwright (for JS rendering)
Supported Formats: 3 (PDF, Images, HTML)
⚠️ Known Limitations
1. HTML → Image Conversion (Basic)
Current Implementation:
- Simple text rendering with PIL
- Max 5000 characters
- Max 60 lines
- Fixed width font
Limitations:
- ❌ No CSS/styling support
- ❌ No complex layouts
- ❌ No images in HTML
Recommendation:
# Add WeasyPrint for proper HTML rendering
pip install weasyprint
# Renders HTML → PDF → Images with proper layout
2. No Caching
Current State:
- Every request downloads page again
- No deduplication
Recommendation:
# Add Redis cache
cache_key = f"crawl:{url_hash}"
if cached := redis.get(cache_key):
return cached
result = await crawl_url(url)
redis.setex(cache_key, 3600, result) # 1 hour TTL
3. No Rate Limiting
Current State:
- Unlimited requests to target sites
- Risk of IP blocking
Recommendation:
# Add rate limiter
from slowapi import Limiter
limiter = Limiter(key_func=get_remote_address)
@app.post("/ocr/parse")
@limiter.limit("10/minute") # Max 10 requests per minute
async def parse_document(...):
...
4. No Tests
Current State:
- ❌ No unit tests
- ❌ No integration tests
- ❌ No E2E tests
Recommendation:
- Add
tests/test_crawl4ai_service.py - Mock HTTP requests
- Test error handling
5. No robots.txt Support
Current State:
- Ignores robots.txt
- Risk of crawling restricted content
Recommendation:
from urllib.robotparser import RobotFileParser
rp = RobotFileParser()
rp.set_url(f"{url}/robots.txt")
rp.read()
if not rp.can_fetch("*", url):
raise ValueError("Crawling not allowed by robots.txt")
🧪 Testing
Manual Testing
Test PDF Download:
curl -X POST http://localhost:9400/ocr/parse \
-H "Content-Type: multipart/form-data" \
-F "doc_url=https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf" \
-F "output_mode=markdown"
Test HTML Crawl:
curl -X POST http://localhost:9400/ocr/parse \
-H "Content-Type: multipart/form-data" \
-F "doc_url=https://example.com" \
-F "output_mode=text"
Test via Router:
curl -X POST http://localhost:9102/route \
-H "Content-Type: application/json" \
-d '{
"mode": "doc_parse",
"dao_id": "test-dao",
"payload": {
"doc_url": "https://example.com/doc.pdf",
"output_mode": "qa_pairs"
}
}'
Unit Tests (To be implemented)
File: tests/test_crawl4ai_service.py
import pytest
from app.crawler.crawl4ai_service import Crawl4AIService
@pytest.mark.asyncio
async def test_crawl_url():
service = Crawl4AIService()
result = await service.crawl_url("https://example.com")
assert result is not None
assert "text" in result or "markdown" in result
@pytest.mark.asyncio
async def test_download_document():
service = Crawl4AIService()
content = await service.download_document("https://example.com/doc.pdf")
assert content is not None
assert len(content) > 0
🚀 Deployment
Docker Compose
Already configured in: docker-compose.yml
services:
parser-service:
build: ./services/parser-service
environment:
- CRAWL4AI_ENABLED=true
- CRAWL4AI_USE_PLAYWRIGHT=false
- CRAWL4AI_TIMEOUT=30
- CRAWL4AI_MAX_PAGES=1
ports:
- "9400:9400"
Start Service
# Start PARSER Service with Crawl4AI
docker-compose up -d parser-service
# Check logs
docker-compose logs -f parser-service | grep -i crawl
# Health check
curl http://localhost:9400/health
Enable Playwright (Optional)
# Update docker-compose.yml
environment:
- CRAWL4AI_USE_PLAYWRIGHT=true
# Install Playwright in container
docker-compose exec parser-service playwright install chromium
# Restart
docker-compose restart parser-service
📝 Next Steps
Phase 1: Bug Fixes & Testing (Priority 1)
- Add unit tests — Test crawl_url() and download_document()
- Add integration tests — Test full flow with mocked HTTP
- Fix HTML rendering — Implement WeasyPrint for proper HTML → PDF
- Error handling improvements — Better error messages and logging
Phase 2: Performance & Reliability (Priority 2)
- Add caching — Redis cache for crawled content (1 hour TTL)
- Add rate limiting — Per-IP limits (10 req/min)
- Add robots.txt support — Respect crawling rules
- Optimize large pages — Chunking for > 5000 chars
Phase 3: Advanced Features (Priority 3)
- Sitemap support — Crawl multiple pages from sitemap
- Link extraction — Extract and follow links
- Content filtering — Remove ads, navigation, etc.
- Screenshot capture — Full-page screenshots with Playwright
- PDF generation from HTML — Proper HTML → PDF conversion
🔗 Related Documentation
- TODO-PARSER-RAG.md — PARSER Agent roadmap
- INFRASTRUCTURE.md — Server infrastructure
- WARP.md — Developer guide
- docs/cursor/crawl4ai_web_crawler_task.md — Implementation task
- docs/cursor/CRAWL4AI_SERVICE_REPORT.md — Detailed report
- docs/agents/parser.md — PARSER Agent documentation
📊 Service Integration Map
┌─────────────────────────────────────────────┐
│ DAGI Stack Services │
└──────────┬──────────────────────────────────┘
│
┌──────┴──────────┐
│ │
▼ ▼
┌──────────┐ ┌──────────┐
│ Router │────▶│ PARSER │
│ (9102) │ │ Service │
└──────────┘ │ (9400) │
└─────┬────┘
│
┌─────┴─────┐
│ │
▼ ▼
┌──────────┐ ┌──────────┐
│ Crawl4AI │ │ OCR │
│ Service │ │ Pipeline │
└──────────┘ └──────────┘
│ │
└─────┬─────┘
▼
┌──────────────┐
│ RAG │
│ Service │
│ (9500) │
└──────────────┘
Статус: ✅ MVP Complete
Next: Testing + HTML rendering improvements
Last Updated: 2025-01-17 by WARP AI
Maintained by: Ivan Tytar & DAARION Team