# 🌐 Crawl4AI Service β€” Status **ВСрсія:** 1.0.0 (MVP) **ΠžΡΡ‚Π°Π½Π½Ρ” оновлСння:** 2025-01-17 **Бтатус:** βœ… Implemented (MVP Ready) --- ## 🎯 Overview **Crawl4AI Service** β€” Π²Π΅Π±-ΠΊΡ€Π°ΡƒΠ»Π΅Ρ€ для Π°Π²Ρ‚ΠΎΠΌΠ°Ρ‚ΠΈΡ‡Π½ΠΎΠ³ΠΎ завантаТСння Ρ‚Π° ΠΎΠ±Ρ€ΠΎΠ±ΠΊΠΈ Π²Π΅Π±-ΠΊΠΎΠ½Ρ‚Π΅Π½Ρ‚Ρƒ (HTML, PDF, зобраТСння) Ρ‡Π΅Ρ€Π΅Π· PARSER Service. Π†Π½Ρ‚Π΅Π³Ρ€ΠΎΠ²Π°Π½ΠΈΠΉ Π· OCR pipeline для Π°Π²Ρ‚ΠΎΠΌΠ°Ρ‚ΠΈΡ‡Π½ΠΎΡ— ΠΎΠ±Ρ€ΠΎΠ±ΠΊΠΈ Π΄ΠΎΠΊΡƒΠΌΠ΅Π½Ρ‚Ρ–Π² Π· URLs. **ДокумСнтація:** - [docs/cursor/crawl4ai_web_crawler_task.md](./docs/cursor/crawl4ai_web_crawler_task.md) β€” Implementation task - [docs/cursor/CRAWL4AI_SERVICE_REPORT.md](./docs/cursor/CRAWL4AI_SERVICE_REPORT.md) β€” Detailed report --- ## βœ… Implementation Complete **Π”Π°Ρ‚Π° Π·Π°Π²Π΅Ρ€ΡˆΠ΅Π½Π½Ρ:** 2025-01-17 ### Core Module **Location:** `services/parser-service/app/crawler/crawl4ai_service.py` **Lines of Code:** 204 **Functions:** - βœ… `crawl_url()` β€” ΠšΡ€Π°ΡƒΠ»Ρ–Π½Π³ Π²Π΅Π±-сторінок (markdown/text/HTML) - Async/sync support - Playwright integration (optional) - Timeout handling - Error handling with fallback - βœ… `download_document()` β€” ЗавантаТСння PDF Ρ‚Π° images - HTTP download with streaming - Content-Type validation - Size limits - βœ… Async context manager β€” Automatic cleanup - βœ… Lazy initialization β€” Initialize only when used --- ### Integration with PARSER Service **Location:** `services/parser-service/app/api/endpoints.py` (lines 117-223) **Implemented:** - βœ… Replaced TODO with full `doc_url` implementation - βœ… Automatic type detection (PDF/Image/HTML) - βœ… Integration with existing OCR pipeline - βœ… Flow: - **PDF/Images:** Download β†’ OCR - **HTML:** Crawl β†’ Markdown β†’ Text β†’ Image β†’ OCR **Endpoints:** - `POST /ocr/parse` β€” With `doc_url` parameter - `POST /ocr/parse_markdown` β€” With `doc_url` parameter - `POST /ocr/parse_qa` β€” With `doc_url` parameter - `POST /ocr/parse_chunks` β€” With `doc_url` parameter --- ### Configuration **Location:** `services/parser-service/app/core/config.py` **Parameters:** ```python CRAWL4AI_ENABLED = True # Enable/disable crawler CRAWL4AI_USE_PLAYWRIGHT = False # Use Playwright for JS rendering CRAWL4AI_TIMEOUT = 30 # Request timeout (seconds) CRAWL4AI_MAX_PAGES = 1 # Max pages to crawl ``` **Environment Variables:** ```bash CRAWL4AI_ENABLED=true CRAWL4AI_USE_PLAYWRIGHT=false CRAWL4AI_TIMEOUT=30 CRAWL4AI_MAX_PAGES=1 ``` --- ### Dependencies **File:** `services/parser-service/requirements.txt` ``` crawl4ai>=0.3.0 # Web crawler with async support ``` **Optional (for Playwright):** ```bash # If CRAWL4AI_USE_PLAYWRIGHT=true playwright install chromium ``` --- ### Integration with Router **Location:** `providers/ocr_provider.py` **Updated:** - βœ… Pass `doc_url` as form data to PARSER Service - βœ… Support for `doc_url` parameter in RouterRequest **Usage Example:** ```python # Via Router response = await router_client.route_request( mode="doc_parse", dao_id="test-dao", payload={ "doc_url": "https://example.com/document.pdf", "output_mode": "qa_pairs" } ) ``` --- ## 🌐 Supported Formats ### 1. PDF Documents - βœ… Download via HTTP/HTTPS - βœ… Pass to OCR pipeline - βœ… Convert to images β†’ Parse ### 2. Images - βœ… Formats: PNG, JPEG, GIF, TIFF, BMP - βœ… Download and validate - βœ… Pass to OCR pipeline ### 3. HTML Pages - βœ… Crawl and extract content - βœ… Convert to Markdown - βœ… Basic text β†’ image conversion - ⚠️ Limitation: Simple text rendering (max 5000 chars, 60 lines) ### 4. JavaScript-Rendered Pages (Optional) - βœ… Playwright integration available - ⚠️ Disabled by default (performance) - πŸ”§ Enable: `CRAWL4AI_USE_PLAYWRIGHT=true` --- ## πŸ”„ Data Flow ``` User Request β”‚ β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Gateway β”‚ β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜ β”‚ β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Router β”‚ β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜ β”‚ doc_url β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ PARSER β”‚ β”‚ Service β”‚ β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜ β”‚ β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Crawl4AI Svc β”‚ β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”Œβ”€β”€β”€β”΄β”€β”€β”€β”€β” β”‚ β”‚ β–Ό β–Ό PDF/IMG HTML β”‚ β”‚ β”‚ β”Œβ”€β”€β”€β”΄β”€β”€β”€β” β”‚ β”‚ Crawl β”‚ β”‚ β”‚Extractβ”‚ β”‚ β””β”€β”€β”€β”¬β”€β”€β”€β”˜ β”‚ β”‚ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”˜ β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ OCR β”‚ β”‚ Pipeline β”‚ β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β”‚ β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Parsed β”‚ β”‚ Document β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ``` --- ## πŸ“Š Statistics **Code Size:** - Crawler module: 204 lines - Integration code: 107 lines - **Total:** ~311 lines **Configuration:** - Parameters: 4 - Environment variables: 4 **Dependencies:** - New: 1 (`crawl4ai`) - Optional: Playwright (for JS rendering) **Supported Formats:** 3 (PDF, Images, HTML) --- ## ⚠️ Known Limitations ### 1. HTML β†’ Image Conversion (Basic) **Current Implementation:** - Simple text rendering with PIL - Max 5000 characters - Max 60 lines - Fixed width font **Limitations:** - ❌ No CSS/styling support - ❌ No complex layouts - ❌ No images in HTML **Recommendation:** ```python # Add WeasyPrint for proper HTML rendering pip install weasyprint # Renders HTML β†’ PDF β†’ Images with proper layout ``` ### 2. No Caching **Current State:** - Every request downloads page again - No deduplication **Recommendation:** ```python # Add Redis cache cache_key = f"crawl:{url_hash}" if cached := redis.get(cache_key): return cached result = await crawl_url(url) redis.setex(cache_key, 3600, result) # 1 hour TTL ``` ### 3. No Rate Limiting **Current State:** - Unlimited requests to target sites - Risk of IP blocking **Recommendation:** ```python # Add rate limiter from slowapi import Limiter limiter = Limiter(key_func=get_remote_address) @app.post("/ocr/parse") @limiter.limit("10/minute") # Max 10 requests per minute async def parse_document(...): ... ``` ### 4. No Tests **Current State:** - ❌ No unit tests - ❌ No integration tests - ❌ No E2E tests **Recommendation:** - Add `tests/test_crawl4ai_service.py` - Mock HTTP requests - Test error handling ### 5. No robots.txt Support **Current State:** - Ignores robots.txt - Risk of crawling restricted content **Recommendation:** ```python from urllib.robotparser import RobotFileParser rp = RobotFileParser() rp.set_url(f"{url}/robots.txt") rp.read() if not rp.can_fetch("*", url): raise ValueError("Crawling not allowed by robots.txt") ``` --- ## πŸ§ͺ Testing ### Manual Testing **Test PDF Download:** ```bash curl -X POST http://localhost:9400/ocr/parse \ -H "Content-Type: multipart/form-data" \ -F "doc_url=https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf" \ -F "output_mode=markdown" ``` **Test HTML Crawl:** ```bash curl -X POST http://localhost:9400/ocr/parse \ -H "Content-Type: multipart/form-data" \ -F "doc_url=https://example.com" \ -F "output_mode=text" ``` **Test via Router:** ```bash curl -X POST http://localhost:9102/route \ -H "Content-Type: application/json" \ -d '{ "mode": "doc_parse", "dao_id": "test-dao", "payload": { "doc_url": "https://example.com/doc.pdf", "output_mode": "qa_pairs" } }' ``` ### Unit Tests (To be implemented) **File:** `tests/test_crawl4ai_service.py` ```python import pytest from app.crawler.crawl4ai_service import Crawl4AIService @pytest.mark.asyncio async def test_crawl_url(): service = Crawl4AIService() result = await service.crawl_url("https://example.com") assert result is not None assert "text" in result or "markdown" in result @pytest.mark.asyncio async def test_download_document(): service = Crawl4AIService() content = await service.download_document("https://example.com/doc.pdf") assert content is not None assert len(content) > 0 ``` --- ## πŸš€ Deployment ### Docker Compose **Already configured in:** `docker-compose.yml` ```yaml services: parser-service: build: ./services/parser-service environment: - CRAWL4AI_ENABLED=true - CRAWL4AI_USE_PLAYWRIGHT=false - CRAWL4AI_TIMEOUT=30 - CRAWL4AI_MAX_PAGES=1 ports: - "9400:9400" ``` ### Start Service ```bash # Start PARSER Service with Crawl4AI docker-compose up -d parser-service # Check logs docker-compose logs -f parser-service | grep -i crawl # Health check curl http://localhost:9400/health ``` ### Enable Playwright (Optional) ```bash # Update docker-compose.yml environment: - CRAWL4AI_USE_PLAYWRIGHT=true # Install Playwright in container docker-compose exec parser-service playwright install chromium # Restart docker-compose restart parser-service ``` --- ## πŸ“ Next Steps ### Phase 1: Bug Fixes & Testing (Priority 1) - [ ] **Add unit tests** β€” Test crawl_url() and download_document() - [ ] **Add integration tests** β€” Test full flow with mocked HTTP - [ ] **Fix HTML rendering** β€” Implement WeasyPrint for proper HTML β†’ PDF - [ ] **Error handling improvements** β€” Better error messages and logging ### Phase 2: Performance & Reliability (Priority 2) - [ ] **Add caching** β€” Redis cache for crawled content (1 hour TTL) - [ ] **Add rate limiting** β€” Per-IP limits (10 req/min) - [ ] **Add robots.txt support** β€” Respect crawling rules - [ ] **Optimize large pages** β€” Chunking for > 5000 chars ### Phase 3: Advanced Features (Priority 3) - [ ] **Sitemap support** β€” Crawl multiple pages from sitemap - [ ] **Link extraction** β€” Extract and follow links - [ ] **Content filtering** β€” Remove ads, navigation, etc. - [ ] **Screenshot capture** β€” Full-page screenshots with Playwright - [ ] **PDF generation from HTML** β€” Proper HTML β†’ PDF conversion --- ## πŸ”— Related Documentation - [TODO-PARSER-RAG.md](./TODO-PARSER-RAG.md) β€” PARSER Agent roadmap - [INFRASTRUCTURE.md](./INFRASTRUCTURE.md) β€” Server infrastructure - [WARP.md](./WARP.md) β€” Developer guide - [docs/cursor/crawl4ai_web_crawler_task.md](./docs/cursor/crawl4ai_web_crawler_task.md) β€” Implementation task - [docs/cursor/CRAWL4AI_SERVICE_REPORT.md](./docs/cursor/CRAWL4AI_SERVICE_REPORT.md) β€” Detailed report - [docs/agents/parser.md](./docs/agents/parser.md) β€” PARSER Agent documentation --- ## πŸ“Š Service Integration Map ``` β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ DAGI Stack Services β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β–Ό β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Router │────▢│ PARSER β”‚ β”‚ (9102) β”‚ β”‚ Service β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ (9400) β”‚ β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β”‚ β”Œβ”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β” β”‚ β”‚ β–Ό β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Crawl4AI β”‚ β”‚ OCR β”‚ β”‚ Service β”‚ β”‚ Pipeline β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜ β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ RAG β”‚ β”‚ Service β”‚ β”‚ (9500) β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ``` --- **Бтатус:** βœ… MVP Complete **Next:** Testing + HTML rendering improvements **Last Updated:** 2025-01-17 by WARP AI **Maintained by:** Ivan Tytar & DAARION Team