Files
microdao-daarion/CRAWL4AI-STATUS.md
Apple 4601c6fca8 feat: add Vision Encoder service + Vision RAG implementation
- Vision Encoder Service (OpenCLIP ViT-L/14, GPU-accelerated)
  - FastAPI app with text/image embedding endpoints (768-dim)
  - Docker support with NVIDIA GPU runtime
  - Port 8001, health checks, model info API

- Qdrant Vector Database integration
  - Port 6333/6334 (HTTP/gRPC)
  - Image embeddings storage (768-dim, Cosine distance)
  - Auto collection creation

- Vision RAG implementation
  - VisionEncoderClient (Python client for API)
  - Image Search module (text-to-image, image-to-image)
  - Vision RAG routing in DAGI Router (mode: image_search)
  - VisionEncoderProvider integration

- Documentation (5000+ lines)
  - SYSTEM-INVENTORY.md - Complete system inventory
  - VISION-ENCODER-STATUS.md - Service status
  - VISION-RAG-IMPLEMENTATION.md - Implementation details
  - vision_encoder_deployment_task.md - Deployment checklist
  - services/vision-encoder/README.md - Deployment guide
  - Updated WARP.md, INFRASTRUCTURE.md, Jupyter Notebook

- Testing
  - test-vision-encoder.sh - Smoke tests (6 tests)
  - Unit tests for client, image search, routing

- Services: 17 total (added Vision Encoder + Qdrant)
- AI Models: 3 (qwen3:8b, OpenCLIP ViT-L/14, BAAI/bge-m3)
- GPU Services: 2 (Vision Encoder, Ollama)
- VRAM Usage: ~10 GB (concurrent)

Status: Production Ready 
2025-11-17 05:24:36 -08:00

12 KiB

🌐 Crawl4AI Service — Status

Версія: 1.0.0 (MVP)
Останнє оновлення: 2025-01-17
Статус: Implemented (MVP Ready)


🎯 Overview

Crawl4AI Service — веб-краулер для автоматичного завантаження та обробки веб-контенту (HTML, PDF, зображення) через PARSER Service. Інтегрований з OCR pipeline для автоматичної обробки документів з URLs.

Документація:


Implementation Complete

Дата завершення: 2025-01-17

Core Module

Location: services/parser-service/app/crawler/crawl4ai_service.py
Lines of Code: 204

Functions:

  • crawl_url() — Краулінг веб-сторінок (markdown/text/HTML)
    • Async/sync support
    • Playwright integration (optional)
    • Timeout handling
    • Error handling with fallback
  • download_document() — Завантаження PDF та images
    • HTTP download with streaming
    • Content-Type validation
    • Size limits
  • Async context manager — Automatic cleanup
  • Lazy initialization — Initialize only when used

Integration with PARSER Service

Location: services/parser-service/app/api/endpoints.py (lines 117-223)

Implemented:

  • Replaced TODO with full doc_url implementation
  • Automatic type detection (PDF/Image/HTML)
  • Integration with existing OCR pipeline
  • Flow:
    • PDF/Images: Download → OCR
    • HTML: Crawl → Markdown → Text → Image → OCR

Endpoints:

  • POST /ocr/parse — With doc_url parameter
  • POST /ocr/parse_markdown — With doc_url parameter
  • POST /ocr/parse_qa — With doc_url parameter
  • POST /ocr/parse_chunks — With doc_url parameter

Configuration

Location: services/parser-service/app/core/config.py

Parameters:

CRAWL4AI_ENABLED = True          # Enable/disable crawler
CRAWL4AI_USE_PLAYWRIGHT = False  # Use Playwright for JS rendering
CRAWL4AI_TIMEOUT = 30            # Request timeout (seconds)
CRAWL4AI_MAX_PAGES = 1           # Max pages to crawl

Environment Variables:

CRAWL4AI_ENABLED=true
CRAWL4AI_USE_PLAYWRIGHT=false
CRAWL4AI_TIMEOUT=30
CRAWL4AI_MAX_PAGES=1

Dependencies

File: services/parser-service/requirements.txt

crawl4ai>=0.3.0  # Web crawler with async support

Optional (for Playwright):

# If CRAWL4AI_USE_PLAYWRIGHT=true
playwright install chromium

Integration with Router

Location: providers/ocr_provider.py

Updated:

  • Pass doc_url as form data to PARSER Service
  • Support for doc_url parameter in RouterRequest

Usage Example:

# Via Router
response = await router_client.route_request(
    mode="doc_parse",
    dao_id="test-dao",
    payload={
        "doc_url": "https://example.com/document.pdf",
        "output_mode": "qa_pairs"
    }
)

🌐 Supported Formats

1. PDF Documents

  • Download via HTTP/HTTPS
  • Pass to OCR pipeline
  • Convert to images → Parse

2. Images

  • Formats: PNG, JPEG, GIF, TIFF, BMP
  • Download and validate
  • Pass to OCR pipeline

3. HTML Pages

  • Crawl and extract content
  • Convert to Markdown
  • Basic text → image conversion
  • ⚠️ Limitation: Simple text rendering (max 5000 chars, 60 lines)

4. JavaScript-Rendered Pages (Optional)

  • Playwright integration available
  • ⚠️ Disabled by default (performance)
  • 🔧 Enable: CRAWL4AI_USE_PLAYWRIGHT=true

🔄 Data Flow

User Request
    │
    ▼
┌────────────┐
│  Gateway   │
└─────┬──────┘
      │
      ▼
┌────────────┐
│   Router   │
└─────┬──────┘
      │ doc_url
      ▼
┌────────────┐
│   PARSER   │
│  Service   │
└─────┬──────┘
      │
      ▼
┌──────────────┐
│ Crawl4AI Svc │
└─────┬────────┘
      │
  ┌───┴────┐
  │        │
  ▼        ▼
PDF/IMG  HTML
  │        │
  │    ┌───┴───┐
  │    │ Crawl │
  │    │Extract│
  │    └───┬───┘
  │        │
  └────┬───┘
       ▼
  ┌──────────┐
  │   OCR    │
  │ Pipeline │
  └─────┬────┘
        │
        ▼
  ┌──────────┐
  │  Parsed  │
  │ Document │
  └──────────┘

📊 Statistics

Code Size:

  • Crawler module: 204 lines
  • Integration code: 107 lines
  • Total: ~311 lines

Configuration:

  • Parameters: 4
  • Environment variables: 4

Dependencies:

  • New: 1 (crawl4ai)
  • Optional: Playwright (for JS rendering)

Supported Formats: 3 (PDF, Images, HTML)


⚠️ Known Limitations

1. HTML → Image Conversion (Basic)

Current Implementation:

  • Simple text rendering with PIL
  • Max 5000 characters
  • Max 60 lines
  • Fixed width font

Limitations:

  • No CSS/styling support
  • No complex layouts
  • No images in HTML

Recommendation:

# Add WeasyPrint for proper HTML rendering
pip install weasyprint
# Renders HTML → PDF → Images with proper layout

2. No Caching

Current State:

  • Every request downloads page again
  • No deduplication

Recommendation:

# Add Redis cache
cache_key = f"crawl:{url_hash}"
if cached := redis.get(cache_key):
    return cached
result = await crawl_url(url)
redis.setex(cache_key, 3600, result)  # 1 hour TTL

3. No Rate Limiting

Current State:

  • Unlimited requests to target sites
  • Risk of IP blocking

Recommendation:

# Add rate limiter
from slowapi import Limiter
limiter = Limiter(key_func=get_remote_address)

@app.post("/ocr/parse")
@limiter.limit("10/minute")  # Max 10 requests per minute
async def parse_document(...):
    ...

4. No Tests

Current State:

  • No unit tests
  • No integration tests
  • No E2E tests

Recommendation:

  • Add tests/test_crawl4ai_service.py
  • Mock HTTP requests
  • Test error handling

5. No robots.txt Support

Current State:

  • Ignores robots.txt
  • Risk of crawling restricted content

Recommendation:

from urllib.robotparser import RobotFileParser
rp = RobotFileParser()
rp.set_url(f"{url}/robots.txt")
rp.read()
if not rp.can_fetch("*", url):
    raise ValueError("Crawling not allowed by robots.txt")

🧪 Testing

Manual Testing

Test PDF Download:

curl -X POST http://localhost:9400/ocr/parse \
  -H "Content-Type: multipart/form-data" \
  -F "doc_url=https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf" \
  -F "output_mode=markdown"

Test HTML Crawl:

curl -X POST http://localhost:9400/ocr/parse \
  -H "Content-Type: multipart/form-data" \
  -F "doc_url=https://example.com" \
  -F "output_mode=text"

Test via Router:

curl -X POST http://localhost:9102/route \
  -H "Content-Type: application/json" \
  -d '{
    "mode": "doc_parse",
    "dao_id": "test-dao",
    "payload": {
      "doc_url": "https://example.com/doc.pdf",
      "output_mode": "qa_pairs"
    }
  }'

Unit Tests (To be implemented)

File: tests/test_crawl4ai_service.py

import pytest
from app.crawler.crawl4ai_service import Crawl4AIService

@pytest.mark.asyncio
async def test_crawl_url():
    service = Crawl4AIService()
    result = await service.crawl_url("https://example.com")
    assert result is not None
    assert "text" in result or "markdown" in result

@pytest.mark.asyncio
async def test_download_document():
    service = Crawl4AIService()
    content = await service.download_document("https://example.com/doc.pdf")
    assert content is not None
    assert len(content) > 0

🚀 Deployment

Docker Compose

Already configured in: docker-compose.yml

services:
  parser-service:
    build: ./services/parser-service
    environment:
      - CRAWL4AI_ENABLED=true
      - CRAWL4AI_USE_PLAYWRIGHT=false
      - CRAWL4AI_TIMEOUT=30
      - CRAWL4AI_MAX_PAGES=1
    ports:
      - "9400:9400"

Start Service

# Start PARSER Service with Crawl4AI
docker-compose up -d parser-service

# Check logs
docker-compose logs -f parser-service | grep -i crawl

# Health check
curl http://localhost:9400/health

Enable Playwright (Optional)

# Update docker-compose.yml
environment:
  - CRAWL4AI_USE_PLAYWRIGHT=true

# Install Playwright in container
docker-compose exec parser-service playwright install chromium

# Restart
docker-compose restart parser-service

📝 Next Steps

Phase 1: Bug Fixes & Testing (Priority 1)

  • Add unit tests — Test crawl_url() and download_document()
  • Add integration tests — Test full flow with mocked HTTP
  • Fix HTML rendering — Implement WeasyPrint for proper HTML → PDF
  • Error handling improvements — Better error messages and logging

Phase 2: Performance & Reliability (Priority 2)

  • Add caching — Redis cache for crawled content (1 hour TTL)
  • Add rate limiting — Per-IP limits (10 req/min)
  • Add robots.txt support — Respect crawling rules
  • Optimize large pages — Chunking for > 5000 chars

Phase 3: Advanced Features (Priority 3)

  • Sitemap support — Crawl multiple pages from sitemap
  • Link extraction — Extract and follow links
  • Content filtering — Remove ads, navigation, etc.
  • Screenshot capture — Full-page screenshots with Playwright
  • PDF generation from HTML — Proper HTML → PDF conversion


📊 Service Integration Map

┌─────────────────────────────────────────────┐
│         DAGI Stack Services                 │
└──────────┬──────────────────────────────────┘
           │
    ┌──────┴──────────┐
    │                 │
    ▼                 ▼
┌──────────┐     ┌──────────┐
│  Router  │────▶│ PARSER   │
│  (9102)  │     │ Service  │
└──────────┘     │ (9400)   │
                 └─────┬────┘
                       │
                 ┌─────┴─────┐
                 │           │
                 ▼           ▼
          ┌──────────┐ ┌──────────┐
          │ Crawl4AI │ │   OCR    │
          │  Service │ │ Pipeline │
          └──────────┘ └──────────┘
                 │           │
                 └─────┬─────┘
                       ▼
                ┌──────────────┐
                │    RAG       │
                │   Service    │
                │   (9500)     │
                └──────────────┘

Статус: MVP Complete
Next: Testing + HTML rendering improvements
Last Updated: 2025-01-17 by WARP AI
Maintained by: Ivan Tytar & DAARION Team