Files

Apple 4601c6fca8 feat: add Vision Encoder service + Vision RAG implementation

- Vision Encoder Service (OpenCLIP ViT-L/14, GPU-accelerated)
  - FastAPI app with text/image embedding endpoints (768-dim)
  - Docker support with NVIDIA GPU runtime
  - Port 8001, health checks, model info API

- Qdrant Vector Database integration
  - Port 6333/6334 (HTTP/gRPC)
  - Image embeddings storage (768-dim, Cosine distance)
  - Auto collection creation

- Vision RAG implementation
  - VisionEncoderClient (Python client for API)
  - Image Search module (text-to-image, image-to-image)
  - Vision RAG routing in DAGI Router (mode: image_search)
  - VisionEncoderProvider integration

- Documentation (5000+ lines)
  - SYSTEM-INVENTORY.md - Complete system inventory
  - VISION-ENCODER-STATUS.md - Service status
  - VISION-RAG-IMPLEMENTATION.md - Implementation details
  - vision_encoder_deployment_task.md - Deployment checklist
  - services/vision-encoder/README.md - Deployment guide
  - Updated WARP.md, INFRASTRUCTURE.md, Jupyter Notebook

- Testing
  - test-vision-encoder.sh - Smoke tests (6 tests)
  - Unit tests for client, image search, routing

- Services: 17 total (added Vision Encoder + Qdrant)
- AI Models: 3 (qwen3:8b, OpenCLIP ViT-L/14, BAAI/bge-m3)
- GPU Services: 2 (Vision Encoder, Ollama)
- VRAM Usage: ~10 GB (concurrent)

Status: Production Ready ✅

2025-11-17 05:24:36 -08:00

12 KiB

Raw Blame History

🌐 Crawl4AI Service — Status

Версія: 1.0.0 (MVP)
Останнє оновлення: 2025-01-17
Статус: ✅ Implemented (MVP Ready)

🎯 Overview

Crawl4AI Service — веб-краулер для автоматичного завантаження та обробки веб-контенту (HTML, PDF, зображення) через PARSER Service. Інтегрований з OCR pipeline для автоматичної обробки документів з URLs.

Документація:

docs/cursor/crawl4ai_web_crawler_task.md — Implementation task
docs/cursor/CRAWL4AI_SERVICE_REPORT.md — Detailed report

✅ Implementation Complete

Дата завершення: 2025-01-17

Core Module

Location: services/parser-service/app/crawler/crawl4ai_service.py
Lines of Code: 204

Functions:

✅ crawl_url() — Краулінг веб-сторінок (markdown/text/HTML)
- Async/sync support
- Playwright integration (optional)
- Timeout handling
- Error handling with fallback
✅ download_document() — Завантаження PDF та images
- HTTP download with streaming
- Content-Type validation
- Size limits
✅ Async context manager — Automatic cleanup
✅ Lazy initialization — Initialize only when used

Integration with PARSER Service

Location: services/parser-service/app/api/endpoints.py (lines 117-223)

Implemented:

✅ Replaced TODO with full doc_url implementation
✅ Automatic type detection (PDF/Image/HTML)
✅ Integration with existing OCR pipeline
✅ Flow:
- PDF/Images: Download → OCR
- HTML: Crawl → Markdown → Text → Image → OCR

Endpoints:

POST /ocr/parse — With doc_url parameter
POST /ocr/parse_markdown — With doc_url parameter
POST /ocr/parse_qa — With doc_url parameter
POST /ocr/parse_chunks — With doc_url parameter

Configuration

Location: services/parser-service/app/core/config.py

Parameters:

CRAWL4AI_ENABLED = True          # Enable/disable crawler
CRAWL4AI_USE_PLAYWRIGHT = False  # Use Playwright for JS rendering
CRAWL4AI_TIMEOUT = 30            # Request timeout (seconds)
CRAWL4AI_MAX_PAGES = 1           # Max pages to crawl

Environment Variables:

CRAWL4AI_ENABLED=true
CRAWL4AI_USE_PLAYWRIGHT=false
CRAWL4AI_TIMEOUT=30
CRAWL4AI_MAX_PAGES=1

Dependencies

File: services/parser-service/requirements.txt

crawl4ai>=0.3.0  # Web crawler with async support

Optional (for Playwright):

# If CRAWL4AI_USE_PLAYWRIGHT=true
playwright install chromium

Integration with Router

Location: providers/ocr_provider.py

Updated:

✅ Pass doc_url as form data to PARSER Service
✅ Support for doc_url parameter in RouterRequest

Usage Example:

# Via Router
response = await router_client.route_request(
    mode="doc_parse",
    dao_id="test-dao",
    payload={
        "doc_url": "https://example.com/document.pdf",
        "output_mode": "qa_pairs"
    }
)

🌐 Supported Formats

1. PDF Documents

✅ Download via HTTP/HTTPS
✅ Pass to OCR pipeline
✅ Convert to images → Parse

2. Images

✅ Formats: PNG, JPEG, GIF, TIFF, BMP
✅ Download and validate
✅ Pass to OCR pipeline

3. HTML Pages

✅ Crawl and extract content
✅ Convert to Markdown
✅ Basic text → image conversion
⚠️ Limitation: Simple text rendering (max 5000 chars, 60 lines)

4. JavaScript-Rendered Pages (Optional)

✅ Playwright integration available
⚠️ Disabled by default (performance)
🔧 Enable: CRAWL4AI_USE_PLAYWRIGHT=true

🔄 Data Flow

User Request
    │
    ▼
┌────────────┐
│  Gateway   │
└─────┬──────┘
      │
      ▼
┌────────────┐
│   Router   │
└─────┬──────┘
      │ doc_url
      ▼
┌────────────┐
│   PARSER   │
│  Service   │
└─────┬──────┘
      │
      ▼
┌──────────────┐
│ Crawl4AI Svc │
└─────┬────────┘
      │
  ┌───┴────┐
  │        │
  ▼        ▼
PDF/IMG  HTML
  │        │
  │    ┌───┴───┐
  │    │ Crawl │
  │    │Extract│
  │    └───┬───┘
  │        │
  └────┬───┘
       ▼
  ┌──────────┐
  │   OCR    │
  │ Pipeline │
  └─────┬────┘
        │
        ▼
  ┌──────────┐
  │  Parsed  │
  │ Document │
  └──────────┘

📊 Statistics

Code Size:

Crawler module: 204 lines
Integration code: 107 lines
Total: ~311 lines

Configuration:

Parameters: 4
Environment variables: 4

Dependencies:

New: 1 (crawl4ai)
Optional: Playwright (for JS rendering)

Supported Formats: 3 (PDF, Images, HTML)

⚠️ Known Limitations

1. HTML → Image Conversion (Basic)

Current Implementation:

Simple text rendering with PIL
Max 5000 characters
Max 60 lines
Fixed width font

Limitations:

❌ No CSS/styling support
❌ No complex layouts
❌ No images in HTML

Recommendation:

# Add WeasyPrint for proper HTML rendering
pip install weasyprint
# Renders HTML → PDF → Images with proper layout

2. No Caching

Current State:

Every request downloads page again
No deduplication

Recommendation:

# Add Redis cache
cache_key = f"crawl:{url_hash}"
if cached := redis.get(cache_key):
    return cached
result = await crawl_url(url)
redis.setex(cache_key, 3600, result)  # 1 hour TTL

3. No Rate Limiting

Current State:

Unlimited requests to target sites
Risk of IP blocking

Recommendation:

# Add rate limiter
from slowapi import Limiter
limiter = Limiter(key_func=get_remote_address)

@app.post("/ocr/parse")
@limiter.limit("10/minute")  # Max 10 requests per minute
async def parse_document(...):
    ...

4. No Tests

Current State:

❌ No unit tests
❌ No integration tests
❌ No E2E tests

Recommendation:

Add tests/test_crawl4ai_service.py
Mock HTTP requests
Test error handling

5. No robots.txt Support

Current State:

Ignores robots.txt
Risk of crawling restricted content

Recommendation:

from urllib.robotparser import RobotFileParser
rp = RobotFileParser()
rp.set_url(f"{url}/robots.txt")
rp.read()
if not rp.can_fetch("*", url):
    raise ValueError("Crawling not allowed by robots.txt")

🧪 Testing

Manual Testing

Test PDF Download:

curl -X POST http://localhost:9400/ocr/parse \
  -H "Content-Type: multipart/form-data" \
  -F "doc_url=https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf" \
  -F "output_mode=markdown"

Test HTML Crawl:

curl -X POST http://localhost:9400/ocr/parse \
  -H "Content-Type: multipart/form-data" \
  -F "doc_url=https://example.com" \
  -F "output_mode=text"

Test via Router:

curl -X POST http://localhost:9102/route \
  -H "Content-Type: application/json" \
  -d '{
    "mode": "doc_parse",
    "dao_id": "test-dao",
    "payload": {
      "doc_url": "https://example.com/doc.pdf",
      "output_mode": "qa_pairs"
    }
  }'

Unit Tests (To be implemented)

File: tests/test_crawl4ai_service.py

import pytest
from app.crawler.crawl4ai_service import Crawl4AIService

@pytest.mark.asyncio
async def test_crawl_url():
    service = Crawl4AIService()
    result = await service.crawl_url("https://example.com")
    assert result is not None
    assert "text" in result or "markdown" in result

@pytest.mark.asyncio
async def test_download_document():
    service = Crawl4AIService()
    content = await service.download_document("https://example.com/doc.pdf")
    assert content is not None
    assert len(content) > 0

🚀 Deployment

Docker Compose

Already configured in: docker-compose.yml

services:
  parser-service:
    build: ./services/parser-service
    environment:
      - CRAWL4AI_ENABLED=true
      - CRAWL4AI_USE_PLAYWRIGHT=false
      - CRAWL4AI_TIMEOUT=30
      - CRAWL4AI_MAX_PAGES=1
    ports:
      - "9400:9400"

Start Service

# Start PARSER Service with Crawl4AI
docker-compose up -d parser-service

# Check logs
docker-compose logs -f parser-service | grep -i crawl

# Health check
curl http://localhost:9400/health

Enable Playwright (Optional)

# Update docker-compose.yml
environment:
  - CRAWL4AI_USE_PLAYWRIGHT=true

# Install Playwright in container
docker-compose exec parser-service playwright install chromium

# Restart
docker-compose restart parser-service

📝 Next Steps

Phase 1: Bug Fixes & Testing (Priority 1)

Add unit tests — Test crawl_url() and download_document()
Add integration tests — Test full flow with mocked HTTP
Fix HTML rendering — Implement WeasyPrint for proper HTML → PDF
Error handling improvements — Better error messages and logging

Phase 2: Performance & Reliability (Priority 2)

Add caching — Redis cache for crawled content (1 hour TTL)
Add rate limiting — Per-IP limits (10 req/min)
Add robots.txt support — Respect crawling rules
Optimize large pages — Chunking for > 5000 chars

Phase 3: Advanced Features (Priority 3)

Sitemap support — Crawl multiple pages from sitemap
Link extraction — Extract and follow links
Content filtering — Remove ads, navigation, etc.
Screenshot capture — Full-page screenshots with Playwright
PDF generation from HTML — Proper HTML → PDF conversion

TODO-PARSER-RAG.md — PARSER Agent roadmap
INFRASTRUCTURE.md — Server infrastructure
WARP.md — Developer guide
docs/cursor/crawl4ai_web_crawler_task.md — Implementation task
docs/cursor/CRAWL4AI_SERVICE_REPORT.md — Detailed report
docs/agents/parser.md — PARSER Agent documentation

📊 Service Integration Map

┌─────────────────────────────────────────────┐
│         DAGI Stack Services                 │
└──────────┬──────────────────────────────────┘
           │
    ┌──────┴──────────┐
    │                 │
    ▼                 ▼
┌──────────┐     ┌──────────┐
│  Router  │────▶│ PARSER   │
│  (9102)  │     │ Service  │
└──────────┘     │ (9400)   │
                 └─────┬────┘
                       │
                 ┌─────┴─────┐
                 │           │
                 ▼           ▼
          ┌──────────┐ ┌──────────┐
          │ Crawl4AI │ │   OCR    │
          │  Service │ │ Pipeline │
          └──────────┘ └──────────┘
                 │           │
                 └─────┬─────┘
                       ▼
                ┌──────────────┐
                │    RAG       │
                │   Service    │
                │   (9500)     │
                └──────────────┘

Статус: ✅ MVP Complete
Next: Testing + HTML rendering improvements
Last Updated: 2025-01-17 by WARP AI
Maintained by: Ivan Tytar & DAARION Team

12 KiB Raw Blame History

🌐 Crawl4AI Service — Status

🎯 Overview

✅ Implementation Complete

Core Module

Integration with PARSER Service

Configuration

Dependencies

Integration with Router

🌐 Supported Formats

1. PDF Documents

2. Images

3. HTML Pages

4. JavaScript-Rendered Pages (Optional)

🔄 Data Flow

📊 Statistics

⚠️ Known Limitations

1. HTML → Image Conversion (Basic)

2. No Caching

3. No Rate Limiting

4. No Tests

5. No robots.txt Support

🧪 Testing

Manual Testing

Unit Tests (To be implemented)

🚀 Deployment

Docker Compose

Start Service

Enable Playwright (Optional)

📝 Next Steps

Phase 1: Bug Fixes & Testing (Priority 1)

Phase 2: Performance & Reliability (Priority 2)

Phase 3: Advanced Features (Priority 3)

🔗 Related Documentation

📊 Service Integration Map

12 KiB

Raw Blame History