feat: create PARSER service skeleton with FastAPI

- Create parser-service/ with full structure
- Add FastAPI app with endpoints (/parse, /parse_qa, /parse_markdown, /parse_chunks)
- Add Pydantic schemas (ParsedDocument, ParsedBlock, ParsedChunk, etc.)
- Add runtime module with model_loader and inference (with dummy parser)
- Add configuration, Dockerfile, requirements.txt
- Update TODO-PARSER-RAG.md with completed tasks
- Ready for dots.ocr model integration
This commit is contained in:
Apple
2025-11-15 13:15:08 -08:00
parent 2fc1894b26
commit 5e7cfc019e
11 changed files with 824 additions and 30 deletions

View File

@@ -10,17 +10,30 @@
### G.1. Runtime моделі PARSER
- [ ] **G.1.1** Обрати runtime для dots.ocr
- [ ] Варіант A: HuggingFace Transformers + vLLM/SGLang
- [ ] Варіант B: llama.cpp / GGUF (якщо буде GGUF-версія)
- [ ] Варіант C: Ollama (якщо підтримується)
- **Примітка:** Обрати найпростіший варіант для старту
- [x] **G.1.1** Обрати runtime для dots.ocr
- [x] **Рішення:** Python 3.11 + PyTorch + FastAPI
- [x] **Обґрунтування:**
- dots.ocr — torch-модель, потребує PyTorch
- FastAPI для HTTP-обгортки (інтеграція з G.2)
- Python 3.11 для сучасного синтаксису
- [x] **Структура модуля:**
- `parser_runtime/model_loader.py` — завантаження dots.ocr
- `parser_runtime/schemas.py` — ParsedDocument, Page, Chunk
- `parser_runtime/inference.py` — функція `run_ocr(...)`
- [x] **Формат інтерфейсу:**
```python
def parse_document(
input: bytes | str, # bytes або path
output_mode: Literal["raw_json", "markdown", "qa_pairs", "chunks"]
) -> ParsedDocument
```
- [ ] **Реалізація:** Створено каркас, потрібна інтеграція з реальною моделлю
- [ ] **G.1.2** Створити `parser-runtime/` сервіс
- [ ] `parser_runtime/__init__.py`
- [ ] `parser_runtime/model_loader.py` (lazy init, GPU/CPU fallback)
- [ ] `parser_runtime/inference.py` (функції: `parse_image`, `parse_pdf`)
- [ ] `parser_runtime/config.py` (конфігурація моделі)
- [x] **G.1.2** Створити `parser-runtime/` сервіс
- [x] `app/runtime/__init__.py`
- [x] `app/runtime/model_loader.py` (lazy init, GPU/CPU fallback)
- [x] `app/runtime/inference.py` (функції: `parse_document`, `dummy_parse_document`)
- [x] Конфігурація в `app/core/config.py`
- [ ] **G.1.3** Додати конфіг
- [ ] `PARSER_MODEL_NAME=rednote-hilab/dots.ocr`
@@ -33,27 +46,22 @@
### G.2. HTTP-сервіс `parser-service`
- [ ] **G.2.1** Створити сервіс `services/parser-service/` (FastAPI)
- [ ] `main.py` — FastAPI додаток
- [ ] `schemas.py` — Pydantic моделі (ParsedDocument, ParsedBlock, ...)
- [ ] `config.py` — конфігурація
- [ ] `Dockerfile` — Docker образ
- [ ] `requirements.txt` — залежності
- [x] **G.2.1** Створити сервіс `services/parser-service/` (FastAPI)
- [x] `app/main.py` — FastAPI додаток
- [x] `app/schemas.py` — Pydantic моделі (ParsedDocument, ParsedBlock, ...)
- [x] `app/core/config.py` — конфігурація
- [x] `Dockerfile` — Docker образ
- [x] `requirements.txt` — залежності
- [x] `README.md` — документація
- [ ] **G.2.2** Ендпоінти
- [ ] `POST /ocr/parse` — повертає raw JSON
- Request: `{doc_url, file_bytes, output_mode: "raw_json"}`
- Response: `ParsedDocument`
- [ ] `POST /ocr/parse_qa` — Q&A-представлення
- Request: `{doc_url, file_bytes}`
- Response: `{qa_pairs: [...]}`
- [ ] `POST /ocr/parse_markdown` — Markdown-версія
- Request: `{doc_url, file_bytes}`
- Response: `{markdown: "..."}`
- [ ] `POST /ocr/parse_chunks` — семантичні фрагменти для RAG
- Request: `{doc_url, file_bytes, dao_id, doc_id}`
- Response: `{chunks: [...]}`
- [ ] `GET /health` — health check
- [x] **G.2.2** Ендпоінти
- [x] `POST /ocr/parse` — повертає raw JSON (з mock-даними)
- Request: `multipart/form-data` (file) + `output_mode`
- Response: `ParseResponse` з `document`, `markdown`, `qa_pairs`, або `chunks`
- [x] `POST /ocr/parse_qa` — Q&A-представлення (поки що mock)
- [x] `POST /ocr/parse_markdown` — Markdown-версія (поки що mock)
- [x] `POST /ocr/parse_chunks` — семантичні фрагменти для RAG (поки що mock)
- [x] `GET /health` — health check
- [ ] **G.2.3** Підтримати типи файлів
- [ ] PDF (розбиття по сторінках → зображення)

View File

@@ -0,0 +1,27 @@
FROM python:3.11-slim
WORKDIR /app
# Install system dependencies
RUN apt-get update && apt-get install -y \
poppler-utils \
libgl1-mesa-glx \
libglib2.0-0 \
&& rm -rf /var/lib/apt/lists/*
# Copy requirements and install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy application code
COPY . .
# Create temp directory
RUN mkdir -p /tmp/parser
# Expose port
EXPOSE 9400
# Run application
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "9400"]

View File

@@ -0,0 +1,119 @@
# PARSER Service
Document Ingestion & Structuring Agent using dots.ocr.
## Опис
PARSER Service — це FastAPI сервіс для розпізнавання та структурування документів (PDF, зображення) через модель `dots.ocr`.
## Структура
```
parser-service/
├── app/
│ ├── main.py # FastAPI application
│ ├── api/
│ │ └── endpoints.py # API endpoints
│ ├── core/
│ │ └── config.py # Configuration
│ ├── runtime/
│ │ ├── __init__.py
│ │ ├── model_loader.py # Model loading
│ │ └── inference.py # Inference functions
│ └── schemas.py # Pydantic models
├── requirements.txt
├── Dockerfile
└── README.md
```
## API Endpoints
### POST /ocr/parse
Parse document (PDF or image).
**Request:**
- `file`: UploadFile (multipart/form-data)
- `doc_url`: Optional[str] (not yet implemented)
- `output_mode`: `raw_json` | `markdown` | `qa_pairs` | `chunks`
- `dao_id`: Optional[str]
- `doc_id`: Optional[str]
**Response:**
```json
{
"document": {...}, // for raw_json mode
"markdown": "...", // for markdown mode
"qa_pairs": [...], // for qa_pairs mode
"chunks": [...], // for chunks mode
"metadata": {}
}
```
### POST /ocr/parse_qa
Parse document and return Q&A pairs.
### POST /ocr/parse_markdown
Parse document and return Markdown.
### POST /ocr/parse_chunks
Parse document and return chunks for RAG.
### GET /health
Health check endpoint.
## Конфігурація
Environment variables:
- `PARSER_MODEL_NAME`: Model name (default: `rednote-hilab/dots.ocr`)
- `PARSER_DEVICE`: Device (`cuda`, `cpu`, `mps`)
- `PARSER_MAX_PAGES`: Max pages to process (default: 100)
- `PARSER_MAX_RESOLUTION`: Max resolution (default: `4096x4096`)
- `MAX_FILE_SIZE_MB`: Max file size in MB (default: 50)
- `TEMP_DIR`: Temporary directory (default: `/tmp/parser`)
## Запуск
### Development
```bash
cd services/parser-service
pip install -r requirements.txt
uvicorn app.main:app --reload --host 0.0.0.0 --port 9400
```
### Docker
```bash
docker-compose up parser-service
```
## Статус реалізації
- [x] Базова структура сервісу
- [x] API endpoints (з mock-даними)
- [x] Pydantic schemas
- [x] Configuration
- [ ] Інтеграція з dots.ocr моделлю
- [ ] PDF processing
- [ ] Image processing
- [ ] Markdown conversion
- [ ] QA pairs extraction
## Наступні кроки
1. Інтегрувати dots.ocr модель в `app/runtime/inference.py`
2. Додати PDF → images конвертацію
3. Реалізувати реальний parsing замість dummy
4. Додати тести
## Посилання
- [PARSER Agent Documentation](../../docs/agents/parser.md)
- [TODO: PARSER + RAG Implementation](../../TODO-PARSER-RAG.md)

View File

@@ -0,0 +1,192 @@
"""
API endpoints for PARSER Service
"""
import logging
import uuid
from pathlib import Path
from typing import Optional
from fastapi import APIRouter, UploadFile, File, HTTPException, Form
from fastapi.responses import JSONResponse
from app.schemas import (
ParseRequest, ParseResponse, ParsedDocument, ParsedChunk, QAPair, ChunksResponse
)
from app.core.config import settings
from app.runtime.inference import parse_document, dummy_parse_document
logger = logging.getLogger(__name__)
router = APIRouter()
@router.post("/parse", response_model=ParseResponse)
async def parse_document_endpoint(
file: Optional[UploadFile] = File(None),
doc_url: Optional[str] = Form(None),
output_mode: str = Form("raw_json"),
dao_id: Optional[str] = Form(None),
doc_id: Optional[str] = Form(None)
):
"""
Parse document (PDF or image) using dots.ocr
Supports:
- PDF files (multi-page)
- Image files (PNG, JPEG, TIFF)
Output modes:
- raw_json: Full structured JSON
- markdown: Markdown representation
- qa_pairs: Q&A pairs extracted from document
- chunks: Semantic chunks for RAG
"""
try:
# Validate input
if not file and not doc_url:
raise HTTPException(
status_code=400,
detail="Either 'file' or 'doc_url' must be provided"
)
# Determine document type
if file:
doc_type = "image" # Will be determined from file extension
file_ext = Path(file.filename or "").suffix.lower()
if file_ext == ".pdf":
doc_type = "pdf"
# Read file content
content = await file.read()
# Check file size
max_size = settings.MAX_FILE_SIZE_MB * 1024 * 1024
if len(content) > max_size:
raise HTTPException(
status_code=413,
detail=f"File size exceeds maximum {settings.MAX_FILE_SIZE_MB}MB"
)
# Save to temp file
temp_dir = Path(settings.TEMP_DIR)
temp_dir.mkdir(exist_ok=True, parents=True)
temp_file = temp_dir / f"{uuid.uuid4()}{file_ext}"
temp_file.write_bytes(content)
input_path = str(temp_file)
else:
# TODO: Download from doc_url
raise HTTPException(
status_code=501,
detail="doc_url download not yet implemented"
)
# Parse document
logger.info(f"Parsing document: {input_path}, mode: {output_mode}")
# TODO: Replace with real parse_document when model is integrated
parsed_doc = dummy_parse_document(
input_path=input_path,
output_mode=output_mode,
doc_id=doc_id or str(uuid.uuid4()),
doc_type=doc_type
)
# Build response based on output_mode
response_data = {"metadata": {}}
if output_mode == "raw_json":
response_data["document"] = parsed_doc
elif output_mode == "markdown":
# TODO: Convert to markdown
response_data["markdown"] = "# Document\n\n" + "\n\n".join(
block.text for page in parsed_doc.pages for block in page.blocks
)
elif output_mode == "qa_pairs":
# TODO: Extract QA pairs
response_data["qa_pairs"] = []
elif output_mode == "chunks":
# Convert blocks to chunks
chunks = []
for page in parsed_doc.pages:
for block in page.blocks:
chunks.append(ParsedChunk(
text=block.text,
page=page.page_num,
bbox=block.bbox,
section=block.type,
metadata={
"dao_id": dao_id,
"doc_id": parsed_doc.doc_id,
"block_type": block.type
}
))
response_data["chunks"] = chunks
# Cleanup temp file
if file and temp_file.exists():
temp_file.unlink()
return ParseResponse(**response_data)
except HTTPException:
raise
except Exception as e:
logger.error(f"Error parsing document: {e}", exc_info=True)
raise HTTPException(status_code=500, detail=f"Parsing failed: {str(e)}")
@router.post("/parse_qa", response_model=ParseResponse)
async def parse_qa_endpoint(
file: Optional[UploadFile] = File(None),
doc_url: Optional[str] = Form(None)
):
"""Parse document and return Q&A pairs"""
return await parse_document_endpoint(
file=file,
doc_url=doc_url,
output_mode="qa_pairs"
)
@router.post("/parse_markdown", response_model=ParseResponse)
async def parse_markdown_endpoint(
file: Optional[UploadFile] = File(None),
doc_url: Optional[str] = Form(None)
):
"""Parse document and return Markdown"""
return await parse_document_endpoint(
file=file,
doc_url=doc_url,
output_mode="markdown"
)
@router.post("/parse_chunks", response_model=ChunksResponse)
async def parse_chunks_endpoint(
file: Optional[UploadFile] = File(None),
doc_url: Optional[str] = Form(None),
dao_id: str = Form(...),
doc_id: Optional[str] = Form(None)
):
"""Parse document and return chunks for RAG"""
response = await parse_document_endpoint(
file=file,
doc_url=doc_url,
output_mode="chunks",
dao_id=dao_id,
doc_id=doc_id
)
if not response.chunks:
raise HTTPException(status_code=500, detail="Failed to generate chunks")
return ChunksResponse(
chunks=response.chunks,
total_chunks=len(response.chunks),
doc_id=response.chunks[0].metadata.get("doc_id", doc_id or "unknown"),
dao_id=dao_id
)

View File

@@ -0,0 +1,38 @@
"""
Configuration for PARSER Service
"""
import os
from typing import Literal
from pydantic_settings import BaseSettings
class Settings(BaseSettings):
"""Application settings"""
# Service
API_HOST: str = "0.0.0.0"
API_PORT: int = 9400
# PARSER Model
PARSER_MODEL_NAME: str = os.getenv("PARSER_MODEL_NAME", "rednote-hilab/dots.ocr")
PARSER_DEVICE: Literal["cuda", "cpu", "mps"] = os.getenv("PARSER_DEVICE", "cpu")
PARSER_MAX_PAGES: int = int(os.getenv("PARSER_MAX_PAGES", "100"))
PARSER_MAX_RESOLUTION: str = os.getenv("PARSER_MAX_RESOLUTION", "4096x4096")
PARSER_BATCH_SIZE: int = int(os.getenv("PARSER_BATCH_SIZE", "1"))
# File handling
MAX_FILE_SIZE_MB: int = int(os.getenv("MAX_FILE_SIZE_MB", "50"))
TEMP_DIR: str = os.getenv("TEMP_DIR", "/tmp/parser")
# Runtime
RUNTIME_TYPE: Literal["local", "remote"] = os.getenv("RUNTIME_TYPE", "local")
RUNTIME_URL: str = os.getenv("RUNTIME_URL", "http://parser-runtime:11435")
class Config:
env_file = ".env"
case_sensitive = True
settings = Settings()

View File

@@ -0,0 +1,79 @@
"""
PARSER Service - Document Ingestion & Structuring Agent
FastAPI сервіс для розпізнавання та структурування документів через dots.ocr
"""
import logging
from contextlib import asynccontextmanager
from fastapi import FastAPI
from fastapi.middleware.cors import CORSMiddleware
from app.core.config import settings
from app.api.endpoints import router
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s - %(name)s - %(levelname)s - %(message)s"
)
logger = logging.getLogger(__name__)
@asynccontextmanager
async def lifespan(app: FastAPI):
"""Lifespan events: startup and shutdown"""
# Startup
logger.info("Starting PARSER Service...")
logger.info(f"Model: {settings.PARSER_MODEL_NAME}")
logger.info(f"Device: {settings.PARSER_DEVICE}")
logger.info(f"Max pages: {settings.PARSER_MAX_PAGES}")
# TODO: Initialize model loader here
# from app.runtime.model_loader import load_model
# app.state.model = await load_model()
yield
# Shutdown
logger.info("Shutting down PARSER Service...")
app = FastAPI(
title="PARSER Service",
description="Document Ingestion & Structuring Agent using dots.ocr",
version="1.0.0",
lifespan=lifespan
)
app.add_middleware(
CORSMiddleware,
allow_origins=["*"],
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)
app.include_router(router, prefix="/ocr", tags=["OCR"])
@app.get("/health")
async def health():
"""Health check endpoint"""
return {
"status": "healthy",
"service": "parser-service",
"model": settings.PARSER_MODEL_NAME,
"device": settings.PARSER_DEVICE,
"version": "1.0.0"
}
if __name__ == "__main__":
import uvicorn
uvicorn.run(
"app.main:app",
host="0.0.0.0",
port=9400,
reload=True
)

View File

@@ -0,0 +1,15 @@
"""
PARSER Runtime module
Handles model loading and inference for dots.ocr
"""
from app.runtime.inference import parse_document, dummy_parse_document
from app.runtime.model_loader import load_model, get_model
__all__ = [
"parse_document",
"dummy_parse_document",
"load_model",
"get_model"
]

View File

@@ -0,0 +1,112 @@
"""
Inference functions for document parsing
"""
import logging
from typing import Literal, Optional
from pathlib import Path
from app.schemas import ParsedDocument, ParsedPage, ParsedBlock, BBox
from app.runtime.model_loader import get_model
from app.core.config import settings
logger = logging.getLogger(__name__)
def parse_document(
input_path: str,
output_mode: Literal["raw_json", "markdown", "qa_pairs", "chunks"] = "raw_json",
doc_id: Optional[str] = None,
doc_type: Literal["pdf", "image"] = "image"
) -> ParsedDocument:
"""
Parse document using dots.ocr model
Args:
input_path: Path to document file (PDF or image)
output_mode: Output format mode
doc_id: Document ID
doc_type: Document type (pdf or image)
Returns:
ParsedDocument with structured content
"""
model = get_model()
if model is None:
logger.warning("Model not loaded, using dummy parser")
return dummy_parse_document(input_path, output_mode, doc_id, doc_type)
# TODO: Implement actual inference with dots.ocr
# Example:
# from PIL import Image
# import pdf2image # for PDF
# if doc_type == "pdf":
# images = pdf2image.convert_from_path(input_path)
# else:
# images = [Image.open(input_path)]
#
# pages = []
# for idx, image in enumerate(images):
# # Process with model
# inputs = model["processor"](images=image, return_tensors="pt")
# outputs = model["model"].generate(**inputs)
# text = model["processor"].decode(outputs[0], skip_special_tokens=True)
#
# # Parse output into blocks
# blocks = parse_model_output(text, idx + 1)
# pages.append(ParsedPage(...))
#
# return ParsedDocument(...)
# For now, use dummy
return dummy_parse_document(input_path, output_mode, doc_id, doc_type)
def dummy_parse_document(
input_path: str,
output_mode: Literal["raw_json", "markdown", "qa_pairs", "chunks"] = "raw_json",
doc_id: Optional[str] = None,
doc_type: Literal["pdf", "image"] = "image"
) -> ParsedDocument:
"""
Dummy parser for testing (returns mock data)
This will be replaced with actual dots.ocr inference
"""
logger.info(f"Dummy parsing: {input_path}")
# Mock data
mock_page = ParsedPage(
page_num=1,
blocks=[
ParsedBlock(
type="heading",
text="Document Title",
bbox=BBox(x=0, y=0, width=800, height=50),
reading_order=1,
page_num=1
),
ParsedBlock(
type="paragraph",
text="This is a dummy parsed document. Replace this with actual dots.ocr inference.",
bbox=BBox(x=0, y=60, width=800, height=100),
reading_order=2,
page_num=1
)
],
width=800,
height=1200
)
return ParsedDocument(
doc_id=doc_id or "dummy-doc-1",
doc_type=doc_type,
pages=[mock_page],
metadata={
"parser": "dummy",
"input_path": input_path
}
)

View File

@@ -0,0 +1,74 @@
"""
Model loader for dots.ocr
Handles lazy loading and GPU/CPU fallback
"""
import logging
from typing import Optional, Literal
from pathlib import Path
from app.core.config import settings
logger = logging.getLogger(__name__)
# Global model instance
_model: Optional[object] = None
def load_model() -> object:
"""
Load dots.ocr model
Returns:
Loaded model instance
"""
global _model
if _model is not None:
return _model
logger.info(f"Loading model: {settings.PARSER_MODEL_NAME}")
logger.info(f"Device: {settings.PARSER_DEVICE}")
try:
# TODO: Implement actual model loading
# Example:
# from transformers import AutoModelForVision2Seq, AutoProcessor
#
# processor = AutoProcessor.from_pretrained(settings.PARSER_MODEL_NAME)
# model = AutoModelForVision2Seq.from_pretrained(
# settings.PARSER_MODEL_NAME,
# device_map=settings.PARSER_DEVICE
# )
#
# _model = {
# "model": model,
# "processor": processor
# }
# For now, return None (will use dummy parser)
logger.warning("Model loading not yet implemented, using dummy parser")
_model = None
except Exception as e:
logger.error(f"Failed to load model: {e}", exc_info=True)
raise
return _model
def get_model() -> Optional[object]:
"""Get loaded model instance"""
if _model is None:
return load_model()
return _model
def unload_model():
"""Unload model from memory"""
global _model
if _model is not None:
# TODO: Proper cleanup
_model = None
logger.info("Model unloaded")

View File

@@ -0,0 +1,108 @@
"""
Pydantic schemas for PARSER Service
"""
from typing import Optional, List, Dict, Any, Literal
from pydantic import BaseModel, Field
from datetime import datetime
class BBox(BaseModel):
"""Bounding box coordinates"""
x: float = Field(..., description="X coordinate")
y: float = Field(..., description="Y coordinate")
width: float = Field(..., description="Width")
height: float = Field(..., description="Height")
class TableCell(BaseModel):
"""Table cell data"""
row: int
col: int
text: str
rowspan: Optional[int] = 1
colspan: Optional[int] = 1
class TableData(BaseModel):
"""Structured table data"""
rows: List[List[str]] = Field(..., description="Table rows")
columns: List[str] = Field(..., description="Column headers")
merged_cells: Optional[List[Dict[str, Any]]] = Field(None, description="Merged cells info")
class ParsedBlock(BaseModel):
"""Parsed document block"""
type: Literal["paragraph", "heading", "table", "formula", "figure_caption", "list"] = Field(
..., description="Block type"
)
text: str = Field(..., description="Block text content")
bbox: BBox = Field(..., description="Bounding box")
reading_order: int = Field(..., description="Reading order index")
page_num: int = Field(..., description="Page number")
table_data: Optional[TableData] = Field(None, description="Table data (if type=table)")
metadata: Optional[Dict[str, Any]] = Field(None, description="Additional metadata")
class ParsedPage(BaseModel):
"""Parsed document page"""
page_num: int = Field(..., description="Page number (1-indexed)")
blocks: List[ParsedBlock] = Field(..., description="Page blocks")
width: float = Field(..., description="Page width in pixels")
height: float = Field(..., description="Page height in pixels")
class ParsedChunk(BaseModel):
"""Semantic chunk for RAG"""
text: str = Field(..., description="Chunk text")
page: int = Field(..., description="Page number")
bbox: Optional[BBox] = Field(None, description="Bounding box")
section: Optional[str] = Field(None, description="Section name")
metadata: Dict[str, Any] = Field(default_factory=dict, description="Additional metadata")
class QAPair(BaseModel):
"""Question-Answer pair"""
question: str = Field(..., description="Question")
answer: str = Field(..., description="Answer")
source_page: int = Field(..., description="Source page number")
source_bbox: Optional[BBox] = Field(None, description="Source bounding box")
confidence: Optional[float] = Field(None, description="Confidence score")
class ParsedDocument(BaseModel):
"""Complete parsed document"""
doc_id: Optional[str] = Field(None, description="Document ID")
doc_url: Optional[str] = Field(None, description="Document URL")
doc_type: Literal["pdf", "image"] = Field(..., description="Document type")
pages: List[ParsedPage] = Field(..., description="Parsed pages")
metadata: Dict[str, Any] = Field(default_factory=dict, description="Document metadata")
created_at: datetime = Field(default_factory=datetime.utcnow, description="Creation timestamp")
class ParseRequest(BaseModel):
"""Parse request"""
doc_url: Optional[str] = Field(None, description="Document URL")
output_mode: Literal["raw_json", "markdown", "qa_pairs", "chunks"] = Field(
"raw_json", description="Output mode"
)
dao_id: Optional[str] = Field(None, description="DAO ID")
doc_id: Optional[str] = Field(None, description="Document ID")
class ParseResponse(BaseModel):
"""Parse response"""
document: Optional[ParsedDocument] = Field(None, description="Parsed document (raw_json mode)")
markdown: Optional[str] = Field(None, description="Markdown content (markdown mode)")
qa_pairs: Optional[List[QAPair]] = Field(None, description="QA pairs (qa_pairs mode)")
chunks: Optional[List[ParsedChunk]] = Field(None, description="Chunks (chunks mode)")
metadata: Dict[str, Any] = Field(default_factory=dict, description="Additional metadata")
class ChunksResponse(BaseModel):
"""Chunks response for RAG"""
chunks: List[ParsedChunk] = Field(..., description="Document chunks")
total_chunks: int = Field(..., description="Total number of chunks")
doc_id: str = Field(..., description="Document ID")
dao_id: str = Field(..., description="DAO ID")

View File

@@ -0,0 +1,22 @@
# FastAPI and server
fastapi==0.104.1
uvicorn[standard]==0.24.0
python-multipart==0.0.6
pydantic==2.5.0
pydantic-settings==2.1.0
# Model and ML
torch>=2.0.0
transformers>=4.35.0
Pillow>=10.0.0
# PDF processing
pdf2image>=1.16.3
PyMuPDF>=1.23.0 # Alternative PDF library
# Image processing
opencv-python>=4.8.0 # Optional, for advanced image processing
# Utilities
python-dotenv>=1.0.1