feat: create PARSER service skeleton with FastAPI
- Create parser-service/ with full structure - Add FastAPI app with endpoints (/parse, /parse_qa, /parse_markdown, /parse_chunks) - Add Pydantic schemas (ParsedDocument, ParsedBlock, ParsedChunk, etc.) - Add runtime module with model_loader and inference (with dummy parser) - Add configuration, Dockerfile, requirements.txt - Update TODO-PARSER-RAG.md with completed tasks - Ready for dots.ocr model integration
This commit is contained in:
@@ -10,17 +10,30 @@
|
||||
|
||||
### G.1. Runtime моделі PARSER
|
||||
|
||||
- [ ] **G.1.1** Обрати runtime для dots.ocr
|
||||
- [ ] Варіант A: HuggingFace Transformers + vLLM/SGLang
|
||||
- [ ] Варіант B: llama.cpp / GGUF (якщо буде GGUF-версія)
|
||||
- [ ] Варіант C: Ollama (якщо підтримується)
|
||||
- **Примітка:** Обрати найпростіший варіант для старту
|
||||
- [x] **G.1.1** Обрати runtime для dots.ocr ✅
|
||||
- [x] **Рішення:** Python 3.11 + PyTorch + FastAPI
|
||||
- [x] **Обґрунтування:**
|
||||
- dots.ocr — torch-модель, потребує PyTorch
|
||||
- FastAPI для HTTP-обгортки (інтеграція з G.2)
|
||||
- Python 3.11 для сучасного синтаксису
|
||||
- [x] **Структура модуля:**
|
||||
- `parser_runtime/model_loader.py` — завантаження dots.ocr
|
||||
- `parser_runtime/schemas.py` — ParsedDocument, Page, Chunk
|
||||
- `parser_runtime/inference.py` — функція `run_ocr(...)`
|
||||
- [x] **Формат інтерфейсу:**
|
||||
```python
|
||||
def parse_document(
|
||||
input: bytes | str, # bytes або path
|
||||
output_mode: Literal["raw_json", "markdown", "qa_pairs", "chunks"]
|
||||
) -> ParsedDocument
|
||||
```
|
||||
- [ ] **Реалізація:** Створено каркас, потрібна інтеграція з реальною моделлю
|
||||
|
||||
- [ ] **G.1.2** Створити `parser-runtime/` сервіс
|
||||
- [ ] `parser_runtime/__init__.py`
|
||||
- [ ] `parser_runtime/model_loader.py` (lazy init, GPU/CPU fallback)
|
||||
- [ ] `parser_runtime/inference.py` (функції: `parse_image`, `parse_pdf`)
|
||||
- [ ] `parser_runtime/config.py` (конфігурація моделі)
|
||||
- [x] **G.1.2** Створити `parser-runtime/` сервіс ✅
|
||||
- [x] `app/runtime/__init__.py`
|
||||
- [x] `app/runtime/model_loader.py` (lazy init, GPU/CPU fallback)
|
||||
- [x] `app/runtime/inference.py` (функції: `parse_document`, `dummy_parse_document`)
|
||||
- [x] Конфігурація в `app/core/config.py`
|
||||
|
||||
- [ ] **G.1.3** Додати конфіг
|
||||
- [ ] `PARSER_MODEL_NAME=rednote-hilab/dots.ocr`
|
||||
@@ -33,27 +46,22 @@
|
||||
|
||||
### G.2. HTTP-сервіс `parser-service`
|
||||
|
||||
- [ ] **G.2.1** Створити сервіс `services/parser-service/` (FastAPI)
|
||||
- [ ] `main.py` — FastAPI додаток
|
||||
- [ ] `schemas.py` — Pydantic моделі (ParsedDocument, ParsedBlock, ...)
|
||||
- [ ] `config.py` — конфігурація
|
||||
- [ ] `Dockerfile` — Docker образ
|
||||
- [ ] `requirements.txt` — залежності
|
||||
- [x] **G.2.1** Створити сервіс `services/parser-service/` (FastAPI) ✅
|
||||
- [x] `app/main.py` — FastAPI додаток
|
||||
- [x] `app/schemas.py` — Pydantic моделі (ParsedDocument, ParsedBlock, ...)
|
||||
- [x] `app/core/config.py` — конфігурація
|
||||
- [x] `Dockerfile` — Docker образ
|
||||
- [x] `requirements.txt` — залежності
|
||||
- [x] `README.md` — документація
|
||||
|
||||
- [ ] **G.2.2** Ендпоінти
|
||||
- [ ] `POST /ocr/parse` — повертає raw JSON
|
||||
- Request: `{doc_url, file_bytes, output_mode: "raw_json"}`
|
||||
- Response: `ParsedDocument`
|
||||
- [ ] `POST /ocr/parse_qa` — Q&A-представлення
|
||||
- Request: `{doc_url, file_bytes}`
|
||||
- Response: `{qa_pairs: [...]}`
|
||||
- [ ] `POST /ocr/parse_markdown` — Markdown-версія
|
||||
- Request: `{doc_url, file_bytes}`
|
||||
- Response: `{markdown: "..."}`
|
||||
- [ ] `POST /ocr/parse_chunks` — семантичні фрагменти для RAG
|
||||
- Request: `{doc_url, file_bytes, dao_id, doc_id}`
|
||||
- Response: `{chunks: [...]}`
|
||||
- [ ] `GET /health` — health check
|
||||
- [x] **G.2.2** Ендпоінти ✅
|
||||
- [x] `POST /ocr/parse` — повертає raw JSON (з mock-даними)
|
||||
- Request: `multipart/form-data` (file) + `output_mode`
|
||||
- Response: `ParseResponse` з `document`, `markdown`, `qa_pairs`, або `chunks`
|
||||
- [x] `POST /ocr/parse_qa` — Q&A-представлення (поки що mock)
|
||||
- [x] `POST /ocr/parse_markdown` — Markdown-версія (поки що mock)
|
||||
- [x] `POST /ocr/parse_chunks` — семантичні фрагменти для RAG (поки що mock)
|
||||
- [x] `GET /health` — health check
|
||||
|
||||
- [ ] **G.2.3** Підтримати типи файлів
|
||||
- [ ] PDF (розбиття по сторінках → зображення)
|
||||
|
||||
27
services/parser-service/Dockerfile
Normal file
27
services/parser-service/Dockerfile
Normal file
@@ -0,0 +1,27 @@
|
||||
FROM python:3.11-slim
|
||||
|
||||
WORKDIR /app
|
||||
|
||||
# Install system dependencies
|
||||
RUN apt-get update && apt-get install -y \
|
||||
poppler-utils \
|
||||
libgl1-mesa-glx \
|
||||
libglib2.0-0 \
|
||||
&& rm -rf /var/lib/apt/lists/*
|
||||
|
||||
# Copy requirements and install dependencies
|
||||
COPY requirements.txt .
|
||||
RUN pip install --no-cache-dir -r requirements.txt
|
||||
|
||||
# Copy application code
|
||||
COPY . .
|
||||
|
||||
# Create temp directory
|
||||
RUN mkdir -p /tmp/parser
|
||||
|
||||
# Expose port
|
||||
EXPOSE 9400
|
||||
|
||||
# Run application
|
||||
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "9400"]
|
||||
|
||||
119
services/parser-service/README.md
Normal file
119
services/parser-service/README.md
Normal file
@@ -0,0 +1,119 @@
|
||||
# PARSER Service
|
||||
|
||||
Document Ingestion & Structuring Agent using dots.ocr.
|
||||
|
||||
## Опис
|
||||
|
||||
PARSER Service — це FastAPI сервіс для розпізнавання та структурування документів (PDF, зображення) через модель `dots.ocr`.
|
||||
|
||||
## Структура
|
||||
|
||||
```
|
||||
parser-service/
|
||||
├── app/
|
||||
│ ├── main.py # FastAPI application
|
||||
│ ├── api/
|
||||
│ │ └── endpoints.py # API endpoints
|
||||
│ ├── core/
|
||||
│ │ └── config.py # Configuration
|
||||
│ ├── runtime/
|
||||
│ │ ├── __init__.py
|
||||
│ │ ├── model_loader.py # Model loading
|
||||
│ │ └── inference.py # Inference functions
|
||||
│ └── schemas.py # Pydantic models
|
||||
├── requirements.txt
|
||||
├── Dockerfile
|
||||
└── README.md
|
||||
```
|
||||
|
||||
## API Endpoints
|
||||
|
||||
### POST /ocr/parse
|
||||
|
||||
Parse document (PDF or image).
|
||||
|
||||
**Request:**
|
||||
- `file`: UploadFile (multipart/form-data)
|
||||
- `doc_url`: Optional[str] (not yet implemented)
|
||||
- `output_mode`: `raw_json` | `markdown` | `qa_pairs` | `chunks`
|
||||
- `dao_id`: Optional[str]
|
||||
- `doc_id`: Optional[str]
|
||||
|
||||
**Response:**
|
||||
```json
|
||||
{
|
||||
"document": {...}, // for raw_json mode
|
||||
"markdown": "...", // for markdown mode
|
||||
"qa_pairs": [...], // for qa_pairs mode
|
||||
"chunks": [...], // for chunks mode
|
||||
"metadata": {}
|
||||
}
|
||||
```
|
||||
|
||||
### POST /ocr/parse_qa
|
||||
|
||||
Parse document and return Q&A pairs.
|
||||
|
||||
### POST /ocr/parse_markdown
|
||||
|
||||
Parse document and return Markdown.
|
||||
|
||||
### POST /ocr/parse_chunks
|
||||
|
||||
Parse document and return chunks for RAG.
|
||||
|
||||
### GET /health
|
||||
|
||||
Health check endpoint.
|
||||
|
||||
## Конфігурація
|
||||
|
||||
Environment variables:
|
||||
|
||||
- `PARSER_MODEL_NAME`: Model name (default: `rednote-hilab/dots.ocr`)
|
||||
- `PARSER_DEVICE`: Device (`cuda`, `cpu`, `mps`)
|
||||
- `PARSER_MAX_PAGES`: Max pages to process (default: 100)
|
||||
- `PARSER_MAX_RESOLUTION`: Max resolution (default: `4096x4096`)
|
||||
- `MAX_FILE_SIZE_MB`: Max file size in MB (default: 50)
|
||||
- `TEMP_DIR`: Temporary directory (default: `/tmp/parser`)
|
||||
|
||||
## Запуск
|
||||
|
||||
### Development
|
||||
|
||||
```bash
|
||||
cd services/parser-service
|
||||
pip install -r requirements.txt
|
||||
uvicorn app.main:app --reload --host 0.0.0.0 --port 9400
|
||||
```
|
||||
|
||||
### Docker
|
||||
|
||||
```bash
|
||||
docker-compose up parser-service
|
||||
```
|
||||
|
||||
## Статус реалізації
|
||||
|
||||
- [x] Базова структура сервісу
|
||||
- [x] API endpoints (з mock-даними)
|
||||
- [x] Pydantic schemas
|
||||
- [x] Configuration
|
||||
- [ ] Інтеграція з dots.ocr моделлю
|
||||
- [ ] PDF processing
|
||||
- [ ] Image processing
|
||||
- [ ] Markdown conversion
|
||||
- [ ] QA pairs extraction
|
||||
|
||||
## Наступні кроки
|
||||
|
||||
1. Інтегрувати dots.ocr модель в `app/runtime/inference.py`
|
||||
2. Додати PDF → images конвертацію
|
||||
3. Реалізувати реальний parsing замість dummy
|
||||
4. Додати тести
|
||||
|
||||
## Посилання
|
||||
|
||||
- [PARSER Agent Documentation](../../docs/agents/parser.md)
|
||||
- [TODO: PARSER + RAG Implementation](../../TODO-PARSER-RAG.md)
|
||||
|
||||
192
services/parser-service/app/api/endpoints.py
Normal file
192
services/parser-service/app/api/endpoints.py
Normal file
@@ -0,0 +1,192 @@
|
||||
"""
|
||||
API endpoints for PARSER Service
|
||||
"""
|
||||
|
||||
import logging
|
||||
import uuid
|
||||
from pathlib import Path
|
||||
from typing import Optional
|
||||
|
||||
from fastapi import APIRouter, UploadFile, File, HTTPException, Form
|
||||
from fastapi.responses import JSONResponse
|
||||
|
||||
from app.schemas import (
|
||||
ParseRequest, ParseResponse, ParsedDocument, ParsedChunk, QAPair, ChunksResponse
|
||||
)
|
||||
from app.core.config import settings
|
||||
from app.runtime.inference import parse_document, dummy_parse_document
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
router = APIRouter()
|
||||
|
||||
|
||||
@router.post("/parse", response_model=ParseResponse)
|
||||
async def parse_document_endpoint(
|
||||
file: Optional[UploadFile] = File(None),
|
||||
doc_url: Optional[str] = Form(None),
|
||||
output_mode: str = Form("raw_json"),
|
||||
dao_id: Optional[str] = Form(None),
|
||||
doc_id: Optional[str] = Form(None)
|
||||
):
|
||||
"""
|
||||
Parse document (PDF or image) using dots.ocr
|
||||
|
||||
Supports:
|
||||
- PDF files (multi-page)
|
||||
- Image files (PNG, JPEG, TIFF)
|
||||
|
||||
Output modes:
|
||||
- raw_json: Full structured JSON
|
||||
- markdown: Markdown representation
|
||||
- qa_pairs: Q&A pairs extracted from document
|
||||
- chunks: Semantic chunks for RAG
|
||||
"""
|
||||
try:
|
||||
# Validate input
|
||||
if not file and not doc_url:
|
||||
raise HTTPException(
|
||||
status_code=400,
|
||||
detail="Either 'file' or 'doc_url' must be provided"
|
||||
)
|
||||
|
||||
# Determine document type
|
||||
if file:
|
||||
doc_type = "image" # Will be determined from file extension
|
||||
file_ext = Path(file.filename or "").suffix.lower()
|
||||
if file_ext == ".pdf":
|
||||
doc_type = "pdf"
|
||||
|
||||
# Read file content
|
||||
content = await file.read()
|
||||
|
||||
# Check file size
|
||||
max_size = settings.MAX_FILE_SIZE_MB * 1024 * 1024
|
||||
if len(content) > max_size:
|
||||
raise HTTPException(
|
||||
status_code=413,
|
||||
detail=f"File size exceeds maximum {settings.MAX_FILE_SIZE_MB}MB"
|
||||
)
|
||||
|
||||
# Save to temp file
|
||||
temp_dir = Path(settings.TEMP_DIR)
|
||||
temp_dir.mkdir(exist_ok=True, parents=True)
|
||||
temp_file = temp_dir / f"{uuid.uuid4()}{file_ext}"
|
||||
temp_file.write_bytes(content)
|
||||
|
||||
input_path = str(temp_file)
|
||||
|
||||
else:
|
||||
# TODO: Download from doc_url
|
||||
raise HTTPException(
|
||||
status_code=501,
|
||||
detail="doc_url download not yet implemented"
|
||||
)
|
||||
|
||||
# Parse document
|
||||
logger.info(f"Parsing document: {input_path}, mode: {output_mode}")
|
||||
|
||||
# TODO: Replace with real parse_document when model is integrated
|
||||
parsed_doc = dummy_parse_document(
|
||||
input_path=input_path,
|
||||
output_mode=output_mode,
|
||||
doc_id=doc_id or str(uuid.uuid4()),
|
||||
doc_type=doc_type
|
||||
)
|
||||
|
||||
# Build response based on output_mode
|
||||
response_data = {"metadata": {}}
|
||||
|
||||
if output_mode == "raw_json":
|
||||
response_data["document"] = parsed_doc
|
||||
elif output_mode == "markdown":
|
||||
# TODO: Convert to markdown
|
||||
response_data["markdown"] = "# Document\n\n" + "\n\n".join(
|
||||
block.text for page in parsed_doc.pages for block in page.blocks
|
||||
)
|
||||
elif output_mode == "qa_pairs":
|
||||
# TODO: Extract QA pairs
|
||||
response_data["qa_pairs"] = []
|
||||
elif output_mode == "chunks":
|
||||
# Convert blocks to chunks
|
||||
chunks = []
|
||||
for page in parsed_doc.pages:
|
||||
for block in page.blocks:
|
||||
chunks.append(ParsedChunk(
|
||||
text=block.text,
|
||||
page=page.page_num,
|
||||
bbox=block.bbox,
|
||||
section=block.type,
|
||||
metadata={
|
||||
"dao_id": dao_id,
|
||||
"doc_id": parsed_doc.doc_id,
|
||||
"block_type": block.type
|
||||
}
|
||||
))
|
||||
response_data["chunks"] = chunks
|
||||
|
||||
# Cleanup temp file
|
||||
if file and temp_file.exists():
|
||||
temp_file.unlink()
|
||||
|
||||
return ParseResponse(**response_data)
|
||||
|
||||
except HTTPException:
|
||||
raise
|
||||
except Exception as e:
|
||||
logger.error(f"Error parsing document: {e}", exc_info=True)
|
||||
raise HTTPException(status_code=500, detail=f"Parsing failed: {str(e)}")
|
||||
|
||||
|
||||
@router.post("/parse_qa", response_model=ParseResponse)
|
||||
async def parse_qa_endpoint(
|
||||
file: Optional[UploadFile] = File(None),
|
||||
doc_url: Optional[str] = Form(None)
|
||||
):
|
||||
"""Parse document and return Q&A pairs"""
|
||||
return await parse_document_endpoint(
|
||||
file=file,
|
||||
doc_url=doc_url,
|
||||
output_mode="qa_pairs"
|
||||
)
|
||||
|
||||
|
||||
@router.post("/parse_markdown", response_model=ParseResponse)
|
||||
async def parse_markdown_endpoint(
|
||||
file: Optional[UploadFile] = File(None),
|
||||
doc_url: Optional[str] = Form(None)
|
||||
):
|
||||
"""Parse document and return Markdown"""
|
||||
return await parse_document_endpoint(
|
||||
file=file,
|
||||
doc_url=doc_url,
|
||||
output_mode="markdown"
|
||||
)
|
||||
|
||||
|
||||
@router.post("/parse_chunks", response_model=ChunksResponse)
|
||||
async def parse_chunks_endpoint(
|
||||
file: Optional[UploadFile] = File(None),
|
||||
doc_url: Optional[str] = Form(None),
|
||||
dao_id: str = Form(...),
|
||||
doc_id: Optional[str] = Form(None)
|
||||
):
|
||||
"""Parse document and return chunks for RAG"""
|
||||
response = await parse_document_endpoint(
|
||||
file=file,
|
||||
doc_url=doc_url,
|
||||
output_mode="chunks",
|
||||
dao_id=dao_id,
|
||||
doc_id=doc_id
|
||||
)
|
||||
|
||||
if not response.chunks:
|
||||
raise HTTPException(status_code=500, detail="Failed to generate chunks")
|
||||
|
||||
return ChunksResponse(
|
||||
chunks=response.chunks,
|
||||
total_chunks=len(response.chunks),
|
||||
doc_id=response.chunks[0].metadata.get("doc_id", doc_id or "unknown"),
|
||||
dao_id=dao_id
|
||||
)
|
||||
|
||||
38
services/parser-service/app/core/config.py
Normal file
38
services/parser-service/app/core/config.py
Normal file
@@ -0,0 +1,38 @@
|
||||
"""
|
||||
Configuration for PARSER Service
|
||||
"""
|
||||
|
||||
import os
|
||||
from typing import Literal
|
||||
from pydantic_settings import BaseSettings
|
||||
|
||||
|
||||
class Settings(BaseSettings):
|
||||
"""Application settings"""
|
||||
|
||||
# Service
|
||||
API_HOST: str = "0.0.0.0"
|
||||
API_PORT: int = 9400
|
||||
|
||||
# PARSER Model
|
||||
PARSER_MODEL_NAME: str = os.getenv("PARSER_MODEL_NAME", "rednote-hilab/dots.ocr")
|
||||
PARSER_DEVICE: Literal["cuda", "cpu", "mps"] = os.getenv("PARSER_DEVICE", "cpu")
|
||||
PARSER_MAX_PAGES: int = int(os.getenv("PARSER_MAX_PAGES", "100"))
|
||||
PARSER_MAX_RESOLUTION: str = os.getenv("PARSER_MAX_RESOLUTION", "4096x4096")
|
||||
PARSER_BATCH_SIZE: int = int(os.getenv("PARSER_BATCH_SIZE", "1"))
|
||||
|
||||
# File handling
|
||||
MAX_FILE_SIZE_MB: int = int(os.getenv("MAX_FILE_SIZE_MB", "50"))
|
||||
TEMP_DIR: str = os.getenv("TEMP_DIR", "/tmp/parser")
|
||||
|
||||
# Runtime
|
||||
RUNTIME_TYPE: Literal["local", "remote"] = os.getenv("RUNTIME_TYPE", "local")
|
||||
RUNTIME_URL: str = os.getenv("RUNTIME_URL", "http://parser-runtime:11435")
|
||||
|
||||
class Config:
|
||||
env_file = ".env"
|
||||
case_sensitive = True
|
||||
|
||||
|
||||
settings = Settings()
|
||||
|
||||
79
services/parser-service/app/main.py
Normal file
79
services/parser-service/app/main.py
Normal file
@@ -0,0 +1,79 @@
|
||||
"""
|
||||
PARSER Service - Document Ingestion & Structuring Agent
|
||||
FastAPI сервіс для розпізнавання та структурування документів через dots.ocr
|
||||
"""
|
||||
|
||||
import logging
|
||||
from contextlib import asynccontextmanager
|
||||
|
||||
from fastapi import FastAPI
|
||||
from fastapi.middleware.cors import CORSMiddleware
|
||||
|
||||
from app.core.config import settings
|
||||
from app.api.endpoints import router
|
||||
|
||||
logging.basicConfig(
|
||||
level=logging.INFO,
|
||||
format="%(asctime)s - %(name)s - %(levelname)s - %(message)s"
|
||||
)
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
@asynccontextmanager
|
||||
async def lifespan(app: FastAPI):
|
||||
"""Lifespan events: startup and shutdown"""
|
||||
# Startup
|
||||
logger.info("Starting PARSER Service...")
|
||||
logger.info(f"Model: {settings.PARSER_MODEL_NAME}")
|
||||
logger.info(f"Device: {settings.PARSER_DEVICE}")
|
||||
logger.info(f"Max pages: {settings.PARSER_MAX_PAGES}")
|
||||
|
||||
# TODO: Initialize model loader here
|
||||
# from app.runtime.model_loader import load_model
|
||||
# app.state.model = await load_model()
|
||||
|
||||
yield
|
||||
|
||||
# Shutdown
|
||||
logger.info("Shutting down PARSER Service...")
|
||||
|
||||
|
||||
app = FastAPI(
|
||||
title="PARSER Service",
|
||||
description="Document Ingestion & Structuring Agent using dots.ocr",
|
||||
version="1.0.0",
|
||||
lifespan=lifespan
|
||||
)
|
||||
|
||||
app.add_middleware(
|
||||
CORSMiddleware,
|
||||
allow_origins=["*"],
|
||||
allow_credentials=True,
|
||||
allow_methods=["*"],
|
||||
allow_headers=["*"],
|
||||
)
|
||||
|
||||
app.include_router(router, prefix="/ocr", tags=["OCR"])
|
||||
|
||||
|
||||
@app.get("/health")
|
||||
async def health():
|
||||
"""Health check endpoint"""
|
||||
return {
|
||||
"status": "healthy",
|
||||
"service": "parser-service",
|
||||
"model": settings.PARSER_MODEL_NAME,
|
||||
"device": settings.PARSER_DEVICE,
|
||||
"version": "1.0.0"
|
||||
}
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
import uvicorn
|
||||
uvicorn.run(
|
||||
"app.main:app",
|
||||
host="0.0.0.0",
|
||||
port=9400,
|
||||
reload=True
|
||||
)
|
||||
|
||||
15
services/parser-service/app/runtime/__init__.py
Normal file
15
services/parser-service/app/runtime/__init__.py
Normal file
@@ -0,0 +1,15 @@
|
||||
"""
|
||||
PARSER Runtime module
|
||||
Handles model loading and inference for dots.ocr
|
||||
"""
|
||||
|
||||
from app.runtime.inference import parse_document, dummy_parse_document
|
||||
from app.runtime.model_loader import load_model, get_model
|
||||
|
||||
__all__ = [
|
||||
"parse_document",
|
||||
"dummy_parse_document",
|
||||
"load_model",
|
||||
"get_model"
|
||||
]
|
||||
|
||||
112
services/parser-service/app/runtime/inference.py
Normal file
112
services/parser-service/app/runtime/inference.py
Normal file
@@ -0,0 +1,112 @@
|
||||
"""
|
||||
Inference functions for document parsing
|
||||
"""
|
||||
|
||||
import logging
|
||||
from typing import Literal, Optional
|
||||
from pathlib import Path
|
||||
|
||||
from app.schemas import ParsedDocument, ParsedPage, ParsedBlock, BBox
|
||||
from app.runtime.model_loader import get_model
|
||||
from app.core.config import settings
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
def parse_document(
|
||||
input_path: str,
|
||||
output_mode: Literal["raw_json", "markdown", "qa_pairs", "chunks"] = "raw_json",
|
||||
doc_id: Optional[str] = None,
|
||||
doc_type: Literal["pdf", "image"] = "image"
|
||||
) -> ParsedDocument:
|
||||
"""
|
||||
Parse document using dots.ocr model
|
||||
|
||||
Args:
|
||||
input_path: Path to document file (PDF or image)
|
||||
output_mode: Output format mode
|
||||
doc_id: Document ID
|
||||
doc_type: Document type (pdf or image)
|
||||
|
||||
Returns:
|
||||
ParsedDocument with structured content
|
||||
"""
|
||||
model = get_model()
|
||||
|
||||
if model is None:
|
||||
logger.warning("Model not loaded, using dummy parser")
|
||||
return dummy_parse_document(input_path, output_mode, doc_id, doc_type)
|
||||
|
||||
# TODO: Implement actual inference with dots.ocr
|
||||
# Example:
|
||||
# from PIL import Image
|
||||
# import pdf2image # for PDF
|
||||
|
||||
# if doc_type == "pdf":
|
||||
# images = pdf2image.convert_from_path(input_path)
|
||||
# else:
|
||||
# images = [Image.open(input_path)]
|
||||
#
|
||||
# pages = []
|
||||
# for idx, image in enumerate(images):
|
||||
# # Process with model
|
||||
# inputs = model["processor"](images=image, return_tensors="pt")
|
||||
# outputs = model["model"].generate(**inputs)
|
||||
# text = model["processor"].decode(outputs[0], skip_special_tokens=True)
|
||||
#
|
||||
# # Parse output into blocks
|
||||
# blocks = parse_model_output(text, idx + 1)
|
||||
# pages.append(ParsedPage(...))
|
||||
#
|
||||
# return ParsedDocument(...)
|
||||
|
||||
# For now, use dummy
|
||||
return dummy_parse_document(input_path, output_mode, doc_id, doc_type)
|
||||
|
||||
|
||||
def dummy_parse_document(
|
||||
input_path: str,
|
||||
output_mode: Literal["raw_json", "markdown", "qa_pairs", "chunks"] = "raw_json",
|
||||
doc_id: Optional[str] = None,
|
||||
doc_type: Literal["pdf", "image"] = "image"
|
||||
) -> ParsedDocument:
|
||||
"""
|
||||
Dummy parser for testing (returns mock data)
|
||||
|
||||
This will be replaced with actual dots.ocr inference
|
||||
"""
|
||||
logger.info(f"Dummy parsing: {input_path}")
|
||||
|
||||
# Mock data
|
||||
mock_page = ParsedPage(
|
||||
page_num=1,
|
||||
blocks=[
|
||||
ParsedBlock(
|
||||
type="heading",
|
||||
text="Document Title",
|
||||
bbox=BBox(x=0, y=0, width=800, height=50),
|
||||
reading_order=1,
|
||||
page_num=1
|
||||
),
|
||||
ParsedBlock(
|
||||
type="paragraph",
|
||||
text="This is a dummy parsed document. Replace this with actual dots.ocr inference.",
|
||||
bbox=BBox(x=0, y=60, width=800, height=100),
|
||||
reading_order=2,
|
||||
page_num=1
|
||||
)
|
||||
],
|
||||
width=800,
|
||||
height=1200
|
||||
)
|
||||
|
||||
return ParsedDocument(
|
||||
doc_id=doc_id or "dummy-doc-1",
|
||||
doc_type=doc_type,
|
||||
pages=[mock_page],
|
||||
metadata={
|
||||
"parser": "dummy",
|
||||
"input_path": input_path
|
||||
}
|
||||
)
|
||||
|
||||
74
services/parser-service/app/runtime/model_loader.py
Normal file
74
services/parser-service/app/runtime/model_loader.py
Normal file
@@ -0,0 +1,74 @@
|
||||
"""
|
||||
Model loader for dots.ocr
|
||||
Handles lazy loading and GPU/CPU fallback
|
||||
"""
|
||||
|
||||
import logging
|
||||
from typing import Optional, Literal
|
||||
from pathlib import Path
|
||||
|
||||
from app.core.config import settings
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
# Global model instance
|
||||
_model: Optional[object] = None
|
||||
|
||||
|
||||
def load_model() -> object:
|
||||
"""
|
||||
Load dots.ocr model
|
||||
|
||||
Returns:
|
||||
Loaded model instance
|
||||
"""
|
||||
global _model
|
||||
|
||||
if _model is not None:
|
||||
return _model
|
||||
|
||||
logger.info(f"Loading model: {settings.PARSER_MODEL_NAME}")
|
||||
logger.info(f"Device: {settings.PARSER_DEVICE}")
|
||||
|
||||
try:
|
||||
# TODO: Implement actual model loading
|
||||
# Example:
|
||||
# from transformers import AutoModelForVision2Seq, AutoProcessor
|
||||
#
|
||||
# processor = AutoProcessor.from_pretrained(settings.PARSER_MODEL_NAME)
|
||||
# model = AutoModelForVision2Seq.from_pretrained(
|
||||
# settings.PARSER_MODEL_NAME,
|
||||
# device_map=settings.PARSER_DEVICE
|
||||
# )
|
||||
#
|
||||
# _model = {
|
||||
# "model": model,
|
||||
# "processor": processor
|
||||
# }
|
||||
|
||||
# For now, return None (will use dummy parser)
|
||||
logger.warning("Model loading not yet implemented, using dummy parser")
|
||||
_model = None
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to load model: {e}", exc_info=True)
|
||||
raise
|
||||
|
||||
return _model
|
||||
|
||||
|
||||
def get_model() -> Optional[object]:
|
||||
"""Get loaded model instance"""
|
||||
if _model is None:
|
||||
return load_model()
|
||||
return _model
|
||||
|
||||
|
||||
def unload_model():
|
||||
"""Unload model from memory"""
|
||||
global _model
|
||||
if _model is not None:
|
||||
# TODO: Proper cleanup
|
||||
_model = None
|
||||
logger.info("Model unloaded")
|
||||
|
||||
108
services/parser-service/app/schemas.py
Normal file
108
services/parser-service/app/schemas.py
Normal file
@@ -0,0 +1,108 @@
|
||||
"""
|
||||
Pydantic schemas for PARSER Service
|
||||
"""
|
||||
|
||||
from typing import Optional, List, Dict, Any, Literal
|
||||
from pydantic import BaseModel, Field
|
||||
from datetime import datetime
|
||||
|
||||
|
||||
class BBox(BaseModel):
|
||||
"""Bounding box coordinates"""
|
||||
x: float = Field(..., description="X coordinate")
|
||||
y: float = Field(..., description="Y coordinate")
|
||||
width: float = Field(..., description="Width")
|
||||
height: float = Field(..., description="Height")
|
||||
|
||||
|
||||
class TableCell(BaseModel):
|
||||
"""Table cell data"""
|
||||
row: int
|
||||
col: int
|
||||
text: str
|
||||
rowspan: Optional[int] = 1
|
||||
colspan: Optional[int] = 1
|
||||
|
||||
|
||||
class TableData(BaseModel):
|
||||
"""Structured table data"""
|
||||
rows: List[List[str]] = Field(..., description="Table rows")
|
||||
columns: List[str] = Field(..., description="Column headers")
|
||||
merged_cells: Optional[List[Dict[str, Any]]] = Field(None, description="Merged cells info")
|
||||
|
||||
|
||||
class ParsedBlock(BaseModel):
|
||||
"""Parsed document block"""
|
||||
type: Literal["paragraph", "heading", "table", "formula", "figure_caption", "list"] = Field(
|
||||
..., description="Block type"
|
||||
)
|
||||
text: str = Field(..., description="Block text content")
|
||||
bbox: BBox = Field(..., description="Bounding box")
|
||||
reading_order: int = Field(..., description="Reading order index")
|
||||
page_num: int = Field(..., description="Page number")
|
||||
table_data: Optional[TableData] = Field(None, description="Table data (if type=table)")
|
||||
metadata: Optional[Dict[str, Any]] = Field(None, description="Additional metadata")
|
||||
|
||||
|
||||
class ParsedPage(BaseModel):
|
||||
"""Parsed document page"""
|
||||
page_num: int = Field(..., description="Page number (1-indexed)")
|
||||
blocks: List[ParsedBlock] = Field(..., description="Page blocks")
|
||||
width: float = Field(..., description="Page width in pixels")
|
||||
height: float = Field(..., description="Page height in pixels")
|
||||
|
||||
|
||||
class ParsedChunk(BaseModel):
|
||||
"""Semantic chunk for RAG"""
|
||||
text: str = Field(..., description="Chunk text")
|
||||
page: int = Field(..., description="Page number")
|
||||
bbox: Optional[BBox] = Field(None, description="Bounding box")
|
||||
section: Optional[str] = Field(None, description="Section name")
|
||||
metadata: Dict[str, Any] = Field(default_factory=dict, description="Additional metadata")
|
||||
|
||||
|
||||
class QAPair(BaseModel):
|
||||
"""Question-Answer pair"""
|
||||
question: str = Field(..., description="Question")
|
||||
answer: str = Field(..., description="Answer")
|
||||
source_page: int = Field(..., description="Source page number")
|
||||
source_bbox: Optional[BBox] = Field(None, description="Source bounding box")
|
||||
confidence: Optional[float] = Field(None, description="Confidence score")
|
||||
|
||||
|
||||
class ParsedDocument(BaseModel):
|
||||
"""Complete parsed document"""
|
||||
doc_id: Optional[str] = Field(None, description="Document ID")
|
||||
doc_url: Optional[str] = Field(None, description="Document URL")
|
||||
doc_type: Literal["pdf", "image"] = Field(..., description="Document type")
|
||||
pages: List[ParsedPage] = Field(..., description="Parsed pages")
|
||||
metadata: Dict[str, Any] = Field(default_factory=dict, description="Document metadata")
|
||||
created_at: datetime = Field(default_factory=datetime.utcnow, description="Creation timestamp")
|
||||
|
||||
|
||||
class ParseRequest(BaseModel):
|
||||
"""Parse request"""
|
||||
doc_url: Optional[str] = Field(None, description="Document URL")
|
||||
output_mode: Literal["raw_json", "markdown", "qa_pairs", "chunks"] = Field(
|
||||
"raw_json", description="Output mode"
|
||||
)
|
||||
dao_id: Optional[str] = Field(None, description="DAO ID")
|
||||
doc_id: Optional[str] = Field(None, description="Document ID")
|
||||
|
||||
|
||||
class ParseResponse(BaseModel):
|
||||
"""Parse response"""
|
||||
document: Optional[ParsedDocument] = Field(None, description="Parsed document (raw_json mode)")
|
||||
markdown: Optional[str] = Field(None, description="Markdown content (markdown mode)")
|
||||
qa_pairs: Optional[List[QAPair]] = Field(None, description="QA pairs (qa_pairs mode)")
|
||||
chunks: Optional[List[ParsedChunk]] = Field(None, description="Chunks (chunks mode)")
|
||||
metadata: Dict[str, Any] = Field(default_factory=dict, description="Additional metadata")
|
||||
|
||||
|
||||
class ChunksResponse(BaseModel):
|
||||
"""Chunks response for RAG"""
|
||||
chunks: List[ParsedChunk] = Field(..., description="Document chunks")
|
||||
total_chunks: int = Field(..., description="Total number of chunks")
|
||||
doc_id: str = Field(..., description="Document ID")
|
||||
dao_id: str = Field(..., description="DAO ID")
|
||||
|
||||
22
services/parser-service/requirements.txt
Normal file
22
services/parser-service/requirements.txt
Normal file
@@ -0,0 +1,22 @@
|
||||
# FastAPI and server
|
||||
fastapi==0.104.1
|
||||
uvicorn[standard]==0.24.0
|
||||
python-multipart==0.0.6
|
||||
pydantic==2.5.0
|
||||
pydantic-settings==2.1.0
|
||||
|
||||
# Model and ML
|
||||
torch>=2.0.0
|
||||
transformers>=4.35.0
|
||||
Pillow>=10.0.0
|
||||
|
||||
# PDF processing
|
||||
pdf2image>=1.16.3
|
||||
PyMuPDF>=1.23.0 # Alternative PDF library
|
||||
|
||||
# Image processing
|
||||
opencv-python>=4.8.0 # Optional, for advanced image processing
|
||||
|
||||
# Utilities
|
||||
python-dotenv>=1.0.1
|
||||
|
||||
Reference in New Issue
Block a user