feat: enhance model output parser and add integration guide

Model Output Parser: - Support multiple dots.ocr output formats (JSON, structured text, plain text) - Normalize all formats to standard ParsedBlock structure - Handle JSON with blocks/pages arrays - Parse markdown-like structured text - Fallback to plain text parsing - Better error handling and logging Schemas: - Document must-have fields for RAG (doc_id, pages, metadata.dao_id) - ParsedChunk must-have fields (text, metadata.dao_id, metadata.doc_id) - Add detailed field descriptions for RAG integration Integration Guide: - Create INTEGRATION.md with complete integration guide - Document dots.ocr output formats - Show ParsedDocument → Haystack Documents conversion - Provide DAGI Router integration examples - RAG pipeline integration with filters - Complete workflow examples - RBAC integration recommendations
2025-11-16 03:02:42 -08:00
parent ca05c91799
commit 7251e519d6
3 changed files with 753 additions and 108 deletions
--- a/services/parser-service/INTEGRATION.md
+++ b/services/parser-service/INTEGRATION.md
@@ -0,0 +1,415 @@
 # PARSER Service - Integration Guide
 Інтеграція PARSER-сервісу з DAGI Router та RAG-пайплайном.
 ## Формат виводу dots.ocr → ParsedBlock
 ### Очікувані формати виводу dots.ocr
 PARSER-сервіс підтримує кілька форматів виводу від dots.ocr моделі:
 #### 1. JSON зі структурованими блоками (preferred)
 ```json
 {
  "blocks": [
    {
      "type": "heading",
      "text": "Document Title",
      "bbox": [0, 0, 800, 50],
      "reading_order": 1
    },
    {
      "type": "paragraph",
      "text": "Document content...",
      "bbox": [0, 60, 800, 100],
      "reading_order": 2
    },
    {
      "type": "table",
      "text": "Table content",
      "bbox": [0, 200, 800, 300],
      "reading_order": 3,
      "table_data": {
        "rows": [["Header 1", "Header 2"], ["Value 1", "Value 2"]],
        "columns": ["Header 1", "Header 2"]
      }
    }
  ]
 }
 ```
 #### 2. JSON зі сторінками
 ```json
 {
  "pages": [
    {
      "page_num": 1,
      "blocks": [...]
    }
  ]
 }
 ```
 #### 3. Plain text / Markdown
 ```
 # Document Title
 Document content paragraph...
 - List item 1
 - List item 2
 ```
 ### Нормалізація в ParsedBlock
 `model_output_parser.py` автоматично нормалізує всі формати до стандартного `ParsedBlock`:
 ```python
 {
    "type": "paragraph" | "heading" | "table" | "formula" | "figure_caption" | "list",
    "text": "Block text content",
    "bbox": {
        "x": 0.0,
        "y": 0.0,
        "width": 800.0,
        "height": 50.0
    },
    "reading_order": 1,
    "page_num": 1,
    "table_data": {...},  # Optional, for table blocks
    "metadata": {...}      # Optional, additional metadata
 }
 ```
 ## Must-have поля для RAG
 ### ParsedDocument
 **Обов'язкові поля:**
 - `doc_id: str` - Унікальний ідентифікатор документа (для індексації)
 - `pages: List[ParsedPage]` - Список сторінок з блоками (контент)
 - `doc_type: Literal["pdf", "image"]` - Тип документа
 **Рекомендовані поля в metadata:**
 - `metadata.dao_id: str` - ID DAO (для фільтрації)
 - `metadata.user_id: str` - ID користувача (для access control)
 - `metadata.title: str` - Назва документа (для відображення)
 - `metadata.created_at: datetime` - Час завантаження (для сортування)
 ### ParsedChunk
 **Обов'язкові поля:**
 - `text: str` - Текст фрагменту (для індексації)
 - `metadata.dao_id: str` - ID DAO (для фільтрації)
 - `metadata.doc_id: str` - ID документа (для citation)
 **Рекомендовані поля:**
 - `page: int` - Номер сторінки (для citation)
 - `section: str` - Назва секції (для контексту)
 - `metadata.block_type: str` - Тип блоку (heading, paragraph, etc.)
 - `metadata.reading_order: int` - Порядок читання (для сортування)
 - `bbox: BBox` - Координати (для виділення в PDF)
 ## Інтеграція з DAGI Router
 ### 1. Додати provider в router-config.yml
 ```yaml
 providers:
  parser:
    type: ocr
    base_url: "http://parser-service:9400"
    timeout: 120
 ```
 ### 2. Додати routing rule
 ```yaml
 routing:
  - id: doc_parse
    when:
      mode: doc_parse
    use_provider: parser
 ```
 ### 3. Розширити RouterRequest
 Додати в `router_client.py` або `types/api.ts`:
 ```python
 class RouterRequest(BaseModel):
    mode: str
    dao_id: str
    user_id: str
    payload: Dict[str, Any]
    # Нові поля для PARSER
    doc_url: Optional[str] = None
    doc_type: Optional[Literal["pdf", "image"]] = None
    output_mode: Optional[Literal["raw_json", "markdown", "qa_pairs", "chunks"]] = "raw_json"
 ```
 ### 4. Handler в Router
 ```python
@router.post("/route")
 async def route(request: RouterRequest):
    if request.mode == "doc_parse":
        # Викликати parser-service
        async with httpx.AsyncClient() as client:
            files = {"file": await download_file(request.doc_url)}
            response = await client.post(
                "http://parser-service:9400/ocr/parse",
                files=files,
                data={"output_mode": request.output_mode}
            )
            parsed_doc = response.json()
            return {"data": parsed_doc}
 ```
 ## Інтеграція з RAG Pipeline
 ### 1. Конвертація ParsedDocument → Haystack Documents
 ```python
 from haystack.schema import Document
 def parsed_doc_to_haystack_docs(parsed_doc: ParsedDocument) -> List[Document]:
    """Convert ParsedDocument to Haystack Documents for RAG"""
    docs = []
    for page in parsed_doc.pages:
        for block in page.blocks:
            # Skip empty blocks
            if not block.text or not block.text.strip():
                continue
            # Build metadata (must-have для RAG)
            meta = {
                "dao_id": parsed_doc.metadata.get("dao_id", ""),
                "doc_id": parsed_doc.doc_id,
                "page": page.page_num,
                "block_type": block.type,
                "reading_order": block.reading_order,
                "section": block.type if block.type == "heading" else None
            }
            # Add optional fields
            if block.bbox:
                meta["bbox_x"] = block.bbox.x
                meta["bbox_y"] = block.bbox.y
                meta["bbox_width"] = block.bbox.width
                meta["bbox_height"] = block.bbox.height
            # Create Haystack Document
            doc = Document(
                content=block.text,
                meta=meta
            )
            docs.append(doc)
    return docs
 ```
 ### 2. Ingest Pipeline
 ```python
 from haystack import Pipeline
 from haystack.components.embedders import SentenceTransformersTextEmbedder
 from haystack.components.writers import DocumentWriter
 from haystack.document_stores import PGVectorDocumentStore
 def create_ingest_pipeline():
    """Create RAG ingest pipeline"""
    doc_store = PGVectorDocumentStore(
        connection_string="postgresql+psycopg2://...",
        embedding_dim=1024,
        table_name="rag_documents"
    )
    embedder = SentenceTransformersTextEmbedder(
        model="BAAI/bge-m3",
        device="cuda"
    )
    writer = DocumentWriter(document_store=doc_store)
    pipeline = Pipeline()
    pipeline.add_component("embedder", embedder)
    pipeline.add_component("writer", writer)
    pipeline.connect("embedder.documents", "writer.documents")
    return pipeline
 def ingest_parsed_document(parsed_doc: ParsedDocument):
    """Ingest parsed document into RAG"""
    # Convert to Haystack Documents
    docs = parsed_doc_to_haystack_docs(parsed_doc)
    if not docs:
        logger.warning(f"No documents to ingest for doc_id={parsed_doc.doc_id}")
        return
    # Create pipeline
    pipeline = create_ingest_pipeline()
    # Run ingest
    result = pipeline.run({
        "embedder": {"documents": docs}
    })
    logger.info(f"Ingested {len(docs)} chunks for doc_id={parsed_doc.doc_id}")
 ```
 ### 3. Query Pipeline з фільтрами
 ```python
 def answer_query(dao_id: str, question: str, user_id: str):
    """Query RAG with RBAC filters"""
    # Build filters (must-have для ізоляції даних)
    filters = {
        "dao_id": [dao_id]  # Фільтр по DAO
    }
    # Optional: додати фільтри по roles через RBAC
    # user_roles = get_user_roles(user_id, dao_id)
    # if "admin" not in user_roles:
    #     filters["visibility"] = ["public"]
    # Query pipeline
    pipeline = create_query_pipeline()
    result = pipeline.run({
        "embedder": {"texts": [question]},
        "retriever": {"filters": filters, "top_k": 5},
        "generator": {"prompt": question}
    })
    answer = result["generator"]["replies"][0]
    citations = [
        {
            "doc_id": doc.meta["doc_id"],
            "page": doc.meta["page"],
            "text": doc.content[:200],
            "bbox": {
                "x": doc.meta.get("bbox_x"),
                "y": doc.meta.get("bbox_y")
            }
        }
        for doc in result["retriever"]["documents"]
    ]
    return {
        "answer": answer,
        "citations": citations
    }
 ```
 ## Приклад повного workflow
 ### 1. Завантаження документа
 ```python
 # Gateway отримує файл від користувача
 file_bytes = await get_file_from_telegram(file_id)
 # Викликаємо PARSER
 async with httpx.AsyncClient() as client:
    response = await client.post(
        "http://parser-service:9400/ocr/parse_chunks",
        files={"file": ("doc.pdf", file_bytes)},
        data={
            "dao_id": "daarion",
            "doc_id": "tokenomics_v1",
            "output_mode": "chunks"
        }
    )
    result = response.json()
 # Конвертуємо в ParsedDocument
 parsed_doc = ParsedDocument(**result["document"])
 # Додаємо metadata
 parsed_doc.metadata.update({
    "dao_id": "daarion",
    "user_id": "user123",
    "title": "Tokenomics v1"
 })
 # Інжестимо в RAG
 ingest_parsed_document(parsed_doc)
 ```
 ### 2. Запит до RAG
 ```python
 # Користувач питає через бота
 question = "Поясни токеноміку microDAO"
 # Викликаємо RAG через DAGI Router
 router_request = {
    "mode": "rag_query",
    "dao_id": "daarion",
    "user_id": "user123",
    "payload": {
        "question": question
    }
 }
 response = await send_to_router(router_request)
 answer = response["data"]["answer"]
 citations = response["data"]["citations"]
 # Відправляємо користувачу з цитатами
 await send_message(f"{answer}\n\nДжерела: {len(citations)} документів")
 ```
 ## Рекомендації
 ### Для RAG indexing
 1. **Обов'язкові поля:**
   - `doc_id` - для унікальності
   - `dao_id` - для фільтрації
   - `text` - для індексації
 2. **Рекомендовані поля:**
   - `page` - для citation
   - `block_type` - для контексту
   - `section` - для семантичної групи
 3. **Опційні поля:**
   - `bbox` - для виділення в PDF
   - `reading_order` - для сортування
 ### Для DAGI Router
 1. **Обов'язкові поля в payload:**
   - `doc_url` або `file` - для завантаження
   - `output_mode` - для вибору формату
 2. **Рекомендовані поля:**
   - `dao_id` - для контексту
   - `doc_id` - для tracking
 ### Для RBAC інтеграції
 1. **Фільтри в RAG:**
   - `dao_id` - обов'язково
   - `visibility` - для приватних документів
   - `user_id` - для персональних документів
 2. **Перевірки в PARSER:**
   - Перевірка прав на завантаження
   - Перевірка обмежень по розміру
   - Перевірка типу файлу
 ## Посилання
 - [PARSER Agent Documentation](../docs/agents/parser.md)
 - [TODO: RAG Implementation](./TODO-RAG.md)
 - [Deployment Guide](./DEPLOYMENT.md)
--- a/services/parser-service/app/runtime/model_output_parser.py
+++ b/services/parser-service/app/runtime/model_output_parser.py
@@ -1,11 +1,19 @@
 """
 Parser for dots.ocr model output
 Converts model output to structured blocks
 Expected dots.ocr output formats:
 1. JSON with structured blocks (preferred)
 2. Plain text with layout hints
 3. Markdown-like structure
 This parser handles all formats and normalizes to ParsedBlock structure.
 """
 import logging
 import json
-from typing import List, Dict, Any, Optional
+import re
 from typing import List, Dict, Any, Optional, Tuple
 from PIL import Image
 logger = logging.getLogger(__name__)
@@ -19,121 +27,311 @@ def parse_model_output_to_blocks(
    """
    Parse dots.ocr model output into structured blocks
    Handles multiple output formats:
    1. JSON with "blocks" array (preferred)
    2. JSON with "pages" array
    3. Plain text with layout hints
    4. Markdown-like structure
    Args:
-        model_output: Raw text output from model (may be JSON or plain text)
+        model_output: Raw text output from model
        image_size: (width, height) of the image
        page_num: Page number
    Returns:
-        List of block dictionaries
+        List of block dictionaries with normalized structure
    """
    blocks = []
    try:
-        # Try to parse as JSON first (if model outputs structured JSON)
+        # Format 1: Try to parse as JSON (structured output)
-        try:
+        parsed_json = _try_parse_json(model_output)
-            output_data = json.loads(model_output)
+        if parsed_json:
-            if isinstance(output_data, dict) and "blocks" in output_data:
+            blocks = _extract_blocks_from_json(parsed_json, image_size, page_num)
-                # Model outputs structured format
+            if blocks:
-                return output_data["blocks"]
+                logger.debug(f"Parsed {len(blocks)} blocks from JSON output")
-            elif isinstance(output_data, list):
+                return blocks
                # Model outputs list of blocks
                return output_data
        except (json.JSONDecodeError, KeyError):
            # Not JSON, treat as plain text
            pass
-        # Parse plain text output
+        # Format 2: Try to parse as structured text (markdown-like)
-        # This is a simple heuristic - adjust based on actual dots.ocr output format
+        blocks = _parse_structured_text(model_output, image_size, page_num)
-        lines = model_output.strip().split('\n')
+        if blocks:
            logger.debug(f"Parsed {len(blocks)} blocks from structured text")
            return blocks
-        current_block = None
+        # Format 3: Fallback - plain text as single paragraph
-        reading_order = 1
+        blocks = _parse_plain_text(model_output, image_size, page_num)
-        
+        logger.debug(f"Parsed {len(blocks)} blocks from plain text")
        for line in lines:
            line = line.strip()
            if not line:
                continue
            # Heuristic: lines starting with # are headings
            if line.startswith('#'):
                # Save previous block
                if current_block:
                    blocks.append(current_block)
                # New heading block
                current_block = {
                    "type": "heading",
                    "text": line.lstrip('#').strip(),
                    "bbox": {
                        "x": 0,
                        "y": reading_order * 30,
                        "width": image_size[0],
                        "height": 30
                    },
                    "reading_order": reading_order
                }
                reading_order += 1
            else:
                # Regular paragraph
                if current_block and current_block["type"] == "paragraph":
                    # Append to existing paragraph
                    current_block["text"] += " " + line
                else:
                    # Save previous block
                    if current_block:
                        blocks.append(current_block)
                    # New paragraph block
                    current_block = {
                        "type": "paragraph",
                        "text": line,
                        "bbox": {
                            "x": 0,
                            "y": reading_order * 30,
                            "width": image_size[0],
                            "height": 30
                        },
                        "reading_order": reading_order
                    }
                    reading_order += 1
        # Save last block
        if current_block:
            blocks.append(current_block)
        # If no blocks were created, create a single paragraph with all text
        if not blocks:
            blocks.append({
                "type": "paragraph",
                "text": model_output.strip(),
                "bbox": {
                    "x": 0,
                    "y": 0,
                    "width": image_size[0],
                    "height": image_size[1]
                },
                "reading_order": 1
            })
    except Exception as e:
        logger.error(f"Error parsing model output: {e}", exc_info=True)
-        # Fallback: create single block with raw output
+        blocks = _create_fallback_block(model_output, image_size, page_num)
        blocks = [{
            "type": "paragraph",
            "text": model_output.strip() if model_output else "",
            "bbox": {
                "x": 0,
                "y": 0,
                "width": image_size[0],
                "height": image_size[1]
            },
            "reading_order": 1
        }]
    return blocks
 def _try_parse_json(text: str) -> Optional[Dict[str, Any]]:
    """Try to parse text as JSON"""
    try:
        # Try to find JSON in text (might be wrapped in markdown code blocks)
        json_match = re.search(r'```(?:json)?\s*(\{.*?\})\s*```', text, re.DOTALL)
        if json_match:
            return json.loads(json_match.group(1))
        # Try direct JSON parse
        return json.loads(text)
    except (json.JSONDecodeError, ValueError):
        return None
 def _extract_blocks_from_json(
    data: Dict[str, Any],
    image_size: tuple[int, int],
    page_num: int
 ) -> List[Dict[str, Any]]:
    """Extract blocks from JSON structure"""
    blocks = []
    # Format: {"blocks": [...]}
    if "blocks" in data and isinstance(data["blocks"], list):
        for idx, block_data in enumerate(data["blocks"], start=1):
            block = _normalize_block(block_data, image_size, idx)
            if block:
                blocks.append(block)
    # Format: {"pages": [{"blocks": [...]}]}
    elif "pages" in data and isinstance(data["pages"], list):
        for page_data in data["pages"]:
            if isinstance(page_data, dict) and "blocks" in page_data:
                for idx, block_data in enumerate(page_data["blocks"], start=1):
                    block = _normalize_block(block_data, image_size, idx)
                    if block:
                        blocks.append(block)
    # Format: Direct array of blocks
    elif isinstance(data, list):
        for idx, block_data in enumerate(data, start=1):
            block = _normalize_block(block_data, image_size, idx)
            if block:
                blocks.append(block)
    return blocks
 def _normalize_block(
    block_data: Dict[str, Any],
    image_size: tuple[int, int],
    reading_order: int
 ) -> Optional[Dict[str, Any]]:
    """Normalize block data to standard format"""
    if not isinstance(block_data, dict):
        return None
    # Extract text
    text = block_data.get("text") or block_data.get("content") or ""
    if not text or not text.strip():
        return None
    # Extract type
    block_type = block_data.get("type") or block_data.get("block_type") or "paragraph"
    # Normalize type
    type_mapping = {
        "heading": "heading",
        "title": "heading",
        "h1": "heading",
        "h2": "heading",
        "h3": "heading",
        "paragraph": "paragraph",
        "p": "paragraph",
        "text": "paragraph",
        "table": "table",
        "formula": "formula",
        "figure": "figure_caption",
        "caption": "figure_caption",
        "list": "list",
        "li": "list"
    }
    block_type = type_mapping.get(block_type.lower(), "paragraph")
    # Extract bbox
    bbox = block_data.get("bbox") or block_data.get("bounding_box") or {}
    if isinstance(bbox, list) and len(bbox) >= 4:
        # Format: [x, y, width, height]
        bbox = {
            "x": float(bbox[0]),
            "y": float(bbox[1]),
            "width": float(bbox[2]),
            "height": float(bbox[3])
        }
    elif isinstance(bbox, dict):
        # Ensure all fields are present
        bbox = {
            "x": float(bbox.get("x", 0)),
            "y": float(bbox.get("y", 0)),
            "width": float(bbox.get("width", image_size[0])),
            "height": float(bbox.get("height", 30))
        }
    else:
        # Default bbox
        bbox = {
            "x": 0,
            "y": reading_order * 30,
            "width": image_size[0],
            "height": 30
        }
    # Build normalized block
    normalized = {
        "type": block_type,
        "text": text.strip(),
        "bbox": bbox,
        "reading_order": block_data.get("reading_order") or reading_order
    }
    # Add table data if present
    if block_type == "table" and "table_data" in block_data:
        normalized["table_data"] = block_data["table_data"]
    # Add metadata if present
    if "metadata" in block_data:
        normalized["metadata"] = block_data["metadata"]
    return normalized
 def _parse_structured_text(
    text: str,
    image_size: tuple[int, int],
    page_num: int
 ) -> List[Dict[str, Any]]:
    """Parse structured text (markdown-like) into blocks"""
    blocks = []
    lines = text.strip().split('\n')
    current_block = None
    reading_order = 1
    for line in lines:
        line = line.strip()
        if not line:
            if current_block:
                blocks.append(current_block)
                current_block = None
            continue
        # Detect heading (markdown style)
        heading_match = re.match(r'^(#{1,6})\s+(.+)$', line)
        if heading_match:
            if current_block:
                blocks.append(current_block)
            level = len(heading_match.group(1))
            heading_text = heading_match.group(2)
            current_block = {
                "type": "heading",
                "text": heading_text,
                "bbox": {
                    "x": 0,
                    "y": reading_order * 30,
                    "width": image_size[0],
                    "height": 30
                },
                "reading_order": reading_order
            }
            reading_order += 1
            continue
        # Detect list item
        if re.match(r'^[-*+]\s+', line) or re.match(r'^\d+\.\s+', line):
            if current_block and current_block["type"] != "list":
                blocks.append(current_block)
            list_text = re.sub(r'^[-*+]\s+', '', line)
            list_text = re.sub(r'^\d+\.\s+', '', list_text)
            current_block = {
                "type": "list",
                "text": list_text,
                "bbox": {
                    "x": 0,
                    "y": reading_order * 30,
                    "width": image_size[0],
                    "height": 30
                },
                "reading_order": reading_order
            }
            reading_order += 1
            continue
        # Regular paragraph
        if current_block and current_block["type"] == "paragraph":
            current_block["text"] += " " + line
        else:
            if current_block:
                blocks.append(current_block)
            current_block = {
                "type": "paragraph",
                "text": line,
                "bbox": {
                    "x": 0,
                    "y": reading_order * 30,
                    "width": image_size[0],
                    "height": 30
                },
                "reading_order": reading_order
            }
            reading_order += 1
    if current_block:
        blocks.append(current_block)
    return blocks
 def _parse_plain_text(
    text: str,
    image_size: tuple[int, int],
    page_num: int
 ) -> List[Dict[str, Any]]:
    """Parse plain text as single paragraph"""
    if not text or not text.strip():
        return []
    return [{
        "type": "paragraph",
        "text": text.strip(),
        "bbox": {
            "x": 0,
            "y": 0,
            "width": image_size[0],
            "height": image_size[1]
        },
        "reading_order": 1
    }]
 def _create_fallback_block(
    text: str,
    image_size: tuple[int, int],
    page_num: int
 ) -> List[Dict[str, Any]]:
    """Create fallback block when parsing fails"""
    return [{
        "type": "paragraph",
        "text": text.strip() if text else f"Page {page_num} (parsing failed)",
        "bbox": {
            "x": 0,
            "y": 0,
            "width": image_size[0],
            "height": image_size[1]
        },
        "reading_order": 1,
        "metadata": {"parsing_error": True}
    }]
 def extract_layout_info(model_output: Dict[str, Any]) -> Optional[Dict[str, Any]]:
    """
    Extract layout information from model output (if available)
--- a/services/parser-service/app/schemas.py
+++ b/services/parser-service/app/schemas.py
@@ -53,12 +53,28 @@ class ParsedPage(BaseModel):
 class ParsedChunk(BaseModel):
-    """Semantic chunk for RAG"""
+    """
-    text: str = Field(..., description="Chunk text")
+    Semantic chunk for RAG
-    page: int = Field(..., description="Page number")
+    
-    bbox: Optional[BBox] = Field(None, description="Bounding box")
+    Must-have fields for RAG indexing:
-    section: Optional[str] = Field(None, description="Section name")
+    - text: Chunk text content (required)
-    metadata: Dict[str, Any] = Field(default_factory=dict, description="Additional metadata")
+    - metadata.dao_id: DAO identifier (required for filtering)
    - metadata.doc_id: Document identifier (required for citation)
    Recommended fields:
    - page: Page number (for citation)
    - section: Section name (for context)
    - metadata.block_type: Type of block (heading, paragraph, etc.)
    - metadata.reading_order: Reading order (for sorting)
    """
    text: str = Field(..., description="Chunk text (required for RAG)")
    page: int = Field(..., description="Page number (for citation)")
    bbox: Optional[BBox] = Field(None, description="Bounding box (for highlighting)")
    section: Optional[str] = Field(None, description="Section name (for context)")
    metadata: Dict[str, Any] = Field(
        default_factory=dict,
        description="Metadata (must include dao_id, doc_id for RAG)"
    )
 class QAPair(BaseModel):
@@ -71,12 +87,28 @@ class QAPair(BaseModel):
 class ParsedDocument(BaseModel):
-    """Complete parsed document"""
+    """
-    doc_id: Optional[str] = Field(None, description="Document ID")
+    Complete parsed document
    Must-have fields for RAG integration:
    - doc_id: Unique document identifier (required for RAG indexing)
    - pages: List of parsed pages with blocks (required for content)
    - doc_type: Document type (required for processing)
    Recommended fields for RAG:
    - metadata.dao_id: DAO identifier (for filtering)
    - metadata.user_id: User who uploaded (for access control)
    - metadata.title: Document title (for display)
    - metadata.created_at: Upload timestamp (for sorting)
    """
    doc_id: str = Field(..., description="Document ID (required for RAG)")
    doc_url: Optional[str] = Field(None, description="Document URL")
    doc_type: Literal["pdf", "image"] = Field(..., description="Document type")
-    pages: List[ParsedPage] = Field(..., description="Parsed pages")
+    pages: List[ParsedPage] = Field(..., description="Parsed pages (required for RAG)")
-    metadata: Dict[str, Any] = Field(default_factory=dict, description="Document metadata")
+    metadata: Dict[str, Any] = Field(
        default_factory=dict,
        description="Document metadata (should include dao_id, user_id for RAG)"
    )
    created_at: datetime = Field(default_factory=datetime.utcnow, description="Creation timestamp")