Files
microdao-daarion/docs/cursor/rag_ingestion_worker_task.md
Apple 4601c6fca8 feat: add Vision Encoder service + Vision RAG implementation
- Vision Encoder Service (OpenCLIP ViT-L/14, GPU-accelerated)
  - FastAPI app with text/image embedding endpoints (768-dim)
  - Docker support with NVIDIA GPU runtime
  - Port 8001, health checks, model info API

- Qdrant Vector Database integration
  - Port 6333/6334 (HTTP/gRPC)
  - Image embeddings storage (768-dim, Cosine distance)
  - Auto collection creation

- Vision RAG implementation
  - VisionEncoderClient (Python client for API)
  - Image Search module (text-to-image, image-to-image)
  - Vision RAG routing in DAGI Router (mode: image_search)
  - VisionEncoderProvider integration

- Documentation (5000+ lines)
  - SYSTEM-INVENTORY.md - Complete system inventory
  - VISION-ENCODER-STATUS.md - Service status
  - VISION-RAG-IMPLEMENTATION.md - Implementation details
  - vision_encoder_deployment_task.md - Deployment checklist
  - services/vision-encoder/README.md - Deployment guide
  - Updated WARP.md, INFRASTRUCTURE.md, Jupyter Notebook

- Testing
  - test-vision-encoder.sh - Smoke tests (6 tests)
  - Unit tests for client, image search, routing

- Services: 17 total (added Vision Encoder + Qdrant)
- AI Models: 3 (qwen3:8b, OpenCLIP ViT-L/14, BAAI/bge-m3)
- GPU Services: 2 (Vision Encoder, Ollama)
- VRAM Usage: ~10 GB (concurrent)

Status: Production Ready 
2025-11-17 05:24:36 -08:00

8.6 KiB
Raw Blame History

Task: RAG ingestion worker (events → Milvus + Neo4j)

Goal

Design and scaffold a RAG ingestion worker that:

  • Сonsumes domain events (messages, docs, files, RWA updates) from the existing event stream.
  • Transforms them into normalized chunks/documents.
  • Indexes them into Milvus (vector store) and Neo4j (graph store).
  • Works idempotently and supports reindex(team_id).

This worker complements the rag-gateway service (see docs/cursor/rag_gateway_task.md) by keeping its underlying stores up-to-date.

IMPORTANT: This task is about architecture, data flow and scaffolding. Concrete model choices and full schemas can be refined later.


Context

  • Project root: microdao-daarion/.
  • Planned/implemented RAG layer: see docs/cursor/rag_gateway_task.md.
  • Existing docs:
    • docs/cursor/42_nats_event_streams_and_event_catalog.md event stream & catalog.
    • docs/cursor/34_internal_services_architecture.md internal services & topology.

We assume there is (or will be):

  • An event bus (likely NATS) with domain events such as:
    • message.created
    • doc.upsert
    • file.uploaded
    • rwa.energy.update, rwa.food.update, etc.
  • A Milvus cluster instance.
  • A Neo4j instance.

The ingestion worker must not be called directly by agents. It is a back-office service that feeds RAG stores for the rag-gateway.


High-level design

1. Service placement & structure

Create a new service (or extend RAG-gateway repo structure) under, for example:

  • services/rag-ingest-worker/

Suggested files:

  • main.py — entrypoint (CLI or long-running process).
  • config.py — environment/config loader (event bus URL, Milvus/Neo4j URLs, batch sizes, etc.).
  • events/consumer.py — NATS (or other) consumer logic.
  • pipeline/normalization.py — turn events into normalized documents/chunks.
  • pipeline/embedding.py — embedding model client/wrapper.
  • pipeline/index_milvus.py — Milvus upsert logic.
  • pipeline/index_neo4j.py — Neo4j graph updates.
  • api.py — optional HTTP API for:
    • POST /ingest/one ingest single payload for debugging.
    • POST /ingest/reindex/{team_id} trigger reindex job.
    • GET /health health check.

2. Event sources

The worker should subscribe to a small set of core event types (names to be aligned with the actual Event Catalog):

  • message.created — messages in chats/channels (Telegram, internal UI, etc.).
  • doc.upsert — wiki/docs/specs updates.
  • file.uploaded — files (PDF, images) that have parsed text.
  • rwa.* — events related to energy/food/water assets (optional, for later).

Implementation details:

  • Use NATS (or another broker) subscription patterns from docs/cursor/42_nats_event_streams_and_event_catalog.md.
  • Each event should carry at least:
    • event_type
    • team_id / dao_id
    • user_id
    • channel_id / project_id (if applicable)
    • payload with text/content and metadata.

Normalized document/chunk model

Define a common internal model for what is sent to Milvus/Neo4j, e.g. IngestChunk:

Fields (minimum):

  • chunk_id — deterministic ID (e.g. hash of (team_id, source_type, source_id, chunk_index)).
  • team_id / dao_id.
  • project_id (optional).
  • channel_id (optional).
  • agent_id (who generated it, if any).
  • source_type"message" | "doc" | "file" | "wiki" | "rwa" | ....
  • source_id — e.g. message ID, doc ID, file ID.
  • text — the chunk content.
  • tags — list of tags (topic, domain, etc.).
  • visibility"public" | "confidential".
  • created_at — timestamp.

Responsibilities:

  • pipeline/normalization.py:
    • For each event type, map event payload → one or more IngestChunk objects.
    • Handle splitting of long texts into smaller chunks if needed.

Embedding & Milvus indexing

1. Embedding

  • Create an embedding component (pipeline/embedding.py) that:

    • Accepts IngestChunk objects.
    • Supports batch processing.
    • Uses either:
      • Existing LLM proxy/embedding service (preferred), or
      • Direct model (e.g. local bge-m3, gte-large, etc.).
  • Each chunk after embedding should have vector + metadata per schema in rag_gateway_task.

2. Milvus indexing

  • pipeline/index_milvus.py should:

    • Upsert chunks into Milvus.
    • Ensure idempotency using chunk_id as primary key.
    • Store metadata:
      • team_id, project_id, channel_id, agent_id,
      • source_type, source_id,
      • visibility, tags, created_at,
      • embed_model version.
  • Consider using one Milvus collection with a partition key (team_id), or per-DAO collections — but keep code flexible.


Neo4j graph updates

pipeline/index_neo4j.py should:

  • For events that carry structural information (e.g. project uses resource, doc mentions topic):

    • Create or update nodes: User, MicroDAO, Project, Channel, Topic, Resource, File, RWAObject, Doc.
    • Create relationships such as:
      • (:User)-[:MEMBER_OF]->(:MicroDAO)
      • (:Agent)-[:SERVES]->(:MicroDAO|:Project)
      • (:Doc)-[:MENTIONS]->(:Topic)
      • (:Project)-[:USES]->(:Resource)
  • All nodes/edges must include:

    • team_id / dao_id
    • visibility when it matters
  • Operations should be upserts (MERGE) to avoid duplicates.


Idempotency & reindex

1. Idempotent semantics

  • Use deterministic chunk_id for Milvus records.
  • Use Neo4j MERGE for nodes/edges based on natural keys (e.g. (team_id, source_type, source_id, chunk_index)).
  • Replaying the same events should not corrupt or duplicate data.

2. Reindex API

  • Provide a simple HTTP or CLI interface to:

    • POST /ingest/reindex/{team_id} — schedule or start reindex for a team/DAO.
  • Reindex strategy:

    • Read documents/messages from source-of-truth (DB or event replay).
    • Rebuild chunks and embeddings.
    • Upsert into Milvus & Neo4j (idempotently).

Implementation details (can be left as TODOs if missing backends):

  • If there is no easy historic source yet, stub the reindex endpoint with clear TODO and logging.

Monitoring & logging

Add basic observability:

  • Structured logs for:
    • Each event type ingested.
    • Number of chunks produced.
    • Latency for embedding and indexing.
  • (Optional) Metrics counters/gauges:
    • ingest_events_total
    • ingest_chunks_total
    • ingest_errors_total

Files to create/modify (suggested)

Adjust exact paths if needed.

  • services/rag-ingest-worker/main.py

    • Parse config, connect to event bus, start consumers.
  • services/rag-ingest-worker/config.py

    • Environment variables: EVENT_BUS_URL, MILVUS_URL, NEO4J_URL, EMBEDDING_SERVICE_URL, etc.
  • services/rag-ingest-worker/events/consumer.py

    • NATS (or chosen bus) subscription logic.
  • services/rag-ingest-worker/pipeline/normalization.py

    • Functions normalize_message_created(event), normalize_doc_upsert(event), normalize_file_uploaded(event).
  • services/rag-ingest-worker/pipeline/embedding.py

    • embed_chunks(chunks: List[IngestChunk]) -> List[VectorChunk].
  • services/rag-ingest-worker/pipeline/index_milvus.py

    • upsert_chunks_to_milvus(chunks: List[VectorChunk]).
  • services/rag-ingest-worker/pipeline/index_neo4j.py

    • update_graph_for_event(event, chunks: List[IngestChunk]).
  • Optional: services/rag-ingest-worker/api.py

    • FastAPI app with:
      • GET /health
      • POST /ingest/one
      • POST /ingest/reindex/{team_id}
  • Integration docs:

    • Reference docs/cursor/rag_gateway_task.md and docs/cursor/42_nats_event_streams_and_event_catalog.md where appropriate.

Acceptance criteria

  1. A new rag-ingest-worker (or similarly named) module/service exists under services/ with:

    • Clear directory structure (events/, pipeline/, config.py, main.py).
    • Stubs or initial implementations for consuming events and indexing to Milvus/Neo4j.
  2. A normalized internal model (IngestChunk or equivalent) is defined and used across pipelines.

  3. Milvus indexing code:

    • Uses idempotent upserts keyed by chunk_id.
    • Stores metadata compatible with the RAG-gateway schema.
  4. Neo4j update code:

    • Uses MERGE for nodes/relationships.
    • Encodes team_id/dao_id and privacy where relevant.
  5. Idempotency strategy and reindex(team_id) path are present in code (even if reindex is initially a stub with TODO).

  6. Basic logging is present for ingestion operations.

  7. This file (docs/cursor/rag_ingestion_worker_task.md) can be executed by Cursor as:

    cursor task < docs/cursor/rag_ingestion_worker_task.md
    

    and Cursor will use it as the single source of truth for implementing/refining the ingestion worker.