# ADR: Service Boundaries & Contract Ownership **Status:** Accepted **Date:** 2026-01-19 **Authors:** DAARION Platform Team --- ## 1. Service Boundaries ### Gateway (BFF) :9300 **Does:** Telegram webhooks, Auth, Rate limiting, Request normalization, Trace ID generation **Does NOT:** LLM calls, Direct DB access, Business logic ### Router :9102 **Does:** Agent routing, Tool orchestration, Policy enforcement, LLM provider selection **Does NOT:** Session storage, Direct DB access (tech debt: graph_query), File processing ### Control Plane :9200 **Does:** Versioned prompts, Policy/RBAC, Config/flags, Quotas **Does NOT:** Request processing, Data storage ### Memory API :8000 **Does:** Vector search, Graph queries, Fact storage, ACL enforcement, Audit **Does NOT:** LLM calls, File processing ### Swapper :8890 **Does:** Vision, STT, TTS, Image generation (lazy load) **Does NOT:** Data storage, Policy decisions ### CrewAI Worker :9011 **Does:** Async workflows, Multi-agent, NATS consumption **Does NOT:** Sync requests, Direct LLM calls --- ## 2. NATS Subject Taxonomy ### Naming: `{domain}.{action}.{entity}[.{subtype}]` | Subject | Publisher | Consumer | Idempotency Key | |---------|-----------|----------|-----------------| | message.received.{agent_id} | Gateway | Router | request_id | | message.processed.{agent_id} | Router | Gateway | request_id | | attachment.created.{type} | Ingest | Parser | file_id | | attachment.parsed.{type} | Parser | Memory | file_id | | agent.run.requested | Router | Worker | job_id | | agent.run.completed | Worker | Gateway | job_id | | audit.action.{service} | All | Audit | event_id | ### DLQ Policy | Stream | Max Retries | DLQ | Action | |--------|-------------|-----|--------| | ATTACHMENTS | 3 | attachment.failed.dlq | Manual review | | AGENT_RUNS | 3 | agent.run.failed.dlq | Alert + retry | | MEMORY | 5 | memory.failed.dlq | Auto-retry 1h | --- ## 3. Contract Ownership | Contract | Owner | Change Process | |----------|-------|----------------| | Gateway->Router | Router team | PR + staging | | Router->Memory | Memory team | PR + migration | | Router->Control | Control team | PR + cache TTL | | NATS events | Platform team | RFC + version | --- ## 4. Versioning Format: `{service}:{type}:{hash}:{timestamp}` Example: `helion:prompt:abc123:20260119T120000Z` ### Cache TTL | Type | TTL | Invalidation | |------|-----|--------------| | Prompt | 5 min | NATS event | | Policy | 1 min | NATS event | | Config | 30 sec | NATS event | --- ## 5. Privacy Modes | Mode | Restrictions | |------|--------------| | public | None | | team | No cross-team | | private | User-only | | confidential | No logging, no indexing | --- ## 6. Trace Correlation ### HTTP Headers - X-Trace-ID, X-Request-ID, X-Job-ID - X-User-ID, X-Agent-ID, X-Mode ### NATS Headers - Nats-Trace-ID, Nats-Job-ID - Nats-User-ID, Nats-Agent-ID --- ## 7. Acceptance Checklist - [ ] Router stateless: restart doesn't lose jobs - [ ] Idempotency: duplicate job_id = 1 execution - [ ] DLQ works: 3 fails -> DLQ + alert - [ ] Policy enforcement: change applies < cache TTL - [ ] Consumer lag alert: fires on 5min stop --- ## Decision 1. All services MUST use trace middleware 2. NATS messages MUST have idempotency key 3. Control Plane = source of truth 4. Memory API = ONLY data access 5. DLQ processing automated 6. Privacy mode checked in Router