Files
microdao-daarion/docs/ADR_SERVICE_CONTRACTS.md
Apple ef3473db21 snapshot: NODE1 production state 2026-02-09
Complete snapshot of /opt/microdao-daarion/ from NODE1 (144.76.224.179).
This represents the actual running production code that has diverged
significantly from the previous main branch.

Key changes from old main:
- Gateway (http_api.py): expanded from ~40KB to 164KB with full agent support
- Router: new /v1/agents/{id}/infer endpoint with vision + DeepSeek routing
- Behavior Policy: SOWA v2.2 (3-level: FULL/ACK/SILENT)
- Agent Registry: config/agent_registry.yml as single source of truth
- 13 agents configured (was 3)
- Memory service integration
- CrewAI teams and roles

Excluded from snapshot: venv/, .env, data/, backups, .tgz archives

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-02-09 08:46:46 -08:00

127 lines
3.3 KiB
Markdown

# ADR: Service Boundaries & Contract Ownership
**Status:** Accepted
**Date:** 2026-01-19
**Authors:** DAARION Platform Team
---
## 1. Service Boundaries
### Gateway (BFF) :9300
**Does:** Telegram webhooks, Auth, Rate limiting, Request normalization, Trace ID generation
**Does NOT:** LLM calls, Direct DB access, Business logic
### Router :9102
**Does:** Agent routing, Tool orchestration, Policy enforcement, LLM provider selection
**Does NOT:** Session storage, Direct DB access (tech debt: graph_query), File processing
### Control Plane :9200
**Does:** Versioned prompts, Policy/RBAC, Config/flags, Quotas
**Does NOT:** Request processing, Data storage
### Memory API :8000
**Does:** Vector search, Graph queries, Fact storage, ACL enforcement, Audit
**Does NOT:** LLM calls, File processing
### Swapper :8890
**Does:** Vision, STT, TTS, Image generation (lazy load)
**Does NOT:** Data storage, Policy decisions
### CrewAI Worker :9011
**Does:** Async workflows, Multi-agent, NATS consumption
**Does NOT:** Sync requests, Direct LLM calls
---
## 2. NATS Subject Taxonomy
### Naming: `{domain}.{action}.{entity}[.{subtype}]`
| Subject | Publisher | Consumer | Idempotency Key |
|---------|-----------|----------|-----------------|
| message.received.{agent_id} | Gateway | Router | request_id |
| message.processed.{agent_id} | Router | Gateway | request_id |
| attachment.created.{type} | Ingest | Parser | file_id |
| attachment.parsed.{type} | Parser | Memory | file_id |
| agent.run.requested | Router | Worker | job_id |
| agent.run.completed | Worker | Gateway | job_id |
| audit.action.{service} | All | Audit | event_id |
### DLQ Policy
| Stream | Max Retries | DLQ | Action |
|--------|-------------|-----|--------|
| ATTACHMENTS | 3 | attachment.failed.dlq | Manual review |
| AGENT_RUNS | 3 | agent.run.failed.dlq | Alert + retry |
| MEMORY | 5 | memory.failed.dlq | Auto-retry 1h |
---
## 3. Contract Ownership
| Contract | Owner | Change Process |
|----------|-------|----------------|
| Gateway->Router | Router team | PR + staging |
| Router->Memory | Memory team | PR + migration |
| Router->Control | Control team | PR + cache TTL |
| NATS events | Platform team | RFC + version |
---
## 4. Versioning
Format: `{service}:{type}:{hash}:{timestamp}`
Example: `helion:prompt:abc123:20260119T120000Z`
### Cache TTL
| Type | TTL | Invalidation |
|------|-----|--------------|
| Prompt | 5 min | NATS event |
| Policy | 1 min | NATS event |
| Config | 30 sec | NATS event |
---
## 5. Privacy Modes
| Mode | Restrictions |
|------|--------------|
| public | None |
| team | No cross-team |
| private | User-only |
| confidential | No logging, no indexing |
---
## 6. Trace Correlation
### HTTP Headers
- X-Trace-ID, X-Request-ID, X-Job-ID
- X-User-ID, X-Agent-ID, X-Mode
### NATS Headers
- Nats-Trace-ID, Nats-Job-ID
- Nats-User-ID, Nats-Agent-ID
---
## 7. Acceptance Checklist
- [ ] Router stateless: restart doesn't lose jobs
- [ ] Idempotency: duplicate job_id = 1 execution
- [ ] DLQ works: 3 fails -> DLQ + alert
- [ ] Policy enforcement: change applies < cache TTL
- [ ] Consumer lag alert: fires on 5min stop
---
## Decision
1. All services MUST use trace middleware
2. NATS messages MUST have idempotency key
3. Control Plane = source of truth
4. Memory API = ONLY data access
5. DLQ processing automated
6. Privacy mode checked in Router