Complete snapshot of /opt/microdao-daarion/ from NODE1 (144.76.224.179).
This represents the actual running production code that has diverged
significantly from the previous main branch.
Key changes from old main:
- Gateway (http_api.py): expanded from ~40KB to 164KB with full agent support
- Router: new /v1/agents/{id}/infer endpoint with vision + DeepSeek routing
- Behavior Policy: SOWA v2.2 (3-level: FULL/ACK/SILENT)
- Agent Registry: config/agent_registry.yml as single source of truth
- 13 agents configured (was 3)
- Memory service integration
- CrewAI teams and roles
Excluded from snapshot: venv/, .env, data/, backups, .tgz archives
Co-authored-by: Cursor <cursoragent@cursor.com>
127 lines
3.3 KiB
Markdown
127 lines
3.3 KiB
Markdown
# ADR: Service Boundaries & Contract Ownership
|
|
|
|
**Status:** Accepted
|
|
**Date:** 2026-01-19
|
|
**Authors:** DAARION Platform Team
|
|
|
|
---
|
|
|
|
## 1. Service Boundaries
|
|
|
|
### Gateway (BFF) :9300
|
|
**Does:** Telegram webhooks, Auth, Rate limiting, Request normalization, Trace ID generation
|
|
**Does NOT:** LLM calls, Direct DB access, Business logic
|
|
|
|
### Router :9102
|
|
**Does:** Agent routing, Tool orchestration, Policy enforcement, LLM provider selection
|
|
**Does NOT:** Session storage, Direct DB access (tech debt: graph_query), File processing
|
|
|
|
### Control Plane :9200
|
|
**Does:** Versioned prompts, Policy/RBAC, Config/flags, Quotas
|
|
**Does NOT:** Request processing, Data storage
|
|
|
|
### Memory API :8000
|
|
**Does:** Vector search, Graph queries, Fact storage, ACL enforcement, Audit
|
|
**Does NOT:** LLM calls, File processing
|
|
|
|
### Swapper :8890
|
|
**Does:** Vision, STT, TTS, Image generation (lazy load)
|
|
**Does NOT:** Data storage, Policy decisions
|
|
|
|
### CrewAI Worker :9011
|
|
**Does:** Async workflows, Multi-agent, NATS consumption
|
|
**Does NOT:** Sync requests, Direct LLM calls
|
|
|
|
---
|
|
|
|
## 2. NATS Subject Taxonomy
|
|
|
|
### Naming: `{domain}.{action}.{entity}[.{subtype}]`
|
|
|
|
| Subject | Publisher | Consumer | Idempotency Key |
|
|
|---------|-----------|----------|-----------------|
|
|
| message.received.{agent_id} | Gateway | Router | request_id |
|
|
| message.processed.{agent_id} | Router | Gateway | request_id |
|
|
| attachment.created.{type} | Ingest | Parser | file_id |
|
|
| attachment.parsed.{type} | Parser | Memory | file_id |
|
|
| agent.run.requested | Router | Worker | job_id |
|
|
| agent.run.completed | Worker | Gateway | job_id |
|
|
| audit.action.{service} | All | Audit | event_id |
|
|
|
|
### DLQ Policy
|
|
| Stream | Max Retries | DLQ | Action |
|
|
|--------|-------------|-----|--------|
|
|
| ATTACHMENTS | 3 | attachment.failed.dlq | Manual review |
|
|
| AGENT_RUNS | 3 | agent.run.failed.dlq | Alert + retry |
|
|
| MEMORY | 5 | memory.failed.dlq | Auto-retry 1h |
|
|
|
|
---
|
|
|
|
## 3. Contract Ownership
|
|
|
|
| Contract | Owner | Change Process |
|
|
|----------|-------|----------------|
|
|
| Gateway->Router | Router team | PR + staging |
|
|
| Router->Memory | Memory team | PR + migration |
|
|
| Router->Control | Control team | PR + cache TTL |
|
|
| NATS events | Platform team | RFC + version |
|
|
|
|
---
|
|
|
|
## 4. Versioning
|
|
|
|
Format: `{service}:{type}:{hash}:{timestamp}`
|
|
|
|
Example: `helion:prompt:abc123:20260119T120000Z`
|
|
|
|
### Cache TTL
|
|
| Type | TTL | Invalidation |
|
|
|------|-----|--------------|
|
|
| Prompt | 5 min | NATS event |
|
|
| Policy | 1 min | NATS event |
|
|
| Config | 30 sec | NATS event |
|
|
|
|
---
|
|
|
|
## 5. Privacy Modes
|
|
|
|
| Mode | Restrictions |
|
|
|------|--------------|
|
|
| public | None |
|
|
| team | No cross-team |
|
|
| private | User-only |
|
|
| confidential | No logging, no indexing |
|
|
|
|
---
|
|
|
|
## 6. Trace Correlation
|
|
|
|
### HTTP Headers
|
|
- X-Trace-ID, X-Request-ID, X-Job-ID
|
|
- X-User-ID, X-Agent-ID, X-Mode
|
|
|
|
### NATS Headers
|
|
- Nats-Trace-ID, Nats-Job-ID
|
|
- Nats-User-ID, Nats-Agent-ID
|
|
|
|
---
|
|
|
|
## 7. Acceptance Checklist
|
|
|
|
- [ ] Router stateless: restart doesn't lose jobs
|
|
- [ ] Idempotency: duplicate job_id = 1 execution
|
|
- [ ] DLQ works: 3 fails -> DLQ + alert
|
|
- [ ] Policy enforcement: change applies < cache TTL
|
|
- [ ] Consumer lag alert: fires on 5min stop
|
|
|
|
---
|
|
|
|
## Decision
|
|
|
|
1. All services MUST use trace middleware
|
|
2. NATS messages MUST have idempotency key
|
|
3. Control Plane = source of truth
|
|
4. Memory API = ONLY data access
|
|
5. DLQ processing automated
|
|
6. Privacy mode checked in Router
|