Files
microdao-daarion/docs/ADR_SERVICE_CONTRACTS.md
Apple ef3473db21 snapshot: NODE1 production state 2026-02-09
Complete snapshot of /opt/microdao-daarion/ from NODE1 (144.76.224.179).
This represents the actual running production code that has diverged
significantly from the previous main branch.

Key changes from old main:
- Gateway (http_api.py): expanded from ~40KB to 164KB with full agent support
- Router: new /v1/agents/{id}/infer endpoint with vision + DeepSeek routing
- Behavior Policy: SOWA v2.2 (3-level: FULL/ACK/SILENT)
- Agent Registry: config/agent_registry.yml as single source of truth
- 13 agents configured (was 3)
- Memory service integration
- CrewAI teams and roles

Excluded from snapshot: venv/, .env, data/, backups, .tgz archives

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-02-09 08:46:46 -08:00

3.3 KiB

ADR: Service Boundaries & Contract Ownership

Status: Accepted
Date: 2026-01-19
Authors: DAARION Platform Team


1. Service Boundaries

Gateway (BFF) :9300

Does: Telegram webhooks, Auth, Rate limiting, Request normalization, Trace ID generation
Does NOT: LLM calls, Direct DB access, Business logic

Router :9102

Does: Agent routing, Tool orchestration, Policy enforcement, LLM provider selection
Does NOT: Session storage, Direct DB access (tech debt: graph_query), File processing

Control Plane :9200

Does: Versioned prompts, Policy/RBAC, Config/flags, Quotas
Does NOT: Request processing, Data storage

Memory API :8000

Does: Vector search, Graph queries, Fact storage, ACL enforcement, Audit
Does NOT: LLM calls, File processing

Swapper :8890

Does: Vision, STT, TTS, Image generation (lazy load)
Does NOT: Data storage, Policy decisions

CrewAI Worker :9011

Does: Async workflows, Multi-agent, NATS consumption
Does NOT: Sync requests, Direct LLM calls


2. NATS Subject Taxonomy

Naming: {domain}.{action}.{entity}[.{subtype}]

Subject Publisher Consumer Idempotency Key
message.received.{agent_id} Gateway Router request_id
message.processed.{agent_id} Router Gateway request_id
attachment.created.{type} Ingest Parser file_id
attachment.parsed.{type} Parser Memory file_id
agent.run.requested Router Worker job_id
agent.run.completed Worker Gateway job_id
audit.action.{service} All Audit event_id

DLQ Policy

Stream Max Retries DLQ Action
ATTACHMENTS 3 attachment.failed.dlq Manual review
AGENT_RUNS 3 agent.run.failed.dlq Alert + retry
MEMORY 5 memory.failed.dlq Auto-retry 1h

3. Contract Ownership

Contract Owner Change Process
Gateway->Router Router team PR + staging
Router->Memory Memory team PR + migration
Router->Control Control team PR + cache TTL
NATS events Platform team RFC + version

4. Versioning

Format: {service}:{type}:{hash}:{timestamp}

Example: helion:prompt:abc123:20260119T120000Z

Cache TTL

Type TTL Invalidation
Prompt 5 min NATS event
Policy 1 min NATS event
Config 30 sec NATS event

5. Privacy Modes

Mode Restrictions
public None
team No cross-team
private User-only
confidential No logging, no indexing

6. Trace Correlation

HTTP Headers

  • X-Trace-ID, X-Request-ID, X-Job-ID
  • X-User-ID, X-Agent-ID, X-Mode

NATS Headers

  • Nats-Trace-ID, Nats-Job-ID
  • Nats-User-ID, Nats-Agent-ID

7. Acceptance Checklist

  • Router stateless: restart doesn't lose jobs
  • Idempotency: duplicate job_id = 1 execution
  • DLQ works: 3 fails -> DLQ + alert
  • Policy enforcement: change applies < cache TTL
  • Consumer lag alert: fires on 5min stop

Decision

  1. All services MUST use trace middleware
  2. NATS messages MUST have idempotency key
  3. Control Plane = source of truth
  4. Memory API = ONLY data access
  5. DLQ processing automated
  6. Privacy mode checked in Router