Files
microdao-daarion/docs/runbook/sofiia-control-plane.md
Apple 67225a39fa docs(platform): add policy configs, runbooks, ops scripts and platform documentation
Config policies (16 files): alert_routing, architecture_pressure, backlog,
cost_weights, data_governance, incident_escalation, incident_intelligence,
network_allowlist, nodes_registry, observability_sources, rbac_tools_matrix,
release_gate, risk_attribution, risk_policy, slo_policy, tool_limits, tools_rollout

Ops (22 files): Caddyfile, calendar compose, grafana voice dashboard,
deployments/incidents logs, runbooks for alerts/audit/backlog/incidents/sofiia/voice,
cron jobs, scripts (alert_triage, audit_cleanup, migrate_*, governance, schedule),
task_registry, voice alerts/ha/latency/policy

Docs (30+ files): HUMANIZED_STEPAN v2.7-v3 changelogs and runbooks,
NODA1/NODA2 status and setup, audit index and traces, backlog, incident,
supervisor, tools, voice, opencode, release, risk, aistalk, spacebot

Made-with: Cursor
2026-03-03 07:14:53 -08:00

7.9 KiB

Sofiia Control Plane — Operations Runbook

Version: 1.0
Date: 2026-02-25


Architecture: Two-Plane Model

┌─────────────────────────────────┐     ┌─────────────────────────────────┐
│          NODA2 (MacBook)        │     │        NODA1 (Production)        │
│      CONTROL PLANE              │     │       RUNTIME PLANE              │
│                                 │     │                                  │
│  sofiia-console BFF :8002  ────────→  │  router/gateway :8000/:9300     │
│  memory-service UI  :8000       │     │  postgres, qdrant stores         │
│  Ollama             :11434      │     │  cron jobs (governance)          │
│  WebSocket /ws/events           │     │  alert/incident/risk pipelines   │
│                                 │     │                                  │
│  Operator interacts here        │     │  Production traffic runs here    │
└─────────────────────────────────┘     └─────────────────────────────────┘

Rule: All operator actions go through NODA2 BFF

The BFF on NODA2 proxies requests to NODA1 router/governance. You never call NODA1 directly from the browser.


Environment Variables

NODA2 (sofiia-console BFF)

Variable Default Description
PORT 8002 BFF listen port
ENV dev dev|staging|prod — controls CORS strictness, auth enforcement
SOFIIA_CONSOLE_API_KEY "" Bearer auth for write endpoints. Mandatory in prod.
MEMORY_SERVICE_URL http://localhost:8000 Memory service URL (STT/TTS/memory)
OLLAMA_URL http://localhost:11434 Ollama URL for local LLM
CORS_ORIGINS "" Comma-separated allowed origins. Empty = * in dev.
SUPERVISOR_API_KEY "" Key for router/governance calls
NODES_POLL_INTERVAL_SEC 30 How often BFF polls nodes for telemetry
AISTALK_ENABLED false Enable AISTALK adapter
AISTALK_URL "" AISTALK bridge URL
BUILD_ID local Git SHA or build ID (set in CI/CD)
CONFIG_DIR auto-detect Path to config/ directory with nodes_registry.yml

NODA1 (router/governance)

Variable Description
ALERT_BACKEND Must be postgres in production (not memory)
AUDIT_BACKEND auto|jsonl|postgres
GOV_CRON_FILE Path to cron file, default /etc/cron.d/daarion-governance

Starting Services

NODA2 — Start BFF

cd services/sofiia-console
source .venv/bin/activate
uvicorn app.main:app --host 0.0.0.0 --port 8002 --reload

Or via Docker Compose:

docker-compose -f docker-compose.node2-sofiia.yml up -d

NODA2 — Check status

curl http://localhost:8002/api/health
curl http://localhost:8002/api/status/full

Expected: service: "sofiia-console", version: "0.3.x".

Accessing the UI

http://localhost:8000/ui   ← memory-service serves sofiia-ui.html

The UI auto-connects to BFF at http://localhost:8002 (configurable in Settings tab).


Nodes Registry

Edit config/nodes_registry.yml to add/modify nodes:

nodes:
  NODA1:
    label: "Production (NODA1)"
    router_url: "http://<noda1-ip>:9102"
    gateway_url: "http://<noda1-ip>:9300"

  NODA2:
    label: "Control Plane (NODA2)"
    router_url: "http://localhost:8000"
    monitor_url: "http://localhost:8000"

Environment overrides (no need to edit YAML in prod):

export NODES_NODA1_ROUTER_URL=http://10.0.0.5:9102

Monitor Agent on Nodes

The BFF probes each node at GET /monitor/status (falls back to /healthz).

Implementing /monitor/status on a node

Add this endpoint to the node's router or a dedicated lightweight service:

GET /monitor/status  200 OK
{
  "online": true,
  "ts": "2026-02-25T10:00:00Z",
  "node_id": "NODA1",
  "heartbeat_age_s": 5,
  "router": {"ok": true, "latency_ms": 12},
  "gateway": {"ok": true, "latency_ms": 8},
  "alerts_loop_slo": {
    "p95_ms": 320,
    "failed_rate": 0.0
  },
  "open_incidents": 2,
  "backends": {
    "alerts": "postgres",
    "audit": "auto",
    "incidents": "auto",
    "risk_history": "auto",
    "backlog": "auto"
  },
  "last_artifacts": {
    "risk_digest": "2026-02-24",
    "platform_digest": "2026-W08",
    "backlog": "2026-02-24"
  }
}

If /monitor/status is not available, BFF synthesises partial data from /healthz.


Parity Verification

Run after every deploy to both nodes:

# NODA2 alone
python3 ops/scripts/verify_sofiia_stack.py \
  --node NODA2 \
  --bff-url http://localhost:8002 \
  --router-url http://localhost:8000 \
  --env dev

# NODA1 from NODA2 (parity check)
python3 ops/scripts/verify_sofiia_stack.py \
  --node NODA1 \
  --bff-url http://<noda1>:8002 \
  --router-url http://<noda1>:9102 \
  --compare-with http://localhost:8002 \
  --compare-node NODA2 \
  --env prod

# JSON output for CI
python3 ops/scripts/verify_sofiia_stack.py --json | jq .pass

Exit 0 = PASS. Exit 1 = critical failure.

Critical PASS requirements (prod)

  • router_health — router responds 200
  • bff_health — BFF identifies as sofiia-console
  • bff_status_full — router + memory reachable
  • alerts_backend != memory — must be postgres in prod/staging

WebSocket Events

Connect to WS for real-time monitoring:

# Using wscat (npm install -g wscat)
wscat -c ws://localhost:8002/ws/events

# Or via Python
python3 -c "
import asyncio, json, websockets
async def f():
    async with websockets.connect('ws://localhost:8002/ws/events') as ws:
        async for msg in ws:
            print(json.loads(msg)['type'])
asyncio.run(f())
"

Event types: chat.message, chat.reply, voice.stt, voice.tts, ops.run, nodes.status, error.


Troubleshooting

BFF won't start: ModuleNotFoundError

pip install -r services/sofiia-console/requirements.txt

UI shows "BFF: ✗"

  1. Check BFF is running: curl http://localhost:8002/api/health
  2. Check Settings tab → BFF URL points to correct host
  3. Check CORS: BFF URL must match CORS_ORIGINS in prod

Router shows "offline" in Nodes

  1. NODA1 router might not be running: docker ps | grep router
  2. Check config/nodes_registry.yml router_url
  3. Override: export NODES_NODA1_ROUTER_URL=http://<correct-ip>:9102

STT/TTS not working

  1. Check memory-service is running: curl http://localhost:8000/health
  2. Check MEMORY_SERVICE_URL in BFF env
  3. Check browser has microphone permission

Alerts backend is "memory" (should be postgres)

In prod/staging, set:

export ALERT_BACKEND=postgres

Then restart the governance/router service.

Cron jobs not running

# Check cron file
cat /etc/cron.d/daarion-governance

# Manual trigger (example)
cd /path/to/daarion && python3 -m services.router.risk_engine snapshot

AISTALK Integration

See docs/aistalk/contract.md for full integration contract.

Quick enable:

export AISTALK_ENABLED=true
export AISTALK_URL=http://<aistalk-bridge>:PORT
# Restart BFF

Status check:

curl http://localhost:8002/api/status/full | jq .bff.aistalk_enabled

Definition of Done Checklist

  • verify_sofiia_stack.py PASS on NODA2 (dev)
  • verify_sofiia_stack.py PASS on NODA1 (prod) — router + BFF + alerts=postgres
  • --compare-with parity PASS between NODA1 and NODA2
  • Nodes dashboard shows real-time data (online/latency/incidents)
  • Ops tab: release_check runs and shows result
  • Voice: STT → chat → TTS roundtrip works without looping
  • WS Events tab shows chat.reply, voice.stt, nodes.status
  • SOFIIA_CONSOLE_API_KEY set on NODA1 (prod)
  • ALERT_BACKEND=postgres on NODA1 (prod)