# Sofiia Control Plane — Operations Runbook Version: 1.0 Date: 2026-02-25 --- ## Architecture: Two-Plane Model ``` ┌─────────────────────────────────┐ ┌─────────────────────────────────┐ │ NODA2 (MacBook) │ │ NODA1 (Production) │ │ CONTROL PLANE │ │ RUNTIME PLANE │ │ │ │ │ │ sofiia-console BFF :8002 ────────→ │ router/gateway :8000/:9300 │ │ memory-service UI :8000 │ │ postgres, qdrant stores │ │ Ollama :11434 │ │ cron jobs (governance) │ │ WebSocket /ws/events │ │ alert/incident/risk pipelines │ │ │ │ │ │ Operator interacts here │ │ Production traffic runs here │ └─────────────────────────────────┘ └─────────────────────────────────┘ ``` ### Rule: All operator actions go through NODA2 BFF The BFF on NODA2 proxies requests to NODA1 router/governance. You never call NODA1 directly from the browser. --- ## Environment Variables ### NODA2 (sofiia-console BFF) | Variable | Default | Description | |---|---|---| | `PORT` | `8002` | BFF listen port | | `ENV` | `dev` | `dev\|staging\|prod` — controls CORS strictness, auth enforcement | | `SOFIIA_CONSOLE_API_KEY` | `""` | Bearer auth for write endpoints. Mandatory in prod. | | `MEMORY_SERVICE_URL` | `http://localhost:8000` | Memory service URL (STT/TTS/memory) | | `OLLAMA_URL` | `http://localhost:11434` | Ollama URL for local LLM | | `CORS_ORIGINS` | `""` | Comma-separated allowed origins. Empty = `*` in dev. | | `SUPERVISOR_API_KEY` | `""` | Key for router/governance calls | | `NODES_POLL_INTERVAL_SEC` | `30` | How often BFF polls nodes for telemetry | | `AISTALK_ENABLED` | `false` | Enable AISTALK adapter | | `AISTALK_URL` | `""` | AISTALK bridge URL | | `BUILD_ID` | `local` | Git SHA or build ID (set in CI/CD) | | `CONFIG_DIR` | auto-detect | Path to `config/` directory with `nodes_registry.yml` | ### NODA1 (router/governance) | Variable | Description | |---|---| | `ALERT_BACKEND` | Must be `postgres` in production (not `memory`) | | `AUDIT_BACKEND` | `auto\|jsonl\|postgres` | | `GOV_CRON_FILE` | Path to cron file, default `/etc/cron.d/daarion-governance` | --- ## Starting Services ### NODA2 — Start BFF ```bash cd services/sofiia-console source .venv/bin/activate uvicorn app.main:app --host 0.0.0.0 --port 8002 --reload ``` Or via Docker Compose: ```bash docker-compose -f docker-compose.node2-sofiia.yml up -d ``` ### NODA2 — Check status ```bash curl http://localhost:8002/api/health curl http://localhost:8002/api/status/full ``` Expected: `service: "sofiia-console"`, `version: "0.3.x"`. ### Accessing the UI ``` http://localhost:8000/ui ← memory-service serves sofiia-ui.html ``` The UI auto-connects to BFF at `http://localhost:8002` (configurable in Settings tab). --- ## Nodes Registry Edit `config/nodes_registry.yml` to add/modify nodes: ```yaml nodes: NODA1: label: "Production (NODA1)" router_url: "http://:9102" gateway_url: "http://:9300" NODA2: label: "Control Plane (NODA2)" router_url: "http://localhost:8000" monitor_url: "http://localhost:8000" ``` **Environment overrides** (no need to edit YAML in prod): ```bash export NODES_NODA1_ROUTER_URL=http://10.0.0.5:9102 ``` --- ## Monitor Agent on Nodes The BFF probes each node at `GET /monitor/status` (falls back to `/healthz`). ### Implementing `/monitor/status` on a node Add this endpoint to the node's router or a dedicated lightweight service: ```json GET /monitor/status → 200 OK { "online": true, "ts": "2026-02-25T10:00:00Z", "node_id": "NODA1", "heartbeat_age_s": 5, "router": {"ok": true, "latency_ms": 12}, "gateway": {"ok": true, "latency_ms": 8}, "alerts_loop_slo": { "p95_ms": 320, "failed_rate": 0.0 }, "open_incidents": 2, "backends": { "alerts": "postgres", "audit": "auto", "incidents": "auto", "risk_history": "auto", "backlog": "auto" }, "last_artifacts": { "risk_digest": "2026-02-24", "platform_digest": "2026-W08", "backlog": "2026-02-24" } } ``` If `/monitor/status` is not available, BFF synthesises partial data from `/healthz`. --- ## Parity Verification Run after every deploy to both nodes: ```bash # NODA2 alone python3 ops/scripts/verify_sofiia_stack.py \ --node NODA2 \ --bff-url http://localhost:8002 \ --router-url http://localhost:8000 \ --env dev # NODA1 from NODA2 (parity check) python3 ops/scripts/verify_sofiia_stack.py \ --node NODA1 \ --bff-url http://:8002 \ --router-url http://:9102 \ --compare-with http://localhost:8002 \ --compare-node NODA2 \ --env prod # JSON output for CI python3 ops/scripts/verify_sofiia_stack.py --json | jq .pass ``` Exit 0 = PASS. Exit 1 = critical failure. ### Critical PASS requirements (prod) - `router_health` — router responds 200 - `bff_health` — BFF identifies as `sofiia-console` - `bff_status_full` — router + memory reachable - `alerts_backend != memory` — must be postgres in prod/staging --- ## WebSocket Events Connect to WS for real-time monitoring: ```bash # Using wscat (npm install -g wscat) wscat -c ws://localhost:8002/ws/events # Or via Python python3 -c " import asyncio, json, websockets async def f(): async with websockets.connect('ws://localhost:8002/ws/events') as ws: async for msg in ws: print(json.loads(msg)['type']) asyncio.run(f()) " ``` Event types: `chat.message`, `chat.reply`, `voice.stt`, `voice.tts`, `ops.run`, `nodes.status`, `error`. --- ## Troubleshooting ### BFF won't start: `ModuleNotFoundError` ```bash pip install -r services/sofiia-console/requirements.txt ``` ### UI shows "BFF: ✗" 1. Check BFF is running: `curl http://localhost:8002/api/health` 2. Check Settings tab → BFF URL points to correct host 3. Check CORS: BFF URL must match `CORS_ORIGINS` in prod ### Router shows "offline" in Nodes 1. NODA1 router might not be running: `docker ps | grep router` 2. Check `config/nodes_registry.yml` router_url 3. Override: `export NODES_NODA1_ROUTER_URL=http://:9102` ### STT/TTS not working 1. Check memory-service is running: `curl http://localhost:8000/health` 2. Check `MEMORY_SERVICE_URL` in BFF env 3. Check browser has microphone permission ### Alerts backend is "memory" (should be postgres) In prod/staging, set: ```bash export ALERT_BACKEND=postgres ``` Then restart the governance/router service. ### Cron jobs not running ```bash # Check cron file cat /etc/cron.d/daarion-governance # Manual trigger (example) cd /path/to/daarion && python3 -m services.router.risk_engine snapshot ``` --- ## AISTALK Integration See `docs/aistalk/contract.md` for full integration contract. Quick enable: ```bash export AISTALK_ENABLED=true export AISTALK_URL=http://:PORT # Restart BFF ``` Status check: ```bash curl http://localhost:8002/api/status/full | jq .bff.aistalk_enabled ``` --- ## Definition of Done Checklist - [ ] `verify_sofiia_stack.py` PASS on NODA2 (dev) - [ ] `verify_sofiia_stack.py` PASS on NODA1 (prod) — router + BFF + alerts=postgres - [ ] `--compare-with` parity PASS between NODA1 and NODA2 - [ ] Nodes dashboard shows real-time data (online/latency/incidents) - [ ] Ops tab: release_check runs and shows result - [ ] Voice: STT → chat → TTS roundtrip works without looping - [ ] WS Events tab shows `chat.reply`, `voice.stt`, `nodes.status` - [ ] `SOFIIA_CONSOLE_API_KEY` set on NODA1 (prod) - [ ] `ALERT_BACKEND=postgres` on NODA1 (prod)