Config policies (16 files): alert_routing, architecture_pressure, backlog, cost_weights, data_governance, incident_escalation, incident_intelligence, network_allowlist, nodes_registry, observability_sources, rbac_tools_matrix, release_gate, risk_attribution, risk_policy, slo_policy, tool_limits, tools_rollout Ops (22 files): Caddyfile, calendar compose, grafana voice dashboard, deployments/incidents logs, runbooks for alerts/audit/backlog/incidents/sofiia/voice, cron jobs, scripts (alert_triage, audit_cleanup, migrate_*, governance, schedule), task_registry, voice alerts/ha/latency/policy Docs (30+ files): HUMANIZED_STEPAN v2.7-v3 changelogs and runbooks, NODA1/NODA2 status and setup, audit index and traces, backlog, incident, supervisor, tools, voice, opencode, release, risk, aistalk, spacebot Made-with: Cursor
7.9 KiB
Sofiia Control Plane — Operations Runbook
Version: 1.0
Date: 2026-02-25
Architecture: Two-Plane Model
┌─────────────────────────────────┐ ┌─────────────────────────────────┐
│ NODA2 (MacBook) │ │ NODA1 (Production) │
│ CONTROL PLANE │ │ RUNTIME PLANE │
│ │ │ │
│ sofiia-console BFF :8002 ────────→ │ router/gateway :8000/:9300 │
│ memory-service UI :8000 │ │ postgres, qdrant stores │
│ Ollama :11434 │ │ cron jobs (governance) │
│ WebSocket /ws/events │ │ alert/incident/risk pipelines │
│ │ │ │
│ Operator interacts here │ │ Production traffic runs here │
└─────────────────────────────────┘ └─────────────────────────────────┘
Rule: All operator actions go through NODA2 BFF
The BFF on NODA2 proxies requests to NODA1 router/governance. You never call NODA1 directly from the browser.
Environment Variables
NODA2 (sofiia-console BFF)
| Variable | Default | Description |
|---|---|---|
PORT |
8002 |
BFF listen port |
ENV |
dev |
dev|staging|prod — controls CORS strictness, auth enforcement |
SOFIIA_CONSOLE_API_KEY |
"" |
Bearer auth for write endpoints. Mandatory in prod. |
MEMORY_SERVICE_URL |
http://localhost:8000 |
Memory service URL (STT/TTS/memory) |
OLLAMA_URL |
http://localhost:11434 |
Ollama URL for local LLM |
CORS_ORIGINS |
"" |
Comma-separated allowed origins. Empty = * in dev. |
SUPERVISOR_API_KEY |
"" |
Key for router/governance calls |
NODES_POLL_INTERVAL_SEC |
30 |
How often BFF polls nodes for telemetry |
AISTALK_ENABLED |
false |
Enable AISTALK adapter |
AISTALK_URL |
"" |
AISTALK bridge URL |
BUILD_ID |
local |
Git SHA or build ID (set in CI/CD) |
CONFIG_DIR |
auto-detect | Path to config/ directory with nodes_registry.yml |
NODA1 (router/governance)
| Variable | Description |
|---|---|
ALERT_BACKEND |
Must be postgres in production (not memory) |
AUDIT_BACKEND |
auto|jsonl|postgres |
GOV_CRON_FILE |
Path to cron file, default /etc/cron.d/daarion-governance |
Starting Services
NODA2 — Start BFF
cd services/sofiia-console
source .venv/bin/activate
uvicorn app.main:app --host 0.0.0.0 --port 8002 --reload
Or via Docker Compose:
docker-compose -f docker-compose.node2-sofiia.yml up -d
NODA2 — Check status
curl http://localhost:8002/api/health
curl http://localhost:8002/api/status/full
Expected: service: "sofiia-console", version: "0.3.x".
Accessing the UI
http://localhost:8000/ui ← memory-service serves sofiia-ui.html
The UI auto-connects to BFF at http://localhost:8002 (configurable in Settings tab).
Nodes Registry
Edit config/nodes_registry.yml to add/modify nodes:
nodes:
NODA1:
label: "Production (NODA1)"
router_url: "http://<noda1-ip>:9102"
gateway_url: "http://<noda1-ip>:9300"
NODA2:
label: "Control Plane (NODA2)"
router_url: "http://localhost:8000"
monitor_url: "http://localhost:8000"
Environment overrides (no need to edit YAML in prod):
export NODES_NODA1_ROUTER_URL=http://10.0.0.5:9102
Monitor Agent on Nodes
The BFF probes each node at GET /monitor/status (falls back to /healthz).
Implementing /monitor/status on a node
Add this endpoint to the node's router or a dedicated lightweight service:
GET /monitor/status → 200 OK
{
"online": true,
"ts": "2026-02-25T10:00:00Z",
"node_id": "NODA1",
"heartbeat_age_s": 5,
"router": {"ok": true, "latency_ms": 12},
"gateway": {"ok": true, "latency_ms": 8},
"alerts_loop_slo": {
"p95_ms": 320,
"failed_rate": 0.0
},
"open_incidents": 2,
"backends": {
"alerts": "postgres",
"audit": "auto",
"incidents": "auto",
"risk_history": "auto",
"backlog": "auto"
},
"last_artifacts": {
"risk_digest": "2026-02-24",
"platform_digest": "2026-W08",
"backlog": "2026-02-24"
}
}
If /monitor/status is not available, BFF synthesises partial data from /healthz.
Parity Verification
Run after every deploy to both nodes:
# NODA2 alone
python3 ops/scripts/verify_sofiia_stack.py \
--node NODA2 \
--bff-url http://localhost:8002 \
--router-url http://localhost:8000 \
--env dev
# NODA1 from NODA2 (parity check)
python3 ops/scripts/verify_sofiia_stack.py \
--node NODA1 \
--bff-url http://<noda1>:8002 \
--router-url http://<noda1>:9102 \
--compare-with http://localhost:8002 \
--compare-node NODA2 \
--env prod
# JSON output for CI
python3 ops/scripts/verify_sofiia_stack.py --json | jq .pass
Exit 0 = PASS. Exit 1 = critical failure.
Critical PASS requirements (prod)
router_health— router responds 200bff_health— BFF identifies assofiia-consolebff_status_full— router + memory reachablealerts_backend != memory— must be postgres in prod/staging
WebSocket Events
Connect to WS for real-time monitoring:
# Using wscat (npm install -g wscat)
wscat -c ws://localhost:8002/ws/events
# Or via Python
python3 -c "
import asyncio, json, websockets
async def f():
async with websockets.connect('ws://localhost:8002/ws/events') as ws:
async for msg in ws:
print(json.loads(msg)['type'])
asyncio.run(f())
"
Event types: chat.message, chat.reply, voice.stt, voice.tts, ops.run, nodes.status, error.
Troubleshooting
BFF won't start: ModuleNotFoundError
pip install -r services/sofiia-console/requirements.txt
UI shows "BFF: ✗"
- Check BFF is running:
curl http://localhost:8002/api/health - Check Settings tab → BFF URL points to correct host
- Check CORS: BFF URL must match
CORS_ORIGINSin prod
Router shows "offline" in Nodes
- NODA1 router might not be running:
docker ps | grep router - Check
config/nodes_registry.ymlrouter_url - Override:
export NODES_NODA1_ROUTER_URL=http://<correct-ip>:9102
STT/TTS not working
- Check memory-service is running:
curl http://localhost:8000/health - Check
MEMORY_SERVICE_URLin BFF env - Check browser has microphone permission
Alerts backend is "memory" (should be postgres)
In prod/staging, set:
export ALERT_BACKEND=postgres
Then restart the governance/router service.
Cron jobs not running
# Check cron file
cat /etc/cron.d/daarion-governance
# Manual trigger (example)
cd /path/to/daarion && python3 -m services.router.risk_engine snapshot
AISTALK Integration
See docs/aistalk/contract.md for full integration contract.
Quick enable:
export AISTALK_ENABLED=true
export AISTALK_URL=http://<aistalk-bridge>:PORT
# Restart BFF
Status check:
curl http://localhost:8002/api/status/full | jq .bff.aistalk_enabled
Definition of Done Checklist
verify_sofiia_stack.pyPASS on NODA2 (dev)verify_sofiia_stack.pyPASS on NODA1 (prod) — router + BFF + alerts=postgres--compare-withparity PASS between NODA1 and NODA2- Nodes dashboard shows real-time data (online/latency/incidents)
- Ops tab: release_check runs and shows result
- Voice: STT → chat → TTS roundtrip works without looping
- WS Events tab shows
chat.reply,voice.stt,nodes.status SOFIIA_CONSOLE_API_KEYset on NODA1 (prod)ALERT_BACKEND=postgreson NODA1 (prod)