Files
microdao-daarion/docs/runbook/sofiia-control-plane.md
Apple 67225a39fa docs(platform): add policy configs, runbooks, ops scripts and platform documentation
Config policies (16 files): alert_routing, architecture_pressure, backlog,
cost_weights, data_governance, incident_escalation, incident_intelligence,
network_allowlist, nodes_registry, observability_sources, rbac_tools_matrix,
release_gate, risk_attribution, risk_policy, slo_policy, tool_limits, tools_rollout

Ops (22 files): Caddyfile, calendar compose, grafana voice dashboard,
deployments/incidents logs, runbooks for alerts/audit/backlog/incidents/sofiia/voice,
cron jobs, scripts (alert_triage, audit_cleanup, migrate_*, governance, schedule),
task_registry, voice alerts/ha/latency/policy

Docs (30+ files): HUMANIZED_STEPAN v2.7-v3 changelogs and runbooks,
NODA1/NODA2 status and setup, audit index and traces, backlog, incident,
supervisor, tools, voice, opencode, release, risk, aistalk, spacebot

Made-with: Cursor
2026-03-03 07:14:53 -08:00

286 lines
7.9 KiB
Markdown

# Sofiia Control Plane — Operations Runbook
Version: 1.0
Date: 2026-02-25
---
## Architecture: Two-Plane Model
```
┌─────────────────────────────────┐ ┌─────────────────────────────────┐
│ NODA2 (MacBook) │ │ NODA1 (Production) │
│ CONTROL PLANE │ │ RUNTIME PLANE │
│ │ │ │
│ sofiia-console BFF :8002 ────────→ │ router/gateway :8000/:9300 │
│ memory-service UI :8000 │ │ postgres, qdrant stores │
│ Ollama :11434 │ │ cron jobs (governance) │
│ WebSocket /ws/events │ │ alert/incident/risk pipelines │
│ │ │ │
│ Operator interacts here │ │ Production traffic runs here │
└─────────────────────────────────┘ └─────────────────────────────────┘
```
### Rule: All operator actions go through NODA2 BFF
The BFF on NODA2 proxies requests to NODA1 router/governance. You never call NODA1 directly from the browser.
---
## Environment Variables
### NODA2 (sofiia-console BFF)
| Variable | Default | Description |
|---|---|---|
| `PORT` | `8002` | BFF listen port |
| `ENV` | `dev` | `dev\|staging\|prod` — controls CORS strictness, auth enforcement |
| `SOFIIA_CONSOLE_API_KEY` | `""` | Bearer auth for write endpoints. Mandatory in prod. |
| `MEMORY_SERVICE_URL` | `http://localhost:8000` | Memory service URL (STT/TTS/memory) |
| `OLLAMA_URL` | `http://localhost:11434` | Ollama URL for local LLM |
| `CORS_ORIGINS` | `""` | Comma-separated allowed origins. Empty = `*` in dev. |
| `SUPERVISOR_API_KEY` | `""` | Key for router/governance calls |
| `NODES_POLL_INTERVAL_SEC` | `30` | How often BFF polls nodes for telemetry |
| `AISTALK_ENABLED` | `false` | Enable AISTALK adapter |
| `AISTALK_URL` | `""` | AISTALK bridge URL |
| `BUILD_ID` | `local` | Git SHA or build ID (set in CI/CD) |
| `CONFIG_DIR` | auto-detect | Path to `config/` directory with `nodes_registry.yml` |
### NODA1 (router/governance)
| Variable | Description |
|---|---|
| `ALERT_BACKEND` | Must be `postgres` in production (not `memory`) |
| `AUDIT_BACKEND` | `auto\|jsonl\|postgres` |
| `GOV_CRON_FILE` | Path to cron file, default `/etc/cron.d/daarion-governance` |
---
## Starting Services
### NODA2 — Start BFF
```bash
cd services/sofiia-console
source .venv/bin/activate
uvicorn app.main:app --host 0.0.0.0 --port 8002 --reload
```
Or via Docker Compose:
```bash
docker-compose -f docker-compose.node2-sofiia.yml up -d
```
### NODA2 — Check status
```bash
curl http://localhost:8002/api/health
curl http://localhost:8002/api/status/full
```
Expected: `service: "sofiia-console"`, `version: "0.3.x"`.
### Accessing the UI
```
http://localhost:8000/ui ← memory-service serves sofiia-ui.html
```
The UI auto-connects to BFF at `http://localhost:8002` (configurable in Settings tab).
---
## Nodes Registry
Edit `config/nodes_registry.yml` to add/modify nodes:
```yaml
nodes:
NODA1:
label: "Production (NODA1)"
router_url: "http://<noda1-ip>:9102"
gateway_url: "http://<noda1-ip>:9300"
NODA2:
label: "Control Plane (NODA2)"
router_url: "http://localhost:8000"
monitor_url: "http://localhost:8000"
```
**Environment overrides** (no need to edit YAML in prod):
```bash
export NODES_NODA1_ROUTER_URL=http://10.0.0.5:9102
```
---
## Monitor Agent on Nodes
The BFF probes each node at `GET /monitor/status` (falls back to `/healthz`).
### Implementing `/monitor/status` on a node
Add this endpoint to the node's router or a dedicated lightweight service:
```json
GET /monitor/status 200 OK
{
"online": true,
"ts": "2026-02-25T10:00:00Z",
"node_id": "NODA1",
"heartbeat_age_s": 5,
"router": {"ok": true, "latency_ms": 12},
"gateway": {"ok": true, "latency_ms": 8},
"alerts_loop_slo": {
"p95_ms": 320,
"failed_rate": 0.0
},
"open_incidents": 2,
"backends": {
"alerts": "postgres",
"audit": "auto",
"incidents": "auto",
"risk_history": "auto",
"backlog": "auto"
},
"last_artifacts": {
"risk_digest": "2026-02-24",
"platform_digest": "2026-W08",
"backlog": "2026-02-24"
}
}
```
If `/monitor/status` is not available, BFF synthesises partial data from `/healthz`.
---
## Parity Verification
Run after every deploy to both nodes:
```bash
# NODA2 alone
python3 ops/scripts/verify_sofiia_stack.py \
--node NODA2 \
--bff-url http://localhost:8002 \
--router-url http://localhost:8000 \
--env dev
# NODA1 from NODA2 (parity check)
python3 ops/scripts/verify_sofiia_stack.py \
--node NODA1 \
--bff-url http://<noda1>:8002 \
--router-url http://<noda1>:9102 \
--compare-with http://localhost:8002 \
--compare-node NODA2 \
--env prod
# JSON output for CI
python3 ops/scripts/verify_sofiia_stack.py --json | jq .pass
```
Exit 0 = PASS. Exit 1 = critical failure.
### Critical PASS requirements (prod)
- `router_health` — router responds 200
- `bff_health` — BFF identifies as `sofiia-console`
- `bff_status_full` — router + memory reachable
- `alerts_backend != memory` — must be postgres in prod/staging
---
## WebSocket Events
Connect to WS for real-time monitoring:
```bash
# Using wscat (npm install -g wscat)
wscat -c ws://localhost:8002/ws/events
# Or via Python
python3 -c "
import asyncio, json, websockets
async def f():
async with websockets.connect('ws://localhost:8002/ws/events') as ws:
async for msg in ws:
print(json.loads(msg)['type'])
asyncio.run(f())
"
```
Event types: `chat.message`, `chat.reply`, `voice.stt`, `voice.tts`, `ops.run`, `nodes.status`, `error`.
---
## Troubleshooting
### BFF won't start: `ModuleNotFoundError`
```bash
pip install -r services/sofiia-console/requirements.txt
```
### UI shows "BFF: ✗"
1. Check BFF is running: `curl http://localhost:8002/api/health`
2. Check Settings tab → BFF URL points to correct host
3. Check CORS: BFF URL must match `CORS_ORIGINS` in prod
### Router shows "offline" in Nodes
1. NODA1 router might not be running: `docker ps | grep router`
2. Check `config/nodes_registry.yml` router_url
3. Override: `export NODES_NODA1_ROUTER_URL=http://<correct-ip>:9102`
### STT/TTS not working
1. Check memory-service is running: `curl http://localhost:8000/health`
2. Check `MEMORY_SERVICE_URL` in BFF env
3. Check browser has microphone permission
### Alerts backend is "memory" (should be postgres)
In prod/staging, set:
```bash
export ALERT_BACKEND=postgres
```
Then restart the governance/router service.
### Cron jobs not running
```bash
# Check cron file
cat /etc/cron.d/daarion-governance
# Manual trigger (example)
cd /path/to/daarion && python3 -m services.router.risk_engine snapshot
```
---
## AISTALK Integration
See `docs/aistalk/contract.md` for full integration contract.
Quick enable:
```bash
export AISTALK_ENABLED=true
export AISTALK_URL=http://<aistalk-bridge>:PORT
# Restart BFF
```
Status check:
```bash
curl http://localhost:8002/api/status/full | jq .bff.aistalk_enabled
```
---
## Definition of Done Checklist
- [ ] `verify_sofiia_stack.py` PASS on NODA2 (dev)
- [ ] `verify_sofiia_stack.py` PASS on NODA1 (prod) — router + BFF + alerts=postgres
- [ ] `--compare-with` parity PASS between NODA1 and NODA2
- [ ] Nodes dashboard shows real-time data (online/latency/incidents)
- [ ] Ops tab: release_check runs and shows result
- [ ] Voice: STT → chat → TTS roundtrip works without looping
- [ ] WS Events tab shows `chat.reply`, `voice.stt`, `nodes.status`
- [ ] `SOFIIA_CONSOLE_API_KEY` set on NODA1 (prod)
- [ ] `ALERT_BACKEND=postgres` on NODA1 (prod)