Config policies (16 files): alert_routing, architecture_pressure, backlog, cost_weights, data_governance, incident_escalation, incident_intelligence, network_allowlist, nodes_registry, observability_sources, rbac_tools_matrix, release_gate, risk_attribution, risk_policy, slo_policy, tool_limits, tools_rollout Ops (22 files): Caddyfile, calendar compose, grafana voice dashboard, deployments/incidents logs, runbooks for alerts/audit/backlog/incidents/sofiia/voice, cron jobs, scripts (alert_triage, audit_cleanup, migrate_*, governance, schedule), task_registry, voice alerts/ha/latency/policy Docs (30+ files): HUMANIZED_STEPAN v2.7-v3 changelogs and runbooks, NODA1/NODA2 status and setup, audit index and traces, backlog, incident, supervisor, tools, voice, opencode, release, risk, aistalk, spacebot Made-with: Cursor
286 lines
7.9 KiB
Markdown
286 lines
7.9 KiB
Markdown
# Sofiia Control Plane — Operations Runbook
|
|
|
|
Version: 1.0
|
|
Date: 2026-02-25
|
|
|
|
---
|
|
|
|
## Architecture: Two-Plane Model
|
|
|
|
```
|
|
┌─────────────────────────────────┐ ┌─────────────────────────────────┐
|
|
│ NODA2 (MacBook) │ │ NODA1 (Production) │
|
|
│ CONTROL PLANE │ │ RUNTIME PLANE │
|
|
│ │ │ │
|
|
│ sofiia-console BFF :8002 ────────→ │ router/gateway :8000/:9300 │
|
|
│ memory-service UI :8000 │ │ postgres, qdrant stores │
|
|
│ Ollama :11434 │ │ cron jobs (governance) │
|
|
│ WebSocket /ws/events │ │ alert/incident/risk pipelines │
|
|
│ │ │ │
|
|
│ Operator interacts here │ │ Production traffic runs here │
|
|
└─────────────────────────────────┘ └─────────────────────────────────┘
|
|
```
|
|
|
|
### Rule: All operator actions go through NODA2 BFF
|
|
|
|
The BFF on NODA2 proxies requests to NODA1 router/governance. You never call NODA1 directly from the browser.
|
|
|
|
---
|
|
|
|
## Environment Variables
|
|
|
|
### NODA2 (sofiia-console BFF)
|
|
|
|
| Variable | Default | Description |
|
|
|---|---|---|
|
|
| `PORT` | `8002` | BFF listen port |
|
|
| `ENV` | `dev` | `dev\|staging\|prod` — controls CORS strictness, auth enforcement |
|
|
| `SOFIIA_CONSOLE_API_KEY` | `""` | Bearer auth for write endpoints. Mandatory in prod. |
|
|
| `MEMORY_SERVICE_URL` | `http://localhost:8000` | Memory service URL (STT/TTS/memory) |
|
|
| `OLLAMA_URL` | `http://localhost:11434` | Ollama URL for local LLM |
|
|
| `CORS_ORIGINS` | `""` | Comma-separated allowed origins. Empty = `*` in dev. |
|
|
| `SUPERVISOR_API_KEY` | `""` | Key for router/governance calls |
|
|
| `NODES_POLL_INTERVAL_SEC` | `30` | How often BFF polls nodes for telemetry |
|
|
| `AISTALK_ENABLED` | `false` | Enable AISTALK adapter |
|
|
| `AISTALK_URL` | `""` | AISTALK bridge URL |
|
|
| `BUILD_ID` | `local` | Git SHA or build ID (set in CI/CD) |
|
|
| `CONFIG_DIR` | auto-detect | Path to `config/` directory with `nodes_registry.yml` |
|
|
|
|
### NODA1 (router/governance)
|
|
|
|
| Variable | Description |
|
|
|---|---|
|
|
| `ALERT_BACKEND` | Must be `postgres` in production (not `memory`) |
|
|
| `AUDIT_BACKEND` | `auto\|jsonl\|postgres` |
|
|
| `GOV_CRON_FILE` | Path to cron file, default `/etc/cron.d/daarion-governance` |
|
|
|
|
---
|
|
|
|
## Starting Services
|
|
|
|
### NODA2 — Start BFF
|
|
|
|
```bash
|
|
cd services/sofiia-console
|
|
source .venv/bin/activate
|
|
uvicorn app.main:app --host 0.0.0.0 --port 8002 --reload
|
|
```
|
|
|
|
Or via Docker Compose:
|
|
```bash
|
|
docker-compose -f docker-compose.node2-sofiia.yml up -d
|
|
```
|
|
|
|
### NODA2 — Check status
|
|
|
|
```bash
|
|
curl http://localhost:8002/api/health
|
|
curl http://localhost:8002/api/status/full
|
|
```
|
|
|
|
Expected: `service: "sofiia-console"`, `version: "0.3.x"`.
|
|
|
|
### Accessing the UI
|
|
|
|
```
|
|
http://localhost:8000/ui ← memory-service serves sofiia-ui.html
|
|
```
|
|
|
|
The UI auto-connects to BFF at `http://localhost:8002` (configurable in Settings tab).
|
|
|
|
---
|
|
|
|
## Nodes Registry
|
|
|
|
Edit `config/nodes_registry.yml` to add/modify nodes:
|
|
|
|
```yaml
|
|
nodes:
|
|
NODA1:
|
|
label: "Production (NODA1)"
|
|
router_url: "http://<noda1-ip>:9102"
|
|
gateway_url: "http://<noda1-ip>:9300"
|
|
|
|
NODA2:
|
|
label: "Control Plane (NODA2)"
|
|
router_url: "http://localhost:8000"
|
|
monitor_url: "http://localhost:8000"
|
|
```
|
|
|
|
**Environment overrides** (no need to edit YAML in prod):
|
|
```bash
|
|
export NODES_NODA1_ROUTER_URL=http://10.0.0.5:9102
|
|
```
|
|
|
|
---
|
|
|
|
## Monitor Agent on Nodes
|
|
|
|
The BFF probes each node at `GET /monitor/status` (falls back to `/healthz`).
|
|
|
|
### Implementing `/monitor/status` on a node
|
|
|
|
Add this endpoint to the node's router or a dedicated lightweight service:
|
|
|
|
```json
|
|
GET /monitor/status → 200 OK
|
|
{
|
|
"online": true,
|
|
"ts": "2026-02-25T10:00:00Z",
|
|
"node_id": "NODA1",
|
|
"heartbeat_age_s": 5,
|
|
"router": {"ok": true, "latency_ms": 12},
|
|
"gateway": {"ok": true, "latency_ms": 8},
|
|
"alerts_loop_slo": {
|
|
"p95_ms": 320,
|
|
"failed_rate": 0.0
|
|
},
|
|
"open_incidents": 2,
|
|
"backends": {
|
|
"alerts": "postgres",
|
|
"audit": "auto",
|
|
"incidents": "auto",
|
|
"risk_history": "auto",
|
|
"backlog": "auto"
|
|
},
|
|
"last_artifacts": {
|
|
"risk_digest": "2026-02-24",
|
|
"platform_digest": "2026-W08",
|
|
"backlog": "2026-02-24"
|
|
}
|
|
}
|
|
```
|
|
|
|
If `/monitor/status` is not available, BFF synthesises partial data from `/healthz`.
|
|
|
|
---
|
|
|
|
## Parity Verification
|
|
|
|
Run after every deploy to both nodes:
|
|
|
|
```bash
|
|
# NODA2 alone
|
|
python3 ops/scripts/verify_sofiia_stack.py \
|
|
--node NODA2 \
|
|
--bff-url http://localhost:8002 \
|
|
--router-url http://localhost:8000 \
|
|
--env dev
|
|
|
|
# NODA1 from NODA2 (parity check)
|
|
python3 ops/scripts/verify_sofiia_stack.py \
|
|
--node NODA1 \
|
|
--bff-url http://<noda1>:8002 \
|
|
--router-url http://<noda1>:9102 \
|
|
--compare-with http://localhost:8002 \
|
|
--compare-node NODA2 \
|
|
--env prod
|
|
|
|
# JSON output for CI
|
|
python3 ops/scripts/verify_sofiia_stack.py --json | jq .pass
|
|
```
|
|
|
|
Exit 0 = PASS. Exit 1 = critical failure.
|
|
|
|
### Critical PASS requirements (prod)
|
|
|
|
- `router_health` — router responds 200
|
|
- `bff_health` — BFF identifies as `sofiia-console`
|
|
- `bff_status_full` — router + memory reachable
|
|
- `alerts_backend != memory` — must be postgres in prod/staging
|
|
|
|
---
|
|
|
|
## WebSocket Events
|
|
|
|
Connect to WS for real-time monitoring:
|
|
|
|
```bash
|
|
# Using wscat (npm install -g wscat)
|
|
wscat -c ws://localhost:8002/ws/events
|
|
|
|
# Or via Python
|
|
python3 -c "
|
|
import asyncio, json, websockets
|
|
async def f():
|
|
async with websockets.connect('ws://localhost:8002/ws/events') as ws:
|
|
async for msg in ws:
|
|
print(json.loads(msg)['type'])
|
|
asyncio.run(f())
|
|
"
|
|
```
|
|
|
|
Event types: `chat.message`, `chat.reply`, `voice.stt`, `voice.tts`, `ops.run`, `nodes.status`, `error`.
|
|
|
|
---
|
|
|
|
## Troubleshooting
|
|
|
|
### BFF won't start: `ModuleNotFoundError`
|
|
```bash
|
|
pip install -r services/sofiia-console/requirements.txt
|
|
```
|
|
|
|
### UI shows "BFF: ✗"
|
|
1. Check BFF is running: `curl http://localhost:8002/api/health`
|
|
2. Check Settings tab → BFF URL points to correct host
|
|
3. Check CORS: BFF URL must match `CORS_ORIGINS` in prod
|
|
|
|
### Router shows "offline" in Nodes
|
|
1. NODA1 router might not be running: `docker ps | grep router`
|
|
2. Check `config/nodes_registry.yml` router_url
|
|
3. Override: `export NODES_NODA1_ROUTER_URL=http://<correct-ip>:9102`
|
|
|
|
### STT/TTS not working
|
|
1. Check memory-service is running: `curl http://localhost:8000/health`
|
|
2. Check `MEMORY_SERVICE_URL` in BFF env
|
|
3. Check browser has microphone permission
|
|
|
|
### Alerts backend is "memory" (should be postgres)
|
|
In prod/staging, set:
|
|
```bash
|
|
export ALERT_BACKEND=postgres
|
|
```
|
|
Then restart the governance/router service.
|
|
|
|
### Cron jobs not running
|
|
```bash
|
|
# Check cron file
|
|
cat /etc/cron.d/daarion-governance
|
|
|
|
# Manual trigger (example)
|
|
cd /path/to/daarion && python3 -m services.router.risk_engine snapshot
|
|
```
|
|
|
|
---
|
|
|
|
## AISTALK Integration
|
|
|
|
See `docs/aistalk/contract.md` for full integration contract.
|
|
|
|
Quick enable:
|
|
```bash
|
|
export AISTALK_ENABLED=true
|
|
export AISTALK_URL=http://<aistalk-bridge>:PORT
|
|
# Restart BFF
|
|
```
|
|
|
|
Status check:
|
|
```bash
|
|
curl http://localhost:8002/api/status/full | jq .bff.aistalk_enabled
|
|
```
|
|
|
|
---
|
|
|
|
## Definition of Done Checklist
|
|
|
|
- [ ] `verify_sofiia_stack.py` PASS on NODA2 (dev)
|
|
- [ ] `verify_sofiia_stack.py` PASS on NODA1 (prod) — router + BFF + alerts=postgres
|
|
- [ ] `--compare-with` parity PASS between NODA1 and NODA2
|
|
- [ ] Nodes dashboard shows real-time data (online/latency/incidents)
|
|
- [ ] Ops tab: release_check runs and shows result
|
|
- [ ] Voice: STT → chat → TTS roundtrip works without looping
|
|
- [ ] WS Events tab shows `chat.reply`, `voice.stt`, `nodes.status`
|
|
- [ ] `SOFIIA_CONSOLE_API_KEY` set on NODA1 (prod)
|
|
- [ ] `ALERT_BACKEND=postgres` on NODA1 (prod)
|