docs(platform): add policy configs, runbooks, ops scripts and platform documentation
Config policies (16 files): alert_routing, architecture_pressure, backlog, cost_weights, data_governance, incident_escalation, incident_intelligence, network_allowlist, nodes_registry, observability_sources, rbac_tools_matrix, release_gate, risk_attribution, risk_policy, slo_policy, tool_limits, tools_rollout Ops (22 files): Caddyfile, calendar compose, grafana voice dashboard, deployments/incidents logs, runbooks for alerts/audit/backlog/incidents/sofiia/voice, cron jobs, scripts (alert_triage, audit_cleanup, migrate_*, governance, schedule), task_registry, voice alerts/ha/latency/policy Docs (30+ files): HUMANIZED_STEPAN v2.7-v3 changelogs and runbooks, NODA1/NODA2 status and setup, audit index and traces, backlog, incident, supervisor, tools, voice, opencode, release, risk, aistalk, spacebot Made-with: Cursor
This commit is contained in:
285
docs/runbook/sofiia-control-plane.md
Normal file
285
docs/runbook/sofiia-control-plane.md
Normal file
@@ -0,0 +1,285 @@
|
||||
# Sofiia Control Plane — Operations Runbook
|
||||
|
||||
Version: 1.0
|
||||
Date: 2026-02-25
|
||||
|
||||
---
|
||||
|
||||
## Architecture: Two-Plane Model
|
||||
|
||||
```
|
||||
┌─────────────────────────────────┐ ┌─────────────────────────────────┐
|
||||
│ NODA2 (MacBook) │ │ NODA1 (Production) │
|
||||
│ CONTROL PLANE │ │ RUNTIME PLANE │
|
||||
│ │ │ │
|
||||
│ sofiia-console BFF :8002 ────────→ │ router/gateway :8000/:9300 │
|
||||
│ memory-service UI :8000 │ │ postgres, qdrant stores │
|
||||
│ Ollama :11434 │ │ cron jobs (governance) │
|
||||
│ WebSocket /ws/events │ │ alert/incident/risk pipelines │
|
||||
│ │ │ │
|
||||
│ Operator interacts here │ │ Production traffic runs here │
|
||||
└─────────────────────────────────┘ └─────────────────────────────────┘
|
||||
```
|
||||
|
||||
### Rule: All operator actions go through NODA2 BFF
|
||||
|
||||
The BFF on NODA2 proxies requests to NODA1 router/governance. You never call NODA1 directly from the browser.
|
||||
|
||||
---
|
||||
|
||||
## Environment Variables
|
||||
|
||||
### NODA2 (sofiia-console BFF)
|
||||
|
||||
| Variable | Default | Description |
|
||||
|---|---|---|
|
||||
| `PORT` | `8002` | BFF listen port |
|
||||
| `ENV` | `dev` | `dev\|staging\|prod` — controls CORS strictness, auth enforcement |
|
||||
| `SOFIIA_CONSOLE_API_KEY` | `""` | Bearer auth for write endpoints. Mandatory in prod. |
|
||||
| `MEMORY_SERVICE_URL` | `http://localhost:8000` | Memory service URL (STT/TTS/memory) |
|
||||
| `OLLAMA_URL` | `http://localhost:11434` | Ollama URL for local LLM |
|
||||
| `CORS_ORIGINS` | `""` | Comma-separated allowed origins. Empty = `*` in dev. |
|
||||
| `SUPERVISOR_API_KEY` | `""` | Key for router/governance calls |
|
||||
| `NODES_POLL_INTERVAL_SEC` | `30` | How often BFF polls nodes for telemetry |
|
||||
| `AISTALK_ENABLED` | `false` | Enable AISTALK adapter |
|
||||
| `AISTALK_URL` | `""` | AISTALK bridge URL |
|
||||
| `BUILD_ID` | `local` | Git SHA or build ID (set in CI/CD) |
|
||||
| `CONFIG_DIR` | auto-detect | Path to `config/` directory with `nodes_registry.yml` |
|
||||
|
||||
### NODA1 (router/governance)
|
||||
|
||||
| Variable | Description |
|
||||
|---|---|
|
||||
| `ALERT_BACKEND` | Must be `postgres` in production (not `memory`) |
|
||||
| `AUDIT_BACKEND` | `auto\|jsonl\|postgres` |
|
||||
| `GOV_CRON_FILE` | Path to cron file, default `/etc/cron.d/daarion-governance` |
|
||||
|
||||
---
|
||||
|
||||
## Starting Services
|
||||
|
||||
### NODA2 — Start BFF
|
||||
|
||||
```bash
|
||||
cd services/sofiia-console
|
||||
source .venv/bin/activate
|
||||
uvicorn app.main:app --host 0.0.0.0 --port 8002 --reload
|
||||
```
|
||||
|
||||
Or via Docker Compose:
|
||||
```bash
|
||||
docker-compose -f docker-compose.node2-sofiia.yml up -d
|
||||
```
|
||||
|
||||
### NODA2 — Check status
|
||||
|
||||
```bash
|
||||
curl http://localhost:8002/api/health
|
||||
curl http://localhost:8002/api/status/full
|
||||
```
|
||||
|
||||
Expected: `service: "sofiia-console"`, `version: "0.3.x"`.
|
||||
|
||||
### Accessing the UI
|
||||
|
||||
```
|
||||
http://localhost:8000/ui ← memory-service serves sofiia-ui.html
|
||||
```
|
||||
|
||||
The UI auto-connects to BFF at `http://localhost:8002` (configurable in Settings tab).
|
||||
|
||||
---
|
||||
|
||||
## Nodes Registry
|
||||
|
||||
Edit `config/nodes_registry.yml` to add/modify nodes:
|
||||
|
||||
```yaml
|
||||
nodes:
|
||||
NODA1:
|
||||
label: "Production (NODA1)"
|
||||
router_url: "http://<noda1-ip>:9102"
|
||||
gateway_url: "http://<noda1-ip>:9300"
|
||||
|
||||
NODA2:
|
||||
label: "Control Plane (NODA2)"
|
||||
router_url: "http://localhost:8000"
|
||||
monitor_url: "http://localhost:8000"
|
||||
```
|
||||
|
||||
**Environment overrides** (no need to edit YAML in prod):
|
||||
```bash
|
||||
export NODES_NODA1_ROUTER_URL=http://10.0.0.5:9102
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Monitor Agent on Nodes
|
||||
|
||||
The BFF probes each node at `GET /monitor/status` (falls back to `/healthz`).
|
||||
|
||||
### Implementing `/monitor/status` on a node
|
||||
|
||||
Add this endpoint to the node's router or a dedicated lightweight service:
|
||||
|
||||
```json
|
||||
GET /monitor/status → 200 OK
|
||||
{
|
||||
"online": true,
|
||||
"ts": "2026-02-25T10:00:00Z",
|
||||
"node_id": "NODA1",
|
||||
"heartbeat_age_s": 5,
|
||||
"router": {"ok": true, "latency_ms": 12},
|
||||
"gateway": {"ok": true, "latency_ms": 8},
|
||||
"alerts_loop_slo": {
|
||||
"p95_ms": 320,
|
||||
"failed_rate": 0.0
|
||||
},
|
||||
"open_incidents": 2,
|
||||
"backends": {
|
||||
"alerts": "postgres",
|
||||
"audit": "auto",
|
||||
"incidents": "auto",
|
||||
"risk_history": "auto",
|
||||
"backlog": "auto"
|
||||
},
|
||||
"last_artifacts": {
|
||||
"risk_digest": "2026-02-24",
|
||||
"platform_digest": "2026-W08",
|
||||
"backlog": "2026-02-24"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
If `/monitor/status` is not available, BFF synthesises partial data from `/healthz`.
|
||||
|
||||
---
|
||||
|
||||
## Parity Verification
|
||||
|
||||
Run after every deploy to both nodes:
|
||||
|
||||
```bash
|
||||
# NODA2 alone
|
||||
python3 ops/scripts/verify_sofiia_stack.py \
|
||||
--node NODA2 \
|
||||
--bff-url http://localhost:8002 \
|
||||
--router-url http://localhost:8000 \
|
||||
--env dev
|
||||
|
||||
# NODA1 from NODA2 (parity check)
|
||||
python3 ops/scripts/verify_sofiia_stack.py \
|
||||
--node NODA1 \
|
||||
--bff-url http://<noda1>:8002 \
|
||||
--router-url http://<noda1>:9102 \
|
||||
--compare-with http://localhost:8002 \
|
||||
--compare-node NODA2 \
|
||||
--env prod
|
||||
|
||||
# JSON output for CI
|
||||
python3 ops/scripts/verify_sofiia_stack.py --json | jq .pass
|
||||
```
|
||||
|
||||
Exit 0 = PASS. Exit 1 = critical failure.
|
||||
|
||||
### Critical PASS requirements (prod)
|
||||
|
||||
- `router_health` — router responds 200
|
||||
- `bff_health` — BFF identifies as `sofiia-console`
|
||||
- `bff_status_full` — router + memory reachable
|
||||
- `alerts_backend != memory` — must be postgres in prod/staging
|
||||
|
||||
---
|
||||
|
||||
## WebSocket Events
|
||||
|
||||
Connect to WS for real-time monitoring:
|
||||
|
||||
```bash
|
||||
# Using wscat (npm install -g wscat)
|
||||
wscat -c ws://localhost:8002/ws/events
|
||||
|
||||
# Or via Python
|
||||
python3 -c "
|
||||
import asyncio, json, websockets
|
||||
async def f():
|
||||
async with websockets.connect('ws://localhost:8002/ws/events') as ws:
|
||||
async for msg in ws:
|
||||
print(json.loads(msg)['type'])
|
||||
asyncio.run(f())
|
||||
"
|
||||
```
|
||||
|
||||
Event types: `chat.message`, `chat.reply`, `voice.stt`, `voice.tts`, `ops.run`, `nodes.status`, `error`.
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### BFF won't start: `ModuleNotFoundError`
|
||||
```bash
|
||||
pip install -r services/sofiia-console/requirements.txt
|
||||
```
|
||||
|
||||
### UI shows "BFF: ✗"
|
||||
1. Check BFF is running: `curl http://localhost:8002/api/health`
|
||||
2. Check Settings tab → BFF URL points to correct host
|
||||
3. Check CORS: BFF URL must match `CORS_ORIGINS` in prod
|
||||
|
||||
### Router shows "offline" in Nodes
|
||||
1. NODA1 router might not be running: `docker ps | grep router`
|
||||
2. Check `config/nodes_registry.yml` router_url
|
||||
3. Override: `export NODES_NODA1_ROUTER_URL=http://<correct-ip>:9102`
|
||||
|
||||
### STT/TTS not working
|
||||
1. Check memory-service is running: `curl http://localhost:8000/health`
|
||||
2. Check `MEMORY_SERVICE_URL` in BFF env
|
||||
3. Check browser has microphone permission
|
||||
|
||||
### Alerts backend is "memory" (should be postgres)
|
||||
In prod/staging, set:
|
||||
```bash
|
||||
export ALERT_BACKEND=postgres
|
||||
```
|
||||
Then restart the governance/router service.
|
||||
|
||||
### Cron jobs not running
|
||||
```bash
|
||||
# Check cron file
|
||||
cat /etc/cron.d/daarion-governance
|
||||
|
||||
# Manual trigger (example)
|
||||
cd /path/to/daarion && python3 -m services.router.risk_engine snapshot
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## AISTALK Integration
|
||||
|
||||
See `docs/aistalk/contract.md` for full integration contract.
|
||||
|
||||
Quick enable:
|
||||
```bash
|
||||
export AISTALK_ENABLED=true
|
||||
export AISTALK_URL=http://<aistalk-bridge>:PORT
|
||||
# Restart BFF
|
||||
```
|
||||
|
||||
Status check:
|
||||
```bash
|
||||
curl http://localhost:8002/api/status/full | jq .bff.aistalk_enabled
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Definition of Done Checklist
|
||||
|
||||
- [ ] `verify_sofiia_stack.py` PASS on NODA2 (dev)
|
||||
- [ ] `verify_sofiia_stack.py` PASS on NODA1 (prod) — router + BFF + alerts=postgres
|
||||
- [ ] `--compare-with` parity PASS between NODA1 and NODA2
|
||||
- [ ] Nodes dashboard shows real-time data (online/latency/incidents)
|
||||
- [ ] Ops tab: release_check runs and shows result
|
||||
- [ ] Voice: STT → chat → TTS roundtrip works without looping
|
||||
- [ ] WS Events tab shows `chat.reply`, `voice.stt`, `nodes.status`
|
||||
- [ ] `SOFIIA_CONSOLE_API_KEY` set on NODA1 (prod)
|
||||
- [ ] `ALERT_BACKEND=postgres` on NODA1 (prod)
|
||||
Reference in New Issue
Block a user