docs(platform): add policy configs, runbooks, ops scripts and platform documentation

Config policies (16 files): alert_routing, architecture_pressure, backlog,
cost_weights, data_governance, incident_escalation, incident_intelligence,
network_allowlist, nodes_registry, observability_sources, rbac_tools_matrix,
release_gate, risk_attribution, risk_policy, slo_policy, tool_limits, tools_rollout

Ops (22 files): Caddyfile, calendar compose, grafana voice dashboard,
deployments/incidents logs, runbooks for alerts/audit/backlog/incidents/sofiia/voice,
cron jobs, scripts (alert_triage, audit_cleanup, migrate_*, governance, schedule),
task_registry, voice alerts/ha/latency/policy

Docs (30+ files): HUMANIZED_STEPAN v2.7-v3 changelogs and runbooks,
NODA1/NODA2 status and setup, audit index and traces, backlog, incident,
supervisor, tools, voice, opencode, release, risk, aistalk, spacebot

Made-with: Cursor
This commit is contained in:
Apple
2026-03-03 07:14:53 -08:00
parent 129e4ea1fc
commit 67225a39fa
102 changed files with 20060 additions and 0 deletions

View File

@@ -0,0 +1,75 @@
# Release Evidence Template (Sofiia Console)
Заповнювати після кожного релізу. Мета: мати короткий, відтворюваний артефакт виконаних дій і перевірок.
## 1) Release metadata
- Release ID:
- Date/Time UTC:
- Date/Time Europe/Kyiv:
- Operator:
- Target nodes: `NODA1` / `NODA2`
- Deployed SHAs:
- `sofiia-console`:
- `router`:
- `gateway`:
- `memory-service`:
- Change summary (1-3 bullets):
-
## 2) Preflight results
- Command:
- `bash ops/preflight_sofiia_console.sh`
- `STRICT=1 bash ops/preflight_sofiia_console.sh` (prod window)
- Status: `PASS` / `FAIL`
- WARN summary (if any):
-
## 3) Deploy steps performed
- NODA2 precheck: `OK` / `FAIL`
- Notes:
- NODA1 rollout: `OK` / `FAIL`
- Method (docker/systemd/manual):
- Notes:
- NODA2 finalize: `OK` / `FAIL`
- Notes:
## 4) Smoke evidence
- `GET /api/health`: status code / result
- `GET /metrics`: reachable `yes/no`
- Idempotency A/B smoke:
- Command: `bash ops/redis_idempotency_smoke.sh`
- Result: `PASS` / `FAIL`
- `message_id`:
- `/api/audit` auth checks:
- without key -> `401` confirmed: `yes/no`
- with key -> `200` confirmed: `yes/no`
## 5) Post-release checks
- Key metrics deltas (optional):
- `sofiia_rate_limited_total`:
- `sofiia_idempotency_replays_total`:
- Audit write/read quick check: `OK` / `FAIL`
- Retention dry-run:
- Command: `python3 ops/prune_audit_db.py --dry-run`
- `candidates=`:
- Notes:
## 6) Rollback plan & outcome
- Rollback needed: `no` / `yes`
- If yes:
- reason:
- rollback commands used:
- result:
- Final service state: `healthy` / `degraded`
## 7) Sign-off
- Reviewer / approver:
- Timestamp UTC:
- Notes:

View File

@@ -0,0 +1,175 @@
# Sofiia Console — Operations Runbook
## 1. Rebuild & Deploy (NODA2)
```bash
cd /opt/microdao-daarion # or ~/github-projects/microdao-daarion on dev
# Rebuild sofiia-console (UI + backend)
docker compose -f docker-compose.node2-sofiia.yml build sofiia-console --no-cache
docker compose -f docker-compose.node2-sofiia.yml up -d sofiia-console
# Rebuild gateway (for agent registry changes)
docker compose -f docker-compose.node2-sofiia.yml build gateway --no-cache
docker compose -f docker-compose.node2-sofiia.yml up -d gateway
```
## 2. Confirm Build Version
```bash
# Via API
APIKEY=$(grep SOFIIA_CONSOLE_API_KEY .env | cut -d= -f2)
curl -s http://localhost:8002/api/meta/version -H "X-API-Key: $APIKEY"
# Expected: {"version":"0.4.0","build_sha":"dev","build_time":"local",...}
# In UI: header shows "v0.4.0 dev" badge (top right)
```
## 3. Verify Agents List
```bash
APIKEY=$(grep SOFIIA_CONSOLE_API_KEY .env | cut -d= -f2)
# NODA2 agents
curl -s "http://localhost:8002/api/agents?nodes=NODA2" -H "X-API-Key: $APIKEY" | \
python3 -c "import sys,json; d=json.load(sys.stdin); print(f'items={len(d[\"items\"])} stats={d[\"stats\"]} errors={d[\"node_errors\"]}')"
# NODA1 agents
curl -s "http://localhost:8002/api/agents?nodes=NODA1" -H "X-API-Key: $APIKEY" | \
python3 -c "import sys,json; d=json.load(sys.stdin); print(f'items={len(d[\"items\"])} stats={d[\"stats\"]} errors={d[\"node_errors\"]}')"
# All nodes
curl -s "http://localhost:8002/api/agents?nodes=NODA1,NODA2" -H "X-API-Key: $APIKEY" | \
python3 -c "import sys,json; d=json.load(sys.stdin); print(f'items={len(d[\"items\"])} stats={d[\"stats\"]} errors={d[\"node_errors\"]}')"
# Direct gateway check (NODA2)
curl -s http://localhost:9300/health | python3 -c "
import sys,json; d=json.load(sys.stdin)
print(f'agents={d[\"agents_count\"]}')
for k,v in sorted(d[\"agents\"].items()): print(f' {k}: badges={v.get(\"badges\",[])}')
"
```
## 4. UI Debug Panel
У вкладці **📁 Проєкти → Agents**:
1. Натисніть кнопку **🔍 Debug** в панелі дій
2. Debug panel показує:
- `fetch`: час останнього запиту
- `nodes`: вибрані ноди
- `items`: кількість агентів
- `ok/total`: кількість успішних нод
- `errors`: помилки нод (якщо є)
## 5. Troubleshooting
### Агенти не відображаються в UI
1. Перевірте API ключ у налаштуваннях UI
2. Натисніть **↻ Sync**
3. Відкрийте **🔍 Debug** — перевірте `errors`
4. Перевірте gateway health: `curl http://localhost:9300/health`
### Gateway падає при старті
```bash
docker logs dagi-gateway-node2 --tail 50
```
Типова причина: ImportError у `http_api_doc.py``doc_service.py`
Рішення: перевірте що в `doc_service.py` є stub-функції (doc_service, update_document, list_document_versions, publish_document_artifact).
### SQLite "no such column: last_applied_hash"
БД у volume має стару схему. Вирішення — міграції виконуються автоматично при старті через `_MIGRATION_SQL_STMTS` у `db.py`. Restart контейнера вирішує:
```bash
docker restart sofiia-console
```
### NODA2 gateway_url недоступний з контейнера
У `config/nodes_registry.yml` NODA2 використовує `host.docker.internal:9300`.
Якщо UI запущений не в Docker — замініть на `localhost:9300`.
### Monitor / AISTALK не відображаються
Перевірте що в `gateway-bot/http_api.py`:
- `MONITOR_CONFIG` і `AISTALK_CONFIG` визначені через `load_agent_config`
- Вони додані в `AGENT_REGISTRY`
- Файл `gateway-bot/monitor_prompt.txt` існує
```bash
docker exec dagi-gateway-node2 python3 -c "
from http_api import AGENT_REGISTRY
print(list(AGENT_REGISTRY.keys()))
"
```
## 6. Monitor Policy
Monitor (`agent_id=monitor`) є **обов'язковим** агентом на кожній ноді.
### Перевірка
```bash
APIKEY=$(grep SOFIIA_CONSOLE_API_KEY .env | cut -d= -f2)
curl -s "http://localhost:8002/api/agents?nodes=NODA1,NODA2" -H "X-API-Key: $APIKEY" | \
python3 -c "import sys,json; d=json.load(sys.stdin); print('missing:', d.get('required_missing_nodes'))"
```
- `required_missing=[]` — все ОК
- `required_missing=[{"node_id":"NODA1","agent_id":"monitor"}]` — Monitor відсутній на NODA1 → перевірте gateway registry → rebuild gateway
### Governance event
Якщо Monitor відсутній на онлайн-ноді — автоматично записується `governance_event` типу `node_required_agent_missing` (severity=high).
## 7. Voice & Telegram Capabilities
У вкладці Agents:
- **🎙 Voice** badge — агент підтримує голос (AISTALK)
- **💬 Telegram** badge — агент активний у Telegram
- Фільтри **🎙 Voice** і **💬 Telegram** — client-side фільтрація
### API
```bash
curl -s "http://localhost:8002/api/agents?nodes=NODA1" -H "X-API-Key: $APIKEY" | \
python3 -c "import sys,json; d=json.load(sys.stdin);
voice=[a['agent_id'] for a in d['items'] if a.get('capabilities',{}).get('voice')]
print('voice:', voice)"
```
## 8. Document Versioning
API для версій документів (в межах Sofiia Console):
```bash
# Список версій
GET /api/projects/{project_id}/documents/{doc_id}/versions
# Оновити документ (зберігає нову версію)
POST /api/projects/{project_id}/documents/{doc_id}/update
{"content_md": "# Новий зміст", "author_id": "user", "reason": "оновлення", "dry_run": false}
# Відновити версію
POST /api/projects/{project_id}/documents/{doc_id}/restore
{"version_id": "...", "author_id": "user"}
```
## 9. Agent Registry SSoT
Canonical реєстр: `config/agent_registry.yml`
Gateway завантажує агентів з `gateway-bot/http_api.py::AGENT_REGISTRY` (Python dict).
Щоб додати нового агента:
1. Додайте запис в `config/agent_registry.yml`
2. Додайте `*_CONFIG = load_agent_config(...)` і запис в `AGENT_REGISTRY` у `gateway-bot/http_api.py`
3. Створіть `gateway-bot/<agent_id>_prompt.txt`
4. Rebuild gateway
## 10. Ports Reference
| Сервіс | Port | URL |
|---|---|---|
| Sofiia Console UI | 8002 | http://localhost:8002 |
| Gateway | 9300 | http://localhost:9300/health |
| Router | 9102 | http://localhost:9102/health |
| Memory | 8000 | http://localhost:8000/health |
| Qdrant | 6333 | http://localhost:6333/healthz |

View File

@@ -0,0 +1,285 @@
# Sofiia Control Plane — Operations Runbook
Version: 1.0
Date: 2026-02-25
---
## Architecture: Two-Plane Model
```
┌─────────────────────────────────┐ ┌─────────────────────────────────┐
│ NODA2 (MacBook) │ │ NODA1 (Production) │
│ CONTROL PLANE │ │ RUNTIME PLANE │
│ │ │ │
│ sofiia-console BFF :8002 ────────→ │ router/gateway :8000/:9300 │
│ memory-service UI :8000 │ │ postgres, qdrant stores │
│ Ollama :11434 │ │ cron jobs (governance) │
│ WebSocket /ws/events │ │ alert/incident/risk pipelines │
│ │ │ │
│ Operator interacts here │ │ Production traffic runs here │
└─────────────────────────────────┘ └─────────────────────────────────┘
```
### Rule: All operator actions go through NODA2 BFF
The BFF on NODA2 proxies requests to NODA1 router/governance. You never call NODA1 directly from the browser.
---
## Environment Variables
### NODA2 (sofiia-console BFF)
| Variable | Default | Description |
|---|---|---|
| `PORT` | `8002` | BFF listen port |
| `ENV` | `dev` | `dev\|staging\|prod` — controls CORS strictness, auth enforcement |
| `SOFIIA_CONSOLE_API_KEY` | `""` | Bearer auth for write endpoints. Mandatory in prod. |
| `MEMORY_SERVICE_URL` | `http://localhost:8000` | Memory service URL (STT/TTS/memory) |
| `OLLAMA_URL` | `http://localhost:11434` | Ollama URL for local LLM |
| `CORS_ORIGINS` | `""` | Comma-separated allowed origins. Empty = `*` in dev. |
| `SUPERVISOR_API_KEY` | `""` | Key for router/governance calls |
| `NODES_POLL_INTERVAL_SEC` | `30` | How often BFF polls nodes for telemetry |
| `AISTALK_ENABLED` | `false` | Enable AISTALK adapter |
| `AISTALK_URL` | `""` | AISTALK bridge URL |
| `BUILD_ID` | `local` | Git SHA or build ID (set in CI/CD) |
| `CONFIG_DIR` | auto-detect | Path to `config/` directory with `nodes_registry.yml` |
### NODA1 (router/governance)
| Variable | Description |
|---|---|
| `ALERT_BACKEND` | Must be `postgres` in production (not `memory`) |
| `AUDIT_BACKEND` | `auto\|jsonl\|postgres` |
| `GOV_CRON_FILE` | Path to cron file, default `/etc/cron.d/daarion-governance` |
---
## Starting Services
### NODA2 — Start BFF
```bash
cd services/sofiia-console
source .venv/bin/activate
uvicorn app.main:app --host 0.0.0.0 --port 8002 --reload
```
Or via Docker Compose:
```bash
docker-compose -f docker-compose.node2-sofiia.yml up -d
```
### NODA2 — Check status
```bash
curl http://localhost:8002/api/health
curl http://localhost:8002/api/status/full
```
Expected: `service: "sofiia-console"`, `version: "0.3.x"`.
### Accessing the UI
```
http://localhost:8000/ui ← memory-service serves sofiia-ui.html
```
The UI auto-connects to BFF at `http://localhost:8002` (configurable in Settings tab).
---
## Nodes Registry
Edit `config/nodes_registry.yml` to add/modify nodes:
```yaml
nodes:
NODA1:
label: "Production (NODA1)"
router_url: "http://<noda1-ip>:9102"
gateway_url: "http://<noda1-ip>:9300"
NODA2:
label: "Control Plane (NODA2)"
router_url: "http://localhost:8000"
monitor_url: "http://localhost:8000"
```
**Environment overrides** (no need to edit YAML in prod):
```bash
export NODES_NODA1_ROUTER_URL=http://10.0.0.5:9102
```
---
## Monitor Agent on Nodes
The BFF probes each node at `GET /monitor/status` (falls back to `/healthz`).
### Implementing `/monitor/status` on a node
Add this endpoint to the node's router or a dedicated lightweight service:
```json
GET /monitor/status 200 OK
{
"online": true,
"ts": "2026-02-25T10:00:00Z",
"node_id": "NODA1",
"heartbeat_age_s": 5,
"router": {"ok": true, "latency_ms": 12},
"gateway": {"ok": true, "latency_ms": 8},
"alerts_loop_slo": {
"p95_ms": 320,
"failed_rate": 0.0
},
"open_incidents": 2,
"backends": {
"alerts": "postgres",
"audit": "auto",
"incidents": "auto",
"risk_history": "auto",
"backlog": "auto"
},
"last_artifacts": {
"risk_digest": "2026-02-24",
"platform_digest": "2026-W08",
"backlog": "2026-02-24"
}
}
```
If `/monitor/status` is not available, BFF synthesises partial data from `/healthz`.
---
## Parity Verification
Run after every deploy to both nodes:
```bash
# NODA2 alone
python3 ops/scripts/verify_sofiia_stack.py \
--node NODA2 \
--bff-url http://localhost:8002 \
--router-url http://localhost:8000 \
--env dev
# NODA1 from NODA2 (parity check)
python3 ops/scripts/verify_sofiia_stack.py \
--node NODA1 \
--bff-url http://<noda1>:8002 \
--router-url http://<noda1>:9102 \
--compare-with http://localhost:8002 \
--compare-node NODA2 \
--env prod
# JSON output for CI
python3 ops/scripts/verify_sofiia_stack.py --json | jq .pass
```
Exit 0 = PASS. Exit 1 = critical failure.
### Critical PASS requirements (prod)
- `router_health` — router responds 200
- `bff_health` — BFF identifies as `sofiia-console`
- `bff_status_full` — router + memory reachable
- `alerts_backend != memory` — must be postgres in prod/staging
---
## WebSocket Events
Connect to WS for real-time monitoring:
```bash
# Using wscat (npm install -g wscat)
wscat -c ws://localhost:8002/ws/events
# Or via Python
python3 -c "
import asyncio, json, websockets
async def f():
async with websockets.connect('ws://localhost:8002/ws/events') as ws:
async for msg in ws:
print(json.loads(msg)['type'])
asyncio.run(f())
"
```
Event types: `chat.message`, `chat.reply`, `voice.stt`, `voice.tts`, `ops.run`, `nodes.status`, `error`.
---
## Troubleshooting
### BFF won't start: `ModuleNotFoundError`
```bash
pip install -r services/sofiia-console/requirements.txt
```
### UI shows "BFF: ✗"
1. Check BFF is running: `curl http://localhost:8002/api/health`
2. Check Settings tab → BFF URL points to correct host
3. Check CORS: BFF URL must match `CORS_ORIGINS` in prod
### Router shows "offline" in Nodes
1. NODA1 router might not be running: `docker ps | grep router`
2. Check `config/nodes_registry.yml` router_url
3. Override: `export NODES_NODA1_ROUTER_URL=http://<correct-ip>:9102`
### STT/TTS not working
1. Check memory-service is running: `curl http://localhost:8000/health`
2. Check `MEMORY_SERVICE_URL` in BFF env
3. Check browser has microphone permission
### Alerts backend is "memory" (should be postgres)
In prod/staging, set:
```bash
export ALERT_BACKEND=postgres
```
Then restart the governance/router service.
### Cron jobs not running
```bash
# Check cron file
cat /etc/cron.d/daarion-governance
# Manual trigger (example)
cd /path/to/daarion && python3 -m services.router.risk_engine snapshot
```
---
## AISTALK Integration
See `docs/aistalk/contract.md` for full integration contract.
Quick enable:
```bash
export AISTALK_ENABLED=true
export AISTALK_URL=http://<aistalk-bridge>:PORT
# Restart BFF
```
Status check:
```bash
curl http://localhost:8002/api/status/full | jq .bff.aistalk_enabled
```
---
## Definition of Done Checklist
- [ ] `verify_sofiia_stack.py` PASS on NODA2 (dev)
- [ ] `verify_sofiia_stack.py` PASS on NODA1 (prod) — router + BFF + alerts=postgres
- [ ] `--compare-with` parity PASS between NODA1 and NODA2
- [ ] Nodes dashboard shows real-time data (online/latency/incidents)
- [ ] Ops tab: release_check runs and shows result
- [ ] Voice: STT → chat → TTS roundtrip works without looping
- [ ] WS Events tab shows `chat.reply`, `voice.stt`, `nodes.status`
- [ ] `SOFIIA_CONSOLE_API_KEY` set on NODA1 (prod)
- [ ] `ALERT_BACKEND=postgres` on NODA1 (prod)