docs(platform): add policy configs, runbooks, ops scripts and platform documentation

Config policies (16 files): alert_routing, architecture_pressure, backlog, cost_weights, data_governance, incident_escalation, incident_intelligence, network_allowlist, nodes_registry, observability_sources, rbac_tools_matrix, release_gate, risk_attribution, risk_policy, slo_policy, tool_limits, tools_rollout Ops (22 files): Caddyfile, calendar compose, grafana voice dashboard, deployments/incidents logs, runbooks for alerts/audit/backlog/incidents/sofiia/voice, cron jobs, scripts (alert_triage, audit_cleanup, migrate_*, governance, schedule), task_registry, voice alerts/ha/latency/policy Docs (30+ files): HUMANIZED_STEPAN v2.7-v3 changelogs and runbooks, NODA1/NODA2 status and setup, audit index and traces, backlog, incident, supervisor, tools, voice, opencode, release, risk, aistalk, spacebot Made-with: Cursor
2026-03-03 07:14:53 -08:00
parent 129e4ea1fc
commit 67225a39fa
102 changed files with 20060 additions and 0 deletions
--- a/ops/runbook-alerts.md
+++ b/ops/runbook-alerts.md
@@ -0,0 +1,247 @@
+# Runbook: Alert → Incident Bridge (State Machine + Cooldown)
+
+## Topology
+
+```
+Monitor@node1/2  ──► alert_ingest_tool.ingest ──► AlertStore (Postgres or Memory)
+                                                        │
+Sofiia / oncall  ──► oncall_tool.alert_to_incident ─────┘
+                                                        │
+                          IncidentStore (Postgres) ◄───-┘
+                                  │
+                   Sofiia NODA2: incident_triage_graph
+                                  │
+                        postmortem_draft_graph
+```
+
+## Alert State Machine
+
+```
+new → processing → acked
+          ↓
+        failed → (retry after TTL) → new
+```
+
+| Status       | Meaning                                          |
+|-------------|--------------------------------------------------|
+| `new`        | Freshly ingested, not yet claimed                |
+| `processing` | Claimed by a loop worker; locked for 10 min      |
+| `acked`      | Successfully processed and closed                |
+| `failed`     | Processing error; retry after `retry_after_sec`  |
+
+**Concurrency safety:** `claim` uses `SELECT FOR UPDATE SKIP LOCKED` (Postgres) or an in-process lock (Memory). Two concurrent loops cannot claim the same alert.
+
+**Stale processing requeue:** `claim` automatically requeues alerts whose `processing_lock_until` has expired.
+
+---
+
+## Triage Cooldown (per Signature)
+
+After a triage runs for a given `incident_signature`, subsequent alerts with the same signature **within 15 min** (configurable via `triage_cooldown_minutes` in `alert_routing_policy.yml`) only get an `incident_append_event` note — no new triage run. This prevents triage storms.
+
+```yaml
+# config/alert_routing_policy.yml
+defaults:
+  triage_cooldown_minutes: 15
+```
+
+The state is persisted in `incident_signature_state` table (Postgres) or in-memory (fallback).
+
+---
+
+## Startup Checklist
+
+1. **Postgres DDL** (if `ALERT_BACKEND=postgres`):
+   ```bash
+   DATABASE_URL=postgresql://... python3 ops/scripts/migrate_alerts_postgres.py
+   ```
+   This is idempotent — safe to re-run. Adds state machine columns and `incident_signature_state` table.
+
+2. **Env vars on NODE1 (router)**:
+   ```env
+   ALERT_BACKEND=auto           # Postgres → Memory fallback
+   DATABASE_URL=postgresql://...
+   ```
+
+3. **Monitor agent**: configure `source: monitor@node1`, use `alert_ingest_tool.ingest`.
+
+## Operational Scenarios
+
+### Alert storm protection
+
+Alert deduplication prevents storms. If alerts are firing repeatedly:
+1. Check `occurrences` field — same alert ref means dedupe is working
+2. Adjust `dedupe_ttl_minutes` per alert (default 30)
+3. If many different fingerprints create new records — review Monitor fingerprint logic
+
+### False positive alert
+
+1. `alert_ingest_tool.ack` with `note="false positive"`
+2. No incident created (or close the incident if already created via `oncall_tool.incident_close`)
+
+### Alert → Incident conversion
+
+```bash
+# Sofiia or oncall agent calls:
+oncall_tool.alert_to_incident(
+    alert_ref="alrt_...",
+    incident_severity_cap="P1",
+    dedupe_window_minutes=60
+)
+```
+
+### View recent alerts (by status)
+
+```bash
+# Default: all statuses
+alert_ingest_tool.list(window_minutes=240, env="prod")
+
+# Only new/failed (unprocessed):
+alert_ingest_tool.list(window_minutes=240, status_in=["new","failed"])
+```
+
+### Claim alerts for processing (Supervisor loop)
+
+```bash
+# Atomic claim — locks alerts for 10 min
+alert_ingest_tool.claim(window_minutes=240, limit=25, owner="sofiia-supervisor", lock_ttl_seconds=600)
+```
+
+### Mark alert as failed (retry)
+
+```bash
+alert_ingest_tool.fail(alert_ref="alrt_...", error="gateway timeout", retry_after_seconds=300)
+```
+
+### Operational dashboard
+
+```
+GET /v1/alerts/dashboard?window_minutes=240
+# → counts by status, top signatures, latest alerts
+```
+
+```
+GET /v1/incidents/open?service=gateway
+# → open/mitigating incidents
+```
+
+### Monitor health check
+
+Verify Monitor is pushing alerts:
+```bash
+alert_ingest_tool.list(source="monitor@node1", window_minutes=60)
+```
+If empty and there should be alerts → check Monitor service + entitlements.
+
+## SLO Watch Gate
+
+### Staging blocks on SLO breach
+Config in `config/release_gate_policy.yml`:
+```yaml
+staging:
+  gates:
+    slo_watch:
+      mode: "strict"
+```
+
+To temporarily bypass (emergency deploy):
+```bash
+# In release_check input:
+run_slo_watch: false
+```
+Document reason in incident timeline.
+
+### Tuning SLO thresholds
+
+Edit `config/slo_policy.yml`:
+```yaml
+services:
+  gateway:
+    latency_p95_ms: 300    # adjust
+    error_rate_pct: 1.0
+```
+
+## Troubleshooting
+
+| Symptom | Cause | Fix |
+|---------|-------|-----|
+| Alert `accepted=false` | Validation failure (missing service/title, invalid kind) | Fix Monitor alert payload |
+| `deduped=true` unexpectedly | Same fingerprint within TTL | Check Monitor fingerprint logic |
+| `alert_to_incident` fails "not found" | Alert ref expired from MemoryStore | Switch to Postgres backend |
+| Alerts stuck in `processing` | Loop died without acking | Run `claim` — it auto-requeues expired locks. Or: `UPDATE alerts SET status='new', processing_lock_until=NULL WHERE status='processing' AND processing_lock_until < NOW()` |
+| Alerts stuck in `failed` | Persistent processing errors | Check `last_error` field: `SELECT alert_ref, last_error FROM alerts WHERE status='failed'` |
+| Triage not running | Cooldown active | Check `incident_signature_state.last_triage_at`; or reduce `triage_cooldown_minutes` in policy |
+| `claim` returns empty | All new alerts already locked | Check for stale processing: `SELECT COUNT(*) FROM alerts WHERE status='processing' AND processing_lock_until < NOW()` |
+| SLO gate blocks in staging | SLO breach active | Fix service or override with `run_slo_watch: false` |
+| `tools.alerts.ingest` denied | Monitor agent missing entitlement | Check `config/rbac_tools_matrix.yml` `agent_monitor` role |
+| `tools.alerts.claim` denied | Agent missing `tools.alerts.claim` | Only `agent_cto` / `agent_oncall` / Supervisor can claim |
+
+## Retention
+
+Alerts in Postgres: no TTL enforced by default — add a cron job if needed:
+```sql
+DELETE FROM alerts WHERE created_at < NOW() - INTERVAL '30 days';
+```
+
+Memory backend: cleared on process restart.
+
+---
+
+## Production Mode: ALERT_BACKEND=postgres
+
+**⚠ Default is `memory` — do NOT use in production.** Alerts are lost on router restart.
+
+### Setup (one-time, per environment)
+
+**1. Run migration:**
+```bash
+python3 ops/scripts/migrate_alerts_postgres.py \
+  --dsn "postgresql://user:pass@host:5432/daarion"
+# or dry-run:
+python3 ops/scripts/migrate_alerts_postgres.py --dry-run
+```
+
+**2. Set env vars** (in `.env`, docker-compose, or systemd unit):
+```bash
+ALERT_BACKEND=postgres
+ALERT_DATABASE_URL=postgresql://user:pass@host:5432/daarion
+# Fallback: if ALERT_DATABASE_URL is unset, DATABASE_URL is used automatically
+```
+
+**3. Restart router:**
+```bash
+docker compose -f docker-compose.node1.yml restart router
+# or node2:
+docker compose -f docker-compose.node2-sofiia.yml restart router
+```
+
+**4. Verify persistence** (survive a restart):
+```bash
+# Ingest a test alert
+curl -X POST http://router:8000/v1/tools/execute \
+  -H "Content-Type: application/json" \
+  -d '{"tool":"alert_ingest_tool","action":"ingest","service":"test","kind":"test","message":"persistence check"}'
+
+# Restart router
+docker compose restart router
+
+# Confirm alert still visible after restart
+curl "http://router:8000/v1/tools/execute" \
+  -d '{"tool":"alert_ingest_tool","action":"list","service":"test"}'
+# Expect: alert still present → PASS
+```
+
+### DSN resolution order
+
+`alert_store.py` factory resolves DSN in this priority:
+1. `ALERT_DATABASE_URL` (service-specific, recommended)
+2. `DATABASE_URL` (shared Postgres, fallback)
+3. Falls back to memory with a WARNING log if neither is set.
+
+### compose files updated
+
+| File | ALERT_BACKEND set? |
+|------|--------------------|
+| `docker-compose.node1.yml` | ✅ `postgres` |
+| `docker-compose.node2-sofiia.yml` | ✅ `postgres` |
+| `docker-compose.staging.yml` | ✅ `postgres` |