docs(platform): add policy configs, runbooks, ops scripts and platform documentation

Config policies (16 files): alert_routing, architecture_pressure, backlog, cost_weights, data_governance, incident_escalation, incident_intelligence, network_allowlist, nodes_registry, observability_sources, rbac_tools_matrix, release_gate, risk_attribution, risk_policy, slo_policy, tool_limits, tools_rollout Ops (22 files): Caddyfile, calendar compose, grafana voice dashboard, deployments/incidents logs, runbooks for alerts/audit/backlog/incidents/sofiia/voice, cron jobs, scripts (alert_triage, audit_cleanup, migrate_*, governance, schedule), task_registry, voice alerts/ha/latency/policy Docs (30+ files): HUMANIZED_STEPAN v2.7-v3 changelogs and runbooks, NODA1/NODA2 status and setup, audit index and traces, backlog, incident, supervisor, tools, voice, opencode, release, risk, aistalk, spacebot Made-with: Cursor
2026-03-03 07:14:53 -08:00
parent 129e4ea1fc
commit 67225a39fa
102 changed files with 20060 additions and 0 deletions
--- a/docs/risk/risk_index.md
+++ b/docs/risk/risk_index.md
@@ -0,0 +1,206 @@
+# Service Risk Index
+
+> Deterministic. No LLM. Production-grade.
+
+## Overview
+
+The Risk Index Engine computes a **numerical risk score (0–100+)** for every tracked service. It is the single authoritative metric for service health in the DAARION.city control plane.
+
+Score → Band mapping:
+
+| Score  | Band     | Meaning                                  |
+|--------|----------|------------------------------------------|
+| 0–20   | low      | No significant signals                   |
+| 21–50  | medium   | Minor signals; monitor                   |
+| 51–80  | high     | Active problems; coordinate before deploy|
+| 81+    | critical | Block or escalate immediately            |
+
+---
+
+## Scoring Formula
+
+```
+Risk(service) = Σ weight(signal) × count_or_flag(signal)
+```
+
+All weights are policy-driven via `config/risk_policy.yml`.
+
+### Signal weights (defaults)
+
+| Signal                        | Points                        |
+|-------------------------------|-------------------------------|
+| Open P0 incident              | 50 each                       |
+| Open P1 incident              | 25 each                       |
+| Open P2 incident              | 10 each                       |
+| Open P3 incident              | 5 each                        |
+| High recurrence signature 7d  | 20 each                       |
+| Warn recurrence signature 7d  | 10 each                       |
+| High recurrence kind 7d       | 15 each                       |
+| Warn recurrence kind 7d       | 8 each                        |
+| High recurrence signature 30d | 10 each                       |
+| High recurrence kind 30d      | 8 each                        |
+| Overdue follow-up P0          | 20 each                       |
+| Overdue follow-up P1          | 12 each                       |
+| Overdue follow-up other       | 6 each                        |
+| Active SLO violation (60m)    | 10 each                       |
+| Alert-loop SLO violation      | 10 each                       |
+| Escalations 24h (1–2)         | 5 (warn level)                |
+| Escalations 24h (3+)          | 12 (high level)               |
+
+---
+
+## Configuration
+
+**`config/risk_policy.yml`** — controls all weights, thresholds, and per-service overrides.
+
+```yaml
+thresholds:
+  bands:
+    low_max: 20
+    medium_max: 50
+    high_max: 80
+  risk_watch:
+    warn_at: 50
+    fail_at: 80
+
+service_overrides:
+  gateway:
+    risk_watch:
+      fail_at: 75   # gateway fails earlier: critical path
+
+p0_services:
+  - gateway
+  - router
+```
+
+Changes to the file take effect on next request (cache is not long-lived).
+
+---
+
+## API
+
+### `GET /v1/risk/service/{service}?env=prod&window_hours=24`
+
+Returns a `RiskReport`:
+
+```json
+{
+  "service": "gateway",
+  "env": "prod",
+  "score": 72,
+  "band": "high",
+  "thresholds": { "warn_at": 50, "fail_at": 75 },
+  "components": {
+    "open_incidents": { "P0": 0, "P1": 1, "P2": 2, "points": 45 },
+    "recurrence": { "high_signatures_7d": 1, "points": 20 },
+    "followups": { "overdue_P1": 1, "points": 12 },
+    "slo": { "violations": 1, "points": 10 },
+    "alerts_loop": { "violations": 0, "points": 0 },
+    "escalations": { "count_24h": 1, "points": 5 }
+  },
+  "reasons": [
+    "Open P1 incident(s): 1",
+    "High recurrence signatures (7d): 1",
+    "Overdue follow-ups (P1): 1",
+    "Active SLO violation(s) in window: 1",
+    "Escalations in last 24h: 1"
+  ],
+  "recommendations": [
+    "Prioritize open P0/P1 incidents before deploying.",
+    "Investigate recurring failure patterns.",
+    "Avoid risky deploys until SLO violation clears.",
+    "Service is high-risk — coordinate with oncall before release."
+  ],
+  "updated_at": "2026-02-23T12:00:00"
+}
+```
+
+RBAC required: `tools.risk.read` (granted to `agent_cto`, `agent_oncall`, `agent_monitor`).
+
+### `GET /v1/risk/dashboard?env=prod&top_n=10`
+
+Returns top-N services by score with band summary:
+
+```json
+{
+  "env": "prod",
+  "generated_at": "...",
+  "total_services": 4,
+  "band_counts": { "critical": 1, "high": 1, "medium": 2, "low": 0 },
+  "critical_p0_services": ["gateway"],
+  "services": [ ...RiskReports sorted by score desc... ]
+}
+```
+
+### Tool: `risk_engine_tool`
+
+```json
+{ "action": "service",   "service": "gateway", "env": "prod" }
+{ "action": "dashboard", "env": "prod", "top_n": 10 }
+{ "action": "policy" }
+```
+
+---
+
+## Release Gate: `risk_watch`
+
+The `risk_watch` gate integrates Risk Index into the release pipeline.
+
+### Behaviour
+
+| Mode   | When score ≥ warn_at (default 50) | When score ≥ fail_at (default 80) |
+|--------|------------------------------------|-------------------------------------|
+| warn   | pass=true + recommendations added  | pass=true + recommendations added   |
+| strict | pass=true + recommendations added  | **pass=false** — deploy blocked     |
+
+### Policy
+
+```yaml
+# config/release_gate_policy.yml
+dev:
+  risk_watch: { mode: "warn" }
+staging:
+  risk_watch: { mode: "strict" }   # blocks p0_services when score >= fail_at
+prod:
+  risk_watch: { mode: "warn" }
+```
+
+### Non-fatal guarantee
+
+If the Risk Engine is unavailable (store down, timeout, error), `risk_watch` is **skipped** — never blocks. A warning is added to the gate output.
+
+### Release inputs
+
+| Input              | Type    | Default | Description                                  |
+|--------------------|---------|---------|----------------------------------------------|
+| `run_risk_watch`   | boolean | true    | Enable/disable the gate                      |
+| `risk_watch_env`   | string  | prod    | Env to score against                         |
+| `risk_watch_warn_at` | int  | policy  | Override warn threshold                      |
+| `risk_watch_fail_at` | int  | policy  | Override fail threshold                      |
+
+---
+
+## Architecture
+
+```
+[Incident Store]──open incidents──┐
+[Intelligence]──recurrence 7d/30d─┤
+[Followups Summary]──overdue──────┤──► risk_engine.py ──► RiskReport
+[SLO Snapshot]──violations────────┤           │
+[Alert Store]──loop SLO───────────┤      score_to_band
+[Decision Events]──escalations────┘           │
+                                        release_check_runner
+                                           risk_watch gate
+```
+
+The engine has **zero LLM calls**. It is deterministic: given the same signals, the same score is always produced.
+
+---
+
+## Testing
+
+```bash
+pytest tests/test_risk_engine.py         # scoring + bands + overrides
+pytest tests/test_risk_dashboard.py      # sorting + band counts + p0 detection
+pytest tests/test_release_check_risk_watch.py  # warn/strict/non-fatal gate
+```