docs(platform): add policy configs, runbooks, ops scripts and platform documentation

Config policies (16 files): alert_routing, architecture_pressure, backlog, cost_weights, data_governance, incident_escalation, incident_intelligence, network_allowlist, nodes_registry, observability_sources, rbac_tools_matrix, release_gate, risk_attribution, risk_policy, slo_policy, tool_limits, tools_rollout Ops (22 files): Caddyfile, calendar compose, grafana voice dashboard, deployments/incidents logs, runbooks for alerts/audit/backlog/incidents/sofiia/voice, cron jobs, scripts (alert_triage, audit_cleanup, migrate_*, governance, schedule), task_registry, voice alerts/ha/latency/policy Docs (30+ files): HUMANIZED_STEPAN v2.7-v3 changelogs and runbooks, NODA1/NODA2 status and setup, audit index and traces, backlog, incident, supervisor, tools, voice, opencode, release, risk, aistalk, spacebot Made-with: Cursor
2026-03-03 07:14:53 -08:00
parent 129e4ea1fc
commit 67225a39fa
102 changed files with 20060 additions and 0 deletions
--- a/docs/incident/escalation.md
+++ b/docs/incident/escalation.md
@@ -0,0 +1,99 @@
+# Incident Escalation Engine
+
+Deterministic, LLM-free engine that escalates incidents and identifies auto-resolve candidates
+based on alert storm behavior.
+
+## Overview
+
+```
+alert_triage_graph (every 5 min)
+  └─ process_alerts
+  └─ post_process_escalation  ← incident_escalation_tool.evaluate
+  └─ post_process_autoresolve ← incident_escalation_tool.auto_resolve_candidates
+  └─ build_digest             ← includes escalation + candidate summary
+```
+
+## Escalation Logic
+
+Config: `config/incident_escalation_policy.yml`
+
+| Trigger | From → To |
+|---------|-----------|
+| `occurrences_60m ≥ 10` OR `triage_count_24h ≥ 3` | P2 → P1 |
+| `occurrences_60m ≥ 25` OR `triage_count_24h ≥ 6` | P1 → P0 |
+| Cap: `severity_cap: "P0"` | never exceeds P0 |
+
+When escalation triggers:
+1. `incident_append_event(type=decision)` — audit trail
+2. `incident_append_event(type=followup)` — auto follow-up (if `create_followup_on_escalate: true`)
+
+## Auto-resolve Candidates
+
+Incidents where `last_alert_at < now - no_alerts_minutes_for_candidate`:
+
+- `close_allowed_severities: ["P2", "P3"]` — only low-severity auto-closeable
+- `auto_close: false` (default) — produces *candidates* only, no auto-close
+- Each candidate gets a `note` event appended to the incident timeline
+
+## Alert-loop SLO
+
+Tracked in `/v1/alerts/dashboard?window_minutes=240`:
+
+```json
+"slo": {
+  "claim_to_ack_p95_seconds": 12.3,
+  "failed_rate_pct": 0.5,
+  "processing_stuck_count": 0,
+  "violations": []
+}
+```
+
+Thresholds (from `alert_loop_slo` in policy):
+- `claim_to_ack_p95_seconds: 60` — p95 latency from claim to ack
+- `failed_rate_pct: 5` — max % failed/(acked+failed)
+- `processing_stuck_minutes: 15` — alerts stuck in processing beyond this
+
+## RBAC
+
+| Action | Required entitlement |
+|--------|---------------------|
+| `evaluate` | `tools.oncall.incident_write` (CTO/oncall) |
+| `auto_resolve_candidates` | `tools.oncall.incident_write` (CTO/oncall) |
+
+Monitor agent does NOT have access (ingest-only).
+
+## Configuration
+
+```yaml
+# config/incident_escalation_policy.yml
+escalation:
+  occurrences_thresholds:
+    P2_to_P1: 10
+    P1_to_P0: 25
+  triage_thresholds_24h:
+    P2_to_P1: 3
+    P1_to_P0: 6
+  severity_cap: "P0"
+  create_followup_on_escalate: true
+
+auto_resolve:
+  no_alerts_minutes_for_candidate: 60
+  close_allowed_severities: ["P2", "P3"]
+  auto_close: false
+
+alert_loop_slo:
+  claim_to_ack_p95_seconds: 60
+  failed_rate_pct: 5
+  processing_stuck_minutes: 15
+```
+
+## Tuning
+
+**Too many escalations (noisy)?**  
+→ Increase `occurrences_thresholds.P2_to_P1` or `triage_thresholds_24h.P2_to_P1`.
+
+**Auto-resolve too aggressive?**  
+→ Increase `no_alerts_minutes_for_candidate` (e.g., 120 min).
+
+**Ready to enable auto-close for P3?**  
+→ Set `auto_close: true` and `close_allowed_severities: ["P3"]`.