# Incident Escalation Engine Deterministic, LLM-free engine that escalates incidents and identifies auto-resolve candidates based on alert storm behavior. ## Overview ``` alert_triage_graph (every 5 min) └─ process_alerts └─ post_process_escalation ← incident_escalation_tool.evaluate └─ post_process_autoresolve ← incident_escalation_tool.auto_resolve_candidates └─ build_digest ← includes escalation + candidate summary ``` ## Escalation Logic Config: `config/incident_escalation_policy.yml` | Trigger | From → To | |---------|-----------| | `occurrences_60m ≥ 10` OR `triage_count_24h ≥ 3` | P2 → P1 | | `occurrences_60m ≥ 25` OR `triage_count_24h ≥ 6` | P1 → P0 | | Cap: `severity_cap: "P0"` | never exceeds P0 | When escalation triggers: 1. `incident_append_event(type=decision)` — audit trail 2. `incident_append_event(type=followup)` — auto follow-up (if `create_followup_on_escalate: true`) ## Auto-resolve Candidates Incidents where `last_alert_at < now - no_alerts_minutes_for_candidate`: - `close_allowed_severities: ["P2", "P3"]` — only low-severity auto-closeable - `auto_close: false` (default) — produces *candidates* only, no auto-close - Each candidate gets a `note` event appended to the incident timeline ## Alert-loop SLO Tracked in `/v1/alerts/dashboard?window_minutes=240`: ```json "slo": { "claim_to_ack_p95_seconds": 12.3, "failed_rate_pct": 0.5, "processing_stuck_count": 0, "violations": [] } ``` Thresholds (from `alert_loop_slo` in policy): - `claim_to_ack_p95_seconds: 60` — p95 latency from claim to ack - `failed_rate_pct: 5` — max % failed/(acked+failed) - `processing_stuck_minutes: 15` — alerts stuck in processing beyond this ## RBAC | Action | Required entitlement | |--------|---------------------| | `evaluate` | `tools.oncall.incident_write` (CTO/oncall) | | `auto_resolve_candidates` | `tools.oncall.incident_write` (CTO/oncall) | Monitor agent does NOT have access (ingest-only). ## Configuration ```yaml # config/incident_escalation_policy.yml escalation: occurrences_thresholds: P2_to_P1: 10 P1_to_P0: 25 triage_thresholds_24h: P2_to_P1: 3 P1_to_P0: 6 severity_cap: "P0" create_followup_on_escalate: true auto_resolve: no_alerts_minutes_for_candidate: 60 close_allowed_severities: ["P2", "P3"] auto_close: false alert_loop_slo: claim_to_ack_p95_seconds: 60 failed_rate_pct: 5 processing_stuck_minutes: 15 ``` ## Tuning **Too many escalations (noisy)?** → Increase `occurrences_thresholds.P2_to_P1` or `triage_thresholds_24h.P2_to_P1`. **Auto-resolve too aggressive?** → Increase `no_alerts_minutes_for_candidate` (e.g., 120 min). **Ready to enable auto-close for P3?** → Set `auto_close: true` and `close_allowed_severities: ["P3"]`.