microdao-daarion/docs/incident/escalation.md

# Incident Escalation Engine

Deterministic, LLM-free engine that escalates incidents and identifies auto-resolve candidates
based on alert storm behavior.

## Overview

```
alert_triage_graph (every 5 min)
  └─ process_alerts
  └─ post_process_escalation  ← incident_escalation_tool.evaluate
  └─ post_process_autoresolve ← incident_escalation_tool.auto_resolve_candidates
  └─ build_digest             ← includes escalation + candidate summary
```

## Escalation Logic

Config: `config/incident_escalation_policy.yml`

| Trigger | From → To |
|---------|-----------|
| `occurrences_60m ≥ 10` OR `triage_count_24h ≥ 3` | P2 → P1 |
| `occurrences_60m ≥ 25` OR `triage_count_24h ≥ 6` | P1 → P0 |
| Cap: `severity_cap: "P0"` | never exceeds P0 |

When escalation triggers:
1. `incident_append_event(type=decision)` — audit trail
2. `incident_append_event(type=followup)` — auto follow-up (if `create_followup_on_escalate: true`)

## Auto-resolve Candidates

Incidents where `last_alert_at < now - no_alerts_minutes_for_candidate`:

- `close_allowed_severities: ["P2", "P3"]` — only low-severity auto-closeable
- `auto_close: false` (default) — produces *candidates* only, no auto-close
- Each candidate gets a `note` event appended to the incident timeline

## Alert-loop SLO

Tracked in `/v1/alerts/dashboard?window_minutes=240`:

```json
"slo": {
  "claim_to_ack_p95_seconds": 12.3,
  "failed_rate_pct": 0.5,
  "processing_stuck_count": 0,
  "violations": []
}
```

Thresholds (from `alert_loop_slo` in policy):
- `claim_to_ack_p95_seconds: 60` — p95 latency from claim to ack
- `failed_rate_pct: 5` — max % failed/(acked+failed)
- `processing_stuck_minutes: 15` — alerts stuck in processing beyond this

## RBAC

| Action | Required entitlement |
|--------|---------------------|
| `evaluate` | `tools.oncall.incident_write` (CTO/oncall) |
| `auto_resolve_candidates` | `tools.oncall.incident_write` (CTO/oncall) |

Monitor agent does NOT have access (ingest-only).

## Configuration

```yaml
# config/incident_escalation_policy.yml
escalation:
  occurrences_thresholds:
    P2_to_P1: 10
    P1_to_P0: 25
  triage_thresholds_24h:
    P2_to_P1: 3
    P1_to_P0: 6
  severity_cap: "P0"
  create_followup_on_escalate: true

auto_resolve:
  no_alerts_minutes_for_candidate: 60
  close_allowed_severities: ["P2", "P3"]
  auto_close: false

alert_loop_slo:
  claim_to_ack_p95_seconds: 60
  failed_rate_pct: 5
  processing_stuck_minutes: 15
```

## Tuning

**Too many escalations (noisy)?**
→ Increase `occurrences_thresholds.P2_to_P1` or `triage_thresholds_24h.P2_to_P1`.

**Auto-resolve too aggressive?**
→ Increase `no_alerts_minutes_for_candidate` (e.g., 120 min).

**Ready to enable auto-close for P3?**
→ Set `auto_close: true` and `close_allowed_severities: ["P3"]`.