Config policies (16 files): alert_routing, architecture_pressure, backlog, cost_weights, data_governance, incident_escalation, incident_intelligence, network_allowlist, nodes_registry, observability_sources, rbac_tools_matrix, release_gate, risk_attribution, risk_policy, slo_policy, tool_limits, tools_rollout Ops (22 files): Caddyfile, calendar compose, grafana voice dashboard, deployments/incidents logs, runbooks for alerts/audit/backlog/incidents/sofiia/voice, cron jobs, scripts (alert_triage, audit_cleanup, migrate_*, governance, schedule), task_registry, voice alerts/ha/latency/policy Docs (30+ files): HUMANIZED_STEPAN v2.7-v3 changelogs and runbooks, NODA1/NODA2 status and setup, audit index and traces, backlog, incident, supervisor, tools, voice, opencode, release, risk, aistalk, spacebot Made-with: Cursor
100 lines
2.8 KiB
Markdown
100 lines
2.8 KiB
Markdown
# Incident Escalation Engine
|
|
|
|
Deterministic, LLM-free engine that escalates incidents and identifies auto-resolve candidates
|
|
based on alert storm behavior.
|
|
|
|
## Overview
|
|
|
|
```
|
|
alert_triage_graph (every 5 min)
|
|
└─ process_alerts
|
|
└─ post_process_escalation ← incident_escalation_tool.evaluate
|
|
└─ post_process_autoresolve ← incident_escalation_tool.auto_resolve_candidates
|
|
└─ build_digest ← includes escalation + candidate summary
|
|
```
|
|
|
|
## Escalation Logic
|
|
|
|
Config: `config/incident_escalation_policy.yml`
|
|
|
|
| Trigger | From → To |
|
|
|---------|-----------|
|
|
| `occurrences_60m ≥ 10` OR `triage_count_24h ≥ 3` | P2 → P1 |
|
|
| `occurrences_60m ≥ 25` OR `triage_count_24h ≥ 6` | P1 → P0 |
|
|
| Cap: `severity_cap: "P0"` | never exceeds P0 |
|
|
|
|
When escalation triggers:
|
|
1. `incident_append_event(type=decision)` — audit trail
|
|
2. `incident_append_event(type=followup)` — auto follow-up (if `create_followup_on_escalate: true`)
|
|
|
|
## Auto-resolve Candidates
|
|
|
|
Incidents where `last_alert_at < now - no_alerts_minutes_for_candidate`:
|
|
|
|
- `close_allowed_severities: ["P2", "P3"]` — only low-severity auto-closeable
|
|
- `auto_close: false` (default) — produces *candidates* only, no auto-close
|
|
- Each candidate gets a `note` event appended to the incident timeline
|
|
|
|
## Alert-loop SLO
|
|
|
|
Tracked in `/v1/alerts/dashboard?window_minutes=240`:
|
|
|
|
```json
|
|
"slo": {
|
|
"claim_to_ack_p95_seconds": 12.3,
|
|
"failed_rate_pct": 0.5,
|
|
"processing_stuck_count": 0,
|
|
"violations": []
|
|
}
|
|
```
|
|
|
|
Thresholds (from `alert_loop_slo` in policy):
|
|
- `claim_to_ack_p95_seconds: 60` — p95 latency from claim to ack
|
|
- `failed_rate_pct: 5` — max % failed/(acked+failed)
|
|
- `processing_stuck_minutes: 15` — alerts stuck in processing beyond this
|
|
|
|
## RBAC
|
|
|
|
| Action | Required entitlement |
|
|
|--------|---------------------|
|
|
| `evaluate` | `tools.oncall.incident_write` (CTO/oncall) |
|
|
| `auto_resolve_candidates` | `tools.oncall.incident_write` (CTO/oncall) |
|
|
|
|
Monitor agent does NOT have access (ingest-only).
|
|
|
|
## Configuration
|
|
|
|
```yaml
|
|
# config/incident_escalation_policy.yml
|
|
escalation:
|
|
occurrences_thresholds:
|
|
P2_to_P1: 10
|
|
P1_to_P0: 25
|
|
triage_thresholds_24h:
|
|
P2_to_P1: 3
|
|
P1_to_P0: 6
|
|
severity_cap: "P0"
|
|
create_followup_on_escalate: true
|
|
|
|
auto_resolve:
|
|
no_alerts_minutes_for_candidate: 60
|
|
close_allowed_severities: ["P2", "P3"]
|
|
auto_close: false
|
|
|
|
alert_loop_slo:
|
|
claim_to_ack_p95_seconds: 60
|
|
failed_rate_pct: 5
|
|
processing_stuck_minutes: 15
|
|
```
|
|
|
|
## Tuning
|
|
|
|
**Too many escalations (noisy)?**
|
|
→ Increase `occurrences_thresholds.P2_to_P1` or `triage_thresholds_24h.P2_to_P1`.
|
|
|
|
**Auto-resolve too aggressive?**
|
|
→ Increase `no_alerts_minutes_for_candidate` (e.g., 120 min).
|
|
|
|
**Ready to enable auto-close for P3?**
|
|
→ Set `auto_close: true` and `close_allowed_severities: ["P3"]`.
|