Config policies (16 files): alert_routing, architecture_pressure, backlog, cost_weights, data_governance, incident_escalation, incident_intelligence, network_allowlist, nodes_registry, observability_sources, rbac_tools_matrix, release_gate, risk_attribution, risk_policy, slo_policy, tool_limits, tools_rollout Ops (22 files): Caddyfile, calendar compose, grafana voice dashboard, deployments/incidents logs, runbooks for alerts/audit/backlog/incidents/sofiia/voice, cron jobs, scripts (alert_triage, audit_cleanup, migrate_*, governance, schedule), task_registry, voice alerts/ha/latency/policy Docs (30+ files): HUMANIZED_STEPAN v2.7-v3 changelogs and runbooks, NODA1/NODA2 status and setup, audit index and traces, backlog, incident, supervisor, tools, voice, opencode, release, risk, aistalk, spacebot Made-with: Cursor
2.8 KiB
2.8 KiB
Incident Escalation Engine
Deterministic, LLM-free engine that escalates incidents and identifies auto-resolve candidates based on alert storm behavior.
Overview
alert_triage_graph (every 5 min)
└─ process_alerts
└─ post_process_escalation ← incident_escalation_tool.evaluate
└─ post_process_autoresolve ← incident_escalation_tool.auto_resolve_candidates
└─ build_digest ← includes escalation + candidate summary
Escalation Logic
Config: config/incident_escalation_policy.yml
| Trigger | From → To |
|---|---|
occurrences_60m ≥ 10 OR triage_count_24h ≥ 3 |
P2 → P1 |
occurrences_60m ≥ 25 OR triage_count_24h ≥ 6 |
P1 → P0 |
Cap: severity_cap: "P0" |
never exceeds P0 |
When escalation triggers:
incident_append_event(type=decision)— audit trailincident_append_event(type=followup)— auto follow-up (ifcreate_followup_on_escalate: true)
Auto-resolve Candidates
Incidents where last_alert_at < now - no_alerts_minutes_for_candidate:
close_allowed_severities: ["P2", "P3"]— only low-severity auto-closeableauto_close: false(default) — produces candidates only, no auto-close- Each candidate gets a
noteevent appended to the incident timeline
Alert-loop SLO
Tracked in /v1/alerts/dashboard?window_minutes=240:
"slo": {
"claim_to_ack_p95_seconds": 12.3,
"failed_rate_pct": 0.5,
"processing_stuck_count": 0,
"violations": []
}
Thresholds (from alert_loop_slo in policy):
claim_to_ack_p95_seconds: 60— p95 latency from claim to ackfailed_rate_pct: 5— max % failed/(acked+failed)processing_stuck_minutes: 15— alerts stuck in processing beyond this
RBAC
| Action | Required entitlement |
|---|---|
evaluate |
tools.oncall.incident_write (CTO/oncall) |
auto_resolve_candidates |
tools.oncall.incident_write (CTO/oncall) |
Monitor agent does NOT have access (ingest-only).
Configuration
# config/incident_escalation_policy.yml
escalation:
occurrences_thresholds:
P2_to_P1: 10
P1_to_P0: 25
triage_thresholds_24h:
P2_to_P1: 3
P1_to_P0: 6
severity_cap: "P0"
create_followup_on_escalate: true
auto_resolve:
no_alerts_minutes_for_candidate: 60
close_allowed_severities: ["P2", "P3"]
auto_close: false
alert_loop_slo:
claim_to_ack_p95_seconds: 60
failed_rate_pct: 5
processing_stuck_minutes: 15
Tuning
Too many escalations (noisy)?
→ Increase occurrences_thresholds.P2_to_P1 or triage_thresholds_24h.P2_to_P1.
Auto-resolve too aggressive?
→ Increase no_alerts_minutes_for_candidate (e.g., 120 min).
Ready to enable auto-close for P3?
→ Set auto_close: true and close_allowed_severities: ["P3"].