Files

Apple 67225a39fa docs(platform): add policy configs, runbooks, ops scripts and platform documentation

Config policies (16 files): alert_routing, architecture_pressure, backlog,
cost_weights, data_governance, incident_escalation, incident_intelligence,
network_allowlist, nodes_registry, observability_sources, rbac_tools_matrix,
release_gate, risk_attribution, risk_policy, slo_policy, tool_limits, tools_rollout

Ops (22 files): Caddyfile, calendar compose, grafana voice dashboard,
deployments/incidents logs, runbooks for alerts/audit/backlog/incidents/sofiia/voice,
cron jobs, scripts (alert_triage, audit_cleanup, migrate_*, governance, schedule),
task_registry, voice alerts/ha/latency/policy

Docs (30+ files): HUMANIZED_STEPAN v2.7-v3 changelogs and runbooks,
NODA1/NODA2 status and setup, audit index and traces, backlog, incident,
supervisor, tools, voice, opencode, release, risk, aistalk, spacebot

Made-with: Cursor

2026-03-03 07:14:53 -08:00

2.8 KiB

Raw Permalink Blame History

Incident Escalation Engine

Deterministic, LLM-free engine that escalates incidents and identifies auto-resolve candidates based on alert storm behavior.

Overview

alert_triage_graph (every 5 min)
  └─ process_alerts
  └─ post_process_escalation  ← incident_escalation_tool.evaluate
  └─ post_process_autoresolve ← incident_escalation_tool.auto_resolve_candidates
  └─ build_digest             ← includes escalation + candidate summary

Escalation Logic

Config: config/incident_escalation_policy.yml

Trigger	From → To
`occurrences_60m ≥ 10` OR `triage_count_24h ≥ 3`	P2 → P1
`occurrences_60m ≥ 25` OR `triage_count_24h ≥ 6`	P1 → P0
Cap: `severity_cap: "P0"`	never exceeds P0

When escalation triggers:

incident_append_event(type=decision) — audit trail
incident_append_event(type=followup) — auto follow-up (if create_followup_on_escalate: true)

Auto-resolve Candidates

Incidents where last_alert_at < now - no_alerts_minutes_for_candidate:

close_allowed_severities: ["P2", "P3"] — only low-severity auto-closeable
auto_close: false (default) — produces candidates only, no auto-close
Each candidate gets a note event appended to the incident timeline

Alert-loop SLO

Tracked in /v1/alerts/dashboard?window_minutes=240:

"slo": {
  "claim_to_ack_p95_seconds": 12.3,
  "failed_rate_pct": 0.5,
  "processing_stuck_count": 0,
  "violations": []
}

Thresholds (from alert_loop_slo in policy):

claim_to_ack_p95_seconds: 60 — p95 latency from claim to ack
failed_rate_pct: 5 — max % failed/(acked+failed)
processing_stuck_minutes: 15 — alerts stuck in processing beyond this

RBAC

Action	Required entitlement
`evaluate`	`tools.oncall.incident_write` (CTO/oncall)
`auto_resolve_candidates`	`tools.oncall.incident_write` (CTO/oncall)

Monitor agent does NOT have access (ingest-only).

Configuration

# config/incident_escalation_policy.yml
escalation:
  occurrences_thresholds:
    P2_to_P1: 10
    P1_to_P0: 25
  triage_thresholds_24h:
    P2_to_P1: 3
    P1_to_P0: 6
  severity_cap: "P0"
  create_followup_on_escalate: true

auto_resolve:
  no_alerts_minutes_for_candidate: 60
  close_allowed_severities: ["P2", "P3"]
  auto_close: false

alert_loop_slo:
  claim_to_ack_p95_seconds: 60
  failed_rate_pct: 5
  processing_stuck_minutes: 15

Tuning

Too many escalations (noisy)?
→ Increase occurrences_thresholds.P2_to_P1 or triage_thresholds_24h.P2_to_P1.

Auto-resolve too aggressive?
→ Increase no_alerts_minutes_for_candidate (e.g., 120 min).

Ready to enable auto-close for P3?
→ Set auto_close: true and close_allowed_severities: ["P3"].

2.8 KiB Raw Permalink Blame History