Config policies (16 files): alert_routing, architecture_pressure, backlog, cost_weights, data_governance, incident_escalation, incident_intelligence, network_allowlist, nodes_registry, observability_sources, rbac_tools_matrix, release_gate, risk_attribution, risk_policy, slo_policy, tool_limits, tools_rollout Ops (22 files): Caddyfile, calendar compose, grafana voice dashboard, deployments/incidents logs, runbooks for alerts/audit/backlog/incidents/sofiia/voice, cron jobs, scripts (alert_triage, audit_cleanup, migrate_*, governance, schedule), task_registry, voice alerts/ha/latency/policy Docs (30+ files): HUMANIZED_STEPAN v2.7-v3 changelogs and runbooks, NODA1/NODA2 status and setup, audit index and traces, backlog, incident, supervisor, tools, voice, opencode, release, risk, aistalk, spacebot Made-with: Cursor
4.2 KiB
Alert → Incident Bridge
Overview
The Alert Bridge provides a governed, deduplicated pipeline from Monitor/Prometheus detection to Incident creation.
Security model: Monitor sends alerts (tools.alerts.ingest only). Sofiia/oncall create incidents (tools.oncall.incident_write + tools.alerts.ack). No agent gets both roles automatically.
Monitor@nodeX ──ingest──► AlertStore ──alert_to_incident──► IncidentStore
(tools.alerts.ingest) (tools.oncall.incident_write)
│
IncidentTriage (Sofiia NODA2)
│
PostmortemDraft
AlertEvent Schema
{
"source": "monitor@node1",
"service": "gateway",
"env": "prod",
"severity": "P1",
"kind": "slo_breach",
"title": "gateway SLO: latency p95 > 300ms",
"summary": "p95 latency at 450ms, error_rate 2.5%",
"started_at": "2025-01-23T09:00:00Z",
"labels": {
"node": "node1",
"fingerprint": "gateway:slo_breach:latency"
},
"metrics": {
"latency_p95_ms": 450,
"error_rate_pct": 2.5
},
"evidence": {
"log_samples": ["ERROR timeout after 30s", "WARN retry 3/3"],
"query": "rate(http_errors_total[5m])"
}
}
Severity values
P0, P1, P2, P3, INFO
Kind values
slo_breach, crashloop, latency, error_rate, disk, oom, deploy, security, custom
Dedupe Behavior
Dedupe key = sha256(service|env|kind|fingerprint).
- Same key within TTL (default 30 min) →
deduped=true,occurrences++, no new record - Same key after TTL → new alert record
- Different fingerprint → separate record
alert_ingest_tool API
ingest (Monitor role)
{
"action": "ingest",
"alert": { ...AlertEvent... },
"dedupe_ttl_minutes": 30
}
Response:
{
"accepted": true,
"deduped": false,
"dedupe_key": "abc123...",
"alert_ref": "alrt_20250123_090000_a1b2c3",
"occurrences": 1
}
list (read)
{ "action": "list", "service": "gateway", "env": "prod", "window_minutes": 240, "limit": 50 }
get (read)
{ "action": "get", "alert_ref": "alrt_..." }
ack (oncall/cto)
{ "action": "ack", "alert_ref": "alrt_...", "actor": "sofiia", "note": "false positive" }
oncall_tool.alert_to_incident
Converts a stored alert into an incident (or attaches to an existing open one).
{
"action": "alert_to_incident",
"alert_ref": "alrt_...",
"incident_severity_cap": "P1",
"dedupe_window_minutes": 60,
"attach_artifact": true
}
Response:
{
"incident_id": "inc_20250123_090000_xyz",
"created": true,
"severity": "P1",
"artifact_path": "ops/incidents/inc_.../alert_alrt_....json",
"note": "Incident created and alert acked"
}
Logic
- Load alert from
AlertStore - Check for existing open P0/P1 incident for same service/env within
dedupe_window_minutes- If found → attach event to existing incident, ack alert
- If not found → create incident, append
note+metrictimeline events, optionally attach masked alert JSON as artifact, ack alert
RBAC
| Role | ingest | list/get | ack | alert_to_incident |
|---|---|---|---|---|
agent_monitor |
✅ | ❌ | ❌ | ❌ |
agent_cto |
✅ | ✅ | ✅ | ✅ |
agent_oncall |
❌ | ✅ | ✅ | ✅ |
agent_interface |
❌ | ✅ | ❌ | ❌ |
agent_default |
❌ | ❌ | ❌ | ❌ |
SLO Watch Gate
The slo_watch gate in release_check prevents deploys during active SLO breaches.
| Profile | Mode | Behavior |
|---|---|---|
| dev | warn | Recommendations only |
| staging | strict | Blocks on any violation |
| prod | warn | Recommendations only |
Configure in config/release_gate_policy.yml per profile. Override per run with run_slo_watch: false.
Backends
| Env var | Value | Effect |
|---|---|---|
ALERT_BACKEND |
memory (default) |
In-process, not persistent |
ALERT_BACKEND |
postgres |
Persistent, needs DATABASE_URL |
ALERT_BACKEND |
auto |
Postgres if DATABASE_URL set, else memory |
Run DDL: python3 ops/scripts/migrate_alerts_postgres.py