# Alert → Incident Bridge ## Overview The Alert Bridge provides a governed, deduplicated pipeline from Monitor/Prometheus detection to Incident creation. **Security model:** Monitor sends alerts (`tools.alerts.ingest` only). Sofiia/oncall create incidents (`tools.oncall.incident_write` + `tools.alerts.ack`). No agent gets both roles automatically. ``` Monitor@nodeX ──ingest──► AlertStore ──alert_to_incident──► IncidentStore (tools.alerts.ingest) (tools.oncall.incident_write) │ IncidentTriage (Sofiia NODA2) │ PostmortemDraft ``` ## AlertEvent Schema ```json { "source": "monitor@node1", "service": "gateway", "env": "prod", "severity": "P1", "kind": "slo_breach", "title": "gateway SLO: latency p95 > 300ms", "summary": "p95 latency at 450ms, error_rate 2.5%", "started_at": "2025-01-23T09:00:00Z", "labels": { "node": "node1", "fingerprint": "gateway:slo_breach:latency" }, "metrics": { "latency_p95_ms": 450, "error_rate_pct": 2.5 }, "evidence": { "log_samples": ["ERROR timeout after 30s", "WARN retry 3/3"], "query": "rate(http_errors_total[5m])" } } ``` ### Severity values `P0`, `P1`, `P2`, `P3`, `INFO` ### Kind values `slo_breach`, `crashloop`, `latency`, `error_rate`, `disk`, `oom`, `deploy`, `security`, `custom` ## Dedupe Behavior Dedupe key = `sha256(service|env|kind|fingerprint)`. - Same key within TTL (default 30 min) → `deduped=true`, `occurrences++`, no new record - Same key after TTL → new alert record - Different fingerprint → separate record ## `alert_ingest_tool` API ### ingest (Monitor role) ```json { "action": "ingest", "alert": { ...AlertEvent... }, "dedupe_ttl_minutes": 30 } ``` Response: ```json { "accepted": true, "deduped": false, "dedupe_key": "abc123...", "alert_ref": "alrt_20250123_090000_a1b2c3", "occurrences": 1 } ``` ### list (read) ```json { "action": "list", "service": "gateway", "env": "prod", "window_minutes": 240, "limit": 50 } ``` ### get (read) ```json { "action": "get", "alert_ref": "alrt_..." } ``` ### ack (oncall/cto) ```json { "action": "ack", "alert_ref": "alrt_...", "actor": "sofiia", "note": "false positive" } ``` ## `oncall_tool.alert_to_incident` Converts a stored alert into an incident (or attaches to an existing open one). ```json { "action": "alert_to_incident", "alert_ref": "alrt_...", "incident_severity_cap": "P1", "dedupe_window_minutes": 60, "attach_artifact": true } ``` Response: ```json { "incident_id": "inc_20250123_090000_xyz", "created": true, "severity": "P1", "artifact_path": "ops/incidents/inc_.../alert_alrt_....json", "note": "Incident created and alert acked" } ``` ### Logic 1. Load alert from `AlertStore` 2. Check for existing open P0/P1 incident for same service/env within `dedupe_window_minutes` - If found → attach event to existing incident, ack alert 3. If not found → create incident, append `note` + `metric` timeline events, optionally attach masked alert JSON as artifact, ack alert ## RBAC | Role | ingest | list/get | ack | alert_to_incident | |------|--------|----------|-----|-------------------| | `agent_monitor` | ✅ | ❌ | ❌ | ❌ | | `agent_cto` | ✅ | ✅ | ✅ | ✅ | | `agent_oncall` | ❌ | ✅ | ✅ | ✅ | | `agent_interface` | ❌ | ✅ | ❌ | ❌ | | `agent_default` | ❌ | ❌ | ❌ | ❌ | ## SLO Watch Gate The `slo_watch` gate in `release_check` prevents deploys during active SLO breaches. | Profile | Mode | Behavior | |---------|------|----------| | dev | warn | Recommendations only | | staging | strict | Blocks on any violation | | prod | warn | Recommendations only | Configure in `config/release_gate_policy.yml` per profile. Override per run with `run_slo_watch: false`. ## Backends | Env var | Value | Effect | |---------|-------|--------| | `ALERT_BACKEND` | `memory` (default) | In-process, not persistent | | `ALERT_BACKEND` | `postgres` | Persistent, needs DATABASE_URL | | `ALERT_BACKEND` | `auto` | Postgres if DATABASE_URL set, else memory | Run DDL: `python3 ops/scripts/migrate_alerts_postgres.py`