microdao-daarion/docs/incident/alerts.md

# Alert → Incident Bridge

## Overview

The Alert Bridge provides a governed, deduplicated pipeline from Monitor/Prometheus detection to Incident creation.

**Security model:** Monitor sends alerts (`tools.alerts.ingest` only). Sofiia/oncall create incidents (`tools.oncall.incident_write` + `tools.alerts.ack`). No agent gets both roles automatically.

```
Monitor@nodeX ──ingest──► AlertStore ──alert_to_incident──► IncidentStore
      (tools.alerts.ingest)             (tools.oncall.incident_write)
                                                 │
                                         IncidentTriage (Sofiia NODA2)
                                                 │
                                         PostmortemDraft
```

## AlertEvent Schema

```json
{
  "source": "monitor@node1",
  "service": "gateway",
  "env": "prod",
  "severity": "P1",
  "kind": "slo_breach",
  "title": "gateway SLO: latency p95 > 300ms",
  "summary": "p95 latency at 450ms, error_rate 2.5%",
  "started_at": "2025-01-23T09:00:00Z",
  "labels": {
    "node": "node1",
    "fingerprint": "gateway:slo_breach:latency"
  },
  "metrics": {
    "latency_p95_ms": 450,
    "error_rate_pct": 2.5
  },
  "evidence": {
    "log_samples": ["ERROR timeout after 30s", "WARN retry 3/3"],
    "query": "rate(http_errors_total[5m])"
  }
}
```

### Severity values
`P0`, `P1`, `P2`, `P3`, `INFO`

### Kind values
`slo_breach`, `crashloop`, `latency`, `error_rate`, `disk`, `oom`, `deploy`, `security`, `custom`

## Dedupe Behavior

Dedupe key = `sha256(service|env|kind|fingerprint)`.

- Same key within TTL (default 30 min) → `deduped=true`, `occurrences++`, no new record
- Same key after TTL → new alert record
- Different fingerprint → separate record

## `alert_ingest_tool` API

### ingest (Monitor role)
```json
{
  "action": "ingest",
  "alert": { ...AlertEvent... },
  "dedupe_ttl_minutes": 30
}
```
Response:
```json
{
  "accepted": true,
  "deduped": false,
  "dedupe_key": "abc123...",
  "alert_ref": "alrt_20250123_090000_a1b2c3",
  "occurrences": 1
}
```

### list (read)
```json
{ "action": "list", "service": "gateway", "env": "prod", "window_minutes": 240, "limit": 50 }
```

### get (read)
```json
{ "action": "get", "alert_ref": "alrt_..." }
```

### ack (oncall/cto)
```json
{ "action": "ack", "alert_ref": "alrt_...", "actor": "sofiia", "note": "false positive" }
```

## `oncall_tool.alert_to_incident`

Converts a stored alert into an incident (or attaches to an existing open one).

```json
{
  "action": "alert_to_incident",
  "alert_ref": "alrt_...",
  "incident_severity_cap": "P1",
  "dedupe_window_minutes": 60,
  "attach_artifact": true
}
```

Response:
```json
{
  "incident_id": "inc_20250123_090000_xyz",
  "created": true,
  "severity": "P1",
  "artifact_path": "ops/incidents/inc_.../alert_alrt_....json",
  "note": "Incident created and alert acked"
}
```

### Logic
1. Load alert from `AlertStore`
2. Check for existing open P0/P1 incident for same service/env within `dedupe_window_minutes`
   - If found → attach event to existing incident, ack alert
3. If not found → create incident, append `note` + `metric` timeline events, optionally attach masked alert JSON as artifact, ack alert

## RBAC

| Role | ingest | list/get | ack | alert_to_incident |
|------|--------|----------|-----|-------------------|
| `agent_monitor` | ✅ | ❌ | ❌ | ❌ |
| `agent_cto` | ✅ | ✅ | ✅ | ✅ |
| `agent_oncall` | ❌ | ✅ | ✅ | ✅ |
| `agent_interface` | ❌ | ✅ | ❌ | ❌ |
| `agent_default` | ❌ | ❌ | ❌ | ❌ |

## SLO Watch Gate

The `slo_watch` gate in `release_check` prevents deploys during active SLO breaches.

| Profile | Mode | Behavior |
|---------|------|----------|
| dev | warn | Recommendations only |
| staging | strict | Blocks on any violation |
| prod | warn | Recommendations only |

Configure in `config/release_gate_policy.yml` per profile. Override per run with `run_slo_watch: false`.

## Backends

| Env var | Value | Effect |
|---------|-------|--------|
| `ALERT_BACKEND` | `memory` (default) | In-process, not persistent |
| `ALERT_BACKEND` | `postgres` | Persistent, needs DATABASE_URL |
| `ALERT_BACKEND` | `auto` | Postgres if DATABASE_URL set, else memory |

Run DDL: `python3 ops/scripts/migrate_alerts_postgres.py`