microdao-daarion/ops/runbook-alerts.md

# Runbook: Alert → Incident Bridge (State Machine + Cooldown)

## Topology

```
Monitor@node1/2  ──► alert_ingest_tool.ingest ──► AlertStore (Postgres or Memory)
                                                        │
Sofiia / oncall  ──► oncall_tool.alert_to_incident ─────┘
                                                        │
                          IncidentStore (Postgres) ◄───-┘
                                  │
                   Sofiia NODA2: incident_triage_graph
                                  │
                        postmortem_draft_graph
```

## Alert State Machine

```
new → processing → acked
          ↓
        failed → (retry after TTL) → new
```

| Status       | Meaning                                          |
|-------------|--------------------------------------------------|
| `new`        | Freshly ingested, not yet claimed                |
| `processing` | Claimed by a loop worker; locked for 10 min      |
| `acked`      | Successfully processed and closed                |
| `failed`     | Processing error; retry after `retry_after_sec`  |

**Concurrency safety:** `claim` uses `SELECT FOR UPDATE SKIP LOCKED` (Postgres) or an in-process lock (Memory). Two concurrent loops cannot claim the same alert.

**Stale processing requeue:** `claim` automatically requeues alerts whose `processing_lock_until` has expired.

---

## Triage Cooldown (per Signature)

After a triage runs for a given `incident_signature`, subsequent alerts with the same signature **within 15 min** (configurable via `triage_cooldown_minutes` in `alert_routing_policy.yml`) only get an `incident_append_event` note — no new triage run. This prevents triage storms.

```yaml
# config/alert_routing_policy.yml
defaults:
  triage_cooldown_minutes: 15
```

The state is persisted in `incident_signature_state` table (Postgres) or in-memory (fallback).

---

## Startup Checklist

1. **Postgres DDL** (if `ALERT_BACKEND=postgres`):
   ```bash
   DATABASE_URL=postgresql://... python3 ops/scripts/migrate_alerts_postgres.py
   ```
   This is idempotent — safe to re-run. Adds state machine columns and `incident_signature_state` table.

2. **Env vars on NODE1 (router)**:
   ```env
   ALERT_BACKEND=auto           # Postgres → Memory fallback
   DATABASE_URL=postgresql://...
   ```

3. **Monitor agent**: configure `source: monitor@node1`, use `alert_ingest_tool.ingest`.

## Operational Scenarios

### Alert storm protection

Alert deduplication prevents storms. If alerts are firing repeatedly:
1. Check `occurrences` field — same alert ref means dedupe is working
2. Adjust `dedupe_ttl_minutes` per alert (default 30)
3. If many different fingerprints create new records — review Monitor fingerprint logic

### False positive alert

1. `alert_ingest_tool.ack` with `note="false positive"`
2. No incident created (or close the incident if already created via `oncall_tool.incident_close`)

### Alert → Incident conversion

```bash
# Sofiia or oncall agent calls:
oncall_tool.alert_to_incident(
    alert_ref="alrt_...",
    incident_severity_cap="P1",
    dedupe_window_minutes=60
)
```

### View recent alerts (by status)

```bash
# Default: all statuses
alert_ingest_tool.list(window_minutes=240, env="prod")

# Only new/failed (unprocessed):
alert_ingest_tool.list(window_minutes=240, status_in=["new","failed"])
```

### Claim alerts for processing (Supervisor loop)

```bash
# Atomic claim — locks alerts for 10 min
alert_ingest_tool.claim(window_minutes=240, limit=25, owner="sofiia-supervisor", lock_ttl_seconds=600)
```

### Mark alert as failed (retry)

```bash
alert_ingest_tool.fail(alert_ref="alrt_...", error="gateway timeout", retry_after_seconds=300)
```

### Operational dashboard

```
GET /v1/alerts/dashboard?window_minutes=240
# → counts by status, top signatures, latest alerts
```

```
GET /v1/incidents/open?service=gateway
# → open/mitigating incidents
```

### Monitor health check

Verify Monitor is pushing alerts:
```bash
alert_ingest_tool.list(source="monitor@node1", window_minutes=60)
```
If empty and there should be alerts → check Monitor service + entitlements.

## SLO Watch Gate

### Staging blocks on SLO breach
Config in `config/release_gate_policy.yml`:
```yaml
staging:
  gates:
    slo_watch:
      mode: "strict"
```

To temporarily bypass (emergency deploy):
```bash
# In release_check input:
run_slo_watch: false
```
Document reason in incident timeline.

### Tuning SLO thresholds

Edit `config/slo_policy.yml`:
```yaml
services:
  gateway:
    latency_p95_ms: 300    # adjust
    error_rate_pct: 1.0
```

## Troubleshooting

| Symptom | Cause | Fix |
|---------|-------|-----|
| Alert `accepted=false` | Validation failure (missing service/title, invalid kind) | Fix Monitor alert payload |
| `deduped=true` unexpectedly | Same fingerprint within TTL | Check Monitor fingerprint logic |
| `alert_to_incident` fails "not found" | Alert ref expired from MemoryStore | Switch to Postgres backend |
| Alerts stuck in `processing` | Loop died without acking | Run `claim` — it auto-requeues expired locks. Or: `UPDATE alerts SET status='new', processing_lock_until=NULL WHERE status='processing' AND processing_lock_until < NOW()` |
| Alerts stuck in `failed` | Persistent processing errors | Check `last_error` field: `SELECT alert_ref, last_error FROM alerts WHERE status='failed'` |
| Triage not running | Cooldown active | Check `incident_signature_state.last_triage_at`; or reduce `triage_cooldown_minutes` in policy |
| `claim` returns empty | All new alerts already locked | Check for stale processing: `SELECT COUNT(*) FROM alerts WHERE status='processing' AND processing_lock_until < NOW()` |
| SLO gate blocks in staging | SLO breach active | Fix service or override with `run_slo_watch: false` |
| `tools.alerts.ingest` denied | Monitor agent missing entitlement | Check `config/rbac_tools_matrix.yml` `agent_monitor` role |
| `tools.alerts.claim` denied | Agent missing `tools.alerts.claim` | Only `agent_cto` / `agent_oncall` / Supervisor can claim |

## Retention

Alerts in Postgres: no TTL enforced by default — add a cron job if needed:
```sql
DELETE FROM alerts WHERE created_at < NOW() - INTERVAL '30 days';
```

Memory backend: cleared on process restart.

---

## Production Mode: ALERT_BACKEND=postgres

**⚠ Default is `memory` — do NOT use in production.** Alerts are lost on router restart.

### Setup (one-time, per environment)

**1. Run migration:**
```bash
python3 ops/scripts/migrate_alerts_postgres.py \
  --dsn "postgresql://user:pass@host:5432/daarion"
# or dry-run:
python3 ops/scripts/migrate_alerts_postgres.py --dry-run
```

**2. Set env vars** (in `.env`, docker-compose, or systemd unit):
```bash
ALERT_BACKEND=postgres
ALERT_DATABASE_URL=postgresql://user:pass@host:5432/daarion
# Fallback: if ALERT_DATABASE_URL is unset, DATABASE_URL is used automatically
```

**3. Restart router:**
```bash
docker compose -f docker-compose.node1.yml restart router
# or node2:
docker compose -f docker-compose.node2-sofiia.yml restart router
```

**4. Verify persistence** (survive a restart):
```bash
# Ingest a test alert
curl -X POST http://router:8000/v1/tools/execute \
  -H "Content-Type: application/json" \
  -d '{"tool":"alert_ingest_tool","action":"ingest","service":"test","kind":"test","message":"persistence check"}'

# Restart router
docker compose restart router

# Confirm alert still visible after restart
curl "http://router:8000/v1/tools/execute" \
  -d '{"tool":"alert_ingest_tool","action":"list","service":"test"}'
# Expect: alert still present → PASS
```

### DSN resolution order

`alert_store.py` factory resolves DSN in this priority:
1. `ALERT_DATABASE_URL` (service-specific, recommended)
2. `DATABASE_URL` (shared Postgres, fallback)
3. Falls back to memory with a WARNING log if neither is set.

### compose files updated

| File | ALERT_BACKEND set? |
|------|--------------------|
| `docker-compose.node1.yml` | ✅ `postgres` |
| `docker-compose.node2-sofiia.yml` | ✅ `postgres` |
| `docker-compose.staging.yml` | ✅ `postgres` |