# Runbook: Alert → Incident Bridge (State Machine + Cooldown) ## Topology ``` Monitor@node1/2 ──► alert_ingest_tool.ingest ──► AlertStore (Postgres or Memory) │ Sofiia / oncall ──► oncall_tool.alert_to_incident ─────┘ │ IncidentStore (Postgres) ◄───-┘ │ Sofiia NODA2: incident_triage_graph │ postmortem_draft_graph ``` ## Alert State Machine ``` new → processing → acked ↓ failed → (retry after TTL) → new ``` | Status | Meaning | |-------------|--------------------------------------------------| | `new` | Freshly ingested, not yet claimed | | `processing` | Claimed by a loop worker; locked for 10 min | | `acked` | Successfully processed and closed | | `failed` | Processing error; retry after `retry_after_sec` | **Concurrency safety:** `claim` uses `SELECT FOR UPDATE SKIP LOCKED` (Postgres) or an in-process lock (Memory). Two concurrent loops cannot claim the same alert. **Stale processing requeue:** `claim` automatically requeues alerts whose `processing_lock_until` has expired. --- ## Triage Cooldown (per Signature) After a triage runs for a given `incident_signature`, subsequent alerts with the same signature **within 15 min** (configurable via `triage_cooldown_minutes` in `alert_routing_policy.yml`) only get an `incident_append_event` note — no new triage run. This prevents triage storms. ```yaml # config/alert_routing_policy.yml defaults: triage_cooldown_minutes: 15 ``` The state is persisted in `incident_signature_state` table (Postgres) or in-memory (fallback). --- ## Startup Checklist 1. **Postgres DDL** (if `ALERT_BACKEND=postgres`): ```bash DATABASE_URL=postgresql://... python3 ops/scripts/migrate_alerts_postgres.py ``` This is idempotent — safe to re-run. Adds state machine columns and `incident_signature_state` table. 2. **Env vars on NODE1 (router)**: ```env ALERT_BACKEND=auto # Postgres → Memory fallback DATABASE_URL=postgresql://... ``` 3. **Monitor agent**: configure `source: monitor@node1`, use `alert_ingest_tool.ingest`. ## Operational Scenarios ### Alert storm protection Alert deduplication prevents storms. If alerts are firing repeatedly: 1. Check `occurrences` field — same alert ref means dedupe is working 2. Adjust `dedupe_ttl_minutes` per alert (default 30) 3. If many different fingerprints create new records — review Monitor fingerprint logic ### False positive alert 1. `alert_ingest_tool.ack` with `note="false positive"` 2. No incident created (or close the incident if already created via `oncall_tool.incident_close`) ### Alert → Incident conversion ```bash # Sofiia or oncall agent calls: oncall_tool.alert_to_incident( alert_ref="alrt_...", incident_severity_cap="P1", dedupe_window_minutes=60 ) ``` ### View recent alerts (by status) ```bash # Default: all statuses alert_ingest_tool.list(window_minutes=240, env="prod") # Only new/failed (unprocessed): alert_ingest_tool.list(window_minutes=240, status_in=["new","failed"]) ``` ### Claim alerts for processing (Supervisor loop) ```bash # Atomic claim — locks alerts for 10 min alert_ingest_tool.claim(window_minutes=240, limit=25, owner="sofiia-supervisor", lock_ttl_seconds=600) ``` ### Mark alert as failed (retry) ```bash alert_ingest_tool.fail(alert_ref="alrt_...", error="gateway timeout", retry_after_seconds=300) ``` ### Operational dashboard ``` GET /v1/alerts/dashboard?window_minutes=240 # → counts by status, top signatures, latest alerts ``` ``` GET /v1/incidents/open?service=gateway # → open/mitigating incidents ``` ### Monitor health check Verify Monitor is pushing alerts: ```bash alert_ingest_tool.list(source="monitor@node1", window_minutes=60) ``` If empty and there should be alerts → check Monitor service + entitlements. ## SLO Watch Gate ### Staging blocks on SLO breach Config in `config/release_gate_policy.yml`: ```yaml staging: gates: slo_watch: mode: "strict" ``` To temporarily bypass (emergency deploy): ```bash # In release_check input: run_slo_watch: false ``` Document reason in incident timeline. ### Tuning SLO thresholds Edit `config/slo_policy.yml`: ```yaml services: gateway: latency_p95_ms: 300 # adjust error_rate_pct: 1.0 ``` ## Troubleshooting | Symptom | Cause | Fix | |---------|-------|-----| | Alert `accepted=false` | Validation failure (missing service/title, invalid kind) | Fix Monitor alert payload | | `deduped=true` unexpectedly | Same fingerprint within TTL | Check Monitor fingerprint logic | | `alert_to_incident` fails "not found" | Alert ref expired from MemoryStore | Switch to Postgres backend | | Alerts stuck in `processing` | Loop died without acking | Run `claim` — it auto-requeues expired locks. Or: `UPDATE alerts SET status='new', processing_lock_until=NULL WHERE status='processing' AND processing_lock_until < NOW()` | | Alerts stuck in `failed` | Persistent processing errors | Check `last_error` field: `SELECT alert_ref, last_error FROM alerts WHERE status='failed'` | | Triage not running | Cooldown active | Check `incident_signature_state.last_triage_at`; or reduce `triage_cooldown_minutes` in policy | | `claim` returns empty | All new alerts already locked | Check for stale processing: `SELECT COUNT(*) FROM alerts WHERE status='processing' AND processing_lock_until < NOW()` | | SLO gate blocks in staging | SLO breach active | Fix service or override with `run_slo_watch: false` | | `tools.alerts.ingest` denied | Monitor agent missing entitlement | Check `config/rbac_tools_matrix.yml` `agent_monitor` role | | `tools.alerts.claim` denied | Agent missing `tools.alerts.claim` | Only `agent_cto` / `agent_oncall` / Supervisor can claim | ## Retention Alerts in Postgres: no TTL enforced by default — add a cron job if needed: ```sql DELETE FROM alerts WHERE created_at < NOW() - INTERVAL '30 days'; ``` Memory backend: cleared on process restart. --- ## Production Mode: ALERT_BACKEND=postgres **⚠ Default is `memory` — do NOT use in production.** Alerts are lost on router restart. ### Setup (one-time, per environment) **1. Run migration:** ```bash python3 ops/scripts/migrate_alerts_postgres.py \ --dsn "postgresql://user:pass@host:5432/daarion" # or dry-run: python3 ops/scripts/migrate_alerts_postgres.py --dry-run ``` **2. Set env vars** (in `.env`, docker-compose, or systemd unit): ```bash ALERT_BACKEND=postgres ALERT_DATABASE_URL=postgresql://user:pass@host:5432/daarion # Fallback: if ALERT_DATABASE_URL is unset, DATABASE_URL is used automatically ``` **3. Restart router:** ```bash docker compose -f docker-compose.node1.yml restart router # or node2: docker compose -f docker-compose.node2-sofiia.yml restart router ``` **4. Verify persistence** (survive a restart): ```bash # Ingest a test alert curl -X POST http://router:8000/v1/tools/execute \ -H "Content-Type: application/json" \ -d '{"tool":"alert_ingest_tool","action":"ingest","service":"test","kind":"test","message":"persistence check"}' # Restart router docker compose restart router # Confirm alert still visible after restart curl "http://router:8000/v1/tools/execute" \ -d '{"tool":"alert_ingest_tool","action":"list","service":"test"}' # Expect: alert still present → PASS ``` ### DSN resolution order `alert_store.py` factory resolves DSN in this priority: 1. `ALERT_DATABASE_URL` (service-specific, recommended) 2. `DATABASE_URL` (shared Postgres, fallback) 3. Falls back to memory with a WARNING log if neither is set. ### compose files updated | File | ALERT_BACKEND set? | |------|--------------------| | `docker-compose.node1.yml` | ✅ `postgres` | | `docker-compose.node2-sofiia.yml` | ✅ `postgres` | | `docker-compose.staging.yml` | ✅ `postgres` |