Files
microdao-daarion/ops/runbook-alerts.md
Apple 67225a39fa docs(platform): add policy configs, runbooks, ops scripts and platform documentation
Config policies (16 files): alert_routing, architecture_pressure, backlog,
cost_weights, data_governance, incident_escalation, incident_intelligence,
network_allowlist, nodes_registry, observability_sources, rbac_tools_matrix,
release_gate, risk_attribution, risk_policy, slo_policy, tool_limits, tools_rollout

Ops (22 files): Caddyfile, calendar compose, grafana voice dashboard,
deployments/incidents logs, runbooks for alerts/audit/backlog/incidents/sofiia/voice,
cron jobs, scripts (alert_triage, audit_cleanup, migrate_*, governance, schedule),
task_registry, voice alerts/ha/latency/policy

Docs (30+ files): HUMANIZED_STEPAN v2.7-v3 changelogs and runbooks,
NODA1/NODA2 status and setup, audit index and traces, backlog, incident,
supervisor, tools, voice, opencode, release, risk, aistalk, spacebot

Made-with: Cursor
2026-03-03 07:14:53 -08:00

248 lines
8.0 KiB
Markdown

# Runbook: Alert → Incident Bridge (State Machine + Cooldown)
## Topology
```
Monitor@node1/2 ──► alert_ingest_tool.ingest ──► AlertStore (Postgres or Memory)
Sofiia / oncall ──► oncall_tool.alert_to_incident ─────┘
IncidentStore (Postgres) ◄───-┘
Sofiia NODA2: incident_triage_graph
postmortem_draft_graph
```
## Alert State Machine
```
new → processing → acked
failed → (retry after TTL) → new
```
| Status | Meaning |
|-------------|--------------------------------------------------|
| `new` | Freshly ingested, not yet claimed |
| `processing` | Claimed by a loop worker; locked for 10 min |
| `acked` | Successfully processed and closed |
| `failed` | Processing error; retry after `retry_after_sec` |
**Concurrency safety:** `claim` uses `SELECT FOR UPDATE SKIP LOCKED` (Postgres) or an in-process lock (Memory). Two concurrent loops cannot claim the same alert.
**Stale processing requeue:** `claim` automatically requeues alerts whose `processing_lock_until` has expired.
---
## Triage Cooldown (per Signature)
After a triage runs for a given `incident_signature`, subsequent alerts with the same signature **within 15 min** (configurable via `triage_cooldown_minutes` in `alert_routing_policy.yml`) only get an `incident_append_event` note — no new triage run. This prevents triage storms.
```yaml
# config/alert_routing_policy.yml
defaults:
triage_cooldown_minutes: 15
```
The state is persisted in `incident_signature_state` table (Postgres) or in-memory (fallback).
---
## Startup Checklist
1. **Postgres DDL** (if `ALERT_BACKEND=postgres`):
```bash
DATABASE_URL=postgresql://... python3 ops/scripts/migrate_alerts_postgres.py
```
This is idempotent — safe to re-run. Adds state machine columns and `incident_signature_state` table.
2. **Env vars on NODE1 (router)**:
```env
ALERT_BACKEND=auto # Postgres → Memory fallback
DATABASE_URL=postgresql://...
```
3. **Monitor agent**: configure `source: monitor@node1`, use `alert_ingest_tool.ingest`.
## Operational Scenarios
### Alert storm protection
Alert deduplication prevents storms. If alerts are firing repeatedly:
1. Check `occurrences` field — same alert ref means dedupe is working
2. Adjust `dedupe_ttl_minutes` per alert (default 30)
3. If many different fingerprints create new records — review Monitor fingerprint logic
### False positive alert
1. `alert_ingest_tool.ack` with `note="false positive"`
2. No incident created (or close the incident if already created via `oncall_tool.incident_close`)
### Alert → Incident conversion
```bash
# Sofiia or oncall agent calls:
oncall_tool.alert_to_incident(
alert_ref="alrt_...",
incident_severity_cap="P1",
dedupe_window_minutes=60
)
```
### View recent alerts (by status)
```bash
# Default: all statuses
alert_ingest_tool.list(window_minutes=240, env="prod")
# Only new/failed (unprocessed):
alert_ingest_tool.list(window_minutes=240, status_in=["new","failed"])
```
### Claim alerts for processing (Supervisor loop)
```bash
# Atomic claim — locks alerts for 10 min
alert_ingest_tool.claim(window_minutes=240, limit=25, owner="sofiia-supervisor", lock_ttl_seconds=600)
```
### Mark alert as failed (retry)
```bash
alert_ingest_tool.fail(alert_ref="alrt_...", error="gateway timeout", retry_after_seconds=300)
```
### Operational dashboard
```
GET /v1/alerts/dashboard?window_minutes=240
# → counts by status, top signatures, latest alerts
```
```
GET /v1/incidents/open?service=gateway
# → open/mitigating incidents
```
### Monitor health check
Verify Monitor is pushing alerts:
```bash
alert_ingest_tool.list(source="monitor@node1", window_minutes=60)
```
If empty and there should be alerts → check Monitor service + entitlements.
## SLO Watch Gate
### Staging blocks on SLO breach
Config in `config/release_gate_policy.yml`:
```yaml
staging:
gates:
slo_watch:
mode: "strict"
```
To temporarily bypass (emergency deploy):
```bash
# In release_check input:
run_slo_watch: false
```
Document reason in incident timeline.
### Tuning SLO thresholds
Edit `config/slo_policy.yml`:
```yaml
services:
gateway:
latency_p95_ms: 300 # adjust
error_rate_pct: 1.0
```
## Troubleshooting
| Symptom | Cause | Fix |
|---------|-------|-----|
| Alert `accepted=false` | Validation failure (missing service/title, invalid kind) | Fix Monitor alert payload |
| `deduped=true` unexpectedly | Same fingerprint within TTL | Check Monitor fingerprint logic |
| `alert_to_incident` fails "not found" | Alert ref expired from MemoryStore | Switch to Postgres backend |
| Alerts stuck in `processing` | Loop died without acking | Run `claim` — it auto-requeues expired locks. Or: `UPDATE alerts SET status='new', processing_lock_until=NULL WHERE status='processing' AND processing_lock_until < NOW()` |
| Alerts stuck in `failed` | Persistent processing errors | Check `last_error` field: `SELECT alert_ref, last_error FROM alerts WHERE status='failed'` |
| Triage not running | Cooldown active | Check `incident_signature_state.last_triage_at`; or reduce `triage_cooldown_minutes` in policy |
| `claim` returns empty | All new alerts already locked | Check for stale processing: `SELECT COUNT(*) FROM alerts WHERE status='processing' AND processing_lock_until < NOW()` |
| SLO gate blocks in staging | SLO breach active | Fix service or override with `run_slo_watch: false` |
| `tools.alerts.ingest` denied | Monitor agent missing entitlement | Check `config/rbac_tools_matrix.yml` `agent_monitor` role |
| `tools.alerts.claim` denied | Agent missing `tools.alerts.claim` | Only `agent_cto` / `agent_oncall` / Supervisor can claim |
## Retention
Alerts in Postgres: no TTL enforced by default — add a cron job if needed:
```sql
DELETE FROM alerts WHERE created_at < NOW() - INTERVAL '30 days';
```
Memory backend: cleared on process restart.
---
## Production Mode: ALERT_BACKEND=postgres
**⚠ Default is `memory` — do NOT use in production.** Alerts are lost on router restart.
### Setup (one-time, per environment)
**1. Run migration:**
```bash
python3 ops/scripts/migrate_alerts_postgres.py \
--dsn "postgresql://user:pass@host:5432/daarion"
# or dry-run:
python3 ops/scripts/migrate_alerts_postgres.py --dry-run
```
**2. Set env vars** (in `.env`, docker-compose, or systemd unit):
```bash
ALERT_BACKEND=postgres
ALERT_DATABASE_URL=postgresql://user:pass@host:5432/daarion
# Fallback: if ALERT_DATABASE_URL is unset, DATABASE_URL is used automatically
```
**3. Restart router:**
```bash
docker compose -f docker-compose.node1.yml restart router
# or node2:
docker compose -f docker-compose.node2-sofiia.yml restart router
```
**4. Verify persistence** (survive a restart):
```bash
# Ingest a test alert
curl -X POST http://router:8000/v1/tools/execute \
-H "Content-Type: application/json" \
-d '{"tool":"alert_ingest_tool","action":"ingest","service":"test","kind":"test","message":"persistence check"}'
# Restart router
docker compose restart router
# Confirm alert still visible after restart
curl "http://router:8000/v1/tools/execute" \
-d '{"tool":"alert_ingest_tool","action":"list","service":"test"}'
# Expect: alert still present → PASS
```
### DSN resolution order
`alert_store.py` factory resolves DSN in this priority:
1. `ALERT_DATABASE_URL` (service-specific, recommended)
2. `DATABASE_URL` (shared Postgres, fallback)
3. Falls back to memory with a WARNING log if neither is set.
### compose files updated
| File | ALERT_BACKEND set? |
|------|--------------------|
| `docker-compose.node1.yml` | ✅ `postgres` |
| `docker-compose.node2-sofiia.yml` | ✅ `postgres` |
| `docker-compose.staging.yml` | ✅ `postgres` |