Files
microdao-daarion/ops/runbook-alerts.md
Apple 67225a39fa docs(platform): add policy configs, runbooks, ops scripts and platform documentation
Config policies (16 files): alert_routing, architecture_pressure, backlog,
cost_weights, data_governance, incident_escalation, incident_intelligence,
network_allowlist, nodes_registry, observability_sources, rbac_tools_matrix,
release_gate, risk_attribution, risk_policy, slo_policy, tool_limits, tools_rollout

Ops (22 files): Caddyfile, calendar compose, grafana voice dashboard,
deployments/incidents logs, runbooks for alerts/audit/backlog/incidents/sofiia/voice,
cron jobs, scripts (alert_triage, audit_cleanup, migrate_*, governance, schedule),
task_registry, voice alerts/ha/latency/policy

Docs (30+ files): HUMANIZED_STEPAN v2.7-v3 changelogs and runbooks,
NODA1/NODA2 status and setup, audit index and traces, backlog, incident,
supervisor, tools, voice, opencode, release, risk, aistalk, spacebot

Made-with: Cursor
2026-03-03 07:14:53 -08:00

8.0 KiB

Runbook: Alert → Incident Bridge (State Machine + Cooldown)

Topology

Monitor@node1/2  ──► alert_ingest_tool.ingest ──► AlertStore (Postgres or Memory)
                                                        │
Sofiia / oncall  ──► oncall_tool.alert_to_incident ─────┘
                                                        │
                          IncidentStore (Postgres) ◄───-┘
                                  │
                   Sofiia NODA2: incident_triage_graph
                                  │
                        postmortem_draft_graph

Alert State Machine

new → processing → acked
          ↓
        failed → (retry after TTL) → new
Status Meaning
new Freshly ingested, not yet claimed
processing Claimed by a loop worker; locked for 10 min
acked Successfully processed and closed
failed Processing error; retry after retry_after_sec

Concurrency safety: claim uses SELECT FOR UPDATE SKIP LOCKED (Postgres) or an in-process lock (Memory). Two concurrent loops cannot claim the same alert.

Stale processing requeue: claim automatically requeues alerts whose processing_lock_until has expired.


Triage Cooldown (per Signature)

After a triage runs for a given incident_signature, subsequent alerts with the same signature within 15 min (configurable via triage_cooldown_minutes in alert_routing_policy.yml) only get an incident_append_event note — no new triage run. This prevents triage storms.

# config/alert_routing_policy.yml
defaults:
  triage_cooldown_minutes: 15

The state is persisted in incident_signature_state table (Postgres) or in-memory (fallback).


Startup Checklist

  1. Postgres DDL (if ALERT_BACKEND=postgres):

    DATABASE_URL=postgresql://... python3 ops/scripts/migrate_alerts_postgres.py
    

    This is idempotent — safe to re-run. Adds state machine columns and incident_signature_state table.

  2. Env vars on NODE1 (router):

    ALERT_BACKEND=auto           # Postgres → Memory fallback
    DATABASE_URL=postgresql://...
    
  3. Monitor agent: configure source: monitor@node1, use alert_ingest_tool.ingest.

Operational Scenarios

Alert storm protection

Alert deduplication prevents storms. If alerts are firing repeatedly:

  1. Check occurrences field — same alert ref means dedupe is working
  2. Adjust dedupe_ttl_minutes per alert (default 30)
  3. If many different fingerprints create new records — review Monitor fingerprint logic

False positive alert

  1. alert_ingest_tool.ack with note="false positive"
  2. No incident created (or close the incident if already created via oncall_tool.incident_close)

Alert → Incident conversion

# Sofiia or oncall agent calls:
oncall_tool.alert_to_incident(
    alert_ref="alrt_...",
    incident_severity_cap="P1",
    dedupe_window_minutes=60
)

View recent alerts (by status)

# Default: all statuses
alert_ingest_tool.list(window_minutes=240, env="prod")

# Only new/failed (unprocessed):
alert_ingest_tool.list(window_minutes=240, status_in=["new","failed"])

Claim alerts for processing (Supervisor loop)

# Atomic claim — locks alerts for 10 min
alert_ingest_tool.claim(window_minutes=240, limit=25, owner="sofiia-supervisor", lock_ttl_seconds=600)

Mark alert as failed (retry)

alert_ingest_tool.fail(alert_ref="alrt_...", error="gateway timeout", retry_after_seconds=300)

Operational dashboard

GET /v1/alerts/dashboard?window_minutes=240
# → counts by status, top signatures, latest alerts
GET /v1/incidents/open?service=gateway
# → open/mitigating incidents

Monitor health check

Verify Monitor is pushing alerts:

alert_ingest_tool.list(source="monitor@node1", window_minutes=60)

If empty and there should be alerts → check Monitor service + entitlements.

SLO Watch Gate

Staging blocks on SLO breach

Config in config/release_gate_policy.yml:

staging:
  gates:
    slo_watch:
      mode: "strict"

To temporarily bypass (emergency deploy):

# In release_check input:
run_slo_watch: false

Document reason in incident timeline.

Tuning SLO thresholds

Edit config/slo_policy.yml:

services:
  gateway:
    latency_p95_ms: 300    # adjust
    error_rate_pct: 1.0

Troubleshooting

Symptom Cause Fix
Alert accepted=false Validation failure (missing service/title, invalid kind) Fix Monitor alert payload
deduped=true unexpectedly Same fingerprint within TTL Check Monitor fingerprint logic
alert_to_incident fails "not found" Alert ref expired from MemoryStore Switch to Postgres backend
Alerts stuck in processing Loop died without acking Run claim — it auto-requeues expired locks. Or: UPDATE alerts SET status='new', processing_lock_until=NULL WHERE status='processing' AND processing_lock_until < NOW()
Alerts stuck in failed Persistent processing errors Check last_error field: SELECT alert_ref, last_error FROM alerts WHERE status='failed'
Triage not running Cooldown active Check incident_signature_state.last_triage_at; or reduce triage_cooldown_minutes in policy
claim returns empty All new alerts already locked Check for stale processing: SELECT COUNT(*) FROM alerts WHERE status='processing' AND processing_lock_until < NOW()
SLO gate blocks in staging SLO breach active Fix service or override with run_slo_watch: false
tools.alerts.ingest denied Monitor agent missing entitlement Check config/rbac_tools_matrix.yml agent_monitor role
tools.alerts.claim denied Agent missing tools.alerts.claim Only agent_cto / agent_oncall / Supervisor can claim

Retention

Alerts in Postgres: no TTL enforced by default — add a cron job if needed:

DELETE FROM alerts WHERE created_at < NOW() - INTERVAL '30 days';

Memory backend: cleared on process restart.


Production Mode: ALERT_BACKEND=postgres

⚠ Default is memory — do NOT use in production. Alerts are lost on router restart.

Setup (one-time, per environment)

1. Run migration:

python3 ops/scripts/migrate_alerts_postgres.py \
  --dsn "postgresql://user:pass@host:5432/daarion"
# or dry-run:
python3 ops/scripts/migrate_alerts_postgres.py --dry-run

2. Set env vars (in .env, docker-compose, or systemd unit):

ALERT_BACKEND=postgres
ALERT_DATABASE_URL=postgresql://user:pass@host:5432/daarion
# Fallback: if ALERT_DATABASE_URL is unset, DATABASE_URL is used automatically

3. Restart router:

docker compose -f docker-compose.node1.yml restart router
# or node2:
docker compose -f docker-compose.node2-sofiia.yml restart router

4. Verify persistence (survive a restart):

# Ingest a test alert
curl -X POST http://router:8000/v1/tools/execute \
  -H "Content-Type: application/json" \
  -d '{"tool":"alert_ingest_tool","action":"ingest","service":"test","kind":"test","message":"persistence check"}'

# Restart router
docker compose restart router

# Confirm alert still visible after restart
curl "http://router:8000/v1/tools/execute" \
  -d '{"tool":"alert_ingest_tool","action":"list","service":"test"}'
# Expect: alert still present → PASS

DSN resolution order

alert_store.py factory resolves DSN in this priority:

  1. ALERT_DATABASE_URL (service-specific, recommended)
  2. DATABASE_URL (shared Postgres, fallback)
  3. Falls back to memory with a WARNING log if neither is set.

compose files updated

File ALERT_BACKEND set?
docker-compose.node1.yml postgres
docker-compose.node2-sofiia.yml postgres
docker-compose.staging.yml postgres