Config policies (16 files): alert_routing, architecture_pressure, backlog, cost_weights, data_governance, incident_escalation, incident_intelligence, network_allowlist, nodes_registry, observability_sources, rbac_tools_matrix, release_gate, risk_attribution, risk_policy, slo_policy, tool_limits, tools_rollout Ops (22 files): Caddyfile, calendar compose, grafana voice dashboard, deployments/incidents logs, runbooks for alerts/audit/backlog/incidents/sofiia/voice, cron jobs, scripts (alert_triage, audit_cleanup, migrate_*, governance, schedule), task_registry, voice alerts/ha/latency/policy Docs (30+ files): HUMANIZED_STEPAN v2.7-v3 changelogs and runbooks, NODA1/NODA2 status and setup, audit index and traces, backlog, incident, supervisor, tools, voice, opencode, release, risk, aistalk, spacebot Made-with: Cursor
8.0 KiB
Runbook: Alert → Incident Bridge (State Machine + Cooldown)
Topology
Monitor@node1/2 ──► alert_ingest_tool.ingest ──► AlertStore (Postgres or Memory)
│
Sofiia / oncall ──► oncall_tool.alert_to_incident ─────┘
│
IncidentStore (Postgres) ◄───-┘
│
Sofiia NODA2: incident_triage_graph
│
postmortem_draft_graph
Alert State Machine
new → processing → acked
↓
failed → (retry after TTL) → new
| Status | Meaning |
|---|---|
new |
Freshly ingested, not yet claimed |
processing |
Claimed by a loop worker; locked for 10 min |
acked |
Successfully processed and closed |
failed |
Processing error; retry after retry_after_sec |
Concurrency safety: claim uses SELECT FOR UPDATE SKIP LOCKED (Postgres) or an in-process lock (Memory). Two concurrent loops cannot claim the same alert.
Stale processing requeue: claim automatically requeues alerts whose processing_lock_until has expired.
Triage Cooldown (per Signature)
After a triage runs for a given incident_signature, subsequent alerts with the same signature within 15 min (configurable via triage_cooldown_minutes in alert_routing_policy.yml) only get an incident_append_event note — no new triage run. This prevents triage storms.
# config/alert_routing_policy.yml
defaults:
triage_cooldown_minutes: 15
The state is persisted in incident_signature_state table (Postgres) or in-memory (fallback).
Startup Checklist
-
Postgres DDL (if
ALERT_BACKEND=postgres):DATABASE_URL=postgresql://... python3 ops/scripts/migrate_alerts_postgres.pyThis is idempotent — safe to re-run. Adds state machine columns and
incident_signature_statetable. -
Env vars on NODE1 (router):
ALERT_BACKEND=auto # Postgres → Memory fallback DATABASE_URL=postgresql://... -
Monitor agent: configure
source: monitor@node1, usealert_ingest_tool.ingest.
Operational Scenarios
Alert storm protection
Alert deduplication prevents storms. If alerts are firing repeatedly:
- Check
occurrencesfield — same alert ref means dedupe is working - Adjust
dedupe_ttl_minutesper alert (default 30) - If many different fingerprints create new records — review Monitor fingerprint logic
False positive alert
alert_ingest_tool.ackwithnote="false positive"- No incident created (or close the incident if already created via
oncall_tool.incident_close)
Alert → Incident conversion
# Sofiia or oncall agent calls:
oncall_tool.alert_to_incident(
alert_ref="alrt_...",
incident_severity_cap="P1",
dedupe_window_minutes=60
)
View recent alerts (by status)
# Default: all statuses
alert_ingest_tool.list(window_minutes=240, env="prod")
# Only new/failed (unprocessed):
alert_ingest_tool.list(window_minutes=240, status_in=["new","failed"])
Claim alerts for processing (Supervisor loop)
# Atomic claim — locks alerts for 10 min
alert_ingest_tool.claim(window_minutes=240, limit=25, owner="sofiia-supervisor", lock_ttl_seconds=600)
Mark alert as failed (retry)
alert_ingest_tool.fail(alert_ref="alrt_...", error="gateway timeout", retry_after_seconds=300)
Operational dashboard
GET /v1/alerts/dashboard?window_minutes=240
# → counts by status, top signatures, latest alerts
GET /v1/incidents/open?service=gateway
# → open/mitigating incidents
Monitor health check
Verify Monitor is pushing alerts:
alert_ingest_tool.list(source="monitor@node1", window_minutes=60)
If empty and there should be alerts → check Monitor service + entitlements.
SLO Watch Gate
Staging blocks on SLO breach
Config in config/release_gate_policy.yml:
staging:
gates:
slo_watch:
mode: "strict"
To temporarily bypass (emergency deploy):
# In release_check input:
run_slo_watch: false
Document reason in incident timeline.
Tuning SLO thresholds
Edit config/slo_policy.yml:
services:
gateway:
latency_p95_ms: 300 # adjust
error_rate_pct: 1.0
Troubleshooting
| Symptom | Cause | Fix |
|---|---|---|
Alert accepted=false |
Validation failure (missing service/title, invalid kind) | Fix Monitor alert payload |
deduped=true unexpectedly |
Same fingerprint within TTL | Check Monitor fingerprint logic |
alert_to_incident fails "not found" |
Alert ref expired from MemoryStore | Switch to Postgres backend |
Alerts stuck in processing |
Loop died without acking | Run claim — it auto-requeues expired locks. Or: UPDATE alerts SET status='new', processing_lock_until=NULL WHERE status='processing' AND processing_lock_until < NOW() |
Alerts stuck in failed |
Persistent processing errors | Check last_error field: SELECT alert_ref, last_error FROM alerts WHERE status='failed' |
| Triage not running | Cooldown active | Check incident_signature_state.last_triage_at; or reduce triage_cooldown_minutes in policy |
claim returns empty |
All new alerts already locked | Check for stale processing: SELECT COUNT(*) FROM alerts WHERE status='processing' AND processing_lock_until < NOW() |
| SLO gate blocks in staging | SLO breach active | Fix service or override with run_slo_watch: false |
tools.alerts.ingest denied |
Monitor agent missing entitlement | Check config/rbac_tools_matrix.yml agent_monitor role |
tools.alerts.claim denied |
Agent missing tools.alerts.claim |
Only agent_cto / agent_oncall / Supervisor can claim |
Retention
Alerts in Postgres: no TTL enforced by default — add a cron job if needed:
DELETE FROM alerts WHERE created_at < NOW() - INTERVAL '30 days';
Memory backend: cleared on process restart.
Production Mode: ALERT_BACKEND=postgres
⚠ Default is memory — do NOT use in production. Alerts are lost on router restart.
Setup (one-time, per environment)
1. Run migration:
python3 ops/scripts/migrate_alerts_postgres.py \
--dsn "postgresql://user:pass@host:5432/daarion"
# or dry-run:
python3 ops/scripts/migrate_alerts_postgres.py --dry-run
2. Set env vars (in .env, docker-compose, or systemd unit):
ALERT_BACKEND=postgres
ALERT_DATABASE_URL=postgresql://user:pass@host:5432/daarion
# Fallback: if ALERT_DATABASE_URL is unset, DATABASE_URL is used automatically
3. Restart router:
docker compose -f docker-compose.node1.yml restart router
# or node2:
docker compose -f docker-compose.node2-sofiia.yml restart router
4. Verify persistence (survive a restart):
# Ingest a test alert
curl -X POST http://router:8000/v1/tools/execute \
-H "Content-Type: application/json" \
-d '{"tool":"alert_ingest_tool","action":"ingest","service":"test","kind":"test","message":"persistence check"}'
# Restart router
docker compose restart router
# Confirm alert still visible after restart
curl "http://router:8000/v1/tools/execute" \
-d '{"tool":"alert_ingest_tool","action":"list","service":"test"}'
# Expect: alert still present → PASS
DSN resolution order
alert_store.py factory resolves DSN in this priority:
ALERT_DATABASE_URL(service-specific, recommended)DATABASE_URL(shared Postgres, fallback)- Falls back to memory with a WARNING log if neither is set.
compose files updated
| File | ALERT_BACKEND set? |
|---|---|
docker-compose.node1.yml |
✅ postgres |
docker-compose.node2-sofiia.yml |
✅ postgres |
docker-compose.staging.yml |
✅ postgres |