Config policies (16 files): alert_routing, architecture_pressure, backlog, cost_weights, data_governance, incident_escalation, incident_intelligence, network_allowlist, nodes_registry, observability_sources, rbac_tools_matrix, release_gate, risk_attribution, risk_policy, slo_policy, tool_limits, tools_rollout Ops (22 files): Caddyfile, calendar compose, grafana voice dashboard, deployments/incidents logs, runbooks for alerts/audit/backlog/incidents/sofiia/voice, cron jobs, scripts (alert_triage, audit_cleanup, migrate_*, governance, schedule), task_registry, voice alerts/ha/latency/policy Docs (30+ files): HUMANIZED_STEPAN v2.7-v3 changelogs and runbooks, NODA1/NODA2 status and setup, audit index and traces, backlog, incident, supervisor, tools, voice, opencode, release, risk, aistalk, spacebot Made-with: Cursor
248 lines
8.0 KiB
Markdown
248 lines
8.0 KiB
Markdown
# Runbook: Alert → Incident Bridge (State Machine + Cooldown)
|
|
|
|
## Topology
|
|
|
|
```
|
|
Monitor@node1/2 ──► alert_ingest_tool.ingest ──► AlertStore (Postgres or Memory)
|
|
│
|
|
Sofiia / oncall ──► oncall_tool.alert_to_incident ─────┘
|
|
│
|
|
IncidentStore (Postgres) ◄───-┘
|
|
│
|
|
Sofiia NODA2: incident_triage_graph
|
|
│
|
|
postmortem_draft_graph
|
|
```
|
|
|
|
## Alert State Machine
|
|
|
|
```
|
|
new → processing → acked
|
|
↓
|
|
failed → (retry after TTL) → new
|
|
```
|
|
|
|
| Status | Meaning |
|
|
|-------------|--------------------------------------------------|
|
|
| `new` | Freshly ingested, not yet claimed |
|
|
| `processing` | Claimed by a loop worker; locked for 10 min |
|
|
| `acked` | Successfully processed and closed |
|
|
| `failed` | Processing error; retry after `retry_after_sec` |
|
|
|
|
**Concurrency safety:** `claim` uses `SELECT FOR UPDATE SKIP LOCKED` (Postgres) or an in-process lock (Memory). Two concurrent loops cannot claim the same alert.
|
|
|
|
**Stale processing requeue:** `claim` automatically requeues alerts whose `processing_lock_until` has expired.
|
|
|
|
---
|
|
|
|
## Triage Cooldown (per Signature)
|
|
|
|
After a triage runs for a given `incident_signature`, subsequent alerts with the same signature **within 15 min** (configurable via `triage_cooldown_minutes` in `alert_routing_policy.yml`) only get an `incident_append_event` note — no new triage run. This prevents triage storms.
|
|
|
|
```yaml
|
|
# config/alert_routing_policy.yml
|
|
defaults:
|
|
triage_cooldown_minutes: 15
|
|
```
|
|
|
|
The state is persisted in `incident_signature_state` table (Postgres) or in-memory (fallback).
|
|
|
|
---
|
|
|
|
## Startup Checklist
|
|
|
|
1. **Postgres DDL** (if `ALERT_BACKEND=postgres`):
|
|
```bash
|
|
DATABASE_URL=postgresql://... python3 ops/scripts/migrate_alerts_postgres.py
|
|
```
|
|
This is idempotent — safe to re-run. Adds state machine columns and `incident_signature_state` table.
|
|
|
|
2. **Env vars on NODE1 (router)**:
|
|
```env
|
|
ALERT_BACKEND=auto # Postgres → Memory fallback
|
|
DATABASE_URL=postgresql://...
|
|
```
|
|
|
|
3. **Monitor agent**: configure `source: monitor@node1`, use `alert_ingest_tool.ingest`.
|
|
|
|
## Operational Scenarios
|
|
|
|
### Alert storm protection
|
|
|
|
Alert deduplication prevents storms. If alerts are firing repeatedly:
|
|
1. Check `occurrences` field — same alert ref means dedupe is working
|
|
2. Adjust `dedupe_ttl_minutes` per alert (default 30)
|
|
3. If many different fingerprints create new records — review Monitor fingerprint logic
|
|
|
|
### False positive alert
|
|
|
|
1. `alert_ingest_tool.ack` with `note="false positive"`
|
|
2. No incident created (or close the incident if already created via `oncall_tool.incident_close`)
|
|
|
|
### Alert → Incident conversion
|
|
|
|
```bash
|
|
# Sofiia or oncall agent calls:
|
|
oncall_tool.alert_to_incident(
|
|
alert_ref="alrt_...",
|
|
incident_severity_cap="P1",
|
|
dedupe_window_minutes=60
|
|
)
|
|
```
|
|
|
|
### View recent alerts (by status)
|
|
|
|
```bash
|
|
# Default: all statuses
|
|
alert_ingest_tool.list(window_minutes=240, env="prod")
|
|
|
|
# Only new/failed (unprocessed):
|
|
alert_ingest_tool.list(window_minutes=240, status_in=["new","failed"])
|
|
```
|
|
|
|
### Claim alerts for processing (Supervisor loop)
|
|
|
|
```bash
|
|
# Atomic claim — locks alerts for 10 min
|
|
alert_ingest_tool.claim(window_minutes=240, limit=25, owner="sofiia-supervisor", lock_ttl_seconds=600)
|
|
```
|
|
|
|
### Mark alert as failed (retry)
|
|
|
|
```bash
|
|
alert_ingest_tool.fail(alert_ref="alrt_...", error="gateway timeout", retry_after_seconds=300)
|
|
```
|
|
|
|
### Operational dashboard
|
|
|
|
```
|
|
GET /v1/alerts/dashboard?window_minutes=240
|
|
# → counts by status, top signatures, latest alerts
|
|
```
|
|
|
|
```
|
|
GET /v1/incidents/open?service=gateway
|
|
# → open/mitigating incidents
|
|
```
|
|
|
|
### Monitor health check
|
|
|
|
Verify Monitor is pushing alerts:
|
|
```bash
|
|
alert_ingest_tool.list(source="monitor@node1", window_minutes=60)
|
|
```
|
|
If empty and there should be alerts → check Monitor service + entitlements.
|
|
|
|
## SLO Watch Gate
|
|
|
|
### Staging blocks on SLO breach
|
|
Config in `config/release_gate_policy.yml`:
|
|
```yaml
|
|
staging:
|
|
gates:
|
|
slo_watch:
|
|
mode: "strict"
|
|
```
|
|
|
|
To temporarily bypass (emergency deploy):
|
|
```bash
|
|
# In release_check input:
|
|
run_slo_watch: false
|
|
```
|
|
Document reason in incident timeline.
|
|
|
|
### Tuning SLO thresholds
|
|
|
|
Edit `config/slo_policy.yml`:
|
|
```yaml
|
|
services:
|
|
gateway:
|
|
latency_p95_ms: 300 # adjust
|
|
error_rate_pct: 1.0
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
| Symptom | Cause | Fix |
|
|
|---------|-------|-----|
|
|
| Alert `accepted=false` | Validation failure (missing service/title, invalid kind) | Fix Monitor alert payload |
|
|
| `deduped=true` unexpectedly | Same fingerprint within TTL | Check Monitor fingerprint logic |
|
|
| `alert_to_incident` fails "not found" | Alert ref expired from MemoryStore | Switch to Postgres backend |
|
|
| Alerts stuck in `processing` | Loop died without acking | Run `claim` — it auto-requeues expired locks. Or: `UPDATE alerts SET status='new', processing_lock_until=NULL WHERE status='processing' AND processing_lock_until < NOW()` |
|
|
| Alerts stuck in `failed` | Persistent processing errors | Check `last_error` field: `SELECT alert_ref, last_error FROM alerts WHERE status='failed'` |
|
|
| Triage not running | Cooldown active | Check `incident_signature_state.last_triage_at`; or reduce `triage_cooldown_minutes` in policy |
|
|
| `claim` returns empty | All new alerts already locked | Check for stale processing: `SELECT COUNT(*) FROM alerts WHERE status='processing' AND processing_lock_until < NOW()` |
|
|
| SLO gate blocks in staging | SLO breach active | Fix service or override with `run_slo_watch: false` |
|
|
| `tools.alerts.ingest` denied | Monitor agent missing entitlement | Check `config/rbac_tools_matrix.yml` `agent_monitor` role |
|
|
| `tools.alerts.claim` denied | Agent missing `tools.alerts.claim` | Only `agent_cto` / `agent_oncall` / Supervisor can claim |
|
|
|
|
## Retention
|
|
|
|
Alerts in Postgres: no TTL enforced by default — add a cron job if needed:
|
|
```sql
|
|
DELETE FROM alerts WHERE created_at < NOW() - INTERVAL '30 days';
|
|
```
|
|
|
|
Memory backend: cleared on process restart.
|
|
|
|
---
|
|
|
|
## Production Mode: ALERT_BACKEND=postgres
|
|
|
|
**⚠ Default is `memory` — do NOT use in production.** Alerts are lost on router restart.
|
|
|
|
### Setup (one-time, per environment)
|
|
|
|
**1. Run migration:**
|
|
```bash
|
|
python3 ops/scripts/migrate_alerts_postgres.py \
|
|
--dsn "postgresql://user:pass@host:5432/daarion"
|
|
# or dry-run:
|
|
python3 ops/scripts/migrate_alerts_postgres.py --dry-run
|
|
```
|
|
|
|
**2. Set env vars** (in `.env`, docker-compose, or systemd unit):
|
|
```bash
|
|
ALERT_BACKEND=postgres
|
|
ALERT_DATABASE_URL=postgresql://user:pass@host:5432/daarion
|
|
# Fallback: if ALERT_DATABASE_URL is unset, DATABASE_URL is used automatically
|
|
```
|
|
|
|
**3. Restart router:**
|
|
```bash
|
|
docker compose -f docker-compose.node1.yml restart router
|
|
# or node2:
|
|
docker compose -f docker-compose.node2-sofiia.yml restart router
|
|
```
|
|
|
|
**4. Verify persistence** (survive a restart):
|
|
```bash
|
|
# Ingest a test alert
|
|
curl -X POST http://router:8000/v1/tools/execute \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"tool":"alert_ingest_tool","action":"ingest","service":"test","kind":"test","message":"persistence check"}'
|
|
|
|
# Restart router
|
|
docker compose restart router
|
|
|
|
# Confirm alert still visible after restart
|
|
curl "http://router:8000/v1/tools/execute" \
|
|
-d '{"tool":"alert_ingest_tool","action":"list","service":"test"}'
|
|
# Expect: alert still present → PASS
|
|
```
|
|
|
|
### DSN resolution order
|
|
|
|
`alert_store.py` factory resolves DSN in this priority:
|
|
1. `ALERT_DATABASE_URL` (service-specific, recommended)
|
|
2. `DATABASE_URL` (shared Postgres, fallback)
|
|
3. Falls back to memory with a WARNING log if neither is set.
|
|
|
|
### compose files updated
|
|
|
|
| File | ALERT_BACKEND set? |
|
|
|------|--------------------|
|
|
| `docker-compose.node1.yml` | ✅ `postgres` |
|
|
| `docker-compose.node2-sofiia.yml` | ✅ `postgres` |
|
|
| `docker-compose.staging.yml` | ✅ `postgres` |
|