docs(platform): add policy configs, runbooks, ops scripts and platform documentation
Config policies (16 files): alert_routing, architecture_pressure, backlog, cost_weights, data_governance, incident_escalation, incident_intelligence, network_allowlist, nodes_registry, observability_sources, rbac_tools_matrix, release_gate, risk_attribution, risk_policy, slo_policy, tool_limits, tools_rollout Ops (22 files): Caddyfile, calendar compose, grafana voice dashboard, deployments/incidents logs, runbooks for alerts/audit/backlog/incidents/sofiia/voice, cron jobs, scripts (alert_triage, audit_cleanup, migrate_*, governance, schedule), task_registry, voice alerts/ha/latency/policy Docs (30+ files): HUMANIZED_STEPAN v2.7-v3 changelogs and runbooks, NODA1/NODA2 status and setup, audit index and traces, backlog, incident, supervisor, tools, voice, opencode, release, risk, aistalk, spacebot Made-with: Cursor
This commit is contained in:
247
ops/runbook-alerts.md
Normal file
247
ops/runbook-alerts.md
Normal file
@@ -0,0 +1,247 @@
|
||||
# Runbook: Alert → Incident Bridge (State Machine + Cooldown)
|
||||
|
||||
## Topology
|
||||
|
||||
```
|
||||
Monitor@node1/2 ──► alert_ingest_tool.ingest ──► AlertStore (Postgres or Memory)
|
||||
│
|
||||
Sofiia / oncall ──► oncall_tool.alert_to_incident ─────┘
|
||||
│
|
||||
IncidentStore (Postgres) ◄───-┘
|
||||
│
|
||||
Sofiia NODA2: incident_triage_graph
|
||||
│
|
||||
postmortem_draft_graph
|
||||
```
|
||||
|
||||
## Alert State Machine
|
||||
|
||||
```
|
||||
new → processing → acked
|
||||
↓
|
||||
failed → (retry after TTL) → new
|
||||
```
|
||||
|
||||
| Status | Meaning |
|
||||
|-------------|--------------------------------------------------|
|
||||
| `new` | Freshly ingested, not yet claimed |
|
||||
| `processing` | Claimed by a loop worker; locked for 10 min |
|
||||
| `acked` | Successfully processed and closed |
|
||||
| `failed` | Processing error; retry after `retry_after_sec` |
|
||||
|
||||
**Concurrency safety:** `claim` uses `SELECT FOR UPDATE SKIP LOCKED` (Postgres) or an in-process lock (Memory). Two concurrent loops cannot claim the same alert.
|
||||
|
||||
**Stale processing requeue:** `claim` automatically requeues alerts whose `processing_lock_until` has expired.
|
||||
|
||||
---
|
||||
|
||||
## Triage Cooldown (per Signature)
|
||||
|
||||
After a triage runs for a given `incident_signature`, subsequent alerts with the same signature **within 15 min** (configurable via `triage_cooldown_minutes` in `alert_routing_policy.yml`) only get an `incident_append_event` note — no new triage run. This prevents triage storms.
|
||||
|
||||
```yaml
|
||||
# config/alert_routing_policy.yml
|
||||
defaults:
|
||||
triage_cooldown_minutes: 15
|
||||
```
|
||||
|
||||
The state is persisted in `incident_signature_state` table (Postgres) or in-memory (fallback).
|
||||
|
||||
---
|
||||
|
||||
## Startup Checklist
|
||||
|
||||
1. **Postgres DDL** (if `ALERT_BACKEND=postgres`):
|
||||
```bash
|
||||
DATABASE_URL=postgresql://... python3 ops/scripts/migrate_alerts_postgres.py
|
||||
```
|
||||
This is idempotent — safe to re-run. Adds state machine columns and `incident_signature_state` table.
|
||||
|
||||
2. **Env vars on NODE1 (router)**:
|
||||
```env
|
||||
ALERT_BACKEND=auto # Postgres → Memory fallback
|
||||
DATABASE_URL=postgresql://...
|
||||
```
|
||||
|
||||
3. **Monitor agent**: configure `source: monitor@node1`, use `alert_ingest_tool.ingest`.
|
||||
|
||||
## Operational Scenarios
|
||||
|
||||
### Alert storm protection
|
||||
|
||||
Alert deduplication prevents storms. If alerts are firing repeatedly:
|
||||
1. Check `occurrences` field — same alert ref means dedupe is working
|
||||
2. Adjust `dedupe_ttl_minutes` per alert (default 30)
|
||||
3. If many different fingerprints create new records — review Monitor fingerprint logic
|
||||
|
||||
### False positive alert
|
||||
|
||||
1. `alert_ingest_tool.ack` with `note="false positive"`
|
||||
2. No incident created (or close the incident if already created via `oncall_tool.incident_close`)
|
||||
|
||||
### Alert → Incident conversion
|
||||
|
||||
```bash
|
||||
# Sofiia or oncall agent calls:
|
||||
oncall_tool.alert_to_incident(
|
||||
alert_ref="alrt_...",
|
||||
incident_severity_cap="P1",
|
||||
dedupe_window_minutes=60
|
||||
)
|
||||
```
|
||||
|
||||
### View recent alerts (by status)
|
||||
|
||||
```bash
|
||||
# Default: all statuses
|
||||
alert_ingest_tool.list(window_minutes=240, env="prod")
|
||||
|
||||
# Only new/failed (unprocessed):
|
||||
alert_ingest_tool.list(window_minutes=240, status_in=["new","failed"])
|
||||
```
|
||||
|
||||
### Claim alerts for processing (Supervisor loop)
|
||||
|
||||
```bash
|
||||
# Atomic claim — locks alerts for 10 min
|
||||
alert_ingest_tool.claim(window_minutes=240, limit=25, owner="sofiia-supervisor", lock_ttl_seconds=600)
|
||||
```
|
||||
|
||||
### Mark alert as failed (retry)
|
||||
|
||||
```bash
|
||||
alert_ingest_tool.fail(alert_ref="alrt_...", error="gateway timeout", retry_after_seconds=300)
|
||||
```
|
||||
|
||||
### Operational dashboard
|
||||
|
||||
```
|
||||
GET /v1/alerts/dashboard?window_minutes=240
|
||||
# → counts by status, top signatures, latest alerts
|
||||
```
|
||||
|
||||
```
|
||||
GET /v1/incidents/open?service=gateway
|
||||
# → open/mitigating incidents
|
||||
```
|
||||
|
||||
### Monitor health check
|
||||
|
||||
Verify Monitor is pushing alerts:
|
||||
```bash
|
||||
alert_ingest_tool.list(source="monitor@node1", window_minutes=60)
|
||||
```
|
||||
If empty and there should be alerts → check Monitor service + entitlements.
|
||||
|
||||
## SLO Watch Gate
|
||||
|
||||
### Staging blocks on SLO breach
|
||||
Config in `config/release_gate_policy.yml`:
|
||||
```yaml
|
||||
staging:
|
||||
gates:
|
||||
slo_watch:
|
||||
mode: "strict"
|
||||
```
|
||||
|
||||
To temporarily bypass (emergency deploy):
|
||||
```bash
|
||||
# In release_check input:
|
||||
run_slo_watch: false
|
||||
```
|
||||
Document reason in incident timeline.
|
||||
|
||||
### Tuning SLO thresholds
|
||||
|
||||
Edit `config/slo_policy.yml`:
|
||||
```yaml
|
||||
services:
|
||||
gateway:
|
||||
latency_p95_ms: 300 # adjust
|
||||
error_rate_pct: 1.0
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
| Symptom | Cause | Fix |
|
||||
|---------|-------|-----|
|
||||
| Alert `accepted=false` | Validation failure (missing service/title, invalid kind) | Fix Monitor alert payload |
|
||||
| `deduped=true` unexpectedly | Same fingerprint within TTL | Check Monitor fingerprint logic |
|
||||
| `alert_to_incident` fails "not found" | Alert ref expired from MemoryStore | Switch to Postgres backend |
|
||||
| Alerts stuck in `processing` | Loop died without acking | Run `claim` — it auto-requeues expired locks. Or: `UPDATE alerts SET status='new', processing_lock_until=NULL WHERE status='processing' AND processing_lock_until < NOW()` |
|
||||
| Alerts stuck in `failed` | Persistent processing errors | Check `last_error` field: `SELECT alert_ref, last_error FROM alerts WHERE status='failed'` |
|
||||
| Triage not running | Cooldown active | Check `incident_signature_state.last_triage_at`; or reduce `triage_cooldown_minutes` in policy |
|
||||
| `claim` returns empty | All new alerts already locked | Check for stale processing: `SELECT COUNT(*) FROM alerts WHERE status='processing' AND processing_lock_until < NOW()` |
|
||||
| SLO gate blocks in staging | SLO breach active | Fix service or override with `run_slo_watch: false` |
|
||||
| `tools.alerts.ingest` denied | Monitor agent missing entitlement | Check `config/rbac_tools_matrix.yml` `agent_monitor` role |
|
||||
| `tools.alerts.claim` denied | Agent missing `tools.alerts.claim` | Only `agent_cto` / `agent_oncall` / Supervisor can claim |
|
||||
|
||||
## Retention
|
||||
|
||||
Alerts in Postgres: no TTL enforced by default — add a cron job if needed:
|
||||
```sql
|
||||
DELETE FROM alerts WHERE created_at < NOW() - INTERVAL '30 days';
|
||||
```
|
||||
|
||||
Memory backend: cleared on process restart.
|
||||
|
||||
---
|
||||
|
||||
## Production Mode: ALERT_BACKEND=postgres
|
||||
|
||||
**⚠ Default is `memory` — do NOT use in production.** Alerts are lost on router restart.
|
||||
|
||||
### Setup (one-time, per environment)
|
||||
|
||||
**1. Run migration:**
|
||||
```bash
|
||||
python3 ops/scripts/migrate_alerts_postgres.py \
|
||||
--dsn "postgresql://user:pass@host:5432/daarion"
|
||||
# or dry-run:
|
||||
python3 ops/scripts/migrate_alerts_postgres.py --dry-run
|
||||
```
|
||||
|
||||
**2. Set env vars** (in `.env`, docker-compose, or systemd unit):
|
||||
```bash
|
||||
ALERT_BACKEND=postgres
|
||||
ALERT_DATABASE_URL=postgresql://user:pass@host:5432/daarion
|
||||
# Fallback: if ALERT_DATABASE_URL is unset, DATABASE_URL is used automatically
|
||||
```
|
||||
|
||||
**3. Restart router:**
|
||||
```bash
|
||||
docker compose -f docker-compose.node1.yml restart router
|
||||
# or node2:
|
||||
docker compose -f docker-compose.node2-sofiia.yml restart router
|
||||
```
|
||||
|
||||
**4. Verify persistence** (survive a restart):
|
||||
```bash
|
||||
# Ingest a test alert
|
||||
curl -X POST http://router:8000/v1/tools/execute \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"tool":"alert_ingest_tool","action":"ingest","service":"test","kind":"test","message":"persistence check"}'
|
||||
|
||||
# Restart router
|
||||
docker compose restart router
|
||||
|
||||
# Confirm alert still visible after restart
|
||||
curl "http://router:8000/v1/tools/execute" \
|
||||
-d '{"tool":"alert_ingest_tool","action":"list","service":"test"}'
|
||||
# Expect: alert still present → PASS
|
||||
```
|
||||
|
||||
### DSN resolution order
|
||||
|
||||
`alert_store.py` factory resolves DSN in this priority:
|
||||
1. `ALERT_DATABASE_URL` (service-specific, recommended)
|
||||
2. `DATABASE_URL` (shared Postgres, fallback)
|
||||
3. Falls back to memory with a WARNING log if neither is set.
|
||||
|
||||
### compose files updated
|
||||
|
||||
| File | ALERT_BACKEND set? |
|
||||
|------|--------------------|
|
||||
| `docker-compose.node1.yml` | ✅ `postgres` |
|
||||
| `docker-compose.node2-sofiia.yml` | ✅ `postgres` |
|
||||
| `docker-compose.staging.yml` | ✅ `postgres` |
|
||||
Reference in New Issue
Block a user