Files

Apple 67225a39fa docs(platform): add policy configs, runbooks, ops scripts and platform documentation

Config policies (16 files): alert_routing, architecture_pressure, backlog,
cost_weights, data_governance, incident_escalation, incident_intelligence,
network_allowlist, nodes_registry, observability_sources, rbac_tools_matrix,
release_gate, risk_attribution, risk_policy, slo_policy, tool_limits, tools_rollout

Ops (22 files): Caddyfile, calendar compose, grafana voice dashboard,
deployments/incidents logs, runbooks for alerts/audit/backlog/incidents/sofiia/voice,
cron jobs, scripts (alert_triage, audit_cleanup, migrate_*, governance, schedule),
task_registry, voice alerts/ha/latency/policy

Docs (30+ files): HUMANIZED_STEPAN v2.7-v3 changelogs and runbooks,
NODA1/NODA2 status and setup, audit index and traces, backlog, incident,
supervisor, tools, voice, opencode, release, risk, aistalk, spacebot

Made-with: Cursor

2026-03-03 07:14:53 -08:00

8.0 KiB

Raw Blame History

Runbook: Alert → Incident Bridge (State Machine + Cooldown)

Topology

Monitor@node1/2  ──► alert_ingest_tool.ingest ──► AlertStore (Postgres or Memory)
                                                        │
Sofiia / oncall  ──► oncall_tool.alert_to_incident ─────┘
                                                        │
                          IncidentStore (Postgres) ◄───-┘
                                  │
                   Sofiia NODA2: incident_triage_graph
                                  │
                        postmortem_draft_graph

Alert State Machine

new → processing → acked
          ↓
        failed → (retry after TTL) → new

Status	Meaning
`new`	Freshly ingested, not yet claimed
`processing`	Claimed by a loop worker; locked for 10 min
`acked`	Successfully processed and closed
`failed`	Processing error; retry after `retry_after_sec`

Concurrency safety: claim uses SELECT FOR UPDATE SKIP LOCKED (Postgres) or an in-process lock (Memory). Two concurrent loops cannot claim the same alert.

Stale processing requeue: claim automatically requeues alerts whose processing_lock_until has expired.

Triage Cooldown (per Signature)

After a triage runs for a given incident_signature, subsequent alerts with the same signature within 15 min (configurable via triage_cooldown_minutes in alert_routing_policy.yml) only get an incident_append_event note — no new triage run. This prevents triage storms.

# config/alert_routing_policy.yml
defaults:
  triage_cooldown_minutes: 15

The state is persisted in incident_signature_state table (Postgres) or in-memory (fallback).

Startup Checklist

Postgres DDL (if ALERT_BACKEND=postgres):
```
DATABASE_URL=postgresql://... python3 ops/scripts/migrate_alerts_postgres.py
```
This is idempotent — safe to re-run. Adds state machine columns and incident_signature_state table.

Env vars on NODE1 (router):

ALERT_BACKEND=auto           # Postgres → Memory fallback
DATABASE_URL=postgresql://...

Monitor agent: configure source: monitor@node1, use alert_ingest_tool.ingest.

Operational Scenarios

Alert storm protection

Alert deduplication prevents storms. If alerts are firing repeatedly:

Check occurrences field — same alert ref means dedupe is working
Adjust dedupe_ttl_minutes per alert (default 30)
If many different fingerprints create new records — review Monitor fingerprint logic

False positive alert

alert_ingest_tool.ack with note="false positive"
No incident created (or close the incident if already created via oncall_tool.incident_close)

Alert → Incident conversion

# Sofiia or oncall agent calls:
oncall_tool.alert_to_incident(
    alert_ref="alrt_...",
    incident_severity_cap="P1",
    dedupe_window_minutes=60
)

View recent alerts (by status)

# Default: all statuses
alert_ingest_tool.list(window_minutes=240, env="prod")

# Only new/failed (unprocessed):
alert_ingest_tool.list(window_minutes=240, status_in=["new","failed"])

Claim alerts for processing (Supervisor loop)

# Atomic claim — locks alerts for 10 min
alert_ingest_tool.claim(window_minutes=240, limit=25, owner="sofiia-supervisor", lock_ttl_seconds=600)

Mark alert as failed (retry)

alert_ingest_tool.fail(alert_ref="alrt_...", error="gateway timeout", retry_after_seconds=300)

Operational dashboard

GET /v1/alerts/dashboard?window_minutes=240
# → counts by status, top signatures, latest alerts

GET /v1/incidents/open?service=gateway
# → open/mitigating incidents

Monitor health check

Verify Monitor is pushing alerts:

alert_ingest_tool.list(source="monitor@node1", window_minutes=60)

If empty and there should be alerts → check Monitor service + entitlements.

SLO Watch Gate

Staging blocks on SLO breach

Config in config/release_gate_policy.yml:

staging:
  gates:
    slo_watch:
      mode: "strict"

To temporarily bypass (emergency deploy):

# In release_check input:
run_slo_watch: false

Document reason in incident timeline.

Tuning SLO thresholds

Edit config/slo_policy.yml:

services:
  gateway:
    latency_p95_ms: 300    # adjust
    error_rate_pct: 1.0

Troubleshooting

Symptom	Cause	Fix
Alert `accepted=false`	Validation failure (missing service/title, invalid kind)	Fix Monitor alert payload
`deduped=true` unexpectedly	Same fingerprint within TTL	Check Monitor fingerprint logic
`alert_to_incident` fails "not found"	Alert ref expired from MemoryStore	Switch to Postgres backend
Alerts stuck in `processing`	Loop died without acking	Run `claim` — it auto-requeues expired locks. Or: `UPDATE alerts SET status='new', processing_lock_until=NULL WHERE status='processing' AND processing_lock_until < NOW()`
Alerts stuck in `failed`	Persistent processing errors	Check `last_error` field: `SELECT alert_ref, last_error FROM alerts WHERE status='failed'`
Triage not running	Cooldown active	Check `incident_signature_state.last_triage_at`; or reduce `triage_cooldown_minutes` in policy
`claim` returns empty	All new alerts already locked	Check for stale processing: `SELECT COUNT(*) FROM alerts WHERE status='processing' AND processing_lock_until < NOW()`
SLO gate blocks in staging	SLO breach active	Fix service or override with `run_slo_watch: false`
`tools.alerts.ingest` denied	Monitor agent missing entitlement	Check `config/rbac_tools_matrix.yml` `agent_monitor` role
`tools.alerts.claim` denied	Agent missing `tools.alerts.claim`	Only `agent_cto` / `agent_oncall` / Supervisor can claim

Retention

Alerts in Postgres: no TTL enforced by default — add a cron job if needed:

DELETE FROM alerts WHERE created_at < NOW() - INTERVAL '30 days';

Memory backend: cleared on process restart.

Production Mode: ALERT_BACKEND=postgres

⚠ Default is memory — do NOT use in production. Alerts are lost on router restart.

Setup (one-time, per environment)

1. Run migration:

python3 ops/scripts/migrate_alerts_postgres.py \
  --dsn "postgresql://user:pass@host:5432/daarion"
# or dry-run:
python3 ops/scripts/migrate_alerts_postgres.py --dry-run

2. Set env vars (in .env, docker-compose, or systemd unit):

ALERT_BACKEND=postgres
ALERT_DATABASE_URL=postgresql://user:pass@host:5432/daarion
# Fallback: if ALERT_DATABASE_URL is unset, DATABASE_URL is used automatically

3. Restart router:

docker compose -f docker-compose.node1.yml restart router
# or node2:
docker compose -f docker-compose.node2-sofiia.yml restart router

4. Verify persistence (survive a restart):

# Ingest a test alert
curl -X POST http://router:8000/v1/tools/execute \
  -H "Content-Type: application/json" \
  -d '{"tool":"alert_ingest_tool","action":"ingest","service":"test","kind":"test","message":"persistence check"}'

# Restart router
docker compose restart router

# Confirm alert still visible after restart
curl "http://router:8000/v1/tools/execute" \
  -d '{"tool":"alert_ingest_tool","action":"list","service":"test"}'
# Expect: alert still present → PASS

DSN resolution order

alert_store.py factory resolves DSN in this priority:

ALERT_DATABASE_URL (service-specific, recommended)
DATABASE_URL (shared Postgres, fallback)
Falls back to memory with a WARNING log if neither is set.

compose files updated

File	ALERT_BACKEND set?
`docker-compose.node1.yml`	✅ `postgres`
`docker-compose.node2-sofiia.yml`	✅ `postgres`
`docker-compose.staging.yml`	✅ `postgres`

8.0 KiB Raw Blame History