Files

Apple 67225a39fa docs(platform): add policy configs, runbooks, ops scripts and platform documentation

Config policies (16 files): alert_routing, architecture_pressure, backlog,
cost_weights, data_governance, incident_escalation, incident_intelligence,
network_allowlist, nodes_registry, observability_sources, rbac_tools_matrix,
release_gate, risk_attribution, risk_policy, slo_policy, tool_limits, tools_rollout

Ops (22 files): Caddyfile, calendar compose, grafana voice dashboard,
deployments/incidents logs, runbooks for alerts/audit/backlog/incidents/sofiia/voice,
cron jobs, scripts (alert_triage, audit_cleanup, migrate_*, governance, schedule),
task_registry, voice alerts/ha/latency/policy

Docs (30+ files): HUMANIZED_STEPAN v2.7-v3 changelogs and runbooks,
NODA1/NODA2 status and setup, audit index and traces, backlog, incident,
supervisor, tools, voice, opencode, release, risk, aistalk, spacebot

Made-with: Cursor

2026-03-03 07:14:53 -08:00

4.2 KiB

Raw Permalink Blame History

Alert → Incident Bridge

Overview

The Alert Bridge provides a governed, deduplicated pipeline from Monitor/Prometheus detection to Incident creation.

Security model: Monitor sends alerts (tools.alerts.ingest only). Sofiia/oncall create incidents (tools.oncall.incident_write + tools.alerts.ack). No agent gets both roles automatically.

Monitor@nodeX ──ingest──► AlertStore ──alert_to_incident──► IncidentStore
      (tools.alerts.ingest)             (tools.oncall.incident_write)
                                                 │
                                         IncidentTriage (Sofiia NODA2)
                                                 │
                                         PostmortemDraft

AlertEvent Schema

{
  "source": "monitor@node1",
  "service": "gateway",
  "env": "prod",
  "severity": "P1",
  "kind": "slo_breach",
  "title": "gateway SLO: latency p95 > 300ms",
  "summary": "p95 latency at 450ms, error_rate 2.5%",
  "started_at": "2025-01-23T09:00:00Z",
  "labels": {
    "node": "node1",
    "fingerprint": "gateway:slo_breach:latency"
  },
  "metrics": {
    "latency_p95_ms": 450,
    "error_rate_pct": 2.5
  },
  "evidence": {
    "log_samples": ["ERROR timeout after 30s", "WARN retry 3/3"],
    "query": "rate(http_errors_total[5m])"
  }
}

Severity values

P0, P1, P2, P3, INFO

Kind values

slo_breach, crashloop, latency, error_rate, disk, oom, deploy, security, custom

Dedupe Behavior

Dedupe key = sha256(service|env|kind|fingerprint).

Same key within TTL (default 30 min) → deduped=true, occurrences++, no new record
Same key after TTL → new alert record
Different fingerprint → separate record

`alert_ingest_tool` API

ingest (Monitor role)

{
  "action": "ingest",
  "alert": { ...AlertEvent... },
  "dedupe_ttl_minutes": 30
}

Response:

{
  "accepted": true,
  "deduped": false,
  "dedupe_key": "abc123...",
  "alert_ref": "alrt_20250123_090000_a1b2c3",
  "occurrences": 1
}

list (read)

{ "action": "list", "service": "gateway", "env": "prod", "window_minutes": 240, "limit": 50 }

get (read)

{ "action": "get", "alert_ref": "alrt_..." }

ack (oncall/cto)

{ "action": "ack", "alert_ref": "alrt_...", "actor": "sofiia", "note": "false positive" }

`oncall_tool.alert_to_incident`

Converts a stored alert into an incident (or attaches to an existing open one).

{
  "action": "alert_to_incident",
  "alert_ref": "alrt_...",
  "incident_severity_cap": "P1",
  "dedupe_window_minutes": 60,
  "attach_artifact": true
}

Response:

{
  "incident_id": "inc_20250123_090000_xyz",
  "created": true,
  "severity": "P1",
  "artifact_path": "ops/incidents/inc_.../alert_alrt_....json",
  "note": "Incident created and alert acked"
}

Logic

Load alert from AlertStore
Check for existing open P0/P1 incident for same service/env within dedupe_window_minutes
- If found → attach event to existing incident, ack alert
If not found → create incident, append note + metric timeline events, optionally attach masked alert JSON as artifact, ack alert

RBAC

Role	ingest	list/get	ack	alert_to_incident
`agent_monitor`	✅	❌	❌	❌
`agent_cto`	✅	✅	✅	✅
`agent_oncall`	❌	✅	✅	✅
`agent_interface`	❌	✅	❌	❌
`agent_default`	❌	❌	❌	❌

SLO Watch Gate

The slo_watch gate in release_check prevents deploys during active SLO breaches.

Profile	Mode	Behavior
dev	warn	Recommendations only
staging	strict	Blocks on any violation
prod	warn	Recommendations only

Configure in config/release_gate_policy.yml per profile. Override per run with run_slo_watch: false.

Backends

Env var	Value	Effect
`ALERT_BACKEND`	`memory` (default)	In-process, not persistent
`ALERT_BACKEND`	`postgres`	Persistent, needs DATABASE_URL
`ALERT_BACKEND`	`auto`	Postgres if DATABASE_URL set, else memory

Run DDL: python3 ops/scripts/migrate_alerts_postgres.py

4.2 KiB Raw Permalink Blame History