docs(platform): add policy configs, runbooks, ops scripts and platform documentation
Config policies (16 files): alert_routing, architecture_pressure, backlog, cost_weights, data_governance, incident_escalation, incident_intelligence, network_allowlist, nodes_registry, observability_sources, rbac_tools_matrix, release_gate, risk_attribution, risk_policy, slo_policy, tool_limits, tools_rollout Ops (22 files): Caddyfile, calendar compose, grafana voice dashboard, deployments/incidents logs, runbooks for alerts/audit/backlog/incidents/sofiia/voice, cron jobs, scripts (alert_triage, audit_cleanup, migrate_*, governance, schedule), task_registry, voice alerts/ha/latency/policy Docs (30+ files): HUMANIZED_STEPAN v2.7-v3 changelogs and runbooks, NODA1/NODA2 status and setup, audit index and traces, backlog, incident, supervisor, tools, voice, opencode, release, risk, aistalk, spacebot Made-with: Cursor
This commit is contained in:
156
docs/incident/alerts.md
Normal file
156
docs/incident/alerts.md
Normal file
@@ -0,0 +1,156 @@
|
||||
# Alert → Incident Bridge
|
||||
|
||||
## Overview
|
||||
|
||||
The Alert Bridge provides a governed, deduplicated pipeline from Monitor/Prometheus detection to Incident creation.
|
||||
|
||||
**Security model:** Monitor sends alerts (`tools.alerts.ingest` only). Sofiia/oncall create incidents (`tools.oncall.incident_write` + `tools.alerts.ack`). No agent gets both roles automatically.
|
||||
|
||||
```
|
||||
Monitor@nodeX ──ingest──► AlertStore ──alert_to_incident──► IncidentStore
|
||||
(tools.alerts.ingest) (tools.oncall.incident_write)
|
||||
│
|
||||
IncidentTriage (Sofiia NODA2)
|
||||
│
|
||||
PostmortemDraft
|
||||
```
|
||||
|
||||
## AlertEvent Schema
|
||||
|
||||
```json
|
||||
{
|
||||
"source": "monitor@node1",
|
||||
"service": "gateway",
|
||||
"env": "prod",
|
||||
"severity": "P1",
|
||||
"kind": "slo_breach",
|
||||
"title": "gateway SLO: latency p95 > 300ms",
|
||||
"summary": "p95 latency at 450ms, error_rate 2.5%",
|
||||
"started_at": "2025-01-23T09:00:00Z",
|
||||
"labels": {
|
||||
"node": "node1",
|
||||
"fingerprint": "gateway:slo_breach:latency"
|
||||
},
|
||||
"metrics": {
|
||||
"latency_p95_ms": 450,
|
||||
"error_rate_pct": 2.5
|
||||
},
|
||||
"evidence": {
|
||||
"log_samples": ["ERROR timeout after 30s", "WARN retry 3/3"],
|
||||
"query": "rate(http_errors_total[5m])"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Severity values
|
||||
`P0`, `P1`, `P2`, `P3`, `INFO`
|
||||
|
||||
### Kind values
|
||||
`slo_breach`, `crashloop`, `latency`, `error_rate`, `disk`, `oom`, `deploy`, `security`, `custom`
|
||||
|
||||
## Dedupe Behavior
|
||||
|
||||
Dedupe key = `sha256(service|env|kind|fingerprint)`.
|
||||
|
||||
- Same key within TTL (default 30 min) → `deduped=true`, `occurrences++`, no new record
|
||||
- Same key after TTL → new alert record
|
||||
- Different fingerprint → separate record
|
||||
|
||||
## `alert_ingest_tool` API
|
||||
|
||||
### ingest (Monitor role)
|
||||
```json
|
||||
{
|
||||
"action": "ingest",
|
||||
"alert": { ...AlertEvent... },
|
||||
"dedupe_ttl_minutes": 30
|
||||
}
|
||||
```
|
||||
Response:
|
||||
```json
|
||||
{
|
||||
"accepted": true,
|
||||
"deduped": false,
|
||||
"dedupe_key": "abc123...",
|
||||
"alert_ref": "alrt_20250123_090000_a1b2c3",
|
||||
"occurrences": 1
|
||||
}
|
||||
```
|
||||
|
||||
### list (read)
|
||||
```json
|
||||
{ "action": "list", "service": "gateway", "env": "prod", "window_minutes": 240, "limit": 50 }
|
||||
```
|
||||
|
||||
### get (read)
|
||||
```json
|
||||
{ "action": "get", "alert_ref": "alrt_..." }
|
||||
```
|
||||
|
||||
### ack (oncall/cto)
|
||||
```json
|
||||
{ "action": "ack", "alert_ref": "alrt_...", "actor": "sofiia", "note": "false positive" }
|
||||
```
|
||||
|
||||
## `oncall_tool.alert_to_incident`
|
||||
|
||||
Converts a stored alert into an incident (or attaches to an existing open one).
|
||||
|
||||
```json
|
||||
{
|
||||
"action": "alert_to_incident",
|
||||
"alert_ref": "alrt_...",
|
||||
"incident_severity_cap": "P1",
|
||||
"dedupe_window_minutes": 60,
|
||||
"attach_artifact": true
|
||||
}
|
||||
```
|
||||
|
||||
Response:
|
||||
```json
|
||||
{
|
||||
"incident_id": "inc_20250123_090000_xyz",
|
||||
"created": true,
|
||||
"severity": "P1",
|
||||
"artifact_path": "ops/incidents/inc_.../alert_alrt_....json",
|
||||
"note": "Incident created and alert acked"
|
||||
}
|
||||
```
|
||||
|
||||
### Logic
|
||||
1. Load alert from `AlertStore`
|
||||
2. Check for existing open P0/P1 incident for same service/env within `dedupe_window_minutes`
|
||||
- If found → attach event to existing incident, ack alert
|
||||
3. If not found → create incident, append `note` + `metric` timeline events, optionally attach masked alert JSON as artifact, ack alert
|
||||
|
||||
## RBAC
|
||||
|
||||
| Role | ingest | list/get | ack | alert_to_incident |
|
||||
|------|--------|----------|-----|-------------------|
|
||||
| `agent_monitor` | ✅ | ❌ | ❌ | ❌ |
|
||||
| `agent_cto` | ✅ | ✅ | ✅ | ✅ |
|
||||
| `agent_oncall` | ❌ | ✅ | ✅ | ✅ |
|
||||
| `agent_interface` | ❌ | ✅ | ❌ | ❌ |
|
||||
| `agent_default` | ❌ | ❌ | ❌ | ❌ |
|
||||
|
||||
## SLO Watch Gate
|
||||
|
||||
The `slo_watch` gate in `release_check` prevents deploys during active SLO breaches.
|
||||
|
||||
| Profile | Mode | Behavior |
|
||||
|---------|------|----------|
|
||||
| dev | warn | Recommendations only |
|
||||
| staging | strict | Blocks on any violation |
|
||||
| prod | warn | Recommendations only |
|
||||
|
||||
Configure in `config/release_gate_policy.yml` per profile. Override per run with `run_slo_watch: false`.
|
||||
|
||||
## Backends
|
||||
|
||||
| Env var | Value | Effect |
|
||||
|---------|-------|--------|
|
||||
| `ALERT_BACKEND` | `memory` (default) | In-process, not persistent |
|
||||
| `ALERT_BACKEND` | `postgres` | Persistent, needs DATABASE_URL |
|
||||
| `ALERT_BACKEND` | `auto` | Postgres if DATABASE_URL set, else memory |
|
||||
|
||||
Run DDL: `python3 ops/scripts/migrate_alerts_postgres.py`
|
||||
Reference in New Issue
Block a user