docs(platform): add policy configs, runbooks, ops scripts and platform documentation

Config policies (16 files): alert_routing, architecture_pressure, backlog,
cost_weights, data_governance, incident_escalation, incident_intelligence,
network_allowlist, nodes_registry, observability_sources, rbac_tools_matrix,
release_gate, risk_attribution, risk_policy, slo_policy, tool_limits, tools_rollout

Ops (22 files): Caddyfile, calendar compose, grafana voice dashboard,
deployments/incidents logs, runbooks for alerts/audit/backlog/incidents/sofiia/voice,
cron jobs, scripts (alert_triage, audit_cleanup, migrate_*, governance, schedule),
task_registry, voice alerts/ha/latency/policy

Docs (30+ files): HUMANIZED_STEPAN v2.7-v3 changelogs and runbooks,
NODA1/NODA2 status and setup, audit index and traces, backlog, incident,
supervisor, tools, voice, opencode, release, risk, aistalk, spacebot

Made-with: Cursor
This commit is contained in:
Apple
2026-03-03 07:14:53 -08:00
parent 129e4ea1fc
commit 67225a39fa
102 changed files with 20060 additions and 0 deletions

206
docs/risk/risk_index.md Normal file
View File

@@ -0,0 +1,206 @@
# Service Risk Index
> Deterministic. No LLM. Production-grade.
## Overview
The Risk Index Engine computes a **numerical risk score (0100+)** for every tracked service. It is the single authoritative metric for service health in the DAARION.city control plane.
Score → Band mapping:
| Score | Band | Meaning |
|--------|----------|------------------------------------------|
| 020 | low | No significant signals |
| 2150 | medium | Minor signals; monitor |
| 5180 | high | Active problems; coordinate before deploy|
| 81+ | critical | Block or escalate immediately |
---
## Scoring Formula
```
Risk(service) = Σ weight(signal) × count_or_flag(signal)
```
All weights are policy-driven via `config/risk_policy.yml`.
### Signal weights (defaults)
| Signal | Points |
|-------------------------------|-------------------------------|
| Open P0 incident | 50 each |
| Open P1 incident | 25 each |
| Open P2 incident | 10 each |
| Open P3 incident | 5 each |
| High recurrence signature 7d | 20 each |
| Warn recurrence signature 7d | 10 each |
| High recurrence kind 7d | 15 each |
| Warn recurrence kind 7d | 8 each |
| High recurrence signature 30d | 10 each |
| High recurrence kind 30d | 8 each |
| Overdue follow-up P0 | 20 each |
| Overdue follow-up P1 | 12 each |
| Overdue follow-up other | 6 each |
| Active SLO violation (60m) | 10 each |
| Alert-loop SLO violation | 10 each |
| Escalations 24h (12) | 5 (warn level) |
| Escalations 24h (3+) | 12 (high level) |
---
## Configuration
**`config/risk_policy.yml`** — controls all weights, thresholds, and per-service overrides.
```yaml
thresholds:
bands:
low_max: 20
medium_max: 50
high_max: 80
risk_watch:
warn_at: 50
fail_at: 80
service_overrides:
gateway:
risk_watch:
fail_at: 75 # gateway fails earlier: critical path
p0_services:
- gateway
- router
```
Changes to the file take effect on next request (cache is not long-lived).
---
## API
### `GET /v1/risk/service/{service}?env=prod&window_hours=24`
Returns a `RiskReport`:
```json
{
"service": "gateway",
"env": "prod",
"score": 72,
"band": "high",
"thresholds": { "warn_at": 50, "fail_at": 75 },
"components": {
"open_incidents": { "P0": 0, "P1": 1, "P2": 2, "points": 45 },
"recurrence": { "high_signatures_7d": 1, "points": 20 },
"followups": { "overdue_P1": 1, "points": 12 },
"slo": { "violations": 1, "points": 10 },
"alerts_loop": { "violations": 0, "points": 0 },
"escalations": { "count_24h": 1, "points": 5 }
},
"reasons": [
"Open P1 incident(s): 1",
"High recurrence signatures (7d): 1",
"Overdue follow-ups (P1): 1",
"Active SLO violation(s) in window: 1",
"Escalations in last 24h: 1"
],
"recommendations": [
"Prioritize open P0/P1 incidents before deploying.",
"Investigate recurring failure patterns.",
"Avoid risky deploys until SLO violation clears.",
"Service is high-risk — coordinate with oncall before release."
],
"updated_at": "2026-02-23T12:00:00"
}
```
RBAC required: `tools.risk.read` (granted to `agent_cto`, `agent_oncall`, `agent_monitor`).
### `GET /v1/risk/dashboard?env=prod&top_n=10`
Returns top-N services by score with band summary:
```json
{
"env": "prod",
"generated_at": "...",
"total_services": 4,
"band_counts": { "critical": 1, "high": 1, "medium": 2, "low": 0 },
"critical_p0_services": ["gateway"],
"services": [ ...RiskReports sorted by score desc... ]
}
```
### Tool: `risk_engine_tool`
```json
{ "action": "service", "service": "gateway", "env": "prod" }
{ "action": "dashboard", "env": "prod", "top_n": 10 }
{ "action": "policy" }
```
---
## Release Gate: `risk_watch`
The `risk_watch` gate integrates Risk Index into the release pipeline.
### Behaviour
| Mode | When score ≥ warn_at (default 50) | When score ≥ fail_at (default 80) |
|--------|------------------------------------|-------------------------------------|
| warn | pass=true + recommendations added | pass=true + recommendations added |
| strict | pass=true + recommendations added | **pass=false** — deploy blocked |
### Policy
```yaml
# config/release_gate_policy.yml
dev:
risk_watch: { mode: "warn" }
staging:
risk_watch: { mode: "strict" } # blocks p0_services when score >= fail_at
prod:
risk_watch: { mode: "warn" }
```
### Non-fatal guarantee
If the Risk Engine is unavailable (store down, timeout, error), `risk_watch` is **skipped** — never blocks. A warning is added to the gate output.
### Release inputs
| Input | Type | Default | Description |
|--------------------|---------|---------|----------------------------------------------|
| `run_risk_watch` | boolean | true | Enable/disable the gate |
| `risk_watch_env` | string | prod | Env to score against |
| `risk_watch_warn_at` | int | policy | Override warn threshold |
| `risk_watch_fail_at` | int | policy | Override fail threshold |
---
## Architecture
```
[Incident Store]──open incidents──┐
[Intelligence]──recurrence 7d/30d─┤
[Followups Summary]──overdue──────┤──► risk_engine.py ──► RiskReport
[SLO Snapshot]──violations────────┤ │
[Alert Store]──loop SLO───────────┤ score_to_band
[Decision Events]──escalations────┘ │
release_check_runner
risk_watch gate
```
The engine has **zero LLM calls**. It is deterministic: given the same signals, the same score is always produced.
---
## Testing
```bash
pytest tests/test_risk_engine.py # scoring + bands + overrides
pytest tests/test_risk_dashboard.py # sorting + band counts + p0 detection
pytest tests/test_release_check_risk_watch.py # warn/strict/non-fatal gate
```