docs(platform): add policy configs, runbooks, ops scripts and platform documentation
Config policies (16 files): alert_routing, architecture_pressure, backlog, cost_weights, data_governance, incident_escalation, incident_intelligence, network_allowlist, nodes_registry, observability_sources, rbac_tools_matrix, release_gate, risk_attribution, risk_policy, slo_policy, tool_limits, tools_rollout Ops (22 files): Caddyfile, calendar compose, grafana voice dashboard, deployments/incidents logs, runbooks for alerts/audit/backlog/incidents/sofiia/voice, cron jobs, scripts (alert_triage, audit_cleanup, migrate_*, governance, schedule), task_registry, voice alerts/ha/latency/policy Docs (30+ files): HUMANIZED_STEPAN v2.7-v3 changelogs and runbooks, NODA1/NODA2 status and setup, audit index and traces, backlog, incident, supervisor, tools, voice, opencode, release, risk, aistalk, spacebot Made-with: Cursor
This commit is contained in:
206
docs/risk/risk_index.md
Normal file
206
docs/risk/risk_index.md
Normal file
@@ -0,0 +1,206 @@
|
||||
# Service Risk Index
|
||||
|
||||
> Deterministic. No LLM. Production-grade.
|
||||
|
||||
## Overview
|
||||
|
||||
The Risk Index Engine computes a **numerical risk score (0–100+)** for every tracked service. It is the single authoritative metric for service health in the DAARION.city control plane.
|
||||
|
||||
Score → Band mapping:
|
||||
|
||||
| Score | Band | Meaning |
|
||||
|--------|----------|------------------------------------------|
|
||||
| 0–20 | low | No significant signals |
|
||||
| 21–50 | medium | Minor signals; monitor |
|
||||
| 51–80 | high | Active problems; coordinate before deploy|
|
||||
| 81+ | critical | Block or escalate immediately |
|
||||
|
||||
---
|
||||
|
||||
## Scoring Formula
|
||||
|
||||
```
|
||||
Risk(service) = Σ weight(signal) × count_or_flag(signal)
|
||||
```
|
||||
|
||||
All weights are policy-driven via `config/risk_policy.yml`.
|
||||
|
||||
### Signal weights (defaults)
|
||||
|
||||
| Signal | Points |
|
||||
|-------------------------------|-------------------------------|
|
||||
| Open P0 incident | 50 each |
|
||||
| Open P1 incident | 25 each |
|
||||
| Open P2 incident | 10 each |
|
||||
| Open P3 incident | 5 each |
|
||||
| High recurrence signature 7d | 20 each |
|
||||
| Warn recurrence signature 7d | 10 each |
|
||||
| High recurrence kind 7d | 15 each |
|
||||
| Warn recurrence kind 7d | 8 each |
|
||||
| High recurrence signature 30d | 10 each |
|
||||
| High recurrence kind 30d | 8 each |
|
||||
| Overdue follow-up P0 | 20 each |
|
||||
| Overdue follow-up P1 | 12 each |
|
||||
| Overdue follow-up other | 6 each |
|
||||
| Active SLO violation (60m) | 10 each |
|
||||
| Alert-loop SLO violation | 10 each |
|
||||
| Escalations 24h (1–2) | 5 (warn level) |
|
||||
| Escalations 24h (3+) | 12 (high level) |
|
||||
|
||||
---
|
||||
|
||||
## Configuration
|
||||
|
||||
**`config/risk_policy.yml`** — controls all weights, thresholds, and per-service overrides.
|
||||
|
||||
```yaml
|
||||
thresholds:
|
||||
bands:
|
||||
low_max: 20
|
||||
medium_max: 50
|
||||
high_max: 80
|
||||
risk_watch:
|
||||
warn_at: 50
|
||||
fail_at: 80
|
||||
|
||||
service_overrides:
|
||||
gateway:
|
||||
risk_watch:
|
||||
fail_at: 75 # gateway fails earlier: critical path
|
||||
|
||||
p0_services:
|
||||
- gateway
|
||||
- router
|
||||
```
|
||||
|
||||
Changes to the file take effect on next request (cache is not long-lived).
|
||||
|
||||
---
|
||||
|
||||
## API
|
||||
|
||||
### `GET /v1/risk/service/{service}?env=prod&window_hours=24`
|
||||
|
||||
Returns a `RiskReport`:
|
||||
|
||||
```json
|
||||
{
|
||||
"service": "gateway",
|
||||
"env": "prod",
|
||||
"score": 72,
|
||||
"band": "high",
|
||||
"thresholds": { "warn_at": 50, "fail_at": 75 },
|
||||
"components": {
|
||||
"open_incidents": { "P0": 0, "P1": 1, "P2": 2, "points": 45 },
|
||||
"recurrence": { "high_signatures_7d": 1, "points": 20 },
|
||||
"followups": { "overdue_P1": 1, "points": 12 },
|
||||
"slo": { "violations": 1, "points": 10 },
|
||||
"alerts_loop": { "violations": 0, "points": 0 },
|
||||
"escalations": { "count_24h": 1, "points": 5 }
|
||||
},
|
||||
"reasons": [
|
||||
"Open P1 incident(s): 1",
|
||||
"High recurrence signatures (7d): 1",
|
||||
"Overdue follow-ups (P1): 1",
|
||||
"Active SLO violation(s) in window: 1",
|
||||
"Escalations in last 24h: 1"
|
||||
],
|
||||
"recommendations": [
|
||||
"Prioritize open P0/P1 incidents before deploying.",
|
||||
"Investigate recurring failure patterns.",
|
||||
"Avoid risky deploys until SLO violation clears.",
|
||||
"Service is high-risk — coordinate with oncall before release."
|
||||
],
|
||||
"updated_at": "2026-02-23T12:00:00"
|
||||
}
|
||||
```
|
||||
|
||||
RBAC required: `tools.risk.read` (granted to `agent_cto`, `agent_oncall`, `agent_monitor`).
|
||||
|
||||
### `GET /v1/risk/dashboard?env=prod&top_n=10`
|
||||
|
||||
Returns top-N services by score with band summary:
|
||||
|
||||
```json
|
||||
{
|
||||
"env": "prod",
|
||||
"generated_at": "...",
|
||||
"total_services": 4,
|
||||
"band_counts": { "critical": 1, "high": 1, "medium": 2, "low": 0 },
|
||||
"critical_p0_services": ["gateway"],
|
||||
"services": [ ...RiskReports sorted by score desc... ]
|
||||
}
|
||||
```
|
||||
|
||||
### Tool: `risk_engine_tool`
|
||||
|
||||
```json
|
||||
{ "action": "service", "service": "gateway", "env": "prod" }
|
||||
{ "action": "dashboard", "env": "prod", "top_n": 10 }
|
||||
{ "action": "policy" }
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Release Gate: `risk_watch`
|
||||
|
||||
The `risk_watch` gate integrates Risk Index into the release pipeline.
|
||||
|
||||
### Behaviour
|
||||
|
||||
| Mode | When score ≥ warn_at (default 50) | When score ≥ fail_at (default 80) |
|
||||
|--------|------------------------------------|-------------------------------------|
|
||||
| warn | pass=true + recommendations added | pass=true + recommendations added |
|
||||
| strict | pass=true + recommendations added | **pass=false** — deploy blocked |
|
||||
|
||||
### Policy
|
||||
|
||||
```yaml
|
||||
# config/release_gate_policy.yml
|
||||
dev:
|
||||
risk_watch: { mode: "warn" }
|
||||
staging:
|
||||
risk_watch: { mode: "strict" } # blocks p0_services when score >= fail_at
|
||||
prod:
|
||||
risk_watch: { mode: "warn" }
|
||||
```
|
||||
|
||||
### Non-fatal guarantee
|
||||
|
||||
If the Risk Engine is unavailable (store down, timeout, error), `risk_watch` is **skipped** — never blocks. A warning is added to the gate output.
|
||||
|
||||
### Release inputs
|
||||
|
||||
| Input | Type | Default | Description |
|
||||
|--------------------|---------|---------|----------------------------------------------|
|
||||
| `run_risk_watch` | boolean | true | Enable/disable the gate |
|
||||
| `risk_watch_env` | string | prod | Env to score against |
|
||||
| `risk_watch_warn_at` | int | policy | Override warn threshold |
|
||||
| `risk_watch_fail_at` | int | policy | Override fail threshold |
|
||||
|
||||
---
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
[Incident Store]──open incidents──┐
|
||||
[Intelligence]──recurrence 7d/30d─┤
|
||||
[Followups Summary]──overdue──────┤──► risk_engine.py ──► RiskReport
|
||||
[SLO Snapshot]──violations────────┤ │
|
||||
[Alert Store]──loop SLO───────────┤ score_to_band
|
||||
[Decision Events]──escalations────┘ │
|
||||
release_check_runner
|
||||
risk_watch gate
|
||||
```
|
||||
|
||||
The engine has **zero LLM calls**. It is deterministic: given the same signals, the same score is always produced.
|
||||
|
||||
---
|
||||
|
||||
## Testing
|
||||
|
||||
```bash
|
||||
pytest tests/test_risk_engine.py # scoring + bands + overrides
|
||||
pytest tests/test_risk_dashboard.py # sorting + band counts + p0 detection
|
||||
pytest tests/test_release_check_risk_watch.py # warn/strict/non-fatal gate
|
||||
```
|
||||
Reference in New Issue
Block a user