Files
microdao-daarion/docs/risk/risk_index.md
Apple 67225a39fa docs(platform): add policy configs, runbooks, ops scripts and platform documentation
Config policies (16 files): alert_routing, architecture_pressure, backlog,
cost_weights, data_governance, incident_escalation, incident_intelligence,
network_allowlist, nodes_registry, observability_sources, rbac_tools_matrix,
release_gate, risk_attribution, risk_policy, slo_policy, tool_limits, tools_rollout

Ops (22 files): Caddyfile, calendar compose, grafana voice dashboard,
deployments/incidents logs, runbooks for alerts/audit/backlog/incidents/sofiia/voice,
cron jobs, scripts (alert_triage, audit_cleanup, migrate_*, governance, schedule),
task_registry, voice alerts/ha/latency/policy

Docs (30+ files): HUMANIZED_STEPAN v2.7-v3 changelogs and runbooks,
NODA1/NODA2 status and setup, audit index and traces, backlog, incident,
supervisor, tools, voice, opencode, release, risk, aistalk, spacebot

Made-with: Cursor
2026-03-03 07:14:53 -08:00

207 lines
6.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Service Risk Index
> Deterministic. No LLM. Production-grade.
## Overview
The Risk Index Engine computes a **numerical risk score (0100+)** for every tracked service. It is the single authoritative metric for service health in the DAARION.city control plane.
Score → Band mapping:
| Score | Band | Meaning |
|--------|----------|------------------------------------------|
| 020 | low | No significant signals |
| 2150 | medium | Minor signals; monitor |
| 5180 | high | Active problems; coordinate before deploy|
| 81+ | critical | Block or escalate immediately |
---
## Scoring Formula
```
Risk(service) = Σ weight(signal) × count_or_flag(signal)
```
All weights are policy-driven via `config/risk_policy.yml`.
### Signal weights (defaults)
| Signal | Points |
|-------------------------------|-------------------------------|
| Open P0 incident | 50 each |
| Open P1 incident | 25 each |
| Open P2 incident | 10 each |
| Open P3 incident | 5 each |
| High recurrence signature 7d | 20 each |
| Warn recurrence signature 7d | 10 each |
| High recurrence kind 7d | 15 each |
| Warn recurrence kind 7d | 8 each |
| High recurrence signature 30d | 10 each |
| High recurrence kind 30d | 8 each |
| Overdue follow-up P0 | 20 each |
| Overdue follow-up P1 | 12 each |
| Overdue follow-up other | 6 each |
| Active SLO violation (60m) | 10 each |
| Alert-loop SLO violation | 10 each |
| Escalations 24h (12) | 5 (warn level) |
| Escalations 24h (3+) | 12 (high level) |
---
## Configuration
**`config/risk_policy.yml`** — controls all weights, thresholds, and per-service overrides.
```yaml
thresholds:
bands:
low_max: 20
medium_max: 50
high_max: 80
risk_watch:
warn_at: 50
fail_at: 80
service_overrides:
gateway:
risk_watch:
fail_at: 75 # gateway fails earlier: critical path
p0_services:
- gateway
- router
```
Changes to the file take effect on next request (cache is not long-lived).
---
## API
### `GET /v1/risk/service/{service}?env=prod&window_hours=24`
Returns a `RiskReport`:
```json
{
"service": "gateway",
"env": "prod",
"score": 72,
"band": "high",
"thresholds": { "warn_at": 50, "fail_at": 75 },
"components": {
"open_incidents": { "P0": 0, "P1": 1, "P2": 2, "points": 45 },
"recurrence": { "high_signatures_7d": 1, "points": 20 },
"followups": { "overdue_P1": 1, "points": 12 },
"slo": { "violations": 1, "points": 10 },
"alerts_loop": { "violations": 0, "points": 0 },
"escalations": { "count_24h": 1, "points": 5 }
},
"reasons": [
"Open P1 incident(s): 1",
"High recurrence signatures (7d): 1",
"Overdue follow-ups (P1): 1",
"Active SLO violation(s) in window: 1",
"Escalations in last 24h: 1"
],
"recommendations": [
"Prioritize open P0/P1 incidents before deploying.",
"Investigate recurring failure patterns.",
"Avoid risky deploys until SLO violation clears.",
"Service is high-risk — coordinate with oncall before release."
],
"updated_at": "2026-02-23T12:00:00"
}
```
RBAC required: `tools.risk.read` (granted to `agent_cto`, `agent_oncall`, `agent_monitor`).
### `GET /v1/risk/dashboard?env=prod&top_n=10`
Returns top-N services by score with band summary:
```json
{
"env": "prod",
"generated_at": "...",
"total_services": 4,
"band_counts": { "critical": 1, "high": 1, "medium": 2, "low": 0 },
"critical_p0_services": ["gateway"],
"services": [ ...RiskReports sorted by score desc... ]
}
```
### Tool: `risk_engine_tool`
```json
{ "action": "service", "service": "gateway", "env": "prod" }
{ "action": "dashboard", "env": "prod", "top_n": 10 }
{ "action": "policy" }
```
---
## Release Gate: `risk_watch`
The `risk_watch` gate integrates Risk Index into the release pipeline.
### Behaviour
| Mode | When score ≥ warn_at (default 50) | When score ≥ fail_at (default 80) |
|--------|------------------------------------|-------------------------------------|
| warn | pass=true + recommendations added | pass=true + recommendations added |
| strict | pass=true + recommendations added | **pass=false** — deploy blocked |
### Policy
```yaml
# config/release_gate_policy.yml
dev:
risk_watch: { mode: "warn" }
staging:
risk_watch: { mode: "strict" } # blocks p0_services when score >= fail_at
prod:
risk_watch: { mode: "warn" }
```
### Non-fatal guarantee
If the Risk Engine is unavailable (store down, timeout, error), `risk_watch` is **skipped** — never blocks. A warning is added to the gate output.
### Release inputs
| Input | Type | Default | Description |
|--------------------|---------|---------|----------------------------------------------|
| `run_risk_watch` | boolean | true | Enable/disable the gate |
| `risk_watch_env` | string | prod | Env to score against |
| `risk_watch_warn_at` | int | policy | Override warn threshold |
| `risk_watch_fail_at` | int | policy | Override fail threshold |
---
## Architecture
```
[Incident Store]──open incidents──┐
[Intelligence]──recurrence 7d/30d─┤
[Followups Summary]──overdue──────┤──► risk_engine.py ──► RiskReport
[SLO Snapshot]──violations────────┤ │
[Alert Store]──loop SLO───────────┤ score_to_band
[Decision Events]──escalations────┘ │
release_check_runner
risk_watch gate
```
The engine has **zero LLM calls**. It is deterministic: given the same signals, the same score is always produced.
---
## Testing
```bash
pytest tests/test_risk_engine.py # scoring + bands + overrides
pytest tests/test_risk_dashboard.py # sorting + band counts + p0 detection
pytest tests/test_release_check_risk_watch.py # warn/strict/non-fatal gate
```