Config policies (16 files): alert_routing, architecture_pressure, backlog, cost_weights, data_governance, incident_escalation, incident_intelligence, network_allowlist, nodes_registry, observability_sources, rbac_tools_matrix, release_gate, risk_attribution, risk_policy, slo_policy, tool_limits, tools_rollout Ops (22 files): Caddyfile, calendar compose, grafana voice dashboard, deployments/incidents logs, runbooks for alerts/audit/backlog/incidents/sofiia/voice, cron jobs, scripts (alert_triage, audit_cleanup, migrate_*, governance, schedule), task_registry, voice alerts/ha/latency/policy Docs (30+ files): HUMANIZED_STEPAN v2.7-v3 changelogs and runbooks, NODA1/NODA2 status and setup, audit index and traces, backlog, incident, supervisor, tools, voice, opencode, release, risk, aistalk, spacebot Made-with: Cursor
207 lines
6.5 KiB
Markdown
207 lines
6.5 KiB
Markdown
# Service Risk Index
|
||
|
||
> Deterministic. No LLM. Production-grade.
|
||
|
||
## Overview
|
||
|
||
The Risk Index Engine computes a **numerical risk score (0–100+)** for every tracked service. It is the single authoritative metric for service health in the DAARION.city control plane.
|
||
|
||
Score → Band mapping:
|
||
|
||
| Score | Band | Meaning |
|
||
|--------|----------|------------------------------------------|
|
||
| 0–20 | low | No significant signals |
|
||
| 21–50 | medium | Minor signals; monitor |
|
||
| 51–80 | high | Active problems; coordinate before deploy|
|
||
| 81+ | critical | Block or escalate immediately |
|
||
|
||
---
|
||
|
||
## Scoring Formula
|
||
|
||
```
|
||
Risk(service) = Σ weight(signal) × count_or_flag(signal)
|
||
```
|
||
|
||
All weights are policy-driven via `config/risk_policy.yml`.
|
||
|
||
### Signal weights (defaults)
|
||
|
||
| Signal | Points |
|
||
|-------------------------------|-------------------------------|
|
||
| Open P0 incident | 50 each |
|
||
| Open P1 incident | 25 each |
|
||
| Open P2 incident | 10 each |
|
||
| Open P3 incident | 5 each |
|
||
| High recurrence signature 7d | 20 each |
|
||
| Warn recurrence signature 7d | 10 each |
|
||
| High recurrence kind 7d | 15 each |
|
||
| Warn recurrence kind 7d | 8 each |
|
||
| High recurrence signature 30d | 10 each |
|
||
| High recurrence kind 30d | 8 each |
|
||
| Overdue follow-up P0 | 20 each |
|
||
| Overdue follow-up P1 | 12 each |
|
||
| Overdue follow-up other | 6 each |
|
||
| Active SLO violation (60m) | 10 each |
|
||
| Alert-loop SLO violation | 10 each |
|
||
| Escalations 24h (1–2) | 5 (warn level) |
|
||
| Escalations 24h (3+) | 12 (high level) |
|
||
|
||
---
|
||
|
||
## Configuration
|
||
|
||
**`config/risk_policy.yml`** — controls all weights, thresholds, and per-service overrides.
|
||
|
||
```yaml
|
||
thresholds:
|
||
bands:
|
||
low_max: 20
|
||
medium_max: 50
|
||
high_max: 80
|
||
risk_watch:
|
||
warn_at: 50
|
||
fail_at: 80
|
||
|
||
service_overrides:
|
||
gateway:
|
||
risk_watch:
|
||
fail_at: 75 # gateway fails earlier: critical path
|
||
|
||
p0_services:
|
||
- gateway
|
||
- router
|
||
```
|
||
|
||
Changes to the file take effect on next request (cache is not long-lived).
|
||
|
||
---
|
||
|
||
## API
|
||
|
||
### `GET /v1/risk/service/{service}?env=prod&window_hours=24`
|
||
|
||
Returns a `RiskReport`:
|
||
|
||
```json
|
||
{
|
||
"service": "gateway",
|
||
"env": "prod",
|
||
"score": 72,
|
||
"band": "high",
|
||
"thresholds": { "warn_at": 50, "fail_at": 75 },
|
||
"components": {
|
||
"open_incidents": { "P0": 0, "P1": 1, "P2": 2, "points": 45 },
|
||
"recurrence": { "high_signatures_7d": 1, "points": 20 },
|
||
"followups": { "overdue_P1": 1, "points": 12 },
|
||
"slo": { "violations": 1, "points": 10 },
|
||
"alerts_loop": { "violations": 0, "points": 0 },
|
||
"escalations": { "count_24h": 1, "points": 5 }
|
||
},
|
||
"reasons": [
|
||
"Open P1 incident(s): 1",
|
||
"High recurrence signatures (7d): 1",
|
||
"Overdue follow-ups (P1): 1",
|
||
"Active SLO violation(s) in window: 1",
|
||
"Escalations in last 24h: 1"
|
||
],
|
||
"recommendations": [
|
||
"Prioritize open P0/P1 incidents before deploying.",
|
||
"Investigate recurring failure patterns.",
|
||
"Avoid risky deploys until SLO violation clears.",
|
||
"Service is high-risk — coordinate with oncall before release."
|
||
],
|
||
"updated_at": "2026-02-23T12:00:00"
|
||
}
|
||
```
|
||
|
||
RBAC required: `tools.risk.read` (granted to `agent_cto`, `agent_oncall`, `agent_monitor`).
|
||
|
||
### `GET /v1/risk/dashboard?env=prod&top_n=10`
|
||
|
||
Returns top-N services by score with band summary:
|
||
|
||
```json
|
||
{
|
||
"env": "prod",
|
||
"generated_at": "...",
|
||
"total_services": 4,
|
||
"band_counts": { "critical": 1, "high": 1, "medium": 2, "low": 0 },
|
||
"critical_p0_services": ["gateway"],
|
||
"services": [ ...RiskReports sorted by score desc... ]
|
||
}
|
||
```
|
||
|
||
### Tool: `risk_engine_tool`
|
||
|
||
```json
|
||
{ "action": "service", "service": "gateway", "env": "prod" }
|
||
{ "action": "dashboard", "env": "prod", "top_n": 10 }
|
||
{ "action": "policy" }
|
||
```
|
||
|
||
---
|
||
|
||
## Release Gate: `risk_watch`
|
||
|
||
The `risk_watch` gate integrates Risk Index into the release pipeline.
|
||
|
||
### Behaviour
|
||
|
||
| Mode | When score ≥ warn_at (default 50) | When score ≥ fail_at (default 80) |
|
||
|--------|------------------------------------|-------------------------------------|
|
||
| warn | pass=true + recommendations added | pass=true + recommendations added |
|
||
| strict | pass=true + recommendations added | **pass=false** — deploy blocked |
|
||
|
||
### Policy
|
||
|
||
```yaml
|
||
# config/release_gate_policy.yml
|
||
dev:
|
||
risk_watch: { mode: "warn" }
|
||
staging:
|
||
risk_watch: { mode: "strict" } # blocks p0_services when score >= fail_at
|
||
prod:
|
||
risk_watch: { mode: "warn" }
|
||
```
|
||
|
||
### Non-fatal guarantee
|
||
|
||
If the Risk Engine is unavailable (store down, timeout, error), `risk_watch` is **skipped** — never blocks. A warning is added to the gate output.
|
||
|
||
### Release inputs
|
||
|
||
| Input | Type | Default | Description |
|
||
|--------------------|---------|---------|----------------------------------------------|
|
||
| `run_risk_watch` | boolean | true | Enable/disable the gate |
|
||
| `risk_watch_env` | string | prod | Env to score against |
|
||
| `risk_watch_warn_at` | int | policy | Override warn threshold |
|
||
| `risk_watch_fail_at` | int | policy | Override fail threshold |
|
||
|
||
---
|
||
|
||
## Architecture
|
||
|
||
```
|
||
[Incident Store]──open incidents──┐
|
||
[Intelligence]──recurrence 7d/30d─┤
|
||
[Followups Summary]──overdue──────┤──► risk_engine.py ──► RiskReport
|
||
[SLO Snapshot]──violations────────┤ │
|
||
[Alert Store]──loop SLO───────────┤ score_to_band
|
||
[Decision Events]──escalations────┘ │
|
||
release_check_runner
|
||
risk_watch gate
|
||
```
|
||
|
||
The engine has **zero LLM calls**. It is deterministic: given the same signals, the same score is always produced.
|
||
|
||
---
|
||
|
||
## Testing
|
||
|
||
```bash
|
||
pytest tests/test_risk_engine.py # scoring + bands + overrides
|
||
pytest tests/test_risk_dashboard.py # sorting + band counts + p0 detection
|
||
pytest tests/test_release_check_risk_watch.py # warn/strict/non-fatal gate
|
||
```
|