Config policies (16 files): alert_routing, architecture_pressure, backlog, cost_weights, data_governance, incident_escalation, incident_intelligence, network_allowlist, nodes_registry, observability_sources, rbac_tools_matrix, release_gate, risk_attribution, risk_policy, slo_policy, tool_limits, tools_rollout Ops (22 files): Caddyfile, calendar compose, grafana voice dashboard, deployments/incidents logs, runbooks for alerts/audit/backlog/incidents/sofiia/voice, cron jobs, scripts (alert_triage, audit_cleanup, migrate_*, governance, schedule), task_registry, voice alerts/ha/latency/policy Docs (30+ files): HUMANIZED_STEPAN v2.7-v3 changelogs and runbooks, NODA1/NODA2 status and setup, audit index and traces, backlog, incident, supervisor, tools, voice, opencode, release, risk, aistalk, spacebot Made-with: Cursor
6.5 KiB
Service Risk Index
Deterministic. No LLM. Production-grade.
Overview
The Risk Index Engine computes a numerical risk score (0–100+) for every tracked service. It is the single authoritative metric for service health in the DAARION.city control plane.
Score → Band mapping:
| Score | Band | Meaning |
|---|---|---|
| 0–20 | low | No significant signals |
| 21–50 | medium | Minor signals; monitor |
| 51–80 | high | Active problems; coordinate before deploy |
| 81+ | critical | Block or escalate immediately |
Scoring Formula
Risk(service) = Σ weight(signal) × count_or_flag(signal)
All weights are policy-driven via config/risk_policy.yml.
Signal weights (defaults)
| Signal | Points |
|---|---|
| Open P0 incident | 50 each |
| Open P1 incident | 25 each |
| Open P2 incident | 10 each |
| Open P3 incident | 5 each |
| High recurrence signature 7d | 20 each |
| Warn recurrence signature 7d | 10 each |
| High recurrence kind 7d | 15 each |
| Warn recurrence kind 7d | 8 each |
| High recurrence signature 30d | 10 each |
| High recurrence kind 30d | 8 each |
| Overdue follow-up P0 | 20 each |
| Overdue follow-up P1 | 12 each |
| Overdue follow-up other | 6 each |
| Active SLO violation (60m) | 10 each |
| Alert-loop SLO violation | 10 each |
| Escalations 24h (1–2) | 5 (warn level) |
| Escalations 24h (3+) | 12 (high level) |
Configuration
config/risk_policy.yml — controls all weights, thresholds, and per-service overrides.
thresholds:
bands:
low_max: 20
medium_max: 50
high_max: 80
risk_watch:
warn_at: 50
fail_at: 80
service_overrides:
gateway:
risk_watch:
fail_at: 75 # gateway fails earlier: critical path
p0_services:
- gateway
- router
Changes to the file take effect on next request (cache is not long-lived).
API
GET /v1/risk/service/{service}?env=prod&window_hours=24
Returns a RiskReport:
{
"service": "gateway",
"env": "prod",
"score": 72,
"band": "high",
"thresholds": { "warn_at": 50, "fail_at": 75 },
"components": {
"open_incidents": { "P0": 0, "P1": 1, "P2": 2, "points": 45 },
"recurrence": { "high_signatures_7d": 1, "points": 20 },
"followups": { "overdue_P1": 1, "points": 12 },
"slo": { "violations": 1, "points": 10 },
"alerts_loop": { "violations": 0, "points": 0 },
"escalations": { "count_24h": 1, "points": 5 }
},
"reasons": [
"Open P1 incident(s): 1",
"High recurrence signatures (7d): 1",
"Overdue follow-ups (P1): 1",
"Active SLO violation(s) in window: 1",
"Escalations in last 24h: 1"
],
"recommendations": [
"Prioritize open P0/P1 incidents before deploying.",
"Investigate recurring failure patterns.",
"Avoid risky deploys until SLO violation clears.",
"Service is high-risk — coordinate with oncall before release."
],
"updated_at": "2026-02-23T12:00:00"
}
RBAC required: tools.risk.read (granted to agent_cto, agent_oncall, agent_monitor).
GET /v1/risk/dashboard?env=prod&top_n=10
Returns top-N services by score with band summary:
{
"env": "prod",
"generated_at": "...",
"total_services": 4,
"band_counts": { "critical": 1, "high": 1, "medium": 2, "low": 0 },
"critical_p0_services": ["gateway"],
"services": [ ...RiskReports sorted by score desc... ]
}
Tool: risk_engine_tool
{ "action": "service", "service": "gateway", "env": "prod" }
{ "action": "dashboard", "env": "prod", "top_n": 10 }
{ "action": "policy" }
Release Gate: risk_watch
The risk_watch gate integrates Risk Index into the release pipeline.
Behaviour
| Mode | When score ≥ warn_at (default 50) | When score ≥ fail_at (default 80) |
|---|---|---|
| warn | pass=true + recommendations added | pass=true + recommendations added |
| strict | pass=true + recommendations added | pass=false — deploy blocked |
Policy
# config/release_gate_policy.yml
dev:
risk_watch: { mode: "warn" }
staging:
risk_watch: { mode: "strict" } # blocks p0_services when score >= fail_at
prod:
risk_watch: { mode: "warn" }
Non-fatal guarantee
If the Risk Engine is unavailable (store down, timeout, error), risk_watch is skipped — never blocks. A warning is added to the gate output.
Release inputs
| Input | Type | Default | Description |
|---|---|---|---|
run_risk_watch |
boolean | true | Enable/disable the gate |
risk_watch_env |
string | prod | Env to score against |
risk_watch_warn_at |
int | policy | Override warn threshold |
risk_watch_fail_at |
int | policy | Override fail threshold |
Architecture
[Incident Store]──open incidents──┐
[Intelligence]──recurrence 7d/30d─┤
[Followups Summary]──overdue──────┤──► risk_engine.py ──► RiskReport
[SLO Snapshot]──violations────────┤ │
[Alert Store]──loop SLO───────────┤ score_to_band
[Decision Events]──escalations────┘ │
release_check_runner
risk_watch gate
The engine has zero LLM calls. It is deterministic: given the same signals, the same score is always produced.
Testing
pytest tests/test_risk_engine.py # scoring + bands + overrides
pytest tests/test_risk_dashboard.py # sorting + band counts + p0 detection
pytest tests/test_release_check_risk_watch.py # warn/strict/non-fatal gate