# Service Risk Index > Deterministic. No LLM. Production-grade. ## Overview The Risk Index Engine computes a **numerical risk score (0–100+)** for every tracked service. It is the single authoritative metric for service health in the DAARION.city control plane. Score → Band mapping: | Score | Band | Meaning | |--------|----------|------------------------------------------| | 0–20 | low | No significant signals | | 21–50 | medium | Minor signals; monitor | | 51–80 | high | Active problems; coordinate before deploy| | 81+ | critical | Block or escalate immediately | --- ## Scoring Formula ``` Risk(service) = Σ weight(signal) × count_or_flag(signal) ``` All weights are policy-driven via `config/risk_policy.yml`. ### Signal weights (defaults) | Signal | Points | |-------------------------------|-------------------------------| | Open P0 incident | 50 each | | Open P1 incident | 25 each | | Open P2 incident | 10 each | | Open P3 incident | 5 each | | High recurrence signature 7d | 20 each | | Warn recurrence signature 7d | 10 each | | High recurrence kind 7d | 15 each | | Warn recurrence kind 7d | 8 each | | High recurrence signature 30d | 10 each | | High recurrence kind 30d | 8 each | | Overdue follow-up P0 | 20 each | | Overdue follow-up P1 | 12 each | | Overdue follow-up other | 6 each | | Active SLO violation (60m) | 10 each | | Alert-loop SLO violation | 10 each | | Escalations 24h (1–2) | 5 (warn level) | | Escalations 24h (3+) | 12 (high level) | --- ## Configuration **`config/risk_policy.yml`** — controls all weights, thresholds, and per-service overrides. ```yaml thresholds: bands: low_max: 20 medium_max: 50 high_max: 80 risk_watch: warn_at: 50 fail_at: 80 service_overrides: gateway: risk_watch: fail_at: 75 # gateway fails earlier: critical path p0_services: - gateway - router ``` Changes to the file take effect on next request (cache is not long-lived). --- ## API ### `GET /v1/risk/service/{service}?env=prod&window_hours=24` Returns a `RiskReport`: ```json { "service": "gateway", "env": "prod", "score": 72, "band": "high", "thresholds": { "warn_at": 50, "fail_at": 75 }, "components": { "open_incidents": { "P0": 0, "P1": 1, "P2": 2, "points": 45 }, "recurrence": { "high_signatures_7d": 1, "points": 20 }, "followups": { "overdue_P1": 1, "points": 12 }, "slo": { "violations": 1, "points": 10 }, "alerts_loop": { "violations": 0, "points": 0 }, "escalations": { "count_24h": 1, "points": 5 } }, "reasons": [ "Open P1 incident(s): 1", "High recurrence signatures (7d): 1", "Overdue follow-ups (P1): 1", "Active SLO violation(s) in window: 1", "Escalations in last 24h: 1" ], "recommendations": [ "Prioritize open P0/P1 incidents before deploying.", "Investigate recurring failure patterns.", "Avoid risky deploys until SLO violation clears.", "Service is high-risk — coordinate with oncall before release." ], "updated_at": "2026-02-23T12:00:00" } ``` RBAC required: `tools.risk.read` (granted to `agent_cto`, `agent_oncall`, `agent_monitor`). ### `GET /v1/risk/dashboard?env=prod&top_n=10` Returns top-N services by score with band summary: ```json { "env": "prod", "generated_at": "...", "total_services": 4, "band_counts": { "critical": 1, "high": 1, "medium": 2, "low": 0 }, "critical_p0_services": ["gateway"], "services": [ ...RiskReports sorted by score desc... ] } ``` ### Tool: `risk_engine_tool` ```json { "action": "service", "service": "gateway", "env": "prod" } { "action": "dashboard", "env": "prod", "top_n": 10 } { "action": "policy" } ``` --- ## Release Gate: `risk_watch` The `risk_watch` gate integrates Risk Index into the release pipeline. ### Behaviour | Mode | When score ≥ warn_at (default 50) | When score ≥ fail_at (default 80) | |--------|------------------------------------|-------------------------------------| | warn | pass=true + recommendations added | pass=true + recommendations added | | strict | pass=true + recommendations added | **pass=false** — deploy blocked | ### Policy ```yaml # config/release_gate_policy.yml dev: risk_watch: { mode: "warn" } staging: risk_watch: { mode: "strict" } # blocks p0_services when score >= fail_at prod: risk_watch: { mode: "warn" } ``` ### Non-fatal guarantee If the Risk Engine is unavailable (store down, timeout, error), `risk_watch` is **skipped** — never blocks. A warning is added to the gate output. ### Release inputs | Input | Type | Default | Description | |--------------------|---------|---------|----------------------------------------------| | `run_risk_watch` | boolean | true | Enable/disable the gate | | `risk_watch_env` | string | prod | Env to score against | | `risk_watch_warn_at` | int | policy | Override warn threshold | | `risk_watch_fail_at` | int | policy | Override fail threshold | --- ## Architecture ``` [Incident Store]──open incidents──┐ [Intelligence]──recurrence 7d/30d─┤ [Followups Summary]──overdue──────┤──► risk_engine.py ──► RiskReport [SLO Snapshot]──violations────────┤ │ [Alert Store]──loop SLO───────────┤ score_to_band [Decision Events]──escalations────┘ │ release_check_runner risk_watch gate ``` The engine has **zero LLM calls**. It is deterministic: given the same signals, the same score is always produced. --- ## Testing ```bash pytest tests/test_risk_engine.py # scoring + bands + overrides pytest tests/test_risk_dashboard.py # sorting + band counts + p0 detection pytest tests/test_release_check_risk_watch.py # warn/strict/non-fatal gate ```