microdao-daarion/docs/risk/risk_index.md

# Service Risk Index

> Deterministic. No LLM. Production-grade.

## Overview

The Risk Index Engine computes a **numerical risk score (0–100+)** for every tracked service. It is the single authoritative metric for service health in the DAARION.city control plane.

Score → Band mapping:

| Score  | Band     | Meaning                                  |
|--------|----------|------------------------------------------|
| 0–20   | low      | No significant signals                   |
| 21–50  | medium   | Minor signals; monitor                   |
| 51–80  | high     | Active problems; coordinate before deploy|
| 81+    | critical | Block or escalate immediately            |

---

## Scoring Formula

```
Risk(service) = Σ weight(signal) × count_or_flag(signal)
```

All weights are policy-driven via `config/risk_policy.yml`.

### Signal weights (defaults)

| Signal                        | Points                        |
|-------------------------------|-------------------------------|
| Open P0 incident              | 50 each                       |
| Open P1 incident              | 25 each                       |
| Open P2 incident              | 10 each                       |
| Open P3 incident              | 5 each                        |
| High recurrence signature 7d  | 20 each                       |
| Warn recurrence signature 7d  | 10 each                       |
| High recurrence kind 7d       | 15 each                       |
| Warn recurrence kind 7d       | 8 each                        |
| High recurrence signature 30d | 10 each                       |
| High recurrence kind 30d      | 8 each                        |
| Overdue follow-up P0          | 20 each                       |
| Overdue follow-up P1          | 12 each                       |
| Overdue follow-up other       | 6 each                        |
| Active SLO violation (60m)    | 10 each                       |
| Alert-loop SLO violation      | 10 each                       |
| Escalations 24h (1–2)         | 5 (warn level)                |
| Escalations 24h (3+)          | 12 (high level)               |

---

## Configuration

**`config/risk_policy.yml`** — controls all weights, thresholds, and per-service overrides.

```yaml
thresholds:
  bands:
    low_max: 20
    medium_max: 50
    high_max: 80
  risk_watch:
    warn_at: 50
    fail_at: 80

service_overrides:
  gateway:
    risk_watch:
      fail_at: 75   # gateway fails earlier: critical path

p0_services:
  - gateway
  - router
```

Changes to the file take effect on next request (cache is not long-lived).

---

## API

### `GET /v1/risk/service/{service}?env=prod&window_hours=24`

Returns a `RiskReport`:

```json
{
  "service": "gateway",
  "env": "prod",
  "score": 72,
  "band": "high",
  "thresholds": { "warn_at": 50, "fail_at": 75 },
  "components": {
    "open_incidents": { "P0": 0, "P1": 1, "P2": 2, "points": 45 },
    "recurrence": { "high_signatures_7d": 1, "points": 20 },
    "followups": { "overdue_P1": 1, "points": 12 },
    "slo": { "violations": 1, "points": 10 },
    "alerts_loop": { "violations": 0, "points": 0 },
    "escalations": { "count_24h": 1, "points": 5 }
  },
  "reasons": [
    "Open P1 incident(s): 1",
    "High recurrence signatures (7d): 1",
    "Overdue follow-ups (P1): 1",
    "Active SLO violation(s) in window: 1",
    "Escalations in last 24h: 1"
  ],
  "recommendations": [
    "Prioritize open P0/P1 incidents before deploying.",
    "Investigate recurring failure patterns.",
    "Avoid risky deploys until SLO violation clears.",
    "Service is high-risk — coordinate with oncall before release."
  ],
  "updated_at": "2026-02-23T12:00:00"
}
```

RBAC required: `tools.risk.read` (granted to `agent_cto`, `agent_oncall`, `agent_monitor`).

### `GET /v1/risk/dashboard?env=prod&top_n=10`

Returns top-N services by score with band summary:

```json
{
  "env": "prod",
  "generated_at": "...",
  "total_services": 4,
  "band_counts": { "critical": 1, "high": 1, "medium": 2, "low": 0 },
  "critical_p0_services": ["gateway"],
  "services": [ ...RiskReports sorted by score desc... ]
}
```

### Tool: `risk_engine_tool`

```json
{ "action": "service",   "service": "gateway", "env": "prod" }
{ "action": "dashboard", "env": "prod", "top_n": 10 }
{ "action": "policy" }
```

---

## Release Gate: `risk_watch`

The `risk_watch` gate integrates Risk Index into the release pipeline.

### Behaviour

| Mode   | When score ≥ warn_at (default 50) | When score ≥ fail_at (default 80) |
|--------|------------------------------------|-------------------------------------|
| warn   | pass=true + recommendations added  | pass=true + recommendations added   |
| strict | pass=true + recommendations added  | **pass=false** — deploy blocked     |

### Policy

```yaml
# config/release_gate_policy.yml
dev:
  risk_watch: { mode: "warn" }
staging:
  risk_watch: { mode: "strict" }   # blocks p0_services when score >= fail_at
prod:
  risk_watch: { mode: "warn" }
```

### Non-fatal guarantee

If the Risk Engine is unavailable (store down, timeout, error), `risk_watch` is **skipped** — never blocks. A warning is added to the gate output.

### Release inputs

| Input              | Type    | Default | Description                                  |
|--------------------|---------|---------|----------------------------------------------|
| `run_risk_watch`   | boolean | true    | Enable/disable the gate                      |
| `risk_watch_env`   | string  | prod    | Env to score against                         |
| `risk_watch_warn_at` | int  | policy  | Override warn threshold                      |
| `risk_watch_fail_at` | int  | policy  | Override fail threshold                      |

---

## Architecture

```
[Incident Store]──open incidents──┐
[Intelligence]──recurrence 7d/30d─┤
[Followups Summary]──overdue──────┤──► risk_engine.py ──► RiskReport
[SLO Snapshot]──violations────────┤           │
[Alert Store]──loop SLO───────────┤      score_to_band
[Decision Events]──escalations────┘           │
                                        release_check_runner
                                           risk_watch gate
```

The engine has **zero LLM calls**. It is deterministic: given the same signals, the same score is always produced.

---

## Testing

```bash
pytest tests/test_risk_engine.py         # scoring + bands + overrides
pytest tests/test_risk_dashboard.py      # sorting + band counts + p0 detection
pytest tests/test_release_check_risk_watch.py  # warn/strict/non-fatal gate
```