Files
microdao-daarion/docs/risk/risk_index.md
Apple 67225a39fa docs(platform): add policy configs, runbooks, ops scripts and platform documentation
Config policies (16 files): alert_routing, architecture_pressure, backlog,
cost_weights, data_governance, incident_escalation, incident_intelligence,
network_allowlist, nodes_registry, observability_sources, rbac_tools_matrix,
release_gate, risk_attribution, risk_policy, slo_policy, tool_limits, tools_rollout

Ops (22 files): Caddyfile, calendar compose, grafana voice dashboard,
deployments/incidents logs, runbooks for alerts/audit/backlog/incidents/sofiia/voice,
cron jobs, scripts (alert_triage, audit_cleanup, migrate_*, governance, schedule),
task_registry, voice alerts/ha/latency/policy

Docs (30+ files): HUMANIZED_STEPAN v2.7-v3 changelogs and runbooks,
NODA1/NODA2 status and setup, audit index and traces, backlog, incident,
supervisor, tools, voice, opencode, release, risk, aistalk, spacebot

Made-with: Cursor
2026-03-03 07:14:53 -08:00

6.5 KiB
Raw Blame History

Service Risk Index

Deterministic. No LLM. Production-grade.

Overview

The Risk Index Engine computes a numerical risk score (0100+) for every tracked service. It is the single authoritative metric for service health in the DAARION.city control plane.

Score → Band mapping:

Score Band Meaning
020 low No significant signals
2150 medium Minor signals; monitor
5180 high Active problems; coordinate before deploy
81+ critical Block or escalate immediately

Scoring Formula

Risk(service) = Σ weight(signal) × count_or_flag(signal)

All weights are policy-driven via config/risk_policy.yml.

Signal weights (defaults)

Signal Points
Open P0 incident 50 each
Open P1 incident 25 each
Open P2 incident 10 each
Open P3 incident 5 each
High recurrence signature 7d 20 each
Warn recurrence signature 7d 10 each
High recurrence kind 7d 15 each
Warn recurrence kind 7d 8 each
High recurrence signature 30d 10 each
High recurrence kind 30d 8 each
Overdue follow-up P0 20 each
Overdue follow-up P1 12 each
Overdue follow-up other 6 each
Active SLO violation (60m) 10 each
Alert-loop SLO violation 10 each
Escalations 24h (12) 5 (warn level)
Escalations 24h (3+) 12 (high level)

Configuration

config/risk_policy.yml — controls all weights, thresholds, and per-service overrides.

thresholds:
  bands:
    low_max: 20
    medium_max: 50
    high_max: 80
  risk_watch:
    warn_at: 50
    fail_at: 80

service_overrides:
  gateway:
    risk_watch:
      fail_at: 75   # gateway fails earlier: critical path

p0_services:
  - gateway
  - router

Changes to the file take effect on next request (cache is not long-lived).


API

GET /v1/risk/service/{service}?env=prod&window_hours=24

Returns a RiskReport:

{
  "service": "gateway",
  "env": "prod",
  "score": 72,
  "band": "high",
  "thresholds": { "warn_at": 50, "fail_at": 75 },
  "components": {
    "open_incidents": { "P0": 0, "P1": 1, "P2": 2, "points": 45 },
    "recurrence": { "high_signatures_7d": 1, "points": 20 },
    "followups": { "overdue_P1": 1, "points": 12 },
    "slo": { "violations": 1, "points": 10 },
    "alerts_loop": { "violations": 0, "points": 0 },
    "escalations": { "count_24h": 1, "points": 5 }
  },
  "reasons": [
    "Open P1 incident(s): 1",
    "High recurrence signatures (7d): 1",
    "Overdue follow-ups (P1): 1",
    "Active SLO violation(s) in window: 1",
    "Escalations in last 24h: 1"
  ],
  "recommendations": [
    "Prioritize open P0/P1 incidents before deploying.",
    "Investigate recurring failure patterns.",
    "Avoid risky deploys until SLO violation clears.",
    "Service is high-risk — coordinate with oncall before release."
  ],
  "updated_at": "2026-02-23T12:00:00"
}

RBAC required: tools.risk.read (granted to agent_cto, agent_oncall, agent_monitor).

GET /v1/risk/dashboard?env=prod&top_n=10

Returns top-N services by score with band summary:

{
  "env": "prod",
  "generated_at": "...",
  "total_services": 4,
  "band_counts": { "critical": 1, "high": 1, "medium": 2, "low": 0 },
  "critical_p0_services": ["gateway"],
  "services": [ ...RiskReports sorted by score desc... ]
}

Tool: risk_engine_tool

{ "action": "service",   "service": "gateway", "env": "prod" }
{ "action": "dashboard", "env": "prod", "top_n": 10 }
{ "action": "policy" }

Release Gate: risk_watch

The risk_watch gate integrates Risk Index into the release pipeline.

Behaviour

Mode When score ≥ warn_at (default 50) When score ≥ fail_at (default 80)
warn pass=true + recommendations added pass=true + recommendations added
strict pass=true + recommendations added pass=false — deploy blocked

Policy

# config/release_gate_policy.yml
dev:
  risk_watch: { mode: "warn" }
staging:
  risk_watch: { mode: "strict" }   # blocks p0_services when score >= fail_at
prod:
  risk_watch: { mode: "warn" }

Non-fatal guarantee

If the Risk Engine is unavailable (store down, timeout, error), risk_watch is skipped — never blocks. A warning is added to the gate output.

Release inputs

Input Type Default Description
run_risk_watch boolean true Enable/disable the gate
risk_watch_env string prod Env to score against
risk_watch_warn_at int policy Override warn threshold
risk_watch_fail_at int policy Override fail threshold

Architecture

[Incident Store]──open incidents──┐
[Intelligence]──recurrence 7d/30d─┤
[Followups Summary]──overdue──────┤──► risk_engine.py ──► RiskReport
[SLO Snapshot]──violations────────┤           │
[Alert Store]──loop SLO───────────┤      score_to_band
[Decision Events]──escalations────┘           │
                                        release_check_runner
                                           risk_watch gate

The engine has zero LLM calls. It is deterministic: given the same signals, the same score is always produced.


Testing

pytest tests/test_risk_engine.py         # scoring + bands + overrides
pytest tests/test_risk_dashboard.py      # sorting + band counts + p0 detection
pytest tests/test_release_check_risk_watch.py  # warn/strict/non-fatal gate