Files

Apple 67225a39fa docs(platform): add policy configs, runbooks, ops scripts and platform documentation

Config policies (16 files): alert_routing, architecture_pressure, backlog,
cost_weights, data_governance, incident_escalation, incident_intelligence,
network_allowlist, nodes_registry, observability_sources, rbac_tools_matrix,
release_gate, risk_attribution, risk_policy, slo_policy, tool_limits, tools_rollout

Ops (22 files): Caddyfile, calendar compose, grafana voice dashboard,
deployments/incidents logs, runbooks for alerts/audit/backlog/incidents/sofiia/voice,
cron jobs, scripts (alert_triage, audit_cleanup, migrate_*, governance, schedule),
task_registry, voice alerts/ha/latency/policy

Docs (30+ files): HUMANIZED_STEPAN v2.7-v3 changelogs and runbooks,
NODA1/NODA2 status and setup, audit index and traces, backlog, incident,
supervisor, tools, voice, opencode, release, risk, aistalk, spacebot

Made-with: Cursor

2026-03-03 07:14:53 -08:00

6.5 KiB

Raw Blame History

Service Risk Index

Deterministic. No LLM. Production-grade.

Overview

The Risk Index Engine computes a numerical risk score (0–100+) for every tracked service. It is the single authoritative metric for service health in the DAARION.city control plane.

Score → Band mapping:

Score	Band	Meaning
0–20	low	No significant signals
21–50	medium	Minor signals; monitor
51–80	high	Active problems; coordinate before deploy
81+	critical	Block or escalate immediately

Scoring Formula

Risk(service) = Σ weight(signal) × count_or_flag(signal)

All weights are policy-driven via config/risk_policy.yml.

Signal weights (defaults)

Signal	Points
Open P0 incident	50 each
Open P1 incident	25 each
Open P2 incident	10 each
Open P3 incident	5 each
High recurrence signature 7d	20 each
Warn recurrence signature 7d	10 each
High recurrence kind 7d	15 each
Warn recurrence kind 7d	8 each
High recurrence signature 30d	10 each
High recurrence kind 30d	8 each
Overdue follow-up P0	20 each
Overdue follow-up P1	12 each
Overdue follow-up other	6 each
Active SLO violation (60m)	10 each
Alert-loop SLO violation	10 each
Escalations 24h (1–2)	5 (warn level)
Escalations 24h (3+)	12 (high level)

Configuration

config/risk_policy.yml — controls all weights, thresholds, and per-service overrides.

thresholds:
  bands:
    low_max: 20
    medium_max: 50
    high_max: 80
  risk_watch:
    warn_at: 50
    fail_at: 80

service_overrides:
  gateway:
    risk_watch:
      fail_at: 75   # gateway fails earlier: critical path

p0_services:
  - gateway
  - router

Changes to the file take effect on next request (cache is not long-lived).

API

`GET /v1/risk/service/{service}?env=prod&window_hours=24`

Returns a RiskReport:

{
  "service": "gateway",
  "env": "prod",
  "score": 72,
  "band": "high",
  "thresholds": { "warn_at": 50, "fail_at": 75 },
  "components": {
    "open_incidents": { "P0": 0, "P1": 1, "P2": 2, "points": 45 },
    "recurrence": { "high_signatures_7d": 1, "points": 20 },
    "followups": { "overdue_P1": 1, "points": 12 },
    "slo": { "violations": 1, "points": 10 },
    "alerts_loop": { "violations": 0, "points": 0 },
    "escalations": { "count_24h": 1, "points": 5 }
  },
  "reasons": [
    "Open P1 incident(s): 1",
    "High recurrence signatures (7d): 1",
    "Overdue follow-ups (P1): 1",
    "Active SLO violation(s) in window: 1",
    "Escalations in last 24h: 1"
  ],
  "recommendations": [
    "Prioritize open P0/P1 incidents before deploying.",
    "Investigate recurring failure patterns.",
    "Avoid risky deploys until SLO violation clears.",
    "Service is high-risk — coordinate with oncall before release."
  ],
  "updated_at": "2026-02-23T12:00:00"
}

RBAC required: tools.risk.read (granted to agent_cto, agent_oncall, agent_monitor).

`GET /v1/risk/dashboard?env=prod&top_n=10`

Returns top-N services by score with band summary:

{
  "env": "prod",
  "generated_at": "...",
  "total_services": 4,
  "band_counts": { "critical": 1, "high": 1, "medium": 2, "low": 0 },
  "critical_p0_services": ["gateway"],
  "services": [ ...RiskReports sorted by score desc... ]
}

Tool: `risk_engine_tool`

{ "action": "service",   "service": "gateway", "env": "prod" }
{ "action": "dashboard", "env": "prod", "top_n": 10 }
{ "action": "policy" }

Release Gate: `risk_watch`

The risk_watch gate integrates Risk Index into the release pipeline.

Behaviour

Mode	When score ≥ warn_at (default 50)	When score ≥ fail_at (default 80)
warn	pass=true + recommendations added	pass=true + recommendations added
strict	pass=true + recommendations added	pass=false — deploy blocked

Policy

# config/release_gate_policy.yml
dev:
  risk_watch: { mode: "warn" }
staging:
  risk_watch: { mode: "strict" }   # blocks p0_services when score >= fail_at
prod:
  risk_watch: { mode: "warn" }

Non-fatal guarantee

If the Risk Engine is unavailable (store down, timeout, error), risk_watch is skipped — never blocks. A warning is added to the gate output.

Release inputs

Input	Type	Default	Description
`run_risk_watch`	boolean	true	Enable/disable the gate
`risk_watch_env`	string	prod	Env to score against
`risk_watch_warn_at`	int	policy	Override warn threshold
`risk_watch_fail_at`	int	policy	Override fail threshold

Architecture

[Incident Store]──open incidents──┐
[Intelligence]──recurrence 7d/30d─┤
[Followups Summary]──overdue──────┤──► risk_engine.py ──► RiskReport
[SLO Snapshot]──violations────────┤           │
[Alert Store]──loop SLO───────────┤      score_to_band
[Decision Events]──escalations────┘           │
                                        release_check_runner
                                           risk_watch gate

The engine has zero LLM calls. It is deterministic: given the same signals, the same score is always produced.

Testing

pytest tests/test_risk_engine.py         # scoring + bands + overrides
pytest tests/test_risk_dashboard.py      # sorting + band counts + p0 detection
pytest tests/test_release_check_risk_watch.py  # warn/strict/non-fatal gate

6.5 KiB Raw Blame History Unescape Escape