Files
microdao-daarion/docs/incident/intelligence.md
Apple 67225a39fa docs(platform): add policy configs, runbooks, ops scripts and platform documentation
Config policies (16 files): alert_routing, architecture_pressure, backlog,
cost_weights, data_governance, incident_escalation, incident_intelligence,
network_allowlist, nodes_registry, observability_sources, rbac_tools_matrix,
release_gate, risk_attribution, risk_policy, slo_policy, tool_limits, tools_rollout

Ops (22 files): Caddyfile, calendar compose, grafana voice dashboard,
deployments/incidents logs, runbooks for alerts/audit/backlog/incidents/sofiia/voice,
cron jobs, scripts (alert_triage, audit_cleanup, migrate_*, governance, schedule),
task_registry, voice alerts/ha/latency/policy

Docs (30+ files): HUMANIZED_STEPAN v2.7-v3 changelogs and runbooks,
NODA1/NODA2 status and setup, audit index and traces, backlog, incident,
supervisor, tools, voice, opencode, release, risk, aistalk, spacebot

Made-with: Cursor
2026-03-03 07:14:53 -08:00

11 KiB

Incident Intelligence Layer

Deterministic, 0 LLM tokens. Pattern detection and weekly reporting built on top of the existing Incident Store and Alert State Machine.


Overview

The Incident Intelligence Layer adds three analytical capabilities to the incident management platform:

Capability Action Description
Correlation correlate Find related incidents for a given incident ID using scored rule matching
Recurrence Detection recurrence Frequency tables for 7d/30d windows with threshold classification
Weekly Digest weekly_digest Full markdown + JSON report saved to ops/reports/incidents/weekly/

All three functions are deterministic and reentrant — running twice on the same data produces the same output.


Architecture

incident_intelligence_tool (tool_manager.py)
    │
    ├── correlate     → incident_intelligence.correlate_incident()
    ├── recurrence    → incident_intelligence.detect_recurrence()
    └── weekly_digest → incident_intelligence.weekly_digest()
                              │
                        IncidentStore (INCIDENT_BACKEND=auto)
                        incident_intel_utils.py (helpers)
                        config/incident_intelligence_policy.yml

Policy: config/incident_intelligence_policy.yml

Correlation rules

Each rule defines a name, weight (score contribution), and match conditions:

Rule name Weight Match conditions
same_signature 100 Exact SHA-256 signature match
same_service_and_kind 60 Same service and same kind
same_service_time_cluster 40 Same service, started within within_minutes
same_kind_cross_service 30 Same kind (cross-service), within within_minutes

The final score is the sum of all matching rule weights. Only incidents scoring ≥ min_score (default: 20) appear in results.

Example: two incidents with the same signature that also share service+kind within 180 min → score = 100 + 60 + 40 + 30 = 230.

Recurrence thresholds

recurrence:
  thresholds:
    signature:
      warn: 3   # ≥ 3 occurrences in window → warn
      high: 6   # ≥ 6 occurrences → high
    kind:
      warn: 5
      high: 10

High-recurrence items receive deterministic recommendations from recurrence.recommendations templates (using Python .format() substitution with {sig}, {kind}, etc.).


Tool Usage

correlate

{
  "tool": "incident_intelligence_tool",
  "action": "correlate",
  "incident_id": "inc_20260218_1430_abc123",
  "append_note": true
}

Response:

{
  "incident_id": "inc_20260218_1430_abc123",
  "related_count": 3,
  "related": [
    {
      "incident_id": "inc_20260215_0900_def456",
      "score": 230,
      "reasons": ["same_signature", "same_service_and_kind", "same_service_time_cluster"],
      "service": "gateway",
      "kind": "error_rate",
      "severity": "P1",
      "status": "closed",
      "started_at": "2026-02-15T09:00:00"
    }
  ]
}

When append_note=true, a timeline event of type note is appended to the target incident listing the top-5 related incidents.

recurrence

{
  "tool": "incident_intelligence_tool",
  "action": "recurrence",
  "window_days": 7
}

Response includes top_signatures, top_kinds, top_services, high_recurrence, and warn_recurrence tables.

weekly_digest

{
  "tool": "incident_intelligence_tool",
  "action": "weekly_digest",
  "save_artifacts": true
}

Response:

{
  "week": "2026-W08",
  "artifact_paths": [
    "ops/reports/incidents/weekly/2026-W08.json",
    "ops/reports/incidents/weekly/2026-W08.md"
  ],
  "markdown_preview": "# Weekly Incident Digest — 2026-W08\n...",
  "json_summary": {
    "week": "2026-W08",
    "open_incidents_count": 2,
    "recent_7d_count": 12,
    "recommendations": [...]
  }
}

RBAC

Action Required entitlement Roles
correlate tools.oncall.read agent_cto, agent_oncall
recurrence tools.oncall.read agent_cto, agent_oncall
weekly_digest tools.oncall.incident_write agent_cto, agent_oncall

Monitor (agent_monitor) has no access to incident_intelligence_tool.


Rate limits

Action Timeout RPM
correlate 10s 10
recurrence 15s 5
weekly_digest 20s 3

Scheduled Job

Task ID: weekly_incident_digest
Schedule: Every Monday 08:00 UTC
Cron: 0 8 * * 1

# NODE1 — add to ops user crontab
0 8 * * 1   /usr/local/bin/job_runner.sh weekly_incident_digest '{}'

Artifacts are written to ops/reports/incidents/weekly/YYYY-WW.json and YYYY-WW.md.


How scoring works

Score(target, candidate) = Σ weight(rule) for each rule that matches

Rules are evaluated in order. The "same_signature" rule is exclusive:
  - If signatures match → score += 100, skip other conditions for this rule.
  - If signatures do not match → skip rule entirely (score += 0).

All other rules use combined conditions (AND logic):
  - All conditions in match{} must be satisfied for the rule to fire.

Two incidents with identical signatures will always score ≥ 100. Two incidents sharing service + kind score ≥ 60. Time proximity (within 180 min, same service) scores ≥ 40.


Tuning guide

Goal Change
Reduce false positives in correlation Increase min_score (e.g., 40)
More aggressive recurrence warnings Lower thresholds.signature.warn
Shorter lookback for correlation Decrease correlation.lookback_days
Disable kind-based cross-service matching Remove same_kind_cross_service rule
Longer digest Increase digest.markdown_max_chars

Files

File Purpose
services/router/incident_intelligence.py Core engine: correlate / recurrence / weekly_digest
services/router/incident_intel_utils.py Helpers: kind extraction, time math, truncation
config/incident_intelligence_policy.yml All tuneable policy parameters
tests/test_incident_correlation.py Correlation unit tests
tests/test_incident_recurrence.py Recurrence detection tests
tests/test_weekly_digest.py Weekly digest tests (incl. artifact write)

Root-Cause Buckets

Overview

build_root_cause_buckets clusters incidents into actionable groups. The bucket key is either service|kind (default) or a signature prefix.

Filtering: only buckets meeting min_count thresholds appear:

  • count_7d ≥ buckets.min_count[7] (default: 3) OR
  • count_30d ≥ buckets.min_count[30] (default: 6)

Sorting: count_7d desc → count_30d desc → last_seen desc.

Tool usage

{
  "tool": "incident_intelligence_tool",
  "action": "buckets",
  "service": "gateway",
  "window_days": 30
}

Response:

{
  "service_filter": "gateway",
  "window_days": 30,
  "bucket_count": 2,
  "buckets": [
    {
      "bucket_key": "gateway|error_rate",
      "counts": {"7d": 5, "30d": 12, "open": 2},
      "last_seen": "2026-02-22T14:30:00",
      "services": ["gateway"],
      "kinds": ["error_rate"],
      "top_signatures": [{"signature": "aabbccdd", "count": 4}],
      "severity_mix": {"P0": 0, "P1": 2, "P2": 3},
      "sample_incidents": [...],
      "recommendations": [
        "Add regression test for API contract & error mapping",
        "Add/adjust SLO thresholds & alert routing"
      ]
    }
  ]
}

Deterministic recommendations by kind

Kind Recommendations
error_rate, slo_breach Add regression test; review deploys; adjust SLO thresholds
latency Check p95 vs saturation; investigate DB/queue contention
oom, crashloop Memory profiling; container limits; fix leaks
disk Retention/cleanup automation; verify volumes
security Dependency scanner + rotate secrets; verify allowlists
queue Consumer lag + dead-letter queue
network DNS audit; network policies
(any open incidents) ⚠ Do not deploy risky changes until mitigated

Auto Follow-ups (policy-driven)

When weekly_digest runs with autofollowups.enabled=true, it automatically appends a followup event to the most recent open incident in each high-recurrence bucket.

Deduplication

Follow-up key: {dedupe_key_prefix}:{YYYY-WW}:{bucket_key}

One follow-up per bucket per week. A second call in the same week with the same bucket → skipped with reason: already_exists.

A new week (YYYY-WW changes) → new follow-up is created.

Policy knobs

autofollowups:
  enabled: true
  only_when_high: true          # only high-recurrence buckets trigger follow-ups
  owner: "oncall"
  priority: "P1"
  due_days: 7
  dedupe_key_prefix: "intel_recur"

Follow-up event structure

{
  "type": "followup",
  "message": "[intel] Recurrence high: gateway|error_rate (7d=5, 30d=12, kinds=error_rate)",
  "meta": {
    "title": "[intel] Recurrence high: gateway|error_rate",
    "owner": "oncall",
    "priority": "P1",
    "due_date": "2026-03-02",
    "dedupe_key": "intel_recur:2026-W08:gateway|error_rate",
    "auto_created": true,
    "bucket_key": "gateway|error_rate",
    "count_7d": 5
  }
}

recurrence_watch Release Gate

Purpose

Warns (or blocks in staging) when the service being deployed has a high incident recurrence pattern — catching "we're deploying into a known-bad state."

GatePolicy profiles

Profile Mode Blocks on
dev warn Never blocks
staging strict High recurrence + P0/P1 severity
prod warn Never blocks (accumulate data first)

Strict mode logic

if mode == "strict":
    if gate.has_high_recurrence AND gate.max_severity_seen in fail_on.severity_in:
        pass = False

fail_on.severity_in defaults to ["P0", "P1"]. Only P2/P3 incidents in a high-recurrence bucket do not block.

Gate output fields

Field Description
has_high_recurrence True if any signature or kind is in "high" zone
has_warn_recurrence True if any signature or kind is in "warn" zone
max_severity_seen Most severe incident in the service window
high_signatures List of first 5 high-recurrence signature prefixes
high_kinds List of first 5 high-recurrence kinds
total_incidents Total incidents in window
skipped True if gate was bypassed (error or tool unavailable)

Input overrides

{
  "run_recurrence_watch": true,
  "recurrence_watch_mode": "off",       // override policy
  "recurrence_watch_windows_days": [7, 30],
  "recurrence_watch_service": "gateway" // default: service_name from release inputs
}

Backward compatibility

If run_recurrence_watch is not in inputs, defaults to true. If recurrence_watch_mode is not set, falls back to GatePolicy profile setting.