Files

Apple 67225a39fa docs(platform): add policy configs, runbooks, ops scripts and platform documentation

Config policies (16 files): alert_routing, architecture_pressure, backlog,
cost_weights, data_governance, incident_escalation, incident_intelligence,
network_allowlist, nodes_registry, observability_sources, rbac_tools_matrix,
release_gate, risk_attribution, risk_policy, slo_policy, tool_limits, tools_rollout

Ops (22 files): Caddyfile, calendar compose, grafana voice dashboard,
deployments/incidents logs, runbooks for alerts/audit/backlog/incidents/sofiia/voice,
cron jobs, scripts (alert_triage, audit_cleanup, migrate_*, governance, schedule),
task_registry, voice alerts/ha/latency/policy

Docs (30+ files): HUMANIZED_STEPAN v2.7-v3 changelogs and runbooks,
NODA1/NODA2 status and setup, audit index and traces, backlog, incident,
supervisor, tools, voice, opencode, release, risk, aistalk, spacebot

Made-with: Cursor

2026-03-03 07:14:53 -08:00

11 KiB

Raw Blame History

Incident Intelligence Layer

Deterministic, 0 LLM tokens. Pattern detection and weekly reporting built on top of the existing Incident Store and Alert State Machine.

Overview

The Incident Intelligence Layer adds three analytical capabilities to the incident management platform:

Capability	Action	Description
Correlation	`correlate`	Find related incidents for a given incident ID using scored rule matching
Recurrence Detection	`recurrence`	Frequency tables for 7d/30d windows with threshold classification
Weekly Digest	`weekly_digest`	Full markdown + JSON report saved to `ops/reports/incidents/weekly/`

All three functions are deterministic and reentrant — running twice on the same data produces the same output.

Architecture

incident_intelligence_tool (tool_manager.py)
    │
    ├── correlate     → incident_intelligence.correlate_incident()
    ├── recurrence    → incident_intelligence.detect_recurrence()
    └── weekly_digest → incident_intelligence.weekly_digest()
                              │
                        IncidentStore (INCIDENT_BACKEND=auto)
                        incident_intel_utils.py (helpers)
                        config/incident_intelligence_policy.yml

Policy: `config/incident_intelligence_policy.yml`

Correlation rules

Each rule defines a name, weight (score contribution), and match conditions:

Rule name	Weight	Match conditions
`same_signature`	100	Exact SHA-256 signature match
`same_service_and_kind`	60	Same service and same kind
`same_service_time_cluster`	40	Same service, started within `within_minutes`
`same_kind_cross_service`	30	Same kind (cross-service), within `within_minutes`

The final score is the sum of all matching rule weights. Only incidents scoring ≥ min_score (default: 20) appear in results.

Example: two incidents with the same signature that also share service+kind within 180 min → score = 100 + 60 + 40 + 30 = 230.

Recurrence thresholds

recurrence:
  thresholds:
    signature:
      warn: 3   # ≥ 3 occurrences in window → warn
      high: 6   # ≥ 6 occurrences → high
    kind:
      warn: 5
      high: 10

High-recurrence items receive deterministic recommendations from recurrence.recommendations templates (using Python .format() substitution with {sig}, {kind}, etc.).

Tool Usage

`correlate`

{
  "tool": "incident_intelligence_tool",
  "action": "correlate",
  "incident_id": "inc_20260218_1430_abc123",
  "append_note": true
}

Response:

{
  "incident_id": "inc_20260218_1430_abc123",
  "related_count": 3,
  "related": [
    {
      "incident_id": "inc_20260215_0900_def456",
      "score": 230,
      "reasons": ["same_signature", "same_service_and_kind", "same_service_time_cluster"],
      "service": "gateway",
      "kind": "error_rate",
      "severity": "P1",
      "status": "closed",
      "started_at": "2026-02-15T09:00:00"
    }
  ]
}

When append_note=true, a timeline event of type note is appended to the target incident listing the top-5 related incidents.

`recurrence`

{
  "tool": "incident_intelligence_tool",
  "action": "recurrence",
  "window_days": 7
}

Response includes top_signatures, top_kinds, top_services, high_recurrence, and warn_recurrence tables.

`weekly_digest`

{
  "tool": "incident_intelligence_tool",
  "action": "weekly_digest",
  "save_artifacts": true
}

Response:

{
  "week": "2026-W08",
  "artifact_paths": [
    "ops/reports/incidents/weekly/2026-W08.json",
    "ops/reports/incidents/weekly/2026-W08.md"
  ],
  "markdown_preview": "# Weekly Incident Digest — 2026-W08\n...",
  "json_summary": {
    "week": "2026-W08",
    "open_incidents_count": 2,
    "recent_7d_count": 12,
    "recommendations": [...]
  }
}

RBAC

Action	Required entitlement	Roles
`correlate`	`tools.oncall.read`	`agent_cto`, `agent_oncall`
`recurrence`	`tools.oncall.read`	`agent_cto`, `agent_oncall`
`weekly_digest`	`tools.oncall.incident_write`	`agent_cto`, `agent_oncall`

Monitor (agent_monitor) has no access to incident_intelligence_tool.

Rate limits

Action	Timeout	RPM
`correlate`	10s	10
`recurrence`	15s	5
`weekly_digest`	20s	3

Scheduled Job

Task ID: weekly_incident_digest
Schedule: Every Monday 08:00 UTC
Cron: 0 8 * * 1

# NODE1 — add to ops user crontab
0 8 * * 1   /usr/local/bin/job_runner.sh weekly_incident_digest '{}'

Artifacts are written to ops/reports/incidents/weekly/YYYY-WW.json and YYYY-WW.md.

How scoring works

Score(target, candidate) = Σ weight(rule) for each rule that matches

Rules are evaluated in order. The "same_signature" rule is exclusive:
  - If signatures match → score += 100, skip other conditions for this rule.
  - If signatures do not match → skip rule entirely (score += 0).

All other rules use combined conditions (AND logic):
  - All conditions in match{} must be satisfied for the rule to fire.

Two incidents with identical signatures will always score ≥ 100. Two incidents sharing service + kind score ≥ 60. Time proximity (within 180 min, same service) scores ≥ 40.

Tuning guide

Goal	Change
Reduce false positives in correlation	Increase `min_score` (e.g., 40)
More aggressive recurrence warnings	Lower `thresholds.signature.warn`
Shorter lookback for correlation	Decrease `correlation.lookback_days`
Disable kind-based cross-service matching	Remove `same_kind_cross_service` rule
Longer digest	Increase `digest.markdown_max_chars`

Files

File	Purpose
`services/router/incident_intelligence.py`	Core engine: correlate / recurrence / weekly_digest
`services/router/incident_intel_utils.py`	Helpers: kind extraction, time math, truncation
`config/incident_intelligence_policy.yml`	All tuneable policy parameters
`tests/test_incident_correlation.py`	Correlation unit tests
`tests/test_incident_recurrence.py`	Recurrence detection tests
`tests/test_weekly_digest.py`	Weekly digest tests (incl. artifact write)

Root-Cause Buckets

Overview

build_root_cause_buckets clusters incidents into actionable groups. The bucket key is either service|kind (default) or a signature prefix.

Filtering: only buckets meeting min_count thresholds appear:

count_7d ≥ buckets.min_count[7] (default: 3) OR
count_30d ≥ buckets.min_count[30] (default: 6)

Sorting: count_7d desc → count_30d desc → last_seen desc.

Tool usage

{
  "tool": "incident_intelligence_tool",
  "action": "buckets",
  "service": "gateway",
  "window_days": 30
}

Response:

{
  "service_filter": "gateway",
  "window_days": 30,
  "bucket_count": 2,
  "buckets": [
    {
      "bucket_key": "gateway|error_rate",
      "counts": {"7d": 5, "30d": 12, "open": 2},
      "last_seen": "2026-02-22T14:30:00",
      "services": ["gateway"],
      "kinds": ["error_rate"],
      "top_signatures": [{"signature": "aabbccdd", "count": 4}],
      "severity_mix": {"P0": 0, "P1": 2, "P2": 3},
      "sample_incidents": [...],
      "recommendations": [
        "Add regression test for API contract & error mapping",
        "Add/adjust SLO thresholds & alert routing"
      ]
    }
  ]
}

Deterministic recommendations by kind

Kind	Recommendations
`error_rate`, `slo_breach`	Add regression test; review deploys; adjust SLO thresholds
`latency`	Check p95 vs saturation; investigate DB/queue contention
`oom`, `crashloop`	Memory profiling; container limits; fix leaks
`disk`	Retention/cleanup automation; verify volumes
`security`	Dependency scanner + rotate secrets; verify allowlists
`queue`	Consumer lag + dead-letter queue
`network`	DNS audit; network policies
(any open incidents)	⚠ Do not deploy risky changes until mitigated

Auto Follow-ups (policy-driven)

When weekly_digest runs with autofollowups.enabled=true, it automatically appends a followup event to the most recent open incident in each high-recurrence bucket.

Deduplication

Follow-up key: {dedupe_key_prefix}:{YYYY-WW}:{bucket_key}

One follow-up per bucket per week. A second call in the same week with the same bucket → skipped with reason: already_exists.

A new week (YYYY-WW changes) → new follow-up is created.

Policy knobs

autofollowups:
  enabled: true
  only_when_high: true          # only high-recurrence buckets trigger follow-ups
  owner: "oncall"
  priority: "P1"
  due_days: 7
  dedupe_key_prefix: "intel_recur"

Follow-up event structure

{
  "type": "followup",
  "message": "[intel] Recurrence high: gateway|error_rate (7d=5, 30d=12, kinds=error_rate)",
  "meta": {
    "title": "[intel] Recurrence high: gateway|error_rate",
    "owner": "oncall",
    "priority": "P1",
    "due_date": "2026-03-02",
    "dedupe_key": "intel_recur:2026-W08:gateway|error_rate",
    "auto_created": true,
    "bucket_key": "gateway|error_rate",
    "count_7d": 5
  }
}

`recurrence_watch` Release Gate

Purpose

Warns (or blocks in staging) when the service being deployed has a high incident recurrence pattern — catching "we're deploying into a known-bad state."

GatePolicy profiles

Profile	Mode	Blocks on
`dev`	`warn`	Never blocks
`staging`	`strict`	High recurrence + P0/P1 severity
`prod`	`warn`	Never blocks (accumulate data first)

Strict mode logic

if mode == "strict":
    if gate.has_high_recurrence AND gate.max_severity_seen in fail_on.severity_in:
        pass = False

fail_on.severity_in defaults to ["P0", "P1"]. Only P2/P3 incidents in a high-recurrence bucket do not block.

Gate output fields

Field	Description
`has_high_recurrence`	True if any signature or kind is in "high" zone
`has_warn_recurrence`	True if any signature or kind is in "warn" zone
`max_severity_seen`	Most severe incident in the service window
`high_signatures`	List of first 5 high-recurrence signature prefixes
`high_kinds`	List of first 5 high-recurrence kinds
`total_incidents`	Total incidents in window
`skipped`	True if gate was bypassed (error or tool unavailable)

Input overrides

{
  "run_recurrence_watch": true,
  "recurrence_watch_mode": "off",       // override policy
  "recurrence_watch_windows_days": [7, 30],
  "recurrence_watch_service": "gateway" // default: service_name from release inputs
}

Backward compatibility

If run_recurrence_watch is not in inputs, defaults to true. If recurrence_watch_mode is not set, falls back to GatePolicy profile setting.

11 KiB Raw Blame History

Incident Intelligence Layer

Overview

Architecture

Policy: config/incident_intelligence_policy.yml

Correlation rules

Recurrence thresholds

Tool Usage

correlate

recurrence

weekly_digest

RBAC

Rate limits

Scheduled Job

How scoring works

Tuning guide

Files

Root-Cause Buckets

Overview

Tool usage

Deterministic recommendations by kind

Auto Follow-ups (policy-driven)

Deduplication

Policy knobs

Follow-up event structure

recurrence_watch Release Gate

Purpose

GatePolicy profiles

Strict mode logic

Gate output fields

Input overrides

Backward compatibility

11 KiB

Raw Blame History

Policy: `config/incident_intelligence_policy.yml`

`correlate`

`recurrence`

`weekly_digest`

`recurrence_watch` Release Gate