Config policies (16 files): alert_routing, architecture_pressure, backlog, cost_weights, data_governance, incident_escalation, incident_intelligence, network_allowlist, nodes_registry, observability_sources, rbac_tools_matrix, release_gate, risk_attribution, risk_policy, slo_policy, tool_limits, tools_rollout Ops (22 files): Caddyfile, calendar compose, grafana voice dashboard, deployments/incidents logs, runbooks for alerts/audit/backlog/incidents/sofiia/voice, cron jobs, scripts (alert_triage, audit_cleanup, migrate_*, governance, schedule), task_registry, voice alerts/ha/latency/policy Docs (30+ files): HUMANIZED_STEPAN v2.7-v3 changelogs and runbooks, NODA1/NODA2 status and setup, audit index and traces, backlog, incident, supervisor, tools, voice, opencode, release, risk, aistalk, spacebot Made-with: Cursor
11 KiB
Incident Intelligence Layer
Deterministic, 0 LLM tokens. Pattern detection and weekly reporting built on top of the existing Incident Store and Alert State Machine.
Overview
The Incident Intelligence Layer adds three analytical capabilities to the incident management platform:
| Capability | Action | Description |
|---|---|---|
| Correlation | correlate |
Find related incidents for a given incident ID using scored rule matching |
| Recurrence Detection | recurrence |
Frequency tables for 7d/30d windows with threshold classification |
| Weekly Digest | weekly_digest |
Full markdown + JSON report saved to ops/reports/incidents/weekly/ |
All three functions are deterministic and reentrant — running twice on the same data produces the same output.
Architecture
incident_intelligence_tool (tool_manager.py)
│
├── correlate → incident_intelligence.correlate_incident()
├── recurrence → incident_intelligence.detect_recurrence()
└── weekly_digest → incident_intelligence.weekly_digest()
│
IncidentStore (INCIDENT_BACKEND=auto)
incident_intel_utils.py (helpers)
config/incident_intelligence_policy.yml
Policy: config/incident_intelligence_policy.yml
Correlation rules
Each rule defines a name, weight (score contribution), and match conditions:
| Rule name | Weight | Match conditions |
|---|---|---|
same_signature |
100 | Exact SHA-256 signature match |
same_service_and_kind |
60 | Same service and same kind |
same_service_time_cluster |
40 | Same service, started within within_minutes |
same_kind_cross_service |
30 | Same kind (cross-service), within within_minutes |
The final score is the sum of all matching rule weights. Only incidents scoring ≥ min_score (default: 20) appear in results.
Example: two incidents with the same signature that also share service+kind within 180 min → score = 100 + 60 + 40 + 30 = 230.
Recurrence thresholds
recurrence:
thresholds:
signature:
warn: 3 # ≥ 3 occurrences in window → warn
high: 6 # ≥ 6 occurrences → high
kind:
warn: 5
high: 10
High-recurrence items receive deterministic recommendations from recurrence.recommendations templates (using Python .format() substitution with {sig}, {kind}, etc.).
Tool Usage
correlate
{
"tool": "incident_intelligence_tool",
"action": "correlate",
"incident_id": "inc_20260218_1430_abc123",
"append_note": true
}
Response:
{
"incident_id": "inc_20260218_1430_abc123",
"related_count": 3,
"related": [
{
"incident_id": "inc_20260215_0900_def456",
"score": 230,
"reasons": ["same_signature", "same_service_and_kind", "same_service_time_cluster"],
"service": "gateway",
"kind": "error_rate",
"severity": "P1",
"status": "closed",
"started_at": "2026-02-15T09:00:00"
}
]
}
When append_note=true, a timeline event of type note is appended to the target incident listing the top-5 related incidents.
recurrence
{
"tool": "incident_intelligence_tool",
"action": "recurrence",
"window_days": 7
}
Response includes top_signatures, top_kinds, top_services, high_recurrence, and warn_recurrence tables.
weekly_digest
{
"tool": "incident_intelligence_tool",
"action": "weekly_digest",
"save_artifacts": true
}
Response:
{
"week": "2026-W08",
"artifact_paths": [
"ops/reports/incidents/weekly/2026-W08.json",
"ops/reports/incidents/weekly/2026-W08.md"
],
"markdown_preview": "# Weekly Incident Digest — 2026-W08\n...",
"json_summary": {
"week": "2026-W08",
"open_incidents_count": 2,
"recent_7d_count": 12,
"recommendations": [...]
}
}
RBAC
| Action | Required entitlement | Roles |
|---|---|---|
correlate |
tools.oncall.read |
agent_cto, agent_oncall |
recurrence |
tools.oncall.read |
agent_cto, agent_oncall |
weekly_digest |
tools.oncall.incident_write |
agent_cto, agent_oncall |
Monitor (agent_monitor) has no access to incident_intelligence_tool.
Rate limits
| Action | Timeout | RPM |
|---|---|---|
correlate |
10s | 10 |
recurrence |
15s | 5 |
weekly_digest |
20s | 3 |
Scheduled Job
Task ID: weekly_incident_digest
Schedule: Every Monday 08:00 UTC
Cron: 0 8 * * 1
# NODE1 — add to ops user crontab
0 8 * * 1 /usr/local/bin/job_runner.sh weekly_incident_digest '{}'
Artifacts are written to ops/reports/incidents/weekly/YYYY-WW.json and YYYY-WW.md.
How scoring works
Score(target, candidate) = Σ weight(rule) for each rule that matches
Rules are evaluated in order. The "same_signature" rule is exclusive:
- If signatures match → score += 100, skip other conditions for this rule.
- If signatures do not match → skip rule entirely (score += 0).
All other rules use combined conditions (AND logic):
- All conditions in match{} must be satisfied for the rule to fire.
Two incidents with identical signatures will always score ≥ 100. Two incidents sharing service + kind score ≥ 60. Time proximity (within 180 min, same service) scores ≥ 40.
Tuning guide
| Goal | Change |
|---|---|
| Reduce false positives in correlation | Increase min_score (e.g., 40) |
| More aggressive recurrence warnings | Lower thresholds.signature.warn |
| Shorter lookback for correlation | Decrease correlation.lookback_days |
| Disable kind-based cross-service matching | Remove same_kind_cross_service rule |
| Longer digest | Increase digest.markdown_max_chars |
Files
| File | Purpose |
|---|---|
services/router/incident_intelligence.py |
Core engine: correlate / recurrence / weekly_digest |
services/router/incident_intel_utils.py |
Helpers: kind extraction, time math, truncation |
config/incident_intelligence_policy.yml |
All tuneable policy parameters |
tests/test_incident_correlation.py |
Correlation unit tests |
tests/test_incident_recurrence.py |
Recurrence detection tests |
tests/test_weekly_digest.py |
Weekly digest tests (incl. artifact write) |
Root-Cause Buckets
Overview
build_root_cause_buckets clusters incidents into actionable groups. The bucket key is either service|kind (default) or a signature prefix.
Filtering: only buckets meeting min_count thresholds appear:
count_7d ≥ buckets.min_count[7](default: 3) ORcount_30d ≥ buckets.min_count[30](default: 6)
Sorting: count_7d desc → count_30d desc → last_seen desc.
Tool usage
{
"tool": "incident_intelligence_tool",
"action": "buckets",
"service": "gateway",
"window_days": 30
}
Response:
{
"service_filter": "gateway",
"window_days": 30,
"bucket_count": 2,
"buckets": [
{
"bucket_key": "gateway|error_rate",
"counts": {"7d": 5, "30d": 12, "open": 2},
"last_seen": "2026-02-22T14:30:00",
"services": ["gateway"],
"kinds": ["error_rate"],
"top_signatures": [{"signature": "aabbccdd", "count": 4}],
"severity_mix": {"P0": 0, "P1": 2, "P2": 3},
"sample_incidents": [...],
"recommendations": [
"Add regression test for API contract & error mapping",
"Add/adjust SLO thresholds & alert routing"
]
}
]
}
Deterministic recommendations by kind
| Kind | Recommendations |
|---|---|
error_rate, slo_breach |
Add regression test; review deploys; adjust SLO thresholds |
latency |
Check p95 vs saturation; investigate DB/queue contention |
oom, crashloop |
Memory profiling; container limits; fix leaks |
disk |
Retention/cleanup automation; verify volumes |
security |
Dependency scanner + rotate secrets; verify allowlists |
queue |
Consumer lag + dead-letter queue |
network |
DNS audit; network policies |
| (any open incidents) | ⚠ Do not deploy risky changes until mitigated |
Auto Follow-ups (policy-driven)
When weekly_digest runs with autofollowups.enabled=true, it automatically appends a followup event to the most recent open incident in each high-recurrence bucket.
Deduplication
Follow-up key: {dedupe_key_prefix}:{YYYY-WW}:{bucket_key}
One follow-up per bucket per week. A second call in the same week with the same bucket → skipped with reason: already_exists.
A new week (YYYY-WW changes) → new follow-up is created.
Policy knobs
autofollowups:
enabled: true
only_when_high: true # only high-recurrence buckets trigger follow-ups
owner: "oncall"
priority: "P1"
due_days: 7
dedupe_key_prefix: "intel_recur"
Follow-up event structure
{
"type": "followup",
"message": "[intel] Recurrence high: gateway|error_rate (7d=5, 30d=12, kinds=error_rate)",
"meta": {
"title": "[intel] Recurrence high: gateway|error_rate",
"owner": "oncall",
"priority": "P1",
"due_date": "2026-03-02",
"dedupe_key": "intel_recur:2026-W08:gateway|error_rate",
"auto_created": true,
"bucket_key": "gateway|error_rate",
"count_7d": 5
}
}
recurrence_watch Release Gate
Purpose
Warns (or blocks in staging) when the service being deployed has a high incident recurrence pattern — catching "we're deploying into a known-bad state."
GatePolicy profiles
| Profile | Mode | Blocks on |
|---|---|---|
dev |
warn |
Never blocks |
staging |
strict |
High recurrence + P0/P1 severity |
prod |
warn |
Never blocks (accumulate data first) |
Strict mode logic
if mode == "strict":
if gate.has_high_recurrence AND gate.max_severity_seen in fail_on.severity_in:
pass = False
fail_on.severity_in defaults to ["P0", "P1"]. Only P2/P3 incidents in a high-recurrence bucket do not block.
Gate output fields
| Field | Description |
|---|---|
has_high_recurrence |
True if any signature or kind is in "high" zone |
has_warn_recurrence |
True if any signature or kind is in "warn" zone |
max_severity_seen |
Most severe incident in the service window |
high_signatures |
List of first 5 high-recurrence signature prefixes |
high_kinds |
List of first 5 high-recurrence kinds |
total_incidents |
Total incidents in window |
skipped |
True if gate was bypassed (error or tool unavailable) |
Input overrides
{
"run_recurrence_watch": true,
"recurrence_watch_mode": "off", // override policy
"recurrence_watch_windows_days": [7, 30],
"recurrence_watch_service": "gateway" // default: service_name from release inputs
}
Backward compatibility
If run_recurrence_watch is not in inputs, defaults to true. If recurrence_watch_mode is not set, falls back to GatePolicy profile setting.