# Incident Intelligence Layer > **Deterministic, 0 LLM tokens.** Pattern detection and weekly reporting built on top of the existing Incident Store and Alert State Machine. --- ## Overview The Incident Intelligence Layer adds three analytical capabilities to the incident management platform: | Capability | Action | Description | |---|---|---| | **Correlation** | `correlate` | Find related incidents for a given incident ID using scored rule matching | | **Recurrence Detection** | `recurrence` | Frequency tables for 7d/30d windows with threshold classification | | **Weekly Digest** | `weekly_digest` | Full markdown + JSON report saved to `ops/reports/incidents/weekly/` | All three functions are deterministic and reentrant — running twice on the same data produces the same output. --- ## Architecture ``` incident_intelligence_tool (tool_manager.py) │ ├── correlate → incident_intelligence.correlate_incident() ├── recurrence → incident_intelligence.detect_recurrence() └── weekly_digest → incident_intelligence.weekly_digest() │ IncidentStore (INCIDENT_BACKEND=auto) incident_intel_utils.py (helpers) config/incident_intelligence_policy.yml ``` --- ## Policy: `config/incident_intelligence_policy.yml` ### Correlation rules Each rule defines a `name`, `weight` (score contribution), and `match` conditions: | Rule name | Weight | Match conditions | |---|---|---| | `same_signature` | 100 | Exact SHA-256 signature match | | `same_service_and_kind` | 60 | Same service **and** same kind | | `same_service_time_cluster` | 40 | Same service, started within `within_minutes` | | `same_kind_cross_service` | 30 | Same kind (cross-service), within `within_minutes` | The final score is the sum of all matching rule weights. Only incidents scoring ≥ `min_score` (default: 20) appear in results. **Example:** two incidents with the same signature that also share service+kind within 180 min → score = 100 + 60 + 40 + 30 = 230. ### Recurrence thresholds ```yaml recurrence: thresholds: signature: warn: 3 # ≥ 3 occurrences in window → warn high: 6 # ≥ 6 occurrences → high kind: warn: 5 high: 10 ``` High-recurrence items receive deterministic recommendations from `recurrence.recommendations` templates (using Python `.format()` substitution with `{sig}`, `{kind}`, etc.). --- ## Tool Usage ### `correlate` ```json { "tool": "incident_intelligence_tool", "action": "correlate", "incident_id": "inc_20260218_1430_abc123", "append_note": true } ``` Response: ```json { "incident_id": "inc_20260218_1430_abc123", "related_count": 3, "related": [ { "incident_id": "inc_20260215_0900_def456", "score": 230, "reasons": ["same_signature", "same_service_and_kind", "same_service_time_cluster"], "service": "gateway", "kind": "error_rate", "severity": "P1", "status": "closed", "started_at": "2026-02-15T09:00:00" } ] } ``` When `append_note=true`, a timeline event of type `note` is appended to the target incident listing the top-5 related incidents. ### `recurrence` ```json { "tool": "incident_intelligence_tool", "action": "recurrence", "window_days": 7 } ``` Response includes `top_signatures`, `top_kinds`, `top_services`, `high_recurrence`, and `warn_recurrence` tables. ### `weekly_digest` ```json { "tool": "incident_intelligence_tool", "action": "weekly_digest", "save_artifacts": true } ``` Response: ```json { "week": "2026-W08", "artifact_paths": [ "ops/reports/incidents/weekly/2026-W08.json", "ops/reports/incidents/weekly/2026-W08.md" ], "markdown_preview": "# Weekly Incident Digest — 2026-W08\n...", "json_summary": { "week": "2026-W08", "open_incidents_count": 2, "recent_7d_count": 12, "recommendations": [...] } } ``` --- ## RBAC | Action | Required entitlement | Roles | |---|---|---| | `correlate` | `tools.oncall.read` | `agent_cto`, `agent_oncall` | | `recurrence` | `tools.oncall.read` | `agent_cto`, `agent_oncall` | | `weekly_digest` | `tools.oncall.incident_write` | `agent_cto`, `agent_oncall` | Monitor (`agent_monitor`) has no access to `incident_intelligence_tool`. --- ## Rate limits | Action | Timeout | RPM | |---|---|---| | `correlate` | 10s | 10 | | `recurrence` | 15s | 5 | | `weekly_digest` | 20s | 3 | --- ## Scheduled Job Task ID: `weekly_incident_digest` Schedule: **Every Monday 08:00 UTC** Cron: `0 8 * * 1` ```bash # NODE1 — add to ops user crontab 0 8 * * 1 /usr/local/bin/job_runner.sh weekly_incident_digest '{}' ``` Artifacts are written to `ops/reports/incidents/weekly/YYYY-WW.json` and `YYYY-WW.md`. --- ## How scoring works ``` Score(target, candidate) = Σ weight(rule) for each rule that matches Rules are evaluated in order. The "same_signature" rule is exclusive: - If signatures match → score += 100, skip other conditions for this rule. - If signatures do not match → skip rule entirely (score += 0). All other rules use combined conditions (AND logic): - All conditions in match{} must be satisfied for the rule to fire. ``` Two incidents with **identical signatures** will always score ≥ 100. Two incidents sharing service + kind score ≥ 60. Time proximity (within 180 min, same service) scores ≥ 40. --- ## Tuning guide | Goal | Change | |---|---| | Reduce false positives in correlation | Increase `min_score` (e.g., 40) | | More aggressive recurrence warnings | Lower `thresholds.signature.warn` | | Shorter lookback for correlation | Decrease `correlation.lookback_days` | | Disable kind-based cross-service matching | Remove `same_kind_cross_service` rule | | Longer digest | Increase `digest.markdown_max_chars` | --- ## Files | File | Purpose | |---|---| | `services/router/incident_intelligence.py` | Core engine: correlate / recurrence / weekly_digest | | `services/router/incident_intel_utils.py` | Helpers: kind extraction, time math, truncation | | `config/incident_intelligence_policy.yml` | All tuneable policy parameters | | `tests/test_incident_correlation.py` | Correlation unit tests | | `tests/test_incident_recurrence.py` | Recurrence detection tests | | `tests/test_weekly_digest.py` | Weekly digest tests (incl. artifact write) | --- ## Root-Cause Buckets ### Overview `build_root_cause_buckets` clusters incidents into actionable groups. The bucket key is either `service|kind` (default) or a signature prefix. **Filtering**: only buckets meeting `min_count` thresholds appear: - `count_7d ≥ buckets.min_count[7]` (default: 3) **OR** - `count_30d ≥ buckets.min_count[30]` (default: 6) **Sorting**: `count_7d desc → count_30d desc → last_seen desc`. ### Tool usage ```json { "tool": "incident_intelligence_tool", "action": "buckets", "service": "gateway", "window_days": 30 } ``` Response: ```json { "service_filter": "gateway", "window_days": 30, "bucket_count": 2, "buckets": [ { "bucket_key": "gateway|error_rate", "counts": {"7d": 5, "30d": 12, "open": 2}, "last_seen": "2026-02-22T14:30:00", "services": ["gateway"], "kinds": ["error_rate"], "top_signatures": [{"signature": "aabbccdd", "count": 4}], "severity_mix": {"P0": 0, "P1": 2, "P2": 3}, "sample_incidents": [...], "recommendations": [ "Add regression test for API contract & error mapping", "Add/adjust SLO thresholds & alert routing" ] } ] } ``` ### Deterministic recommendations by kind | Kind | Recommendations | |---|---| | `error_rate`, `slo_breach` | Add regression test; review deploys; adjust SLO thresholds | | `latency` | Check p95 vs saturation; investigate DB/queue contention | | `oom`, `crashloop` | Memory profiling; container limits; fix leaks | | `disk` | Retention/cleanup automation; verify volumes | | `security` | Dependency scanner + rotate secrets; verify allowlists | | `queue` | Consumer lag + dead-letter queue | | `network` | DNS audit; network policies | | *(any open incidents)* | ⚠ Do not deploy risky changes until mitigated | --- ## Auto Follow-ups (policy-driven) When `weekly_digest` runs with `autofollowups.enabled=true`, it automatically appends a `followup` event to the **most recent open incident** in each high-recurrence bucket. ### Deduplication Follow-up key: `{dedupe_key_prefix}:{YYYY-WW}:{bucket_key}` One follow-up per bucket per week. A second call in the same week with the same bucket → skipped with `reason: already_exists`. A new week (`YYYY-WW` changes) → new follow-up is created. ### Policy knobs ```yaml autofollowups: enabled: true only_when_high: true # only high-recurrence buckets trigger follow-ups owner: "oncall" priority: "P1" due_days: 7 dedupe_key_prefix: "intel_recur" ``` ### Follow-up event structure ```json { "type": "followup", "message": "[intel] Recurrence high: gateway|error_rate (7d=5, 30d=12, kinds=error_rate)", "meta": { "title": "[intel] Recurrence high: gateway|error_rate", "owner": "oncall", "priority": "P1", "due_date": "2026-03-02", "dedupe_key": "intel_recur:2026-W08:gateway|error_rate", "auto_created": true, "bucket_key": "gateway|error_rate", "count_7d": 5 } } ``` --- ## `recurrence_watch` Release Gate ### Purpose Warns (or blocks in staging) when the service being deployed has a high incident recurrence pattern — catching "we're deploying into a known-bad state." ### GatePolicy profiles | Profile | Mode | Blocks on | |---|---|---| | `dev` | `warn` | Never blocks | | `staging` | `strict` | High recurrence + P0/P1 severity | | `prod` | `warn` | Never blocks (accumulate data first) | ### Strict mode logic ``` if mode == "strict": if gate.has_high_recurrence AND gate.max_severity_seen in fail_on.severity_in: pass = False ``` `fail_on.severity_in` defaults to `["P0", "P1"]`. Only P2/P3 incidents in a high-recurrence bucket do **not** block. ### Gate output fields | Field | Description | |---|---| | `has_high_recurrence` | True if any signature or kind is in "high" zone | | `has_warn_recurrence` | True if any signature or kind is in "warn" zone | | `max_severity_seen` | Most severe incident in the service window | | `high_signatures` | List of first 5 high-recurrence signature prefixes | | `high_kinds` | List of first 5 high-recurrence kinds | | `total_incidents` | Total incidents in window | | `skipped` | True if gate was bypassed (error or tool unavailable) | ### Input overrides ```json { "run_recurrence_watch": true, "recurrence_watch_mode": "off", // override policy "recurrence_watch_windows_days": [7, 30], "recurrence_watch_service": "gateway" // default: service_name from release inputs } ``` ### Backward compatibility If `run_recurrence_watch` is not in inputs, defaults to `true`. If `recurrence_watch_mode` is not set, falls back to GatePolicy profile setting.