Config policies (16 files): alert_routing, architecture_pressure, backlog, cost_weights, data_governance, incident_escalation, incident_intelligence, network_allowlist, nodes_registry, observability_sources, rbac_tools_matrix, release_gate, risk_attribution, risk_policy, slo_policy, tool_limits, tools_rollout Ops (22 files): Caddyfile, calendar compose, grafana voice dashboard, deployments/incidents logs, runbooks for alerts/audit/backlog/incidents/sofiia/voice, cron jobs, scripts (alert_triage, audit_cleanup, migrate_*, governance, schedule), task_registry, voice alerts/ha/latency/policy Docs (30+ files): HUMANIZED_STEPAN v2.7-v3 changelogs and runbooks, NODA1/NODA2 status and setup, audit index and traces, backlog, incident, supervisor, tools, voice, opencode, release, risk, aistalk, spacebot Made-with: Cursor
388 lines
11 KiB
Markdown
388 lines
11 KiB
Markdown
# Incident Intelligence Layer
|
|
|
|
> **Deterministic, 0 LLM tokens.** Pattern detection and weekly reporting built on top of the existing Incident Store and Alert State Machine.
|
|
|
|
---
|
|
|
|
## Overview
|
|
|
|
The Incident Intelligence Layer adds three analytical capabilities to the incident management platform:
|
|
|
|
| Capability | Action | Description |
|
|
|---|---|---|
|
|
| **Correlation** | `correlate` | Find related incidents for a given incident ID using scored rule matching |
|
|
| **Recurrence Detection** | `recurrence` | Frequency tables for 7d/30d windows with threshold classification |
|
|
| **Weekly Digest** | `weekly_digest` | Full markdown + JSON report saved to `ops/reports/incidents/weekly/` |
|
|
|
|
All three functions are deterministic and reentrant — running twice on the same data produces the same output.
|
|
|
|
---
|
|
|
|
## Architecture
|
|
|
|
```
|
|
incident_intelligence_tool (tool_manager.py)
|
|
│
|
|
├── correlate → incident_intelligence.correlate_incident()
|
|
├── recurrence → incident_intelligence.detect_recurrence()
|
|
└── weekly_digest → incident_intelligence.weekly_digest()
|
|
│
|
|
IncidentStore (INCIDENT_BACKEND=auto)
|
|
incident_intel_utils.py (helpers)
|
|
config/incident_intelligence_policy.yml
|
|
```
|
|
|
|
---
|
|
|
|
## Policy: `config/incident_intelligence_policy.yml`
|
|
|
|
### Correlation rules
|
|
|
|
Each rule defines a `name`, `weight` (score contribution), and `match` conditions:
|
|
|
|
| Rule name | Weight | Match conditions |
|
|
|---|---|---|
|
|
| `same_signature` | 100 | Exact SHA-256 signature match |
|
|
| `same_service_and_kind` | 60 | Same service **and** same kind |
|
|
| `same_service_time_cluster` | 40 | Same service, started within `within_minutes` |
|
|
| `same_kind_cross_service` | 30 | Same kind (cross-service), within `within_minutes` |
|
|
|
|
The final score is the sum of all matching rule weights. Only incidents scoring ≥ `min_score` (default: 20) appear in results.
|
|
|
|
**Example:** two incidents with the same signature that also share service+kind within 180 min → score = 100 + 60 + 40 + 30 = 230.
|
|
|
|
### Recurrence thresholds
|
|
|
|
```yaml
|
|
recurrence:
|
|
thresholds:
|
|
signature:
|
|
warn: 3 # ≥ 3 occurrences in window → warn
|
|
high: 6 # ≥ 6 occurrences → high
|
|
kind:
|
|
warn: 5
|
|
high: 10
|
|
```
|
|
|
|
High-recurrence items receive deterministic recommendations from `recurrence.recommendations` templates (using Python `.format()` substitution with `{sig}`, `{kind}`, etc.).
|
|
|
|
---
|
|
|
|
## Tool Usage
|
|
|
|
### `correlate`
|
|
|
|
```json
|
|
{
|
|
"tool": "incident_intelligence_tool",
|
|
"action": "correlate",
|
|
"incident_id": "inc_20260218_1430_abc123",
|
|
"append_note": true
|
|
}
|
|
```
|
|
|
|
Response:
|
|
|
|
```json
|
|
{
|
|
"incident_id": "inc_20260218_1430_abc123",
|
|
"related_count": 3,
|
|
"related": [
|
|
{
|
|
"incident_id": "inc_20260215_0900_def456",
|
|
"score": 230,
|
|
"reasons": ["same_signature", "same_service_and_kind", "same_service_time_cluster"],
|
|
"service": "gateway",
|
|
"kind": "error_rate",
|
|
"severity": "P1",
|
|
"status": "closed",
|
|
"started_at": "2026-02-15T09:00:00"
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
When `append_note=true`, a timeline event of type `note` is appended to the target incident listing the top-5 related incidents.
|
|
|
|
### `recurrence`
|
|
|
|
```json
|
|
{
|
|
"tool": "incident_intelligence_tool",
|
|
"action": "recurrence",
|
|
"window_days": 7
|
|
}
|
|
```
|
|
|
|
Response includes `top_signatures`, `top_kinds`, `top_services`, `high_recurrence`, and `warn_recurrence` tables.
|
|
|
|
### `weekly_digest`
|
|
|
|
```json
|
|
{
|
|
"tool": "incident_intelligence_tool",
|
|
"action": "weekly_digest",
|
|
"save_artifacts": true
|
|
}
|
|
```
|
|
|
|
Response:
|
|
|
|
```json
|
|
{
|
|
"week": "2026-W08",
|
|
"artifact_paths": [
|
|
"ops/reports/incidents/weekly/2026-W08.json",
|
|
"ops/reports/incidents/weekly/2026-W08.md"
|
|
],
|
|
"markdown_preview": "# Weekly Incident Digest — 2026-W08\n...",
|
|
"json_summary": {
|
|
"week": "2026-W08",
|
|
"open_incidents_count": 2,
|
|
"recent_7d_count": 12,
|
|
"recommendations": [...]
|
|
}
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## RBAC
|
|
|
|
| Action | Required entitlement | Roles |
|
|
|---|---|---|
|
|
| `correlate` | `tools.oncall.read` | `agent_cto`, `agent_oncall` |
|
|
| `recurrence` | `tools.oncall.read` | `agent_cto`, `agent_oncall` |
|
|
| `weekly_digest` | `tools.oncall.incident_write` | `agent_cto`, `agent_oncall` |
|
|
|
|
Monitor (`agent_monitor`) has no access to `incident_intelligence_tool`.
|
|
|
|
---
|
|
|
|
## Rate limits
|
|
|
|
| Action | Timeout | RPM |
|
|
|---|---|---|
|
|
| `correlate` | 10s | 10 |
|
|
| `recurrence` | 15s | 5 |
|
|
| `weekly_digest` | 20s | 3 |
|
|
|
|
---
|
|
|
|
## Scheduled Job
|
|
|
|
Task ID: `weekly_incident_digest`
|
|
Schedule: **Every Monday 08:00 UTC**
|
|
Cron: `0 8 * * 1`
|
|
|
|
```bash
|
|
# NODE1 — add to ops user crontab
|
|
0 8 * * 1 /usr/local/bin/job_runner.sh weekly_incident_digest '{}'
|
|
```
|
|
|
|
Artifacts are written to `ops/reports/incidents/weekly/YYYY-WW.json` and `YYYY-WW.md`.
|
|
|
|
---
|
|
|
|
## How scoring works
|
|
|
|
```
|
|
Score(target, candidate) = Σ weight(rule) for each rule that matches
|
|
|
|
Rules are evaluated in order. The "same_signature" rule is exclusive:
|
|
- If signatures match → score += 100, skip other conditions for this rule.
|
|
- If signatures do not match → skip rule entirely (score += 0).
|
|
|
|
All other rules use combined conditions (AND logic):
|
|
- All conditions in match{} must be satisfied for the rule to fire.
|
|
```
|
|
|
|
Two incidents with **identical signatures** will always score ≥ 100. Two incidents sharing service + kind score ≥ 60. Time proximity (within 180 min, same service) scores ≥ 40.
|
|
|
|
---
|
|
|
|
## Tuning guide
|
|
|
|
| Goal | Change |
|
|
|---|---|
|
|
| Reduce false positives in correlation | Increase `min_score` (e.g., 40) |
|
|
| More aggressive recurrence warnings | Lower `thresholds.signature.warn` |
|
|
| Shorter lookback for correlation | Decrease `correlation.lookback_days` |
|
|
| Disable kind-based cross-service matching | Remove `same_kind_cross_service` rule |
|
|
| Longer digest | Increase `digest.markdown_max_chars` |
|
|
|
|
---
|
|
|
|
## Files
|
|
|
|
| File | Purpose |
|
|
|---|---|
|
|
| `services/router/incident_intelligence.py` | Core engine: correlate / recurrence / weekly_digest |
|
|
| `services/router/incident_intel_utils.py` | Helpers: kind extraction, time math, truncation |
|
|
| `config/incident_intelligence_policy.yml` | All tuneable policy parameters |
|
|
| `tests/test_incident_correlation.py` | Correlation unit tests |
|
|
| `tests/test_incident_recurrence.py` | Recurrence detection tests |
|
|
| `tests/test_weekly_digest.py` | Weekly digest tests (incl. artifact write) |
|
|
|
|
---
|
|
|
|
## Root-Cause Buckets
|
|
|
|
### Overview
|
|
|
|
`build_root_cause_buckets` clusters incidents into actionable groups. The bucket key is either `service|kind` (default) or a signature prefix.
|
|
|
|
**Filtering**: only buckets meeting `min_count` thresholds appear:
|
|
- `count_7d ≥ buckets.min_count[7]` (default: 3) **OR**
|
|
- `count_30d ≥ buckets.min_count[30]` (default: 6)
|
|
|
|
**Sorting**: `count_7d desc → count_30d desc → last_seen desc`.
|
|
|
|
### Tool usage
|
|
|
|
```json
|
|
{
|
|
"tool": "incident_intelligence_tool",
|
|
"action": "buckets",
|
|
"service": "gateway",
|
|
"window_days": 30
|
|
}
|
|
```
|
|
|
|
Response:
|
|
```json
|
|
{
|
|
"service_filter": "gateway",
|
|
"window_days": 30,
|
|
"bucket_count": 2,
|
|
"buckets": [
|
|
{
|
|
"bucket_key": "gateway|error_rate",
|
|
"counts": {"7d": 5, "30d": 12, "open": 2},
|
|
"last_seen": "2026-02-22T14:30:00",
|
|
"services": ["gateway"],
|
|
"kinds": ["error_rate"],
|
|
"top_signatures": [{"signature": "aabbccdd", "count": 4}],
|
|
"severity_mix": {"P0": 0, "P1": 2, "P2": 3},
|
|
"sample_incidents": [...],
|
|
"recommendations": [
|
|
"Add regression test for API contract & error mapping",
|
|
"Add/adjust SLO thresholds & alert routing"
|
|
]
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
### Deterministic recommendations by kind
|
|
|
|
| Kind | Recommendations |
|
|
|---|---|
|
|
| `error_rate`, `slo_breach` | Add regression test; review deploys; adjust SLO thresholds |
|
|
| `latency` | Check p95 vs saturation; investigate DB/queue contention |
|
|
| `oom`, `crashloop` | Memory profiling; container limits; fix leaks |
|
|
| `disk` | Retention/cleanup automation; verify volumes |
|
|
| `security` | Dependency scanner + rotate secrets; verify allowlists |
|
|
| `queue` | Consumer lag + dead-letter queue |
|
|
| `network` | DNS audit; network policies |
|
|
| *(any open incidents)* | ⚠ Do not deploy risky changes until mitigated |
|
|
|
|
---
|
|
|
|
## Auto Follow-ups (policy-driven)
|
|
|
|
When `weekly_digest` runs with `autofollowups.enabled=true`, it automatically appends a `followup` event to the **most recent open incident** in each high-recurrence bucket.
|
|
|
|
### Deduplication
|
|
|
|
Follow-up key: `{dedupe_key_prefix}:{YYYY-WW}:{bucket_key}`
|
|
|
|
One follow-up per bucket per week. A second call in the same week with the same bucket → skipped with `reason: already_exists`.
|
|
|
|
A new week (`YYYY-WW` changes) → new follow-up is created.
|
|
|
|
### Policy knobs
|
|
|
|
```yaml
|
|
autofollowups:
|
|
enabled: true
|
|
only_when_high: true # only high-recurrence buckets trigger follow-ups
|
|
owner: "oncall"
|
|
priority: "P1"
|
|
due_days: 7
|
|
dedupe_key_prefix: "intel_recur"
|
|
```
|
|
|
|
### Follow-up event structure
|
|
|
|
```json
|
|
{
|
|
"type": "followup",
|
|
"message": "[intel] Recurrence high: gateway|error_rate (7d=5, 30d=12, kinds=error_rate)",
|
|
"meta": {
|
|
"title": "[intel] Recurrence high: gateway|error_rate",
|
|
"owner": "oncall",
|
|
"priority": "P1",
|
|
"due_date": "2026-03-02",
|
|
"dedupe_key": "intel_recur:2026-W08:gateway|error_rate",
|
|
"auto_created": true,
|
|
"bucket_key": "gateway|error_rate",
|
|
"count_7d": 5
|
|
}
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## `recurrence_watch` Release Gate
|
|
|
|
### Purpose
|
|
|
|
Warns (or blocks in staging) when the service being deployed has a high incident recurrence pattern — catching "we're deploying into a known-bad state."
|
|
|
|
### GatePolicy profiles
|
|
|
|
| Profile | Mode | Blocks on |
|
|
|---|---|---|
|
|
| `dev` | `warn` | Never blocks |
|
|
| `staging` | `strict` | High recurrence + P0/P1 severity |
|
|
| `prod` | `warn` | Never blocks (accumulate data first) |
|
|
|
|
### Strict mode logic
|
|
|
|
```
|
|
if mode == "strict":
|
|
if gate.has_high_recurrence AND gate.max_severity_seen in fail_on.severity_in:
|
|
pass = False
|
|
```
|
|
|
|
`fail_on.severity_in` defaults to `["P0", "P1"]`. Only P2/P3 incidents in a high-recurrence bucket do **not** block.
|
|
|
|
### Gate output fields
|
|
|
|
| Field | Description |
|
|
|---|---|
|
|
| `has_high_recurrence` | True if any signature or kind is in "high" zone |
|
|
| `has_warn_recurrence` | True if any signature or kind is in "warn" zone |
|
|
| `max_severity_seen` | Most severe incident in the service window |
|
|
| `high_signatures` | List of first 5 high-recurrence signature prefixes |
|
|
| `high_kinds` | List of first 5 high-recurrence kinds |
|
|
| `total_incidents` | Total incidents in window |
|
|
| `skipped` | True if gate was bypassed (error or tool unavailable) |
|
|
|
|
### Input overrides
|
|
|
|
```json
|
|
{
|
|
"run_recurrence_watch": true,
|
|
"recurrence_watch_mode": "off", // override policy
|
|
"recurrence_watch_windows_days": [7, 30],
|
|
"recurrence_watch_service": "gateway" // default: service_name from release inputs
|
|
}
|
|
```
|
|
|
|
### Backward compatibility
|
|
|
|
If `run_recurrence_watch` is not in inputs, defaults to `true`. If `recurrence_watch_mode` is not set, falls back to GatePolicy profile setting.
|
|
|