docs(platform): add policy configs, runbooks, ops scripts and platform documentation
Config policies (16 files): alert_routing, architecture_pressure, backlog, cost_weights, data_governance, incident_escalation, incident_intelligence, network_allowlist, nodes_registry, observability_sources, rbac_tools_matrix, release_gate, risk_attribution, risk_policy, slo_policy, tool_limits, tools_rollout Ops (22 files): Caddyfile, calendar compose, grafana voice dashboard, deployments/incidents logs, runbooks for alerts/audit/backlog/incidents/sofiia/voice, cron jobs, scripts (alert_triage, audit_cleanup, migrate_*, governance, schedule), task_registry, voice alerts/ha/latency/policy Docs (30+ files): HUMANIZED_STEPAN v2.7-v3 changelogs and runbooks, NODA1/NODA2 status and setup, audit index and traces, backlog, incident, supervisor, tools, voice, opencode, release, risk, aistalk, spacebot Made-with: Cursor
This commit is contained in:
156
docs/incident/alerts.md
Normal file
156
docs/incident/alerts.md
Normal file
@@ -0,0 +1,156 @@
|
||||
# Alert → Incident Bridge
|
||||
|
||||
## Overview
|
||||
|
||||
The Alert Bridge provides a governed, deduplicated pipeline from Monitor/Prometheus detection to Incident creation.
|
||||
|
||||
**Security model:** Monitor sends alerts (`tools.alerts.ingest` only). Sofiia/oncall create incidents (`tools.oncall.incident_write` + `tools.alerts.ack`). No agent gets both roles automatically.
|
||||
|
||||
```
|
||||
Monitor@nodeX ──ingest──► AlertStore ──alert_to_incident──► IncidentStore
|
||||
(tools.alerts.ingest) (tools.oncall.incident_write)
|
||||
│
|
||||
IncidentTriage (Sofiia NODA2)
|
||||
│
|
||||
PostmortemDraft
|
||||
```
|
||||
|
||||
## AlertEvent Schema
|
||||
|
||||
```json
|
||||
{
|
||||
"source": "monitor@node1",
|
||||
"service": "gateway",
|
||||
"env": "prod",
|
||||
"severity": "P1",
|
||||
"kind": "slo_breach",
|
||||
"title": "gateway SLO: latency p95 > 300ms",
|
||||
"summary": "p95 latency at 450ms, error_rate 2.5%",
|
||||
"started_at": "2025-01-23T09:00:00Z",
|
||||
"labels": {
|
||||
"node": "node1",
|
||||
"fingerprint": "gateway:slo_breach:latency"
|
||||
},
|
||||
"metrics": {
|
||||
"latency_p95_ms": 450,
|
||||
"error_rate_pct": 2.5
|
||||
},
|
||||
"evidence": {
|
||||
"log_samples": ["ERROR timeout after 30s", "WARN retry 3/3"],
|
||||
"query": "rate(http_errors_total[5m])"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Severity values
|
||||
`P0`, `P1`, `P2`, `P3`, `INFO`
|
||||
|
||||
### Kind values
|
||||
`slo_breach`, `crashloop`, `latency`, `error_rate`, `disk`, `oom`, `deploy`, `security`, `custom`
|
||||
|
||||
## Dedupe Behavior
|
||||
|
||||
Dedupe key = `sha256(service|env|kind|fingerprint)`.
|
||||
|
||||
- Same key within TTL (default 30 min) → `deduped=true`, `occurrences++`, no new record
|
||||
- Same key after TTL → new alert record
|
||||
- Different fingerprint → separate record
|
||||
|
||||
## `alert_ingest_tool` API
|
||||
|
||||
### ingest (Monitor role)
|
||||
```json
|
||||
{
|
||||
"action": "ingest",
|
||||
"alert": { ...AlertEvent... },
|
||||
"dedupe_ttl_minutes": 30
|
||||
}
|
||||
```
|
||||
Response:
|
||||
```json
|
||||
{
|
||||
"accepted": true,
|
||||
"deduped": false,
|
||||
"dedupe_key": "abc123...",
|
||||
"alert_ref": "alrt_20250123_090000_a1b2c3",
|
||||
"occurrences": 1
|
||||
}
|
||||
```
|
||||
|
||||
### list (read)
|
||||
```json
|
||||
{ "action": "list", "service": "gateway", "env": "prod", "window_minutes": 240, "limit": 50 }
|
||||
```
|
||||
|
||||
### get (read)
|
||||
```json
|
||||
{ "action": "get", "alert_ref": "alrt_..." }
|
||||
```
|
||||
|
||||
### ack (oncall/cto)
|
||||
```json
|
||||
{ "action": "ack", "alert_ref": "alrt_...", "actor": "sofiia", "note": "false positive" }
|
||||
```
|
||||
|
||||
## `oncall_tool.alert_to_incident`
|
||||
|
||||
Converts a stored alert into an incident (or attaches to an existing open one).
|
||||
|
||||
```json
|
||||
{
|
||||
"action": "alert_to_incident",
|
||||
"alert_ref": "alrt_...",
|
||||
"incident_severity_cap": "P1",
|
||||
"dedupe_window_minutes": 60,
|
||||
"attach_artifact": true
|
||||
}
|
||||
```
|
||||
|
||||
Response:
|
||||
```json
|
||||
{
|
||||
"incident_id": "inc_20250123_090000_xyz",
|
||||
"created": true,
|
||||
"severity": "P1",
|
||||
"artifact_path": "ops/incidents/inc_.../alert_alrt_....json",
|
||||
"note": "Incident created and alert acked"
|
||||
}
|
||||
```
|
||||
|
||||
### Logic
|
||||
1. Load alert from `AlertStore`
|
||||
2. Check for existing open P0/P1 incident for same service/env within `dedupe_window_minutes`
|
||||
- If found → attach event to existing incident, ack alert
|
||||
3. If not found → create incident, append `note` + `metric` timeline events, optionally attach masked alert JSON as artifact, ack alert
|
||||
|
||||
## RBAC
|
||||
|
||||
| Role | ingest | list/get | ack | alert_to_incident |
|
||||
|------|--------|----------|-----|-------------------|
|
||||
| `agent_monitor` | ✅ | ❌ | ❌ | ❌ |
|
||||
| `agent_cto` | ✅ | ✅ | ✅ | ✅ |
|
||||
| `agent_oncall` | ❌ | ✅ | ✅ | ✅ |
|
||||
| `agent_interface` | ❌ | ✅ | ❌ | ❌ |
|
||||
| `agent_default` | ❌ | ❌ | ❌ | ❌ |
|
||||
|
||||
## SLO Watch Gate
|
||||
|
||||
The `slo_watch` gate in `release_check` prevents deploys during active SLO breaches.
|
||||
|
||||
| Profile | Mode | Behavior |
|
||||
|---------|------|----------|
|
||||
| dev | warn | Recommendations only |
|
||||
| staging | strict | Blocks on any violation |
|
||||
| prod | warn | Recommendations only |
|
||||
|
||||
Configure in `config/release_gate_policy.yml` per profile. Override per run with `run_slo_watch: false`.
|
||||
|
||||
## Backends
|
||||
|
||||
| Env var | Value | Effect |
|
||||
|---------|-------|--------|
|
||||
| `ALERT_BACKEND` | `memory` (default) | In-process, not persistent |
|
||||
| `ALERT_BACKEND` | `postgres` | Persistent, needs DATABASE_URL |
|
||||
| `ALERT_BACKEND` | `auto` | Postgres if DATABASE_URL set, else memory |
|
||||
|
||||
Run DDL: `python3 ops/scripts/migrate_alerts_postgres.py`
|
||||
99
docs/incident/escalation.md
Normal file
99
docs/incident/escalation.md
Normal file
@@ -0,0 +1,99 @@
|
||||
# Incident Escalation Engine
|
||||
|
||||
Deterministic, LLM-free engine that escalates incidents and identifies auto-resolve candidates
|
||||
based on alert storm behavior.
|
||||
|
||||
## Overview
|
||||
|
||||
```
|
||||
alert_triage_graph (every 5 min)
|
||||
└─ process_alerts
|
||||
└─ post_process_escalation ← incident_escalation_tool.evaluate
|
||||
└─ post_process_autoresolve ← incident_escalation_tool.auto_resolve_candidates
|
||||
└─ build_digest ← includes escalation + candidate summary
|
||||
```
|
||||
|
||||
## Escalation Logic
|
||||
|
||||
Config: `config/incident_escalation_policy.yml`
|
||||
|
||||
| Trigger | From → To |
|
||||
|---------|-----------|
|
||||
| `occurrences_60m ≥ 10` OR `triage_count_24h ≥ 3` | P2 → P1 |
|
||||
| `occurrences_60m ≥ 25` OR `triage_count_24h ≥ 6` | P1 → P0 |
|
||||
| Cap: `severity_cap: "P0"` | never exceeds P0 |
|
||||
|
||||
When escalation triggers:
|
||||
1. `incident_append_event(type=decision)` — audit trail
|
||||
2. `incident_append_event(type=followup)` — auto follow-up (if `create_followup_on_escalate: true`)
|
||||
|
||||
## Auto-resolve Candidates
|
||||
|
||||
Incidents where `last_alert_at < now - no_alerts_minutes_for_candidate`:
|
||||
|
||||
- `close_allowed_severities: ["P2", "P3"]` — only low-severity auto-closeable
|
||||
- `auto_close: false` (default) — produces *candidates* only, no auto-close
|
||||
- Each candidate gets a `note` event appended to the incident timeline
|
||||
|
||||
## Alert-loop SLO
|
||||
|
||||
Tracked in `/v1/alerts/dashboard?window_minutes=240`:
|
||||
|
||||
```json
|
||||
"slo": {
|
||||
"claim_to_ack_p95_seconds": 12.3,
|
||||
"failed_rate_pct": 0.5,
|
||||
"processing_stuck_count": 0,
|
||||
"violations": []
|
||||
}
|
||||
```
|
||||
|
||||
Thresholds (from `alert_loop_slo` in policy):
|
||||
- `claim_to_ack_p95_seconds: 60` — p95 latency from claim to ack
|
||||
- `failed_rate_pct: 5` — max % failed/(acked+failed)
|
||||
- `processing_stuck_minutes: 15` — alerts stuck in processing beyond this
|
||||
|
||||
## RBAC
|
||||
|
||||
| Action | Required entitlement |
|
||||
|--------|---------------------|
|
||||
| `evaluate` | `tools.oncall.incident_write` (CTO/oncall) |
|
||||
| `auto_resolve_candidates` | `tools.oncall.incident_write` (CTO/oncall) |
|
||||
|
||||
Monitor agent does NOT have access (ingest-only).
|
||||
|
||||
## Configuration
|
||||
|
||||
```yaml
|
||||
# config/incident_escalation_policy.yml
|
||||
escalation:
|
||||
occurrences_thresholds:
|
||||
P2_to_P1: 10
|
||||
P1_to_P0: 25
|
||||
triage_thresholds_24h:
|
||||
P2_to_P1: 3
|
||||
P1_to_P0: 6
|
||||
severity_cap: "P0"
|
||||
create_followup_on_escalate: true
|
||||
|
||||
auto_resolve:
|
||||
no_alerts_minutes_for_candidate: 60
|
||||
close_allowed_severities: ["P2", "P3"]
|
||||
auto_close: false
|
||||
|
||||
alert_loop_slo:
|
||||
claim_to_ack_p95_seconds: 60
|
||||
failed_rate_pct: 5
|
||||
processing_stuck_minutes: 15
|
||||
```
|
||||
|
||||
## Tuning
|
||||
|
||||
**Too many escalations (noisy)?**
|
||||
→ Increase `occurrences_thresholds.P2_to_P1` or `triage_thresholds_24h.P2_to_P1`.
|
||||
|
||||
**Auto-resolve too aggressive?**
|
||||
→ Increase `no_alerts_minutes_for_candidate` (e.g., 120 min).
|
||||
|
||||
**Ready to enable auto-close for P3?**
|
||||
→ Set `auto_close: true` and `close_allowed_severities: ["P3"]`.
|
||||
102
docs/incident/followups.md
Normal file
102
docs/incident/followups.md
Normal file
@@ -0,0 +1,102 @@
|
||||
# Follow-up Tracker & Release Gate
|
||||
|
||||
## Overview
|
||||
|
||||
Follow-ups are structured action items attached to incidents via `incident_append_event` with `type=followup`. The `followup_watch` gate in `release_check` uses them to block or warn about releases for services with unresolved issues.
|
||||
|
||||
## Follow-up Event Schema
|
||||
|
||||
When appending a follow-up event to an incident:
|
||||
|
||||
```json
|
||||
{
|
||||
"action": "incident_append_event",
|
||||
"incident_id": "inc_20250123_0900_abc1",
|
||||
"type": "followup",
|
||||
"message": "Upgrade postgres driver",
|
||||
"meta": {
|
||||
"title": "Upgrade postgres driver to fix connection leak",
|
||||
"owner": "sofiia",
|
||||
"priority": "P1",
|
||||
"due_date": "2025-02-01T00:00:00Z",
|
||||
"status": "open",
|
||||
"links": ["https://github.com/org/repo/issues/42"]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Meta Fields
|
||||
|
||||
| Field | Type | Required | Description |
|
||||
|-------|------|----------|-------------|
|
||||
| `title` | string | yes | Short description |
|
||||
| `owner` | string | yes | Agent ID or handle |
|
||||
| `priority` | enum | yes | P0, P1, P2, P3 |
|
||||
| `due_date` | ISO8601 | yes | Deadline |
|
||||
| `status` | enum | yes | open, done, cancelled |
|
||||
| `links` | array | no | Related PRs/issues/ADRs |
|
||||
|
||||
## oncall_tool: incident_followups_summary
|
||||
|
||||
Summarises open incidents and overdue follow-ups for a service.
|
||||
|
||||
### Request
|
||||
|
||||
```json
|
||||
{
|
||||
"action": "incident_followups_summary",
|
||||
"service": "gateway",
|
||||
"env": "prod",
|
||||
"window_days": 30
|
||||
}
|
||||
```
|
||||
|
||||
### Response
|
||||
|
||||
```json
|
||||
{
|
||||
"open_incidents": [
|
||||
{"id": "inc_...", "severity": "P1", "status": "open", "started_at": "...", "title": "..."}
|
||||
],
|
||||
"overdue_followups": [
|
||||
{"incident_id": "inc_...", "title": "...", "due_date": "...", "priority": "P1", "owner": "sofiia"}
|
||||
],
|
||||
"stats": {
|
||||
"open_incidents": 1,
|
||||
"overdue": 1,
|
||||
"total_open_followups": 3
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Release Gate: followup_watch
|
||||
|
||||
### Behaviour per GatePolicy mode
|
||||
|
||||
| Mode | Behaviour |
|
||||
|------|-----------|
|
||||
| `off` | Gate skipped entirely |
|
||||
| `warn` | Always pass=True; adds recommendations for open P0/P1 and overdue follow-ups |
|
||||
| `strict` | Blocks release (`pass=false`) if open incidents match `fail_on` severities or overdue follow-ups exist |
|
||||
|
||||
### Configuration
|
||||
|
||||
In `config/release_gate_policy.yml`:
|
||||
|
||||
```yaml
|
||||
followup_watch:
|
||||
mode: "warn" # off | warn | strict
|
||||
fail_on: ["P0", "P1"] # Severities that block in strict mode
|
||||
```
|
||||
|
||||
### release_check inputs
|
||||
|
||||
| Input | Type | Default | Description |
|
||||
|-------|------|---------|-------------|
|
||||
| `run_followup_watch` | bool | true | Enable/disable gate |
|
||||
| `followup_watch_window_days` | int | 30 | Incident scan window |
|
||||
| `followup_watch_env` | string | "any" | Filter by environment |
|
||||
|
||||
## RBAC
|
||||
|
||||
`incident_followups_summary` requires `tools.oncall.read` entitlement.
|
||||
112
docs/incident/incident_log.md
Normal file
112
docs/incident/incident_log.md
Normal file
@@ -0,0 +1,112 @@
|
||||
# NODA1 Incident Log
|
||||
|
||||
---
|
||||
|
||||
## INC-2026-002 | 2026-02-27 | Gateway Workers + SenpAI + facts/upsert
|
||||
|
||||
**Severity:** SEV-1 (всі агенти не відповідали користувачам)
|
||||
**Status:** RESOLVED
|
||||
**Duration:** ~3 дні (з 2026-02-21 09:55 по 2026-02-27 23:15)
|
||||
|
||||
### Summary
|
||||
|
||||
Після апгрейду Redis до 8.6.1 та ряду змін у коді gateway два воркери зависли,
|
||||
SenpAI повертав 500, а `facts/upsert` падав з `InvalidColumnReferenceError`.
|
||||
В сукупності агенти не відповідали у Telegram.
|
||||
|
||||
### Root Causes (3 незалежні)
|
||||
|
||||
| # | Компонент | Причина |
|
||||
|---|-----------|---------|
|
||||
| 1 | `dagi-gateway-worker-node1` | Після Redis 8.6.1 upgrade старі TCP-сокети async-клієнта → `ReadOnlyError` у `brpop()` |
|
||||
| 2 | `dagi-gateway-reminder-worker-node1` | Та сама проблема застарілих з'єднань після Redis upgrade |
|
||||
| 3 | `SenpAI webhook` → Router | `.env`: `ROUTER_URL=http://dagi-staging-router:8000` (staging!) замість `http://router:8000` |
|
||||
| 4 | `memory-service /facts/upsert` | `ensure_facts_table()` DDL застарілий: `UNIQUE(user_id, team_id, fact_key)` → asyncpg кешував старий prepared statement без `agent_id`; ON CONFLICT не знаходив matching constraint |
|
||||
| 5 | `get_doc_context()` | Підпис функції не мав `agent_id=None` параметра, хоча `http_api.py` передавав його |
|
||||
|
||||
### Timeline
|
||||
|
||||
| Час (UTC+1) | Подія |
|
||||
|-------------|-------|
|
||||
| 2026-02-21 09:55 | Остання успішна обробка (agromatrix) |
|
||||
| 2026-02-26 13:09 | Початок `ReadOnlyError` у gateway-worker (Redis upgrade) |
|
||||
| 2026-02-27 17:02 | Поновлення помилок worker після перезапусків |
|
||||
| 2026-02-27 19:49 | Повна блокада gateway-worker (останній restart) |
|
||||
| 2026-02-27 22:46 | Перезапуск dagi-gateway-worker-node1 → стабільний |
|
||||
| 2026-02-27 22:47 | Перезапуск dagi-gateway-reminder-worker-node1 → стабільний |
|
||||
| 2026-02-28 00:01 | Виправлено ensure_facts_table() → memory-service rebuilt |
|
||||
| 2026-02-28 00:05 | Виправлено ROUTER_URL, get_doc_context() → gateway rebuilt |
|
||||
| 2026-02-28 00:15 | Всі 14 агентів HTTP 200 ✓ |
|
||||
|
||||
### Fixes Applied (на сервері /opt/microdao-daarion)
|
||||
|
||||
```
|
||||
1. docker restart dagi-gateway-worker-node1 dagi-gateway-reminder-worker-node1
|
||||
2. services/memory-service/app/database.py:
|
||||
- ensure_facts_table() замінено на noop (таблиця управляється міграціями)
|
||||
- Скопійовано відсутні файли: integration_endpoints.py, integrations.py, voice_endpoints.py
|
||||
3. gateway-bot/services/doc_service.py:
|
||||
- get_doc_context(session_id: str) → get_doc_context(session_id: str, agent_id: str = None)
|
||||
4. .env:
|
||||
- ROUTER_URL=http://dagi-staging-router:8000 → ROUTER_URL=http://router:8000
|
||||
5. Rebuild + restart: memory-service, gateway, gateway-worker, gateway-reminder-worker
|
||||
```
|
||||
|
||||
### Verification
|
||||
|
||||
```
|
||||
All 14 agents HTTP 200:
|
||||
✓ senpai ✓ helion ✓ nutra ✓ daarwizz ✓ greenfood ✓ agromatrix
|
||||
✓ alateya ✓ druid ✓ clan ✓ eonarch ✓ oneok ✓ soul
|
||||
✓ yaromir ✓ sofiia
|
||||
facts/upsert: {"status":"ok"}
|
||||
Gateway: healthy, 14 agents
|
||||
```
|
||||
|
||||
### Action Items (TODO)
|
||||
|
||||
- [ ] Після Redis upgrade — завжди перезапускати workers (додати в runbook)
|
||||
- [ ] Виправити `ensure_facts_table()` в коді репозиторію (локально)
|
||||
- [ ] Виправити `get_doc_context()` сигнатуру в локальному репо
|
||||
- [ ] Виправити `.env` в репозиторії (або `.env.example`) — прибрати staging router URL
|
||||
- [ ] Додати liveness probe для workers: exit(1) при повторних ReadOnlyError
|
||||
- [ ] Алерт: "No messages processed for X minutes"
|
||||
|
||||
---
|
||||
|
||||
---
|
||||
|
||||
## INC-2026-003 | 2026-02-28 | Ollama resource crash → всі агенти 503
|
||||
|
||||
**Severity:** SEV-1 (всі агенти не відповідали у Telegram)
|
||||
**Status:** RESOLVED
|
||||
**Duration:** ~8 годин (з 07:53 по ~16:00 UTC+1)
|
||||
|
||||
### Root Cause
|
||||
|
||||
Ollama впала з помилкою `model runner has unexpectedly stopped, this may be due to resource limitations`. Модель `qwen3:8b` (27.8B params, ~17GB) перевищила ресурси сервера під навантаженням → Router отримував `500` від Ollama → повертав `503` клієнту. Всі агенти були налаштовані на `provider: ollama`.
|
||||
|
||||
### Fix Applied
|
||||
|
||||
Переключено всі агенти в `router-config.yml` з `qwen3_*_8b` профілів → `cloud_deepseek`:
|
||||
- 14 агентів тепер використовують `deepseek-chat` через DeepSeek API
|
||||
- Router перезапущено для підхвачення нового конфігу
|
||||
|
||||
### Verification
|
||||
|
||||
```
|
||||
helion: 🌐 Trying DEEPSEEK API → HTTP 200, 15222 tokens
|
||||
All 14 agents: ✓ HTTP 200
|
||||
```
|
||||
|
||||
### Action Items
|
||||
|
||||
- [ ] Backup `router-config.yml.bak_20260228` → зберегти в репо
|
||||
- [ ] Розглянути переведення Ollama на меншу модель (smollm2:135m або qwen3-vl:8b) для vision-задач
|
||||
- [ ] Додати fallback в Router: якщо Ollama 500 → автоматично cloud_deepseek
|
||||
|
||||
---
|
||||
|
||||
## INC-2026-001 | (попередні інциденти)
|
||||
|
||||
_(додати при потребі)_
|
||||
387
docs/incident/intelligence.md
Normal file
387
docs/incident/intelligence.md
Normal file
@@ -0,0 +1,387 @@
|
||||
# Incident Intelligence Layer
|
||||
|
||||
> **Deterministic, 0 LLM tokens.** Pattern detection and weekly reporting built on top of the existing Incident Store and Alert State Machine.
|
||||
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
The Incident Intelligence Layer adds three analytical capabilities to the incident management platform:
|
||||
|
||||
| Capability | Action | Description |
|
||||
|---|---|---|
|
||||
| **Correlation** | `correlate` | Find related incidents for a given incident ID using scored rule matching |
|
||||
| **Recurrence Detection** | `recurrence` | Frequency tables for 7d/30d windows with threshold classification |
|
||||
| **Weekly Digest** | `weekly_digest` | Full markdown + JSON report saved to `ops/reports/incidents/weekly/` |
|
||||
|
||||
All three functions are deterministic and reentrant — running twice on the same data produces the same output.
|
||||
|
||||
---
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
incident_intelligence_tool (tool_manager.py)
|
||||
│
|
||||
├── correlate → incident_intelligence.correlate_incident()
|
||||
├── recurrence → incident_intelligence.detect_recurrence()
|
||||
└── weekly_digest → incident_intelligence.weekly_digest()
|
||||
│
|
||||
IncidentStore (INCIDENT_BACKEND=auto)
|
||||
incident_intel_utils.py (helpers)
|
||||
config/incident_intelligence_policy.yml
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Policy: `config/incident_intelligence_policy.yml`
|
||||
|
||||
### Correlation rules
|
||||
|
||||
Each rule defines a `name`, `weight` (score contribution), and `match` conditions:
|
||||
|
||||
| Rule name | Weight | Match conditions |
|
||||
|---|---|---|
|
||||
| `same_signature` | 100 | Exact SHA-256 signature match |
|
||||
| `same_service_and_kind` | 60 | Same service **and** same kind |
|
||||
| `same_service_time_cluster` | 40 | Same service, started within `within_minutes` |
|
||||
| `same_kind_cross_service` | 30 | Same kind (cross-service), within `within_minutes` |
|
||||
|
||||
The final score is the sum of all matching rule weights. Only incidents scoring ≥ `min_score` (default: 20) appear in results.
|
||||
|
||||
**Example:** two incidents with the same signature that also share service+kind within 180 min → score = 100 + 60 + 40 + 30 = 230.
|
||||
|
||||
### Recurrence thresholds
|
||||
|
||||
```yaml
|
||||
recurrence:
|
||||
thresholds:
|
||||
signature:
|
||||
warn: 3 # ≥ 3 occurrences in window → warn
|
||||
high: 6 # ≥ 6 occurrences → high
|
||||
kind:
|
||||
warn: 5
|
||||
high: 10
|
||||
```
|
||||
|
||||
High-recurrence items receive deterministic recommendations from `recurrence.recommendations` templates (using Python `.format()` substitution with `{sig}`, `{kind}`, etc.).
|
||||
|
||||
---
|
||||
|
||||
## Tool Usage
|
||||
|
||||
### `correlate`
|
||||
|
||||
```json
|
||||
{
|
||||
"tool": "incident_intelligence_tool",
|
||||
"action": "correlate",
|
||||
"incident_id": "inc_20260218_1430_abc123",
|
||||
"append_note": true
|
||||
}
|
||||
```
|
||||
|
||||
Response:
|
||||
|
||||
```json
|
||||
{
|
||||
"incident_id": "inc_20260218_1430_abc123",
|
||||
"related_count": 3,
|
||||
"related": [
|
||||
{
|
||||
"incident_id": "inc_20260215_0900_def456",
|
||||
"score": 230,
|
||||
"reasons": ["same_signature", "same_service_and_kind", "same_service_time_cluster"],
|
||||
"service": "gateway",
|
||||
"kind": "error_rate",
|
||||
"severity": "P1",
|
||||
"status": "closed",
|
||||
"started_at": "2026-02-15T09:00:00"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
When `append_note=true`, a timeline event of type `note` is appended to the target incident listing the top-5 related incidents.
|
||||
|
||||
### `recurrence`
|
||||
|
||||
```json
|
||||
{
|
||||
"tool": "incident_intelligence_tool",
|
||||
"action": "recurrence",
|
||||
"window_days": 7
|
||||
}
|
||||
```
|
||||
|
||||
Response includes `top_signatures`, `top_kinds`, `top_services`, `high_recurrence`, and `warn_recurrence` tables.
|
||||
|
||||
### `weekly_digest`
|
||||
|
||||
```json
|
||||
{
|
||||
"tool": "incident_intelligence_tool",
|
||||
"action": "weekly_digest",
|
||||
"save_artifacts": true
|
||||
}
|
||||
```
|
||||
|
||||
Response:
|
||||
|
||||
```json
|
||||
{
|
||||
"week": "2026-W08",
|
||||
"artifact_paths": [
|
||||
"ops/reports/incidents/weekly/2026-W08.json",
|
||||
"ops/reports/incidents/weekly/2026-W08.md"
|
||||
],
|
||||
"markdown_preview": "# Weekly Incident Digest — 2026-W08\n...",
|
||||
"json_summary": {
|
||||
"week": "2026-W08",
|
||||
"open_incidents_count": 2,
|
||||
"recent_7d_count": 12,
|
||||
"recommendations": [...]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## RBAC
|
||||
|
||||
| Action | Required entitlement | Roles |
|
||||
|---|---|---|
|
||||
| `correlate` | `tools.oncall.read` | `agent_cto`, `agent_oncall` |
|
||||
| `recurrence` | `tools.oncall.read` | `agent_cto`, `agent_oncall` |
|
||||
| `weekly_digest` | `tools.oncall.incident_write` | `agent_cto`, `agent_oncall` |
|
||||
|
||||
Monitor (`agent_monitor`) has no access to `incident_intelligence_tool`.
|
||||
|
||||
---
|
||||
|
||||
## Rate limits
|
||||
|
||||
| Action | Timeout | RPM |
|
||||
|---|---|---|
|
||||
| `correlate` | 10s | 10 |
|
||||
| `recurrence` | 15s | 5 |
|
||||
| `weekly_digest` | 20s | 3 |
|
||||
|
||||
---
|
||||
|
||||
## Scheduled Job
|
||||
|
||||
Task ID: `weekly_incident_digest`
|
||||
Schedule: **Every Monday 08:00 UTC**
|
||||
Cron: `0 8 * * 1`
|
||||
|
||||
```bash
|
||||
# NODE1 — add to ops user crontab
|
||||
0 8 * * 1 /usr/local/bin/job_runner.sh weekly_incident_digest '{}'
|
||||
```
|
||||
|
||||
Artifacts are written to `ops/reports/incidents/weekly/YYYY-WW.json` and `YYYY-WW.md`.
|
||||
|
||||
---
|
||||
|
||||
## How scoring works
|
||||
|
||||
```
|
||||
Score(target, candidate) = Σ weight(rule) for each rule that matches
|
||||
|
||||
Rules are evaluated in order. The "same_signature" rule is exclusive:
|
||||
- If signatures match → score += 100, skip other conditions for this rule.
|
||||
- If signatures do not match → skip rule entirely (score += 0).
|
||||
|
||||
All other rules use combined conditions (AND logic):
|
||||
- All conditions in match{} must be satisfied for the rule to fire.
|
||||
```
|
||||
|
||||
Two incidents with **identical signatures** will always score ≥ 100. Two incidents sharing service + kind score ≥ 60. Time proximity (within 180 min, same service) scores ≥ 40.
|
||||
|
||||
---
|
||||
|
||||
## Tuning guide
|
||||
|
||||
| Goal | Change |
|
||||
|---|---|
|
||||
| Reduce false positives in correlation | Increase `min_score` (e.g., 40) |
|
||||
| More aggressive recurrence warnings | Lower `thresholds.signature.warn` |
|
||||
| Shorter lookback for correlation | Decrease `correlation.lookback_days` |
|
||||
| Disable kind-based cross-service matching | Remove `same_kind_cross_service` rule |
|
||||
| Longer digest | Increase `digest.markdown_max_chars` |
|
||||
|
||||
---
|
||||
|
||||
## Files
|
||||
|
||||
| File | Purpose |
|
||||
|---|---|
|
||||
| `services/router/incident_intelligence.py` | Core engine: correlate / recurrence / weekly_digest |
|
||||
| `services/router/incident_intel_utils.py` | Helpers: kind extraction, time math, truncation |
|
||||
| `config/incident_intelligence_policy.yml` | All tuneable policy parameters |
|
||||
| `tests/test_incident_correlation.py` | Correlation unit tests |
|
||||
| `tests/test_incident_recurrence.py` | Recurrence detection tests |
|
||||
| `tests/test_weekly_digest.py` | Weekly digest tests (incl. artifact write) |
|
||||
|
||||
---
|
||||
|
||||
## Root-Cause Buckets
|
||||
|
||||
### Overview
|
||||
|
||||
`build_root_cause_buckets` clusters incidents into actionable groups. The bucket key is either `service|kind` (default) or a signature prefix.
|
||||
|
||||
**Filtering**: only buckets meeting `min_count` thresholds appear:
|
||||
- `count_7d ≥ buckets.min_count[7]` (default: 3) **OR**
|
||||
- `count_30d ≥ buckets.min_count[30]` (default: 6)
|
||||
|
||||
**Sorting**: `count_7d desc → count_30d desc → last_seen desc`.
|
||||
|
||||
### Tool usage
|
||||
|
||||
```json
|
||||
{
|
||||
"tool": "incident_intelligence_tool",
|
||||
"action": "buckets",
|
||||
"service": "gateway",
|
||||
"window_days": 30
|
||||
}
|
||||
```
|
||||
|
||||
Response:
|
||||
```json
|
||||
{
|
||||
"service_filter": "gateway",
|
||||
"window_days": 30,
|
||||
"bucket_count": 2,
|
||||
"buckets": [
|
||||
{
|
||||
"bucket_key": "gateway|error_rate",
|
||||
"counts": {"7d": 5, "30d": 12, "open": 2},
|
||||
"last_seen": "2026-02-22T14:30:00",
|
||||
"services": ["gateway"],
|
||||
"kinds": ["error_rate"],
|
||||
"top_signatures": [{"signature": "aabbccdd", "count": 4}],
|
||||
"severity_mix": {"P0": 0, "P1": 2, "P2": 3},
|
||||
"sample_incidents": [...],
|
||||
"recommendations": [
|
||||
"Add regression test for API contract & error mapping",
|
||||
"Add/adjust SLO thresholds & alert routing"
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Deterministic recommendations by kind
|
||||
|
||||
| Kind | Recommendations |
|
||||
|---|---|
|
||||
| `error_rate`, `slo_breach` | Add regression test; review deploys; adjust SLO thresholds |
|
||||
| `latency` | Check p95 vs saturation; investigate DB/queue contention |
|
||||
| `oom`, `crashloop` | Memory profiling; container limits; fix leaks |
|
||||
| `disk` | Retention/cleanup automation; verify volumes |
|
||||
| `security` | Dependency scanner + rotate secrets; verify allowlists |
|
||||
| `queue` | Consumer lag + dead-letter queue |
|
||||
| `network` | DNS audit; network policies |
|
||||
| *(any open incidents)* | ⚠ Do not deploy risky changes until mitigated |
|
||||
|
||||
---
|
||||
|
||||
## Auto Follow-ups (policy-driven)
|
||||
|
||||
When `weekly_digest` runs with `autofollowups.enabled=true`, it automatically appends a `followup` event to the **most recent open incident** in each high-recurrence bucket.
|
||||
|
||||
### Deduplication
|
||||
|
||||
Follow-up key: `{dedupe_key_prefix}:{YYYY-WW}:{bucket_key}`
|
||||
|
||||
One follow-up per bucket per week. A second call in the same week with the same bucket → skipped with `reason: already_exists`.
|
||||
|
||||
A new week (`YYYY-WW` changes) → new follow-up is created.
|
||||
|
||||
### Policy knobs
|
||||
|
||||
```yaml
|
||||
autofollowups:
|
||||
enabled: true
|
||||
only_when_high: true # only high-recurrence buckets trigger follow-ups
|
||||
owner: "oncall"
|
||||
priority: "P1"
|
||||
due_days: 7
|
||||
dedupe_key_prefix: "intel_recur"
|
||||
```
|
||||
|
||||
### Follow-up event structure
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "followup",
|
||||
"message": "[intel] Recurrence high: gateway|error_rate (7d=5, 30d=12, kinds=error_rate)",
|
||||
"meta": {
|
||||
"title": "[intel] Recurrence high: gateway|error_rate",
|
||||
"owner": "oncall",
|
||||
"priority": "P1",
|
||||
"due_date": "2026-03-02",
|
||||
"dedupe_key": "intel_recur:2026-W08:gateway|error_rate",
|
||||
"auto_created": true,
|
||||
"bucket_key": "gateway|error_rate",
|
||||
"count_7d": 5
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## `recurrence_watch` Release Gate
|
||||
|
||||
### Purpose
|
||||
|
||||
Warns (or blocks in staging) when the service being deployed has a high incident recurrence pattern — catching "we're deploying into a known-bad state."
|
||||
|
||||
### GatePolicy profiles
|
||||
|
||||
| Profile | Mode | Blocks on |
|
||||
|---|---|---|
|
||||
| `dev` | `warn` | Never blocks |
|
||||
| `staging` | `strict` | High recurrence + P0/P1 severity |
|
||||
| `prod` | `warn` | Never blocks (accumulate data first) |
|
||||
|
||||
### Strict mode logic
|
||||
|
||||
```
|
||||
if mode == "strict":
|
||||
if gate.has_high_recurrence AND gate.max_severity_seen in fail_on.severity_in:
|
||||
pass = False
|
||||
```
|
||||
|
||||
`fail_on.severity_in` defaults to `["P0", "P1"]`. Only P2/P3 incidents in a high-recurrence bucket do **not** block.
|
||||
|
||||
### Gate output fields
|
||||
|
||||
| Field | Description |
|
||||
|---|---|
|
||||
| `has_high_recurrence` | True if any signature or kind is in "high" zone |
|
||||
| `has_warn_recurrence` | True if any signature or kind is in "warn" zone |
|
||||
| `max_severity_seen` | Most severe incident in the service window |
|
||||
| `high_signatures` | List of first 5 high-recurrence signature prefixes |
|
||||
| `high_kinds` | List of first 5 high-recurrence kinds |
|
||||
| `total_incidents` | Total incidents in window |
|
||||
| `skipped` | True if gate was bypassed (error or tool unavailable) |
|
||||
|
||||
### Input overrides
|
||||
|
||||
```json
|
||||
{
|
||||
"run_recurrence_watch": true,
|
||||
"recurrence_watch_mode": "off", // override policy
|
||||
"recurrence_watch_windows_days": [7, 30],
|
||||
"recurrence_watch_service": "gateway" // default: service_name from release inputs
|
||||
}
|
||||
```
|
||||
|
||||
### Backward compatibility
|
||||
|
||||
If `run_recurrence_watch` is not in inputs, defaults to `true`. If `recurrence_watch_mode` is not set, falls back to GatePolicy profile setting.
|
||||
|
||||
Reference in New Issue
Block a user