Files
microdao-daarion/ops/runbook-incidents.md
Apple 67225a39fa docs(platform): add policy configs, runbooks, ops scripts and platform documentation
Config policies (16 files): alert_routing, architecture_pressure, backlog,
cost_weights, data_governance, incident_escalation, incident_intelligence,
network_allowlist, nodes_registry, observability_sources, rbac_tools_matrix,
release_gate, risk_attribution, risk_policy, slo_policy, tool_limits, tools_rollout

Ops (22 files): Caddyfile, calendar compose, grafana voice dashboard,
deployments/incidents logs, runbooks for alerts/audit/backlog/incidents/sofiia/voice,
cron jobs, scripts (alert_triage, audit_cleanup, migrate_*, governance, schedule),
task_registry, voice alerts/ha/latency/policy

Docs (30+ files): HUMANIZED_STEPAN v2.7-v3 changelogs and runbooks,
NODA1/NODA2 status and setup, audit index and traces, backlog, incident,
supervisor, tools, voice, opencode, release, risk, aistalk, spacebot

Made-with: Cursor
2026-03-03 07:14:53 -08:00

237 lines
6.7 KiB
Markdown

# Runbook: Incident Log Operations
## 1. Initial Setup
### JSONL backend (default)
No setup needed. Incidents stored in `ops/incidents/`:
- `incidents.jsonl` — incident records
- `events.jsonl` — timeline events
- `artifacts.jsonl` — artifact metadata
Artifact files: `ops/incidents/<incident_id>/` (md/json/txt files).
### Postgres backend
```bash
# Run idempotent migration
DATABASE_URL="postgresql://user:pass@host:5432/db" \
python3 ops/scripts/migrate_incidents_postgres.py
# Dry run (prints DDL only)
python3 ops/scripts/migrate_incidents_postgres.py --dry-run
```
Tables created: `incidents`, `incident_events`, `incident_artifacts`.
## 2. Agent Roles & Permissions
| Agent | Role | Incident access |
|-------|------|----------------|
| sofiia | agent_cto | Full CRUD |
| helion | agent_oncall | Full CRUD |
| monitor | agent_monitor | Read only |
| aistalk | agent_interface | Read only |
| others | agent_default | Read only |
## 3. Common Operations
### Create incident manually (via tool)
```json
{
"tool": "oncall_tool",
"action": "incident_create",
"params": {
"service": "gateway",
"severity": "P1",
"title": "Gateway 5xx rate >5%",
"env": "prod",
"started_at": "2026-02-23T10:00:00Z"
},
"agent_id": "sofiia"
}
```
### Generate postmortem
```bash
curl -X POST http://supervisor:8000/v1/graphs/postmortem_draft/runs \
-H "Content-Type: application/json" \
-d '{"agent_id":"sofiia","input":{"incident_id":"inc_..."}}'
```
### List open incidents
```json
{
"tool": "oncall_tool",
"action": "incident_list",
"params": { "status": "open", "limit": 20 }
}
```
## 4. Troubleshooting
### Artifacts not writing
- Check `INCIDENT_ARTIFACTS_DIR` env var (or default `ops/incidents/`).
- Check filesystem permissions (directory must be writable).
- Max artifact size: 2MB. Only json/md/txt allowed.
### Incident not found
- Verify `incident_id` format: `inc_YYYYMMDD_HHMM_<rand>`.
- Check the correct backend is configured (`INCIDENT_BACKEND` env var).
- For JSONL: verify `ops/incidents/incidents.jsonl` exists and is not corrupt.
### Postmortem graph fails
1. Check supervisor logs: `docker logs sofiia-supervisor`.
2. Verify the incident exists: `oncall_tool.incident_get`.
3. Check gateway is reachable from supervisor.
4. Run `GET /v1/runs/<run_id>` to see graph status and error.
## 5. Backup & Retention
### JSONL
```bash
# Backup
cp -r ops/incidents/ /backup/incidents-$(date +%F)/
# Retention: manual cleanup of closed incidents older than N days
# (Not automated yet; add to future audit_cleanup scope)
```
### Postgres
Standard pg_dump for `incidents`, `incident_events`, `incident_artifacts` tables.
## 6. INCIDENT_BACKEND=auto
The incident store supports `INCIDENT_BACKEND=auto` which tries Postgres first and falls back to JSONL:
```bash
# Set in environment:
INCIDENT_BACKEND=auto
DATABASE_URL=postgresql://user:pass@localhost:5432/daarion
# Behaviour:
# - Primary: PostgresIncidentStore
# - Fallback: JsonlIncidentStore (on connection failure)
# - Recovery: re-attempts Postgres after 5 minutes
```
Use `INCIDENT_BACKEND=postgres` for Postgres-only (fails if DB is down) or `jsonl` for file-only.
## 7. Follow-up Tracking
Follow-ups are `incident_append_event` entries with `type=followup` and structured meta:
```bash
# Check overdue follow-ups for a service:
curl -X POST http://gateway/v1/tools/oncall_tool -d '{
"action": "incident_followups_summary",
"service": "gateway",
"env": "prod",
"window_days": 30
}'
```
The `followup_watch` release gate uses this to warn (or block in staging/prod strict mode) about open P0/P1 incidents and overdue follow-ups. See `docs/incident/followups.md`.
## 8. Monitoring
- Check `/healthz` on supervisor.
- Monitor `ops/incidents/` directory size (JSONL backend).
- Daily: review `incident_list status=open` for stale incidents.
- Weekly: review `incident_followups_summary` for overdue items.
## 9. Weekly Incident Intelligence Digest
The `weekly_incident_digest` scheduled job runs every Monday at 08:00 UTC and produces:
- `ops/reports/incidents/weekly/YYYY-WW.json` — full structured data
- `ops/reports/incidents/weekly/YYYY-WW.md` — markdown report for review
### Manual run
```bash
# Via job orchestrator
curl -X POST http://gateway/v1/tools/jobs \
-H "X-API-Key: $GATEWAY_API_KEY" \
-d '{"action":"start_task","params":{"task_id":"weekly_incident_digest","inputs":{}}}'
# Direct tool call (CTO/oncall only)
curl -X POST http://gateway/v1/tools/incident_intelligence_tool \
-H "X-API-Key: $GATEWAY_API_KEY" \
-d '{"action":"weekly_digest","save_artifacts":true}'
```
### Correlating a specific incident
```bash
curl -X POST http://gateway/v1/tools/incident_intelligence_tool \
-H "X-API-Key: $GATEWAY_API_KEY" \
-d '{"action":"correlate","incident_id":"inc_20260218_1430_abc123","append_note":true}'
```
### Recurrence analysis
```bash
curl -X POST http://gateway/v1/tools/incident_intelligence_tool \
-H "X-API-Key: $GATEWAY_API_KEY" \
-d '{"action":"recurrence","window_days":7}'
```
### Digest location
Reports accumulate in `ops/reports/incidents/weekly/`. Retention follows standard `audit_jsonl_days` or manual cleanup.
See also: `docs/incident/intelligence.md` for policy tuning and scoring details.
---
## Scheduler Wiring: cron vs task_registry
### Alert triage loop (already active)
```
# ops/cron/alert_triage.cron — runs every 5 minutes
*/5 * * * * python3 /opt/daarion/ops/scripts/alert_triage_loop.py
```
This processes `new` alerts → creates/updates incidents → triggers escalation when needed.
### Governance jobs (activated in ops/cron/jobs.cron)
The following jobs complement the triage loop by computing intelligence and
generating artifacts that Sofiia can consume:
| Job | Schedule | Output |
|-----|----------|--------|
| `hourly_risk_snapshot` | every hour | `risk_history_store` (Postgres or memory) |
| `daily_risk_digest` | 09:00 UTC | `ops/reports/risk/YYYY-MM-DD.{md,json}` |
| `weekly_platform_priority_digest` | Mon 06:00 UTC | `ops/reports/platform/YYYY-WW.{md,json}` |
| `weekly_backlog_generate` | Mon 06:20 UTC | `ops/backlog/items.jsonl` or Postgres |
### Registering cron entries
```bash
# Deploy all governance cron jobs:
sudo cp ops/cron/jobs.cron /etc/cron.d/daarion-governance
sudo chmod 644 /etc/cron.d/daarion-governance
# Verify active entries:
grep -v "^#\|^$" /etc/cron.d/daarion-governance
```
### Relationship between task_registry.yml and ops/cron/
`ops/task_registry.yml` is the **canonical declaration** of all scheduled jobs
(schedule, permissions, inputs, dry-run). `ops/cron/jobs.cron` is the **physical
activation** — what actually runs. They must be kept in sync.
Use `run_governance_job.py --dry-run` to test any job before enabling in cron.