Config policies (16 files): alert_routing, architecture_pressure, backlog, cost_weights, data_governance, incident_escalation, incident_intelligence, network_allowlist, nodes_registry, observability_sources, rbac_tools_matrix, release_gate, risk_attribution, risk_policy, slo_policy, tool_limits, tools_rollout Ops (22 files): Caddyfile, calendar compose, grafana voice dashboard, deployments/incidents logs, runbooks for alerts/audit/backlog/incidents/sofiia/voice, cron jobs, scripts (alert_triage, audit_cleanup, migrate_*, governance, schedule), task_registry, voice alerts/ha/latency/policy Docs (30+ files): HUMANIZED_STEPAN v2.7-v3 changelogs and runbooks, NODA1/NODA2 status and setup, audit index and traces, backlog, incident, supervisor, tools, voice, opencode, release, risk, aistalk, spacebot Made-with: Cursor
237 lines
6.7 KiB
Markdown
237 lines
6.7 KiB
Markdown
# Runbook: Incident Log Operations
|
|
|
|
## 1. Initial Setup
|
|
|
|
### JSONL backend (default)
|
|
|
|
No setup needed. Incidents stored in `ops/incidents/`:
|
|
- `incidents.jsonl` — incident records
|
|
- `events.jsonl` — timeline events
|
|
- `artifacts.jsonl` — artifact metadata
|
|
|
|
Artifact files: `ops/incidents/<incident_id>/` (md/json/txt files).
|
|
|
|
### Postgres backend
|
|
|
|
```bash
|
|
# Run idempotent migration
|
|
DATABASE_URL="postgresql://user:pass@host:5432/db" \
|
|
python3 ops/scripts/migrate_incidents_postgres.py
|
|
|
|
# Dry run (prints DDL only)
|
|
python3 ops/scripts/migrate_incidents_postgres.py --dry-run
|
|
```
|
|
|
|
Tables created: `incidents`, `incident_events`, `incident_artifacts`.
|
|
|
|
## 2. Agent Roles & Permissions
|
|
|
|
| Agent | Role | Incident access |
|
|
|-------|------|----------------|
|
|
| sofiia | agent_cto | Full CRUD |
|
|
| helion | agent_oncall | Full CRUD |
|
|
| monitor | agent_monitor | Read only |
|
|
| aistalk | agent_interface | Read only |
|
|
| others | agent_default | Read only |
|
|
|
|
## 3. Common Operations
|
|
|
|
### Create incident manually (via tool)
|
|
|
|
```json
|
|
{
|
|
"tool": "oncall_tool",
|
|
"action": "incident_create",
|
|
"params": {
|
|
"service": "gateway",
|
|
"severity": "P1",
|
|
"title": "Gateway 5xx rate >5%",
|
|
"env": "prod",
|
|
"started_at": "2026-02-23T10:00:00Z"
|
|
},
|
|
"agent_id": "sofiia"
|
|
}
|
|
```
|
|
|
|
### Generate postmortem
|
|
|
|
```bash
|
|
curl -X POST http://supervisor:8000/v1/graphs/postmortem_draft/runs \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"agent_id":"sofiia","input":{"incident_id":"inc_..."}}'
|
|
```
|
|
|
|
### List open incidents
|
|
|
|
```json
|
|
{
|
|
"tool": "oncall_tool",
|
|
"action": "incident_list",
|
|
"params": { "status": "open", "limit": 20 }
|
|
}
|
|
```
|
|
|
|
## 4. Troubleshooting
|
|
|
|
### Artifacts not writing
|
|
|
|
- Check `INCIDENT_ARTIFACTS_DIR` env var (or default `ops/incidents/`).
|
|
- Check filesystem permissions (directory must be writable).
|
|
- Max artifact size: 2MB. Only json/md/txt allowed.
|
|
|
|
### Incident not found
|
|
|
|
- Verify `incident_id` format: `inc_YYYYMMDD_HHMM_<rand>`.
|
|
- Check the correct backend is configured (`INCIDENT_BACKEND` env var).
|
|
- For JSONL: verify `ops/incidents/incidents.jsonl` exists and is not corrupt.
|
|
|
|
### Postmortem graph fails
|
|
|
|
1. Check supervisor logs: `docker logs sofiia-supervisor`.
|
|
2. Verify the incident exists: `oncall_tool.incident_get`.
|
|
3. Check gateway is reachable from supervisor.
|
|
4. Run `GET /v1/runs/<run_id>` to see graph status and error.
|
|
|
|
## 5. Backup & Retention
|
|
|
|
### JSONL
|
|
|
|
```bash
|
|
# Backup
|
|
cp -r ops/incidents/ /backup/incidents-$(date +%F)/
|
|
|
|
# Retention: manual cleanup of closed incidents older than N days
|
|
# (Not automated yet; add to future audit_cleanup scope)
|
|
```
|
|
|
|
### Postgres
|
|
|
|
Standard pg_dump for `incidents`, `incident_events`, `incident_artifacts` tables.
|
|
|
|
## 6. INCIDENT_BACKEND=auto
|
|
|
|
The incident store supports `INCIDENT_BACKEND=auto` which tries Postgres first and falls back to JSONL:
|
|
|
|
```bash
|
|
# Set in environment:
|
|
INCIDENT_BACKEND=auto
|
|
DATABASE_URL=postgresql://user:pass@localhost:5432/daarion
|
|
|
|
# Behaviour:
|
|
# - Primary: PostgresIncidentStore
|
|
# - Fallback: JsonlIncidentStore (on connection failure)
|
|
# - Recovery: re-attempts Postgres after 5 minutes
|
|
```
|
|
|
|
Use `INCIDENT_BACKEND=postgres` for Postgres-only (fails if DB is down) or `jsonl` for file-only.
|
|
|
|
## 7. Follow-up Tracking
|
|
|
|
Follow-ups are `incident_append_event` entries with `type=followup` and structured meta:
|
|
|
|
```bash
|
|
# Check overdue follow-ups for a service:
|
|
curl -X POST http://gateway/v1/tools/oncall_tool -d '{
|
|
"action": "incident_followups_summary",
|
|
"service": "gateway",
|
|
"env": "prod",
|
|
"window_days": 30
|
|
}'
|
|
```
|
|
|
|
The `followup_watch` release gate uses this to warn (or block in staging/prod strict mode) about open P0/P1 incidents and overdue follow-ups. See `docs/incident/followups.md`.
|
|
|
|
## 8. Monitoring
|
|
|
|
- Check `/healthz` on supervisor.
|
|
- Monitor `ops/incidents/` directory size (JSONL backend).
|
|
- Daily: review `incident_list status=open` for stale incidents.
|
|
- Weekly: review `incident_followups_summary` for overdue items.
|
|
|
|
## 9. Weekly Incident Intelligence Digest
|
|
|
|
The `weekly_incident_digest` scheduled job runs every Monday at 08:00 UTC and produces:
|
|
|
|
- `ops/reports/incidents/weekly/YYYY-WW.json` — full structured data
|
|
- `ops/reports/incidents/weekly/YYYY-WW.md` — markdown report for review
|
|
|
|
### Manual run
|
|
|
|
```bash
|
|
# Via job orchestrator
|
|
curl -X POST http://gateway/v1/tools/jobs \
|
|
-H "X-API-Key: $GATEWAY_API_KEY" \
|
|
-d '{"action":"start_task","params":{"task_id":"weekly_incident_digest","inputs":{}}}'
|
|
|
|
# Direct tool call (CTO/oncall only)
|
|
curl -X POST http://gateway/v1/tools/incident_intelligence_tool \
|
|
-H "X-API-Key: $GATEWAY_API_KEY" \
|
|
-d '{"action":"weekly_digest","save_artifacts":true}'
|
|
```
|
|
|
|
### Correlating a specific incident
|
|
|
|
```bash
|
|
curl -X POST http://gateway/v1/tools/incident_intelligence_tool \
|
|
-H "X-API-Key: $GATEWAY_API_KEY" \
|
|
-d '{"action":"correlate","incident_id":"inc_20260218_1430_abc123","append_note":true}'
|
|
```
|
|
|
|
### Recurrence analysis
|
|
|
|
```bash
|
|
curl -X POST http://gateway/v1/tools/incident_intelligence_tool \
|
|
-H "X-API-Key: $GATEWAY_API_KEY" \
|
|
-d '{"action":"recurrence","window_days":7}'
|
|
```
|
|
|
|
### Digest location
|
|
|
|
Reports accumulate in `ops/reports/incidents/weekly/`. Retention follows standard `audit_jsonl_days` or manual cleanup.
|
|
|
|
See also: `docs/incident/intelligence.md` for policy tuning and scoring details.
|
|
|
|
---
|
|
|
|
## Scheduler Wiring: cron vs task_registry
|
|
|
|
### Alert triage loop (already active)
|
|
|
|
```
|
|
# ops/cron/alert_triage.cron — runs every 5 minutes
|
|
*/5 * * * * python3 /opt/daarion/ops/scripts/alert_triage_loop.py
|
|
```
|
|
|
|
This processes `new` alerts → creates/updates incidents → triggers escalation when needed.
|
|
|
|
### Governance jobs (activated in ops/cron/jobs.cron)
|
|
|
|
The following jobs complement the triage loop by computing intelligence and
|
|
generating artifacts that Sofiia can consume:
|
|
|
|
| Job | Schedule | Output |
|
|
|-----|----------|--------|
|
|
| `hourly_risk_snapshot` | every hour | `risk_history_store` (Postgres or memory) |
|
|
| `daily_risk_digest` | 09:00 UTC | `ops/reports/risk/YYYY-MM-DD.{md,json}` |
|
|
| `weekly_platform_priority_digest` | Mon 06:00 UTC | `ops/reports/platform/YYYY-WW.{md,json}` |
|
|
| `weekly_backlog_generate` | Mon 06:20 UTC | `ops/backlog/items.jsonl` or Postgres |
|
|
|
|
### Registering cron entries
|
|
|
|
```bash
|
|
# Deploy all governance cron jobs:
|
|
sudo cp ops/cron/jobs.cron /etc/cron.d/daarion-governance
|
|
sudo chmod 644 /etc/cron.d/daarion-governance
|
|
|
|
# Verify active entries:
|
|
grep -v "^#\|^$" /etc/cron.d/daarion-governance
|
|
```
|
|
|
|
### Relationship between task_registry.yml and ops/cron/
|
|
|
|
`ops/task_registry.yml` is the **canonical declaration** of all scheduled jobs
|
|
(schedule, permissions, inputs, dry-run). `ops/cron/jobs.cron` is the **physical
|
|
activation** — what actually runs. They must be kept in sync.
|
|
|
|
Use `run_governance_job.py --dry-run` to test any job before enabling in cron.
|