microdao-daarion/ops/runbook-incidents.md

# Runbook: Incident Log Operations

## 1. Initial Setup

### JSONL backend (default)

No setup needed. Incidents stored in `ops/incidents/`:
- `incidents.jsonl` — incident records
- `events.jsonl` — timeline events
- `artifacts.jsonl` — artifact metadata

Artifact files: `ops/incidents/<incident_id>/` (md/json/txt files).

### Postgres backend

```bash
# Run idempotent migration
DATABASE_URL="postgresql://user:pass@host:5432/db" \
  python3 ops/scripts/migrate_incidents_postgres.py

# Dry run (prints DDL only)
python3 ops/scripts/migrate_incidents_postgres.py --dry-run
```

Tables created: `incidents`, `incident_events`, `incident_artifacts`.

## 2. Agent Roles & Permissions

| Agent | Role | Incident access |
|-------|------|----------------|
| sofiia | agent_cto | Full CRUD |
| helion | agent_oncall | Full CRUD |
| monitor | agent_monitor | Read only |
| aistalk | agent_interface | Read only |
| others | agent_default | Read only |

## 3. Common Operations

### Create incident manually (via tool)

```json
{
  "tool": "oncall_tool",
  "action": "incident_create",
  "params": {
    "service": "gateway",
    "severity": "P1",
    "title": "Gateway 5xx rate >5%",
    "env": "prod",
    "started_at": "2026-02-23T10:00:00Z"
  },
  "agent_id": "sofiia"
}
```

### Generate postmortem

```bash
curl -X POST http://supervisor:8000/v1/graphs/postmortem_draft/runs \
  -H "Content-Type: application/json" \
  -d '{"agent_id":"sofiia","input":{"incident_id":"inc_..."}}'
```

### List open incidents

```json
{
  "tool": "oncall_tool",
  "action": "incident_list",
  "params": { "status": "open", "limit": 20 }
}
```

## 4. Troubleshooting

### Artifacts not writing

- Check `INCIDENT_ARTIFACTS_DIR` env var (or default `ops/incidents/`).
- Check filesystem permissions (directory must be writable).
- Max artifact size: 2MB. Only json/md/txt allowed.

### Incident not found

- Verify `incident_id` format: `inc_YYYYMMDD_HHMM_<rand>`.
- Check the correct backend is configured (`INCIDENT_BACKEND` env var).
- For JSONL: verify `ops/incidents/incidents.jsonl` exists and is not corrupt.

### Postmortem graph fails

1. Check supervisor logs: `docker logs sofiia-supervisor`.
2. Verify the incident exists: `oncall_tool.incident_get`.
3. Check gateway is reachable from supervisor.
4. Run `GET /v1/runs/<run_id>` to see graph status and error.

## 5. Backup & Retention

### JSONL

```bash
# Backup
cp -r ops/incidents/ /backup/incidents-$(date +%F)/

# Retention: manual cleanup of closed incidents older than N days
# (Not automated yet; add to future audit_cleanup scope)
```

### Postgres

Standard pg_dump for `incidents`, `incident_events`, `incident_artifacts` tables.

## 6. INCIDENT_BACKEND=auto

The incident store supports `INCIDENT_BACKEND=auto` which tries Postgres first and falls back to JSONL:

```bash
# Set in environment:
INCIDENT_BACKEND=auto
DATABASE_URL=postgresql://user:pass@localhost:5432/daarion

# Behaviour:
# - Primary: PostgresIncidentStore
# - Fallback: JsonlIncidentStore (on connection failure)
# - Recovery: re-attempts Postgres after 5 minutes
```

Use `INCIDENT_BACKEND=postgres` for Postgres-only (fails if DB is down) or `jsonl` for file-only.

## 7. Follow-up Tracking

Follow-ups are `incident_append_event` entries with `type=followup` and structured meta:

```bash
# Check overdue follow-ups for a service:
curl -X POST http://gateway/v1/tools/oncall_tool -d '{
  "action": "incident_followups_summary",
  "service": "gateway",
  "env": "prod",
  "window_days": 30
}'
```

The `followup_watch` release gate uses this to warn (or block in staging/prod strict mode) about open P0/P1 incidents and overdue follow-ups. See `docs/incident/followups.md`.

## 8. Monitoring

- Check `/healthz` on supervisor.
- Monitor `ops/incidents/` directory size (JSONL backend).
- Daily: review `incident_list status=open` for stale incidents.
- Weekly: review `incident_followups_summary` for overdue items.

## 9. Weekly Incident Intelligence Digest

The `weekly_incident_digest` scheduled job runs every Monday at 08:00 UTC and produces:

- `ops/reports/incidents/weekly/YYYY-WW.json` — full structured data
- `ops/reports/incidents/weekly/YYYY-WW.md` — markdown report for review

### Manual run

```bash
# Via job orchestrator
curl -X POST http://gateway/v1/tools/jobs \
  -H "X-API-Key: $GATEWAY_API_KEY" \
  -d '{"action":"start_task","params":{"task_id":"weekly_incident_digest","inputs":{}}}'

# Direct tool call (CTO/oncall only)
curl -X POST http://gateway/v1/tools/incident_intelligence_tool \
  -H "X-API-Key: $GATEWAY_API_KEY" \
  -d '{"action":"weekly_digest","save_artifacts":true}'
```

### Correlating a specific incident

```bash
curl -X POST http://gateway/v1/tools/incident_intelligence_tool \
  -H "X-API-Key: $GATEWAY_API_KEY" \
  -d '{"action":"correlate","incident_id":"inc_20260218_1430_abc123","append_note":true}'
```

### Recurrence analysis

```bash
curl -X POST http://gateway/v1/tools/incident_intelligence_tool \
  -H "X-API-Key: $GATEWAY_API_KEY" \
  -d '{"action":"recurrence","window_days":7}'
```

### Digest location

Reports accumulate in `ops/reports/incidents/weekly/`. Retention follows standard `audit_jsonl_days` or manual cleanup.

See also: `docs/incident/intelligence.md` for policy tuning and scoring details.

---

## Scheduler Wiring: cron vs task_registry

### Alert triage loop (already active)

```
# ops/cron/alert_triage.cron — runs every 5 minutes
*/5 * * * *  python3 /opt/daarion/ops/scripts/alert_triage_loop.py
```

This processes `new` alerts → creates/updates incidents → triggers escalation when needed.

### Governance jobs (activated in ops/cron/jobs.cron)

The following jobs complement the triage loop by computing intelligence and
generating artifacts that Sofiia can consume:

| Job | Schedule | Output |
|-----|----------|--------|
| `hourly_risk_snapshot` | every hour | `risk_history_store` (Postgres or memory) |
| `daily_risk_digest` | 09:00 UTC | `ops/reports/risk/YYYY-MM-DD.{md,json}` |
| `weekly_platform_priority_digest` | Mon 06:00 UTC | `ops/reports/platform/YYYY-WW.{md,json}` |
| `weekly_backlog_generate` | Mon 06:20 UTC | `ops/backlog/items.jsonl` or Postgres |

### Registering cron entries

```bash
# Deploy all governance cron jobs:
sudo cp ops/cron/jobs.cron /etc/cron.d/daarion-governance
sudo chmod 644 /etc/cron.d/daarion-governance

# Verify active entries:
grep -v "^#\|^$" /etc/cron.d/daarion-governance
```

### Relationship between task_registry.yml and ops/cron/

`ops/task_registry.yml` is the **canonical declaration** of all scheduled jobs
(schedule, permissions, inputs, dry-run). `ops/cron/jobs.cron` is the **physical
activation** — what actually runs. They must be kept in sync.

Use `run_governance_job.py --dry-run` to test any job before enabling in cron.