# Runbook: Incident Log Operations ## 1. Initial Setup ### JSONL backend (default) No setup needed. Incidents stored in `ops/incidents/`: - `incidents.jsonl` — incident records - `events.jsonl` — timeline events - `artifacts.jsonl` — artifact metadata Artifact files: `ops/incidents//` (md/json/txt files). ### Postgres backend ```bash # Run idempotent migration DATABASE_URL="postgresql://user:pass@host:5432/db" \ python3 ops/scripts/migrate_incidents_postgres.py # Dry run (prints DDL only) python3 ops/scripts/migrate_incidents_postgres.py --dry-run ``` Tables created: `incidents`, `incident_events`, `incident_artifacts`. ## 2. Agent Roles & Permissions | Agent | Role | Incident access | |-------|------|----------------| | sofiia | agent_cto | Full CRUD | | helion | agent_oncall | Full CRUD | | monitor | agent_monitor | Read only | | aistalk | agent_interface | Read only | | others | agent_default | Read only | ## 3. Common Operations ### Create incident manually (via tool) ```json { "tool": "oncall_tool", "action": "incident_create", "params": { "service": "gateway", "severity": "P1", "title": "Gateway 5xx rate >5%", "env": "prod", "started_at": "2026-02-23T10:00:00Z" }, "agent_id": "sofiia" } ``` ### Generate postmortem ```bash curl -X POST http://supervisor:8000/v1/graphs/postmortem_draft/runs \ -H "Content-Type: application/json" \ -d '{"agent_id":"sofiia","input":{"incident_id":"inc_..."}}' ``` ### List open incidents ```json { "tool": "oncall_tool", "action": "incident_list", "params": { "status": "open", "limit": 20 } } ``` ## 4. Troubleshooting ### Artifacts not writing - Check `INCIDENT_ARTIFACTS_DIR` env var (or default `ops/incidents/`). - Check filesystem permissions (directory must be writable). - Max artifact size: 2MB. Only json/md/txt allowed. ### Incident not found - Verify `incident_id` format: `inc_YYYYMMDD_HHMM_`. - Check the correct backend is configured (`INCIDENT_BACKEND` env var). - For JSONL: verify `ops/incidents/incidents.jsonl` exists and is not corrupt. ### Postmortem graph fails 1. Check supervisor logs: `docker logs sofiia-supervisor`. 2. Verify the incident exists: `oncall_tool.incident_get`. 3. Check gateway is reachable from supervisor. 4. Run `GET /v1/runs/` to see graph status and error. ## 5. Backup & Retention ### JSONL ```bash # Backup cp -r ops/incidents/ /backup/incidents-$(date +%F)/ # Retention: manual cleanup of closed incidents older than N days # (Not automated yet; add to future audit_cleanup scope) ``` ### Postgres Standard pg_dump for `incidents`, `incident_events`, `incident_artifacts` tables. ## 6. INCIDENT_BACKEND=auto The incident store supports `INCIDENT_BACKEND=auto` which tries Postgres first and falls back to JSONL: ```bash # Set in environment: INCIDENT_BACKEND=auto DATABASE_URL=postgresql://user:pass@localhost:5432/daarion # Behaviour: # - Primary: PostgresIncidentStore # - Fallback: JsonlIncidentStore (on connection failure) # - Recovery: re-attempts Postgres after 5 minutes ``` Use `INCIDENT_BACKEND=postgres` for Postgres-only (fails if DB is down) or `jsonl` for file-only. ## 7. Follow-up Tracking Follow-ups are `incident_append_event` entries with `type=followup` and structured meta: ```bash # Check overdue follow-ups for a service: curl -X POST http://gateway/v1/tools/oncall_tool -d '{ "action": "incident_followups_summary", "service": "gateway", "env": "prod", "window_days": 30 }' ``` The `followup_watch` release gate uses this to warn (or block in staging/prod strict mode) about open P0/P1 incidents and overdue follow-ups. See `docs/incident/followups.md`. ## 8. Monitoring - Check `/healthz` on supervisor. - Monitor `ops/incidents/` directory size (JSONL backend). - Daily: review `incident_list status=open` for stale incidents. - Weekly: review `incident_followups_summary` for overdue items. ## 9. Weekly Incident Intelligence Digest The `weekly_incident_digest` scheduled job runs every Monday at 08:00 UTC and produces: - `ops/reports/incidents/weekly/YYYY-WW.json` — full structured data - `ops/reports/incidents/weekly/YYYY-WW.md` — markdown report for review ### Manual run ```bash # Via job orchestrator curl -X POST http://gateway/v1/tools/jobs \ -H "X-API-Key: $GATEWAY_API_KEY" \ -d '{"action":"start_task","params":{"task_id":"weekly_incident_digest","inputs":{}}}' # Direct tool call (CTO/oncall only) curl -X POST http://gateway/v1/tools/incident_intelligence_tool \ -H "X-API-Key: $GATEWAY_API_KEY" \ -d '{"action":"weekly_digest","save_artifacts":true}' ``` ### Correlating a specific incident ```bash curl -X POST http://gateway/v1/tools/incident_intelligence_tool \ -H "X-API-Key: $GATEWAY_API_KEY" \ -d '{"action":"correlate","incident_id":"inc_20260218_1430_abc123","append_note":true}' ``` ### Recurrence analysis ```bash curl -X POST http://gateway/v1/tools/incident_intelligence_tool \ -H "X-API-Key: $GATEWAY_API_KEY" \ -d '{"action":"recurrence","window_days":7}' ``` ### Digest location Reports accumulate in `ops/reports/incidents/weekly/`. Retention follows standard `audit_jsonl_days` or manual cleanup. See also: `docs/incident/intelligence.md` for policy tuning and scoring details. --- ## Scheduler Wiring: cron vs task_registry ### Alert triage loop (already active) ``` # ops/cron/alert_triage.cron — runs every 5 minutes */5 * * * * python3 /opt/daarion/ops/scripts/alert_triage_loop.py ``` This processes `new` alerts → creates/updates incidents → triggers escalation when needed. ### Governance jobs (activated in ops/cron/jobs.cron) The following jobs complement the triage loop by computing intelligence and generating artifacts that Sofiia can consume: | Job | Schedule | Output | |-----|----------|--------| | `hourly_risk_snapshot` | every hour | `risk_history_store` (Postgres or memory) | | `daily_risk_digest` | 09:00 UTC | `ops/reports/risk/YYYY-MM-DD.{md,json}` | | `weekly_platform_priority_digest` | Mon 06:00 UTC | `ops/reports/platform/YYYY-WW.{md,json}` | | `weekly_backlog_generate` | Mon 06:20 UTC | `ops/backlog/items.jsonl` or Postgres | ### Registering cron entries ```bash # Deploy all governance cron jobs: sudo cp ops/cron/jobs.cron /etc/cron.d/daarion-governance sudo chmod 644 /etc/cron.d/daarion-governance # Verify active entries: grep -v "^#\|^$" /etc/cron.d/daarion-governance ``` ### Relationship between task_registry.yml and ops/cron/ `ops/task_registry.yml` is the **canonical declaration** of all scheduled jobs (schedule, permissions, inputs, dry-run). `ops/cron/jobs.cron` is the **physical activation** — what actually runs. They must be kept in sync. Use `run_governance_job.py --dry-run` to test any job before enabling in cron.