Files
microdao-daarion/ops/runbook-incidents.md
Apple 67225a39fa docs(platform): add policy configs, runbooks, ops scripts and platform documentation
Config policies (16 files): alert_routing, architecture_pressure, backlog,
cost_weights, data_governance, incident_escalation, incident_intelligence,
network_allowlist, nodes_registry, observability_sources, rbac_tools_matrix,
release_gate, risk_attribution, risk_policy, slo_policy, tool_limits, tools_rollout

Ops (22 files): Caddyfile, calendar compose, grafana voice dashboard,
deployments/incidents logs, runbooks for alerts/audit/backlog/incidents/sofiia/voice,
cron jobs, scripts (alert_triage, audit_cleanup, migrate_*, governance, schedule),
task_registry, voice alerts/ha/latency/policy

Docs (30+ files): HUMANIZED_STEPAN v2.7-v3 changelogs and runbooks,
NODA1/NODA2 status and setup, audit index and traces, backlog, incident,
supervisor, tools, voice, opencode, release, risk, aistalk, spacebot

Made-with: Cursor
2026-03-03 07:14:53 -08:00

6.7 KiB

Runbook: Incident Log Operations

1. Initial Setup

JSONL backend (default)

No setup needed. Incidents stored in ops/incidents/:

  • incidents.jsonl — incident records
  • events.jsonl — timeline events
  • artifacts.jsonl — artifact metadata

Artifact files: ops/incidents/<incident_id>/ (md/json/txt files).

Postgres backend

# Run idempotent migration
DATABASE_URL="postgresql://user:pass@host:5432/db" \
  python3 ops/scripts/migrate_incidents_postgres.py

# Dry run (prints DDL only)
python3 ops/scripts/migrate_incidents_postgres.py --dry-run

Tables created: incidents, incident_events, incident_artifacts.

2. Agent Roles & Permissions

Agent Role Incident access
sofiia agent_cto Full CRUD
helion agent_oncall Full CRUD
monitor agent_monitor Read only
aistalk agent_interface Read only
others agent_default Read only

3. Common Operations

Create incident manually (via tool)

{
  "tool": "oncall_tool",
  "action": "incident_create",
  "params": {
    "service": "gateway",
    "severity": "P1",
    "title": "Gateway 5xx rate >5%",
    "env": "prod",
    "started_at": "2026-02-23T10:00:00Z"
  },
  "agent_id": "sofiia"
}

Generate postmortem

curl -X POST http://supervisor:8000/v1/graphs/postmortem_draft/runs \
  -H "Content-Type: application/json" \
  -d '{"agent_id":"sofiia","input":{"incident_id":"inc_..."}}'

List open incidents

{
  "tool": "oncall_tool",
  "action": "incident_list",
  "params": { "status": "open", "limit": 20 }
}

4. Troubleshooting

Artifacts not writing

  • Check INCIDENT_ARTIFACTS_DIR env var (or default ops/incidents/).
  • Check filesystem permissions (directory must be writable).
  • Max artifact size: 2MB. Only json/md/txt allowed.

Incident not found

  • Verify incident_id format: inc_YYYYMMDD_HHMM_<rand>.
  • Check the correct backend is configured (INCIDENT_BACKEND env var).
  • For JSONL: verify ops/incidents/incidents.jsonl exists and is not corrupt.

Postmortem graph fails

  1. Check supervisor logs: docker logs sofiia-supervisor.
  2. Verify the incident exists: oncall_tool.incident_get.
  3. Check gateway is reachable from supervisor.
  4. Run GET /v1/runs/<run_id> to see graph status and error.

5. Backup & Retention

JSONL

# Backup
cp -r ops/incidents/ /backup/incidents-$(date +%F)/

# Retention: manual cleanup of closed incidents older than N days
# (Not automated yet; add to future audit_cleanup scope)

Postgres

Standard pg_dump for incidents, incident_events, incident_artifacts tables.

6. INCIDENT_BACKEND=auto

The incident store supports INCIDENT_BACKEND=auto which tries Postgres first and falls back to JSONL:

# Set in environment:
INCIDENT_BACKEND=auto
DATABASE_URL=postgresql://user:pass@localhost:5432/daarion

# Behaviour:
# - Primary: PostgresIncidentStore
# - Fallback: JsonlIncidentStore (on connection failure)
# - Recovery: re-attempts Postgres after 5 minutes

Use INCIDENT_BACKEND=postgres for Postgres-only (fails if DB is down) or jsonl for file-only.

7. Follow-up Tracking

Follow-ups are incident_append_event entries with type=followup and structured meta:

# Check overdue follow-ups for a service:
curl -X POST http://gateway/v1/tools/oncall_tool -d '{
  "action": "incident_followups_summary",
  "service": "gateway",
  "env": "prod",
  "window_days": 30
}'

The followup_watch release gate uses this to warn (or block in staging/prod strict mode) about open P0/P1 incidents and overdue follow-ups. See docs/incident/followups.md.

8. Monitoring

  • Check /healthz on supervisor.
  • Monitor ops/incidents/ directory size (JSONL backend).
  • Daily: review incident_list status=open for stale incidents.
  • Weekly: review incident_followups_summary for overdue items.

9. Weekly Incident Intelligence Digest

The weekly_incident_digest scheduled job runs every Monday at 08:00 UTC and produces:

  • ops/reports/incidents/weekly/YYYY-WW.json — full structured data
  • ops/reports/incidents/weekly/YYYY-WW.md — markdown report for review

Manual run

# Via job orchestrator
curl -X POST http://gateway/v1/tools/jobs \
  -H "X-API-Key: $GATEWAY_API_KEY" \
  -d '{"action":"start_task","params":{"task_id":"weekly_incident_digest","inputs":{}}}'

# Direct tool call (CTO/oncall only)
curl -X POST http://gateway/v1/tools/incident_intelligence_tool \
  -H "X-API-Key: $GATEWAY_API_KEY" \
  -d '{"action":"weekly_digest","save_artifacts":true}'

Correlating a specific incident

curl -X POST http://gateway/v1/tools/incident_intelligence_tool \
  -H "X-API-Key: $GATEWAY_API_KEY" \
  -d '{"action":"correlate","incident_id":"inc_20260218_1430_abc123","append_note":true}'

Recurrence analysis

curl -X POST http://gateway/v1/tools/incident_intelligence_tool \
  -H "X-API-Key: $GATEWAY_API_KEY" \
  -d '{"action":"recurrence","window_days":7}'

Digest location

Reports accumulate in ops/reports/incidents/weekly/. Retention follows standard audit_jsonl_days or manual cleanup.

See also: docs/incident/intelligence.md for policy tuning and scoring details.


Scheduler Wiring: cron vs task_registry

Alert triage loop (already active)

# ops/cron/alert_triage.cron — runs every 5 minutes
*/5 * * * *  python3 /opt/daarion/ops/scripts/alert_triage_loop.py

This processes new alerts → creates/updates incidents → triggers escalation when needed.

Governance jobs (activated in ops/cron/jobs.cron)

The following jobs complement the triage loop by computing intelligence and generating artifacts that Sofiia can consume:

Job Schedule Output
hourly_risk_snapshot every hour risk_history_store (Postgres or memory)
daily_risk_digest 09:00 UTC ops/reports/risk/YYYY-MM-DD.{md,json}
weekly_platform_priority_digest Mon 06:00 UTC ops/reports/platform/YYYY-WW.{md,json}
weekly_backlog_generate Mon 06:20 UTC ops/backlog/items.jsonl or Postgres

Registering cron entries

# Deploy all governance cron jobs:
sudo cp ops/cron/jobs.cron /etc/cron.d/daarion-governance
sudo chmod 644 /etc/cron.d/daarion-governance

# Verify active entries:
grep -v "^#\|^$" /etc/cron.d/daarion-governance

Relationship between task_registry.yml and ops/cron/

ops/task_registry.yml is the canonical declaration of all scheduled jobs (schedule, permissions, inputs, dry-run). ops/cron/jobs.cron is the physical activation — what actually runs. They must be kept in sync.

Use run_governance_job.py --dry-run to test any job before enabling in cron.