Config policies (16 files): alert_routing, architecture_pressure, backlog, cost_weights, data_governance, incident_escalation, incident_intelligence, network_allowlist, nodes_registry, observability_sources, rbac_tools_matrix, release_gate, risk_attribution, risk_policy, slo_policy, tool_limits, tools_rollout Ops (22 files): Caddyfile, calendar compose, grafana voice dashboard, deployments/incidents logs, runbooks for alerts/audit/backlog/incidents/sofiia/voice, cron jobs, scripts (alert_triage, audit_cleanup, migrate_*, governance, schedule), task_registry, voice alerts/ha/latency/policy Docs (30+ files): HUMANIZED_STEPAN v2.7-v3 changelogs and runbooks, NODA1/NODA2 status and setup, audit index and traces, backlog, incident, supervisor, tools, voice, opencode, release, risk, aistalk, spacebot Made-with: Cursor
6.7 KiB
Runbook: Incident Log Operations
1. Initial Setup
JSONL backend (default)
No setup needed. Incidents stored in ops/incidents/:
incidents.jsonl— incident recordsevents.jsonl— timeline eventsartifacts.jsonl— artifact metadata
Artifact files: ops/incidents/<incident_id>/ (md/json/txt files).
Postgres backend
# Run idempotent migration
DATABASE_URL="postgresql://user:pass@host:5432/db" \
python3 ops/scripts/migrate_incidents_postgres.py
# Dry run (prints DDL only)
python3 ops/scripts/migrate_incidents_postgres.py --dry-run
Tables created: incidents, incident_events, incident_artifacts.
2. Agent Roles & Permissions
| Agent | Role | Incident access |
|---|---|---|
| sofiia | agent_cto | Full CRUD |
| helion | agent_oncall | Full CRUD |
| monitor | agent_monitor | Read only |
| aistalk | agent_interface | Read only |
| others | agent_default | Read only |
3. Common Operations
Create incident manually (via tool)
{
"tool": "oncall_tool",
"action": "incident_create",
"params": {
"service": "gateway",
"severity": "P1",
"title": "Gateway 5xx rate >5%",
"env": "prod",
"started_at": "2026-02-23T10:00:00Z"
},
"agent_id": "sofiia"
}
Generate postmortem
curl -X POST http://supervisor:8000/v1/graphs/postmortem_draft/runs \
-H "Content-Type: application/json" \
-d '{"agent_id":"sofiia","input":{"incident_id":"inc_..."}}'
List open incidents
{
"tool": "oncall_tool",
"action": "incident_list",
"params": { "status": "open", "limit": 20 }
}
4. Troubleshooting
Artifacts not writing
- Check
INCIDENT_ARTIFACTS_DIRenv var (or defaultops/incidents/). - Check filesystem permissions (directory must be writable).
- Max artifact size: 2MB. Only json/md/txt allowed.
Incident not found
- Verify
incident_idformat:inc_YYYYMMDD_HHMM_<rand>. - Check the correct backend is configured (
INCIDENT_BACKENDenv var). - For JSONL: verify
ops/incidents/incidents.jsonlexists and is not corrupt.
Postmortem graph fails
- Check supervisor logs:
docker logs sofiia-supervisor. - Verify the incident exists:
oncall_tool.incident_get. - Check gateway is reachable from supervisor.
- Run
GET /v1/runs/<run_id>to see graph status and error.
5. Backup & Retention
JSONL
# Backup
cp -r ops/incidents/ /backup/incidents-$(date +%F)/
# Retention: manual cleanup of closed incidents older than N days
# (Not automated yet; add to future audit_cleanup scope)
Postgres
Standard pg_dump for incidents, incident_events, incident_artifacts tables.
6. INCIDENT_BACKEND=auto
The incident store supports INCIDENT_BACKEND=auto which tries Postgres first and falls back to JSONL:
# Set in environment:
INCIDENT_BACKEND=auto
DATABASE_URL=postgresql://user:pass@localhost:5432/daarion
# Behaviour:
# - Primary: PostgresIncidentStore
# - Fallback: JsonlIncidentStore (on connection failure)
# - Recovery: re-attempts Postgres after 5 minutes
Use INCIDENT_BACKEND=postgres for Postgres-only (fails if DB is down) or jsonl for file-only.
7. Follow-up Tracking
Follow-ups are incident_append_event entries with type=followup and structured meta:
# Check overdue follow-ups for a service:
curl -X POST http://gateway/v1/tools/oncall_tool -d '{
"action": "incident_followups_summary",
"service": "gateway",
"env": "prod",
"window_days": 30
}'
The followup_watch release gate uses this to warn (or block in staging/prod strict mode) about open P0/P1 incidents and overdue follow-ups. See docs/incident/followups.md.
8. Monitoring
- Check
/healthzon supervisor. - Monitor
ops/incidents/directory size (JSONL backend). - Daily: review
incident_list status=openfor stale incidents. - Weekly: review
incident_followups_summaryfor overdue items.
9. Weekly Incident Intelligence Digest
The weekly_incident_digest scheduled job runs every Monday at 08:00 UTC and produces:
ops/reports/incidents/weekly/YYYY-WW.json— full structured dataops/reports/incidents/weekly/YYYY-WW.md— markdown report for review
Manual run
# Via job orchestrator
curl -X POST http://gateway/v1/tools/jobs \
-H "X-API-Key: $GATEWAY_API_KEY" \
-d '{"action":"start_task","params":{"task_id":"weekly_incident_digest","inputs":{}}}'
# Direct tool call (CTO/oncall only)
curl -X POST http://gateway/v1/tools/incident_intelligence_tool \
-H "X-API-Key: $GATEWAY_API_KEY" \
-d '{"action":"weekly_digest","save_artifacts":true}'
Correlating a specific incident
curl -X POST http://gateway/v1/tools/incident_intelligence_tool \
-H "X-API-Key: $GATEWAY_API_KEY" \
-d '{"action":"correlate","incident_id":"inc_20260218_1430_abc123","append_note":true}'
Recurrence analysis
curl -X POST http://gateway/v1/tools/incident_intelligence_tool \
-H "X-API-Key: $GATEWAY_API_KEY" \
-d '{"action":"recurrence","window_days":7}'
Digest location
Reports accumulate in ops/reports/incidents/weekly/. Retention follows standard audit_jsonl_days or manual cleanup.
See also: docs/incident/intelligence.md for policy tuning and scoring details.
Scheduler Wiring: cron vs task_registry
Alert triage loop (already active)
# ops/cron/alert_triage.cron — runs every 5 minutes
*/5 * * * * python3 /opt/daarion/ops/scripts/alert_triage_loop.py
This processes new alerts → creates/updates incidents → triggers escalation when needed.
Governance jobs (activated in ops/cron/jobs.cron)
The following jobs complement the triage loop by computing intelligence and generating artifacts that Sofiia can consume:
| Job | Schedule | Output |
|---|---|---|
hourly_risk_snapshot |
every hour | risk_history_store (Postgres or memory) |
daily_risk_digest |
09:00 UTC | ops/reports/risk/YYYY-MM-DD.{md,json} |
weekly_platform_priority_digest |
Mon 06:00 UTC | ops/reports/platform/YYYY-WW.{md,json} |
weekly_backlog_generate |
Mon 06:20 UTC | ops/backlog/items.jsonl or Postgres |
Registering cron entries
# Deploy all governance cron jobs:
sudo cp ops/cron/jobs.cron /etc/cron.d/daarion-governance
sudo chmod 644 /etc/cron.d/daarion-governance
# Verify active entries:
grep -v "^#\|^$" /etc/cron.d/daarion-governance
Relationship between task_registry.yml and ops/cron/
ops/task_registry.yml is the canonical declaration of all scheduled jobs
(schedule, permissions, inputs, dry-run). ops/cron/jobs.cron is the physical
activation — what actually runs. They must be kept in sync.
Use run_governance_job.py --dry-run to test any job before enabling in cron.