Config policies (16 files): alert_routing, architecture_pressure, backlog, cost_weights, data_governance, incident_escalation, incident_intelligence, network_allowlist, nodes_registry, observability_sources, rbac_tools_matrix, release_gate, risk_attribution, risk_policy, slo_policy, tool_limits, tools_rollout Ops (22 files): Caddyfile, calendar compose, grafana voice dashboard, deployments/incidents logs, runbooks for alerts/audit/backlog/incidents/sofiia/voice, cron jobs, scripts (alert_triage, audit_cleanup, migrate_*, governance, schedule), task_registry, voice alerts/ha/latency/policy Docs (30+ files): HUMANIZED_STEPAN v2.7-v3 changelogs and runbooks, NODA1/NODA2 status and setup, audit index and traces, backlog, incident, supervisor, tools, voice, opencode, release, risk, aistalk, spacebot Made-with: Cursor
8.6 KiB
Runbook — Engineering Backlog Bridge
Service: Engineering Backlog Bridge
Owner: CTO / Platform Engineering
On-call: oncall
1. Storage Backends
1.1 Default: Auto (Postgres → JSONL)
The AutoBacklogStore attempts Postgres on startup. If Postgres is
unavailable, it falls back to JSONL and retries every 5 minutes.
Check the active backend in logs:
backlog_store: using PostgresBacklogStore
backlog_store: using JsonlBacklogStore
1.2 Switching backend
# Use JSONL only (no DB required)
export BACKLOG_BACKEND=jsonl
# Use Postgres
export BACKLOG_BACKEND=postgres
export BACKLOG_POSTGRES_DSN="postgresql://user:pass@host:5432/daarion"
# Tests only
export BACKLOG_BACKEND=memory
2. Postgres Migration
Run once per environment. Idempotent (safe to re-run).
# Dry-run first
python3 ops/scripts/migrate_backlog_postgres.py \
--dsn "postgresql://user:pass@host/daarion" \
--dry-run
# Apply
python3 ops/scripts/migrate_backlog_postgres.py \
--dsn "postgresql://user:pass@host/daarion"
Alternatively, use $BACKLOG_POSTGRES_DSN or $POSTGRES_DSN environment variables.
Tables created:
backlog_items— dedupe_key UNIQUE constraintbacklog_events— FK to backlog_items with CASCADE DELETE
Indexes: env+status, service, due_date, owner, category, item_id, ts.
3. Weekly Auto-generation
3.1 Automatic (scheduled)
weekly_backlog_generate runs every Monday at 06:20 UTC (20 min after
the weekly platform digest at 06:00 UTC). Registered in ops/task_registry.yml.
3.2 Manual trigger
# HTTP (admin only)
curl -X POST "https://router/v1/backlog/generate/weekly?env=prod"
# Tool call
{
"tool": "backlog_tool",
"action": "auto_generate_weekly",
"env": "prod"
}
3.3 Prerequisite
The latest ops/reports/platform/YYYY-WW.json must exist (produced by
weekly_platform_priority_digest). If it's missing, generation returns:
{ "error": "No platform digest found. Run architecture_pressure_tool.digest first." }
Fix:
# Trigger platform digest
{ "tool": "architecture_pressure_tool", "action": "digest", "env": "prod" }
4. Cleanup (Retention)
Schedule: Daily at 03:40 UTC.
Removes done / canceled items older than retention_days (default 180d).
# Manual cleanup
{
"tool": "backlog_tool",
"action": "cleanup",
"retention_days": 180
}
For JSONL backend, cleanup rewrites the file atomically.
For Postgres, it runs a DELETE WHERE status IN ('done','canceled') AND updated_at < cutoff.
5. JSONL File Management
Files: ops/backlog/items.jsonl, ops/backlog/events.jsonl
The JSONL backend is append-only (updates append a new line; reads use
last-write-wins per id). The file grows over time until cleanup() rewrites it.
Check file size
wc -l ops/backlog/items.jsonl
ls -lh ops/backlog/items.jsonl
Manual compaction (outside cleanup schedule)
python3 -c "
from services.router.backlog_store import JsonlBacklogStore
s = JsonlBacklogStore()
deleted = s.cleanup(retention_days=30)
print(f'Removed {deleted} old items')
"
6. Dashboard & Monitoring
# HTTP
GET /v1/backlog/dashboard?env=prod
# Example response
{
"total": 42,
"status_counts": {"open": 18, "in_progress": 5, "blocked": 3, "done": 14, "canceled": 2},
"priority_counts": {"P0": 1, "P1": 9, "P2": 22, "P3": 10},
"overdue_count": 4,
"overdue": [
{"id": "bl_...", "service": "gateway", "priority": "P1", "due_date": "2026-02-10", ...}
],
"top_services": [{"service": "gateway", "count": 5}, ...]
}
Alert thresholds (recommended):
overdue_count > 5→ notify oncallpriority_counts.P0 > 0 AND overdue→ page CTO
7. Troubleshooting
Items not generated
- Check if platform digest exists:
ls ops/reports/platform/*.json - Verify
generation.weekly_from_pressure_digest: trueinconfig/backlog_policy.yml - Check
max_items_per_run— may cap generation if many services match.
Duplicate items across weeks
Normal — each week gets a new dedupe_key ...:YYYY-WW:.... Items from
previous weeks remain unless closed. This is intentional: unresolved issues
accumulate visibility week-over-week.
Postgres connection failures
Check: BACKLOG_POSTGRES_DSN, network access, and that migration has been run.
The AutoBacklogStore will fall back to JSONL and log a warning.
Wrong owner assigned
Check config/backlog_policy.yml → ownership.overrides. Add/update
service-level overrides as needed. Re-run auto_generate_weekly — the
upsert will update the existing item if ownership changed (title/meta update
path only; owner field is preserved on existing items). For immediate
correction, use set_status + add_comment or upsert with explicit owner.
8. Configuration Reference
config/backlog_policy.yml — key sections:
| Section | Key | Default | Description |
|---|---|---|---|
defaults |
retention_days |
180 | Days to keep done/canceled items |
defaults |
max_items_per_run |
50 | Cap per generation run |
dedupe |
key_prefix |
platform_backlog | Dedupe key prefix |
categories.* |
priority |
varies | Default priority per category |
categories.* |
due_days |
varies | Days until due from creation |
generation |
weekly_from_pressure_digest |
true | Enable weekly generation |
generation |
daily_from_risk_digest |
false | Enable daily generation from risk |
ownership |
default_owner |
oncall | Fallback owner |
ownership.overrides |
{service} |
— | Per-service owner override |
9. Scheduler Wiring: cron vs task_registry
Architecture
There are two sources of truth for scheduled jobs:
| File | Role |
|---|---|
ops/task_registry.yml |
Declarative registry — defines what jobs exist, their schedule, inputs, permissions, and dry-run behavior. Used for documentation, audits, and future scheduler integrations. |
ops/cron/jobs.cron |
Active scheduler — physical cron entries that actually run jobs. Must be kept in sync with task_registry.yml. |
How governance jobs are executed
All governance jobs use the universal runner:
python3 ops/scripts/run_governance_job.py \
--tool <tool_name> \
--action <action> \
--params-json '<json>'
This POSTs to POST /v1/tools/execute on the router. The router applies RBAC
(agent_id=scheduler, which has tools.backlog.admin + tools.pressure.write +
tools.risk.write via the scheduler service account) and executes the tool.
Governance cron schedule
0 * * * * hourly_risk_snapshot (risk_history_tool.snapshot)
0 9 * * * daily_risk_digest (risk_history_tool.digest)
20 3 * * * risk_history_cleanup (risk_history_tool.cleanup)
0 6 * * 1 weekly_platform_priority_digest (architecture_pressure_tool.digest)
20 6 * * 1 weekly_backlog_generate (backlog_tool.auto_generate_weekly)
40 3 * * * daily_backlog_cleanup (backlog_tool.cleanup)
Deployment
# 1. Copy cron file to /etc/cron.d/
sudo cp ops/cron/jobs.cron /etc/cron.d/daarion-governance
sudo chmod 644 /etc/cron.d/daarion-governance
# 2. Edit REPO_ROOT and ROUTER_URL if needed
sudo nano /etc/cron.d/daarion-governance
# 3. Verify syntax
crontab -T /etc/cron.d/daarion-governance
# 4. Check logs
tail -f /var/log/daarion/risk_snapshot.log
tail -f /var/log/daarion/backlog_generate.log
Dry-run testing
python3 ops/scripts/run_governance_job.py \
--tool backlog_tool --action auto_generate_weekly \
--params-json '{"env":"prod"}' \
--dry-run
Expected artifacts
After first run:
ops/reports/risk/YYYY-MM-DD.mdand.json(daily digest)ops/reports/platform/YYYY-WW.mdand.json(weekly platform digest)ops/backlog/items.jsonl(if BACKLOG_BACKEND=jsonl) or Postgresbacklog_itemstable
Troubleshooting
| Symptom | Cause | Fix |
|---|---|---|
Cannot reach http://localhost:8000 |
Router not running or wrong ROUTER_URL |
Check compose, set ROUTER_URL in cron header |
HTTP 401 from /v1/tools/execute |
Missing SCHEDULER_API_KEY |
Set env var or check auth config |
error: No platform digest found |
weekly_backlog_generate ran before weekly_platform_priority_digest |
Fix cron timing (06:00 vs 06:20) or run digest manually |
| Job output empty | Scheduler running but tool silently skipped | Check tool policy (e.g. weekly_from_pressure_digest: false) |