# Runbook — Engineering Backlog Bridge **Service:** Engineering Backlog Bridge **Owner:** CTO / Platform Engineering **On-call:** oncall --- ## 1. Storage Backends ### 1.1 Default: Auto (Postgres → JSONL) The `AutoBacklogStore` attempts Postgres on startup. If Postgres is unavailable, it falls back to JSONL and retries every 5 minutes. Check the active backend in logs: ``` backlog_store: using PostgresBacklogStore backlog_store: using JsonlBacklogStore ``` ### 1.2 Switching backend ```bash # Use JSONL only (no DB required) export BACKLOG_BACKEND=jsonl # Use Postgres export BACKLOG_BACKEND=postgres export BACKLOG_POSTGRES_DSN="postgresql://user:pass@host:5432/daarion" # Tests only export BACKLOG_BACKEND=memory ``` --- ## 2. Postgres Migration Run once per environment. Idempotent (safe to re-run). ```bash # Dry-run first python3 ops/scripts/migrate_backlog_postgres.py \ --dsn "postgresql://user:pass@host/daarion" \ --dry-run # Apply python3 ops/scripts/migrate_backlog_postgres.py \ --dsn "postgresql://user:pass@host/daarion" ``` Alternatively, use `$BACKLOG_POSTGRES_DSN` or `$POSTGRES_DSN` environment variables. **Tables created:** - `backlog_items` — dedupe_key UNIQUE constraint - `backlog_events` — FK to backlog_items with CASCADE DELETE **Indexes:** env+status, service, due_date, owner, category, item_id, ts. --- ## 3. Weekly Auto-generation ### 3.1 Automatic (scheduled) `weekly_backlog_generate` runs every **Monday at 06:20 UTC** (20 min after the weekly platform digest at 06:00 UTC). Registered in `ops/task_registry.yml`. ### 3.2 Manual trigger ```bash # HTTP (admin only) curl -X POST "https://router/v1/backlog/generate/weekly?env=prod" # Tool call { "tool": "backlog_tool", "action": "auto_generate_weekly", "env": "prod" } ``` ### 3.3 Prerequisite The latest `ops/reports/platform/YYYY-WW.json` must exist (produced by `weekly_platform_priority_digest`). If it's missing, generation returns: ```json { "error": "No platform digest found. Run architecture_pressure_tool.digest first." } ``` Fix: ```bash # Trigger platform digest { "tool": "architecture_pressure_tool", "action": "digest", "env": "prod" } ``` --- ## 4. Cleanup (Retention) **Schedule:** Daily at 03:40 UTC. Removes `done` / `canceled` items older than `retention_days` (default 180d). ```bash # Manual cleanup { "tool": "backlog_tool", "action": "cleanup", "retention_days": 180 } ``` For JSONL backend, cleanup rewrites the file atomically. For Postgres, it runs a `DELETE WHERE status IN ('done','canceled') AND updated_at < cutoff`. --- ## 5. JSONL File Management Files: `ops/backlog/items.jsonl`, `ops/backlog/events.jsonl` The JSONL backend is **append-only** (updates append a new line; reads use last-write-wins per `id`). The file grows over time until `cleanup()` rewrites it. ### Check file size ```bash wc -l ops/backlog/items.jsonl ls -lh ops/backlog/items.jsonl ``` ### Manual compaction (outside cleanup schedule) ```bash python3 -c " from services.router.backlog_store import JsonlBacklogStore s = JsonlBacklogStore() deleted = s.cleanup(retention_days=30) print(f'Removed {deleted} old items') " ``` --- ## 6. Dashboard & Monitoring ```bash # HTTP GET /v1/backlog/dashboard?env=prod # Example response { "total": 42, "status_counts": {"open": 18, "in_progress": 5, "blocked": 3, "done": 14, "canceled": 2}, "priority_counts": {"P0": 1, "P1": 9, "P2": 22, "P3": 10}, "overdue_count": 4, "overdue": [ {"id": "bl_...", "service": "gateway", "priority": "P1", "due_date": "2026-02-10", ...} ], "top_services": [{"service": "gateway", "count": 5}, ...] } ``` Alert thresholds (recommended): - `overdue_count > 5` → notify oncall - `priority_counts.P0 > 0 AND overdue` → page CTO --- ## 7. Troubleshooting ### Items not generated 1. Check if platform digest exists: `ls ops/reports/platform/*.json` 2. Verify `generation.weekly_from_pressure_digest: true` in `config/backlog_policy.yml` 3. Check `max_items_per_run` — may cap generation if many services match. ### Duplicate items across weeks Normal — each week gets a new dedupe_key `...:YYYY-WW:...`. Items from previous weeks remain unless closed. This is intentional: unresolved issues accumulate visibility week-over-week. ### Postgres connection failures Check: `BACKLOG_POSTGRES_DSN`, network access, and that migration has been run. The `AutoBacklogStore` will fall back to JSONL and log a warning. ### Wrong owner assigned Check `config/backlog_policy.yml` → `ownership.overrides`. Add/update service-level overrides as needed. Re-run `auto_generate_weekly` — the upsert will update the existing item if `ownership` changed (title/meta update path only; owner field is preserved on existing items). For immediate correction, use `set_status` + `add_comment` or `upsert` with explicit `owner`. --- ## 8. Configuration Reference `config/backlog_policy.yml` — key sections: | Section | Key | Default | Description | |-------------------|-------------------------|---------|-------------| | `defaults` | `retention_days` | 180 | Days to keep done/canceled items | | `defaults` | `max_items_per_run` | 50 | Cap per generation run | | `dedupe` | `key_prefix` | platform_backlog | Dedupe key prefix | | `categories.*` | `priority` | varies | Default priority per category | | `categories.*` | `due_days` | varies | Days until due from creation | | `generation` | `weekly_from_pressure_digest` | true | Enable weekly generation | | `generation` | `daily_from_risk_digest` | false | Enable daily generation from risk | | `ownership` | `default_owner` | oncall | Fallback owner | | `ownership.overrides` | `{service}` | — | Per-service owner override | --- ## 9. Scheduler Wiring: cron vs task_registry ### Architecture There are two sources of truth for scheduled jobs: | File | Role | |------|------| | `ops/task_registry.yml` | **Declarative registry** — defines what jobs exist, their schedule, inputs, permissions, and dry-run behavior. Used for documentation, audits, and future scheduler integrations. | | `ops/cron/jobs.cron` | **Active scheduler** — physical cron entries that actually run jobs. Must be kept in sync with `task_registry.yml`. | ### How governance jobs are executed All governance jobs use the universal runner: ```bash python3 ops/scripts/run_governance_job.py \ --tool \ --action \ --params-json '' ``` This POSTs to `POST /v1/tools/execute` on the router. The router applies RBAC (agent_id=`scheduler`, which has `tools.backlog.admin` + `tools.pressure.write` + `tools.risk.write` via the `scheduler` service account) and executes the tool. ### Governance cron schedule ``` 0 * * * * hourly_risk_snapshot (risk_history_tool.snapshot) 0 9 * * * daily_risk_digest (risk_history_tool.digest) 20 3 * * * risk_history_cleanup (risk_history_tool.cleanup) 0 6 * * 1 weekly_platform_priority_digest (architecture_pressure_tool.digest) 20 6 * * 1 weekly_backlog_generate (backlog_tool.auto_generate_weekly) 40 3 * * * daily_backlog_cleanup (backlog_tool.cleanup) ``` ### Deployment ```bash # 1. Copy cron file to /etc/cron.d/ sudo cp ops/cron/jobs.cron /etc/cron.d/daarion-governance sudo chmod 644 /etc/cron.d/daarion-governance # 2. Edit REPO_ROOT and ROUTER_URL if needed sudo nano /etc/cron.d/daarion-governance # 3. Verify syntax crontab -T /etc/cron.d/daarion-governance # 4. Check logs tail -f /var/log/daarion/risk_snapshot.log tail -f /var/log/daarion/backlog_generate.log ``` ### Dry-run testing ```bash python3 ops/scripts/run_governance_job.py \ --tool backlog_tool --action auto_generate_weekly \ --params-json '{"env":"prod"}' \ --dry-run ``` ### Expected artifacts After first run: - `ops/reports/risk/YYYY-MM-DD.md` and `.json` (daily digest) - `ops/reports/platform/YYYY-WW.md` and `.json` (weekly platform digest) - `ops/backlog/items.jsonl` (if BACKLOG_BACKEND=jsonl) or Postgres `backlog_items` table ### Troubleshooting | Symptom | Cause | Fix | |---------|-------|-----| | `Cannot reach http://localhost:8000` | Router not running or wrong `ROUTER_URL` | Check compose, set `ROUTER_URL` in cron header | | `HTTP 401 from /v1/tools/execute` | Missing `SCHEDULER_API_KEY` | Set env var or check auth config | | `error: No platform digest found` | `weekly_backlog_generate` ran before `weekly_platform_priority_digest` | Fix cron timing (06:00 vs 06:20) or run digest manually | | Job output empty | Scheduler running but tool silently skipped | Check tool policy (e.g. `weekly_from_pressure_digest: false`) |