microdao-daarion/ops/runbook-backlog.md

# Runbook — Engineering Backlog Bridge

**Service:** Engineering Backlog Bridge
**Owner:** CTO / Platform Engineering
**On-call:** oncall

---

## 1. Storage Backends

### 1.1 Default: Auto (Postgres → JSONL)

The `AutoBacklogStore` attempts Postgres on startup. If Postgres is
unavailable, it falls back to JSONL and retries every 5 minutes.

Check the active backend in logs:

```
backlog_store: using PostgresBacklogStore
backlog_store: using JsonlBacklogStore
```

### 1.2 Switching backend

```bash
# Use JSONL only (no DB required)
export BACKLOG_BACKEND=jsonl

# Use Postgres
export BACKLOG_BACKEND=postgres
export BACKLOG_POSTGRES_DSN="postgresql://user:pass@host:5432/daarion"

# Tests only
export BACKLOG_BACKEND=memory
```

---

## 2. Postgres Migration

Run once per environment. Idempotent (safe to re-run).

```bash
# Dry-run first
python3 ops/scripts/migrate_backlog_postgres.py \
  --dsn "postgresql://user:pass@host/daarion" \
  --dry-run

# Apply
python3 ops/scripts/migrate_backlog_postgres.py \
  --dsn "postgresql://user:pass@host/daarion"
```

Alternatively, use `$BACKLOG_POSTGRES_DSN` or `$POSTGRES_DSN` environment variables.

**Tables created:**
- `backlog_items` — dedupe_key UNIQUE constraint
- `backlog_events` — FK to backlog_items with CASCADE DELETE

**Indexes:** env+status, service, due_date, owner, category, item_id, ts.

---

## 3. Weekly Auto-generation

### 3.1 Automatic (scheduled)

`weekly_backlog_generate` runs every **Monday at 06:20 UTC** (20 min after
the weekly platform digest at 06:00 UTC). Registered in `ops/task_registry.yml`.

### 3.2 Manual trigger

```bash
# HTTP (admin only)
curl -X POST "https://router/v1/backlog/generate/weekly?env=prod"

# Tool call
{
  "tool": "backlog_tool",
  "action": "auto_generate_weekly",
  "env": "prod"
}
```

### 3.3 Prerequisite

The latest `ops/reports/platform/YYYY-WW.json` must exist (produced by
`weekly_platform_priority_digest`). If it's missing, generation returns:

```json
{ "error": "No platform digest found. Run architecture_pressure_tool.digest first." }
```

Fix:
```bash
# Trigger platform digest
{ "tool": "architecture_pressure_tool", "action": "digest", "env": "prod" }
```

---

## 4. Cleanup (Retention)

**Schedule:** Daily at 03:40 UTC.

Removes `done` / `canceled` items older than `retention_days` (default 180d).

```bash
# Manual cleanup
{
  "tool": "backlog_tool",
  "action": "cleanup",
  "retention_days": 180
}
```

For JSONL backend, cleanup rewrites the file atomically.
For Postgres, it runs a `DELETE WHERE status IN ('done','canceled') AND updated_at < cutoff`.

---

## 5. JSONL File Management

Files: `ops/backlog/items.jsonl`, `ops/backlog/events.jsonl`

The JSONL backend is **append-only** (updates append a new line; reads use
last-write-wins per `id`). The file grows over time until `cleanup()` rewrites it.

### Check file size

```bash
wc -l ops/backlog/items.jsonl
ls -lh ops/backlog/items.jsonl
```

### Manual compaction (outside cleanup schedule)

```bash
python3 -c "
from services.router.backlog_store import JsonlBacklogStore
s = JsonlBacklogStore()
deleted = s.cleanup(retention_days=30)
print(f'Removed {deleted} old items')
"
```

---

## 6. Dashboard & Monitoring

```bash
# HTTP
GET /v1/backlog/dashboard?env=prod

# Example response
{
  "total": 42,
  "status_counts": {"open": 18, "in_progress": 5, "blocked": 3, "done": 14, "canceled": 2},
  "priority_counts": {"P0": 1, "P1": 9, "P2": 22, "P3": 10},
  "overdue_count": 4,
  "overdue": [
    {"id": "bl_...", "service": "gateway", "priority": "P1", "due_date": "2026-02-10", ...}
  ],
  "top_services": [{"service": "gateway", "count": 5}, ...]
}
```

Alert thresholds (recommended):
- `overdue_count > 5` → notify oncall
- `priority_counts.P0 > 0 AND overdue` → page CTO

---

## 7. Troubleshooting

### Items not generated

1. Check if platform digest exists: `ls ops/reports/platform/*.json`
2. Verify `generation.weekly_from_pressure_digest: true` in `config/backlog_policy.yml`
3. Check `max_items_per_run` — may cap generation if many services match.

### Duplicate items across weeks

Normal — each week gets a new dedupe_key `...:YYYY-WW:...`. Items from
previous weeks remain unless closed. This is intentional: unresolved issues
accumulate visibility week-over-week.

### Postgres connection failures

Check: `BACKLOG_POSTGRES_DSN`, network access, and that migration has been run.
The `AutoBacklogStore` will fall back to JSONL and log a warning.

### Wrong owner assigned

Check `config/backlog_policy.yml` → `ownership.overrides`. Add/update
service-level overrides as needed. Re-run `auto_generate_weekly` — the
upsert will update the existing item if `ownership` changed (title/meta update
path only; owner field is preserved on existing items). For immediate
correction, use `set_status` + `add_comment` or `upsert` with explicit `owner`.

---

## 8. Configuration Reference

`config/backlog_policy.yml` — key sections:

| Section           | Key                     | Default | Description |
|-------------------|-------------------------|---------|-------------|
| `defaults`        | `retention_days`        | 180     | Days to keep done/canceled items |
| `defaults`        | `max_items_per_run`     | 50      | Cap per generation run |
| `dedupe`          | `key_prefix`            | platform_backlog | Dedupe key prefix |
| `categories.*`    | `priority`              | varies  | Default priority per category |
| `categories.*`    | `due_days`              | varies  | Days until due from creation |
| `generation`      | `weekly_from_pressure_digest` | true | Enable weekly generation |
| `generation`      | `daily_from_risk_digest` | false | Enable daily generation from risk |
| `ownership`       | `default_owner`         | oncall  | Fallback owner |
| `ownership.overrides` | `{service}`         | —       | Per-service owner override |

---

## 9. Scheduler Wiring: cron vs task_registry

### Architecture

There are two sources of truth for scheduled jobs:

| File | Role |
|------|------|
| `ops/task_registry.yml` | **Declarative registry** — defines what jobs exist, their schedule, inputs, permissions, and dry-run behavior. Used for documentation, audits, and future scheduler integrations. |
| `ops/cron/jobs.cron` | **Active scheduler** — physical cron entries that actually run jobs. Must be kept in sync with `task_registry.yml`. |

### How governance jobs are executed

All governance jobs use the universal runner:

```bash
python3 ops/scripts/run_governance_job.py \
    --tool <tool_name> \
    --action <action> \
    --params-json '<json>'
```

This POSTs to `POST /v1/tools/execute` on the router. The router applies RBAC
(agent_id=`scheduler`, which has `tools.backlog.admin` + `tools.pressure.write` +
`tools.risk.write` via the `scheduler` service account) and executes the tool.

### Governance cron schedule

```
0  *  * * *   hourly_risk_snapshot          (risk_history_tool.snapshot)
0  9  * * *   daily_risk_digest             (risk_history_tool.digest)
20 3  * * *   risk_history_cleanup          (risk_history_tool.cleanup)
0  6  * * 1   weekly_platform_priority_digest (architecture_pressure_tool.digest)
20 6  * * 1   weekly_backlog_generate       (backlog_tool.auto_generate_weekly)
40 3  * * *   daily_backlog_cleanup         (backlog_tool.cleanup)
```

### Deployment

```bash
# 1. Copy cron file to /etc/cron.d/
sudo cp ops/cron/jobs.cron /etc/cron.d/daarion-governance
sudo chmod 644 /etc/cron.d/daarion-governance

# 2. Edit REPO_ROOT and ROUTER_URL if needed
sudo nano /etc/cron.d/daarion-governance

# 3. Verify syntax
crontab -T /etc/cron.d/daarion-governance

# 4. Check logs
tail -f /var/log/daarion/risk_snapshot.log
tail -f /var/log/daarion/backlog_generate.log
```

### Dry-run testing

```bash
python3 ops/scripts/run_governance_job.py \
    --tool backlog_tool --action auto_generate_weekly \
    --params-json '{"env":"prod"}' \
    --dry-run
```

### Expected artifacts

After first run:
- `ops/reports/risk/YYYY-MM-DD.md` and `.json` (daily digest)
- `ops/reports/platform/YYYY-WW.md` and `.json` (weekly platform digest)
- `ops/backlog/items.jsonl` (if BACKLOG_BACKEND=jsonl) or Postgres `backlog_items` table

### Troubleshooting

| Symptom | Cause | Fix |
|---------|-------|-----|
| `Cannot reach http://localhost:8000` | Router not running or wrong `ROUTER_URL` | Check compose, set `ROUTER_URL` in cron header |
| `HTTP 401 from /v1/tools/execute` | Missing `SCHEDULER_API_KEY` | Set env var or check auth config |
| `error: No platform digest found` | `weekly_backlog_generate` ran before `weekly_platform_priority_digest` | Fix cron timing (06:00 vs 06:20) or run digest manually |
| Job output empty | Scheduler running but tool silently skipped | Check tool policy (e.g. `weekly_from_pressure_digest: false`) |