Config policies (16 files): alert_routing, architecture_pressure, backlog, cost_weights, data_governance, incident_escalation, incident_intelligence, network_allowlist, nodes_registry, observability_sources, rbac_tools_matrix, release_gate, risk_attribution, risk_policy, slo_policy, tool_limits, tools_rollout Ops (22 files): Caddyfile, calendar compose, grafana voice dashboard, deployments/incidents logs, runbooks for alerts/audit/backlog/incidents/sofiia/voice, cron jobs, scripts (alert_triage, audit_cleanup, migrate_*, governance, schedule), task_registry, voice alerts/ha/latency/policy Docs (30+ files): HUMANIZED_STEPAN v2.7-v3 changelogs and runbooks, NODA1/NODA2 status and setup, audit index and traces, backlog, incident, supervisor, tools, voice, opencode, release, risk, aistalk, spacebot Made-with: Cursor
300 lines
8.6 KiB
Markdown
300 lines
8.6 KiB
Markdown
# Runbook — Engineering Backlog Bridge
|
|
|
|
**Service:** Engineering Backlog Bridge
|
|
**Owner:** CTO / Platform Engineering
|
|
**On-call:** oncall
|
|
|
|
---
|
|
|
|
## 1. Storage Backends
|
|
|
|
### 1.1 Default: Auto (Postgres → JSONL)
|
|
|
|
The `AutoBacklogStore` attempts Postgres on startup. If Postgres is
|
|
unavailable, it falls back to JSONL and retries every 5 minutes.
|
|
|
|
Check the active backend in logs:
|
|
|
|
```
|
|
backlog_store: using PostgresBacklogStore
|
|
backlog_store: using JsonlBacklogStore
|
|
```
|
|
|
|
### 1.2 Switching backend
|
|
|
|
```bash
|
|
# Use JSONL only (no DB required)
|
|
export BACKLOG_BACKEND=jsonl
|
|
|
|
# Use Postgres
|
|
export BACKLOG_BACKEND=postgres
|
|
export BACKLOG_POSTGRES_DSN="postgresql://user:pass@host:5432/daarion"
|
|
|
|
# Tests only
|
|
export BACKLOG_BACKEND=memory
|
|
```
|
|
|
|
---
|
|
|
|
## 2. Postgres Migration
|
|
|
|
Run once per environment. Idempotent (safe to re-run).
|
|
|
|
```bash
|
|
# Dry-run first
|
|
python3 ops/scripts/migrate_backlog_postgres.py \
|
|
--dsn "postgresql://user:pass@host/daarion" \
|
|
--dry-run
|
|
|
|
# Apply
|
|
python3 ops/scripts/migrate_backlog_postgres.py \
|
|
--dsn "postgresql://user:pass@host/daarion"
|
|
```
|
|
|
|
Alternatively, use `$BACKLOG_POSTGRES_DSN` or `$POSTGRES_DSN` environment variables.
|
|
|
|
**Tables created:**
|
|
- `backlog_items` — dedupe_key UNIQUE constraint
|
|
- `backlog_events` — FK to backlog_items with CASCADE DELETE
|
|
|
|
**Indexes:** env+status, service, due_date, owner, category, item_id, ts.
|
|
|
|
---
|
|
|
|
## 3. Weekly Auto-generation
|
|
|
|
### 3.1 Automatic (scheduled)
|
|
|
|
`weekly_backlog_generate` runs every **Monday at 06:20 UTC** (20 min after
|
|
the weekly platform digest at 06:00 UTC). Registered in `ops/task_registry.yml`.
|
|
|
|
### 3.2 Manual trigger
|
|
|
|
```bash
|
|
# HTTP (admin only)
|
|
curl -X POST "https://router/v1/backlog/generate/weekly?env=prod"
|
|
|
|
# Tool call
|
|
{
|
|
"tool": "backlog_tool",
|
|
"action": "auto_generate_weekly",
|
|
"env": "prod"
|
|
}
|
|
```
|
|
|
|
### 3.3 Prerequisite
|
|
|
|
The latest `ops/reports/platform/YYYY-WW.json` must exist (produced by
|
|
`weekly_platform_priority_digest`). If it's missing, generation returns:
|
|
|
|
```json
|
|
{ "error": "No platform digest found. Run architecture_pressure_tool.digest first." }
|
|
```
|
|
|
|
Fix:
|
|
```bash
|
|
# Trigger platform digest
|
|
{ "tool": "architecture_pressure_tool", "action": "digest", "env": "prod" }
|
|
```
|
|
|
|
---
|
|
|
|
## 4. Cleanup (Retention)
|
|
|
|
**Schedule:** Daily at 03:40 UTC.
|
|
|
|
Removes `done` / `canceled` items older than `retention_days` (default 180d).
|
|
|
|
```bash
|
|
# Manual cleanup
|
|
{
|
|
"tool": "backlog_tool",
|
|
"action": "cleanup",
|
|
"retention_days": 180
|
|
}
|
|
```
|
|
|
|
For JSONL backend, cleanup rewrites the file atomically.
|
|
For Postgres, it runs a `DELETE WHERE status IN ('done','canceled') AND updated_at < cutoff`.
|
|
|
|
---
|
|
|
|
## 5. JSONL File Management
|
|
|
|
Files: `ops/backlog/items.jsonl`, `ops/backlog/events.jsonl`
|
|
|
|
The JSONL backend is **append-only** (updates append a new line; reads use
|
|
last-write-wins per `id`). The file grows over time until `cleanup()` rewrites it.
|
|
|
|
### Check file size
|
|
|
|
```bash
|
|
wc -l ops/backlog/items.jsonl
|
|
ls -lh ops/backlog/items.jsonl
|
|
```
|
|
|
|
### Manual compaction (outside cleanup schedule)
|
|
|
|
```bash
|
|
python3 -c "
|
|
from services.router.backlog_store import JsonlBacklogStore
|
|
s = JsonlBacklogStore()
|
|
deleted = s.cleanup(retention_days=30)
|
|
print(f'Removed {deleted} old items')
|
|
"
|
|
```
|
|
|
|
---
|
|
|
|
## 6. Dashboard & Monitoring
|
|
|
|
```bash
|
|
# HTTP
|
|
GET /v1/backlog/dashboard?env=prod
|
|
|
|
# Example response
|
|
{
|
|
"total": 42,
|
|
"status_counts": {"open": 18, "in_progress": 5, "blocked": 3, "done": 14, "canceled": 2},
|
|
"priority_counts": {"P0": 1, "P1": 9, "P2": 22, "P3": 10},
|
|
"overdue_count": 4,
|
|
"overdue": [
|
|
{"id": "bl_...", "service": "gateway", "priority": "P1", "due_date": "2026-02-10", ...}
|
|
],
|
|
"top_services": [{"service": "gateway", "count": 5}, ...]
|
|
}
|
|
```
|
|
|
|
Alert thresholds (recommended):
|
|
- `overdue_count > 5` → notify oncall
|
|
- `priority_counts.P0 > 0 AND overdue` → page CTO
|
|
|
|
---
|
|
|
|
## 7. Troubleshooting
|
|
|
|
### Items not generated
|
|
|
|
1. Check if platform digest exists: `ls ops/reports/platform/*.json`
|
|
2. Verify `generation.weekly_from_pressure_digest: true` in `config/backlog_policy.yml`
|
|
3. Check `max_items_per_run` — may cap generation if many services match.
|
|
|
|
### Duplicate items across weeks
|
|
|
|
Normal — each week gets a new dedupe_key `...:YYYY-WW:...`. Items from
|
|
previous weeks remain unless closed. This is intentional: unresolved issues
|
|
accumulate visibility week-over-week.
|
|
|
|
### Postgres connection failures
|
|
|
|
Check: `BACKLOG_POSTGRES_DSN`, network access, and that migration has been run.
|
|
The `AutoBacklogStore` will fall back to JSONL and log a warning.
|
|
|
|
### Wrong owner assigned
|
|
|
|
Check `config/backlog_policy.yml` → `ownership.overrides`. Add/update
|
|
service-level overrides as needed. Re-run `auto_generate_weekly` — the
|
|
upsert will update the existing item if `ownership` changed (title/meta update
|
|
path only; owner field is preserved on existing items). For immediate
|
|
correction, use `set_status` + `add_comment` or `upsert` with explicit `owner`.
|
|
|
|
---
|
|
|
|
## 8. Configuration Reference
|
|
|
|
`config/backlog_policy.yml` — key sections:
|
|
|
|
| Section | Key | Default | Description |
|
|
|-------------------|-------------------------|---------|-------------|
|
|
| `defaults` | `retention_days` | 180 | Days to keep done/canceled items |
|
|
| `defaults` | `max_items_per_run` | 50 | Cap per generation run |
|
|
| `dedupe` | `key_prefix` | platform_backlog | Dedupe key prefix |
|
|
| `categories.*` | `priority` | varies | Default priority per category |
|
|
| `categories.*` | `due_days` | varies | Days until due from creation |
|
|
| `generation` | `weekly_from_pressure_digest` | true | Enable weekly generation |
|
|
| `generation` | `daily_from_risk_digest` | false | Enable daily generation from risk |
|
|
| `ownership` | `default_owner` | oncall | Fallback owner |
|
|
| `ownership.overrides` | `{service}` | — | Per-service owner override |
|
|
|
|
---
|
|
|
|
## 9. Scheduler Wiring: cron vs task_registry
|
|
|
|
### Architecture
|
|
|
|
There are two sources of truth for scheduled jobs:
|
|
|
|
| File | Role |
|
|
|------|------|
|
|
| `ops/task_registry.yml` | **Declarative registry** — defines what jobs exist, their schedule, inputs, permissions, and dry-run behavior. Used for documentation, audits, and future scheduler integrations. |
|
|
| `ops/cron/jobs.cron` | **Active scheduler** — physical cron entries that actually run jobs. Must be kept in sync with `task_registry.yml`. |
|
|
|
|
### How governance jobs are executed
|
|
|
|
All governance jobs use the universal runner:
|
|
|
|
```bash
|
|
python3 ops/scripts/run_governance_job.py \
|
|
--tool <tool_name> \
|
|
--action <action> \
|
|
--params-json '<json>'
|
|
```
|
|
|
|
This POSTs to `POST /v1/tools/execute` on the router. The router applies RBAC
|
|
(agent_id=`scheduler`, which has `tools.backlog.admin` + `tools.pressure.write` +
|
|
`tools.risk.write` via the `scheduler` service account) and executes the tool.
|
|
|
|
### Governance cron schedule
|
|
|
|
```
|
|
0 * * * * hourly_risk_snapshot (risk_history_tool.snapshot)
|
|
0 9 * * * daily_risk_digest (risk_history_tool.digest)
|
|
20 3 * * * risk_history_cleanup (risk_history_tool.cleanup)
|
|
0 6 * * 1 weekly_platform_priority_digest (architecture_pressure_tool.digest)
|
|
20 6 * * 1 weekly_backlog_generate (backlog_tool.auto_generate_weekly)
|
|
40 3 * * * daily_backlog_cleanup (backlog_tool.cleanup)
|
|
```
|
|
|
|
### Deployment
|
|
|
|
```bash
|
|
# 1. Copy cron file to /etc/cron.d/
|
|
sudo cp ops/cron/jobs.cron /etc/cron.d/daarion-governance
|
|
sudo chmod 644 /etc/cron.d/daarion-governance
|
|
|
|
# 2. Edit REPO_ROOT and ROUTER_URL if needed
|
|
sudo nano /etc/cron.d/daarion-governance
|
|
|
|
# 3. Verify syntax
|
|
crontab -T /etc/cron.d/daarion-governance
|
|
|
|
# 4. Check logs
|
|
tail -f /var/log/daarion/risk_snapshot.log
|
|
tail -f /var/log/daarion/backlog_generate.log
|
|
```
|
|
|
|
### Dry-run testing
|
|
|
|
```bash
|
|
python3 ops/scripts/run_governance_job.py \
|
|
--tool backlog_tool --action auto_generate_weekly \
|
|
--params-json '{"env":"prod"}' \
|
|
--dry-run
|
|
```
|
|
|
|
### Expected artifacts
|
|
|
|
After first run:
|
|
- `ops/reports/risk/YYYY-MM-DD.md` and `.json` (daily digest)
|
|
- `ops/reports/platform/YYYY-WW.md` and `.json` (weekly platform digest)
|
|
- `ops/backlog/items.jsonl` (if BACKLOG_BACKEND=jsonl) or Postgres `backlog_items` table
|
|
|
|
### Troubleshooting
|
|
|
|
| Symptom | Cause | Fix |
|
|
|---------|-------|-----|
|
|
| `Cannot reach http://localhost:8000` | Router not running or wrong `ROUTER_URL` | Check compose, set `ROUTER_URL` in cron header |
|
|
| `HTTP 401 from /v1/tools/execute` | Missing `SCHEDULER_API_KEY` | Set env var or check auth config |
|
|
| `error: No platform digest found` | `weekly_backlog_generate` ran before `weekly_platform_priority_digest` | Fix cron timing (06:00 vs 06:20) or run digest manually |
|
|
| Job output empty | Scheduler running but tool silently skipped | Check tool policy (e.g. `weekly_from_pressure_digest: false`) |
|