Files

Apple 67225a39fa docs(platform): add policy configs, runbooks, ops scripts and platform documentation

Config policies (16 files): alert_routing, architecture_pressure, backlog,
cost_weights, data_governance, incident_escalation, incident_intelligence,
network_allowlist, nodes_registry, observability_sources, rbac_tools_matrix,
release_gate, risk_attribution, risk_policy, slo_policy, tool_limits, tools_rollout

Ops (22 files): Caddyfile, calendar compose, grafana voice dashboard,
deployments/incidents logs, runbooks for alerts/audit/backlog/incidents/sofiia/voice,
cron jobs, scripts (alert_triage, audit_cleanup, migrate_*, governance, schedule),
task_registry, voice alerts/ha/latency/policy

Docs (30+ files): HUMANIZED_STEPAN v2.7-v3 changelogs and runbooks,
NODA1/NODA2 status and setup, audit index and traces, backlog, incident,
supervisor, tools, voice, opencode, release, risk, aistalk, spacebot

Made-with: Cursor

2026-03-03 07:14:53 -08:00

8.6 KiB

Raw Blame History

Runbook — Engineering Backlog Bridge

Service: Engineering Backlog Bridge
Owner: CTO / Platform Engineering
On-call: oncall

1. Storage Backends

1.1 Default: Auto (Postgres → JSONL)

The AutoBacklogStore attempts Postgres on startup. If Postgres is unavailable, it falls back to JSONL and retries every 5 minutes.

Check the active backend in logs:

backlog_store: using PostgresBacklogStore
backlog_store: using JsonlBacklogStore

1.2 Switching backend

# Use JSONL only (no DB required)
export BACKLOG_BACKEND=jsonl

# Use Postgres
export BACKLOG_BACKEND=postgres
export BACKLOG_POSTGRES_DSN="postgresql://user:pass@host:5432/daarion"

# Tests only
export BACKLOG_BACKEND=memory

2. Postgres Migration

Run once per environment. Idempotent (safe to re-run).

# Dry-run first
python3 ops/scripts/migrate_backlog_postgres.py \
  --dsn "postgresql://user:pass@host/daarion" \
  --dry-run

# Apply
python3 ops/scripts/migrate_backlog_postgres.py \
  --dsn "postgresql://user:pass@host/daarion"

Alternatively, use $BACKLOG_POSTGRES_DSN or $POSTGRES_DSN environment variables.

Tables created:

backlog_items — dedupe_key UNIQUE constraint
backlog_events — FK to backlog_items with CASCADE DELETE

Indexes: env+status, service, due_date, owner, category, item_id, ts.

3. Weekly Auto-generation

3.1 Automatic (scheduled)

weekly_backlog_generate runs every Monday at 06:20 UTC (20 min after the weekly platform digest at 06:00 UTC). Registered in ops/task_registry.yml.

3.2 Manual trigger

# HTTP (admin only)
curl -X POST "https://router/v1/backlog/generate/weekly?env=prod"

# Tool call
{
  "tool": "backlog_tool",
  "action": "auto_generate_weekly",
  "env": "prod"
}

3.3 Prerequisite

The latest ops/reports/platform/YYYY-WW.json must exist (produced by weekly_platform_priority_digest). If it's missing, generation returns:

{ "error": "No platform digest found. Run architecture_pressure_tool.digest first." }

Fix:

# Trigger platform digest
{ "tool": "architecture_pressure_tool", "action": "digest", "env": "prod" }

4. Cleanup (Retention)

Schedule: Daily at 03:40 UTC.

Removes done / canceled items older than retention_days (default 180d).

# Manual cleanup
{
  "tool": "backlog_tool",
  "action": "cleanup",
  "retention_days": 180
}

For JSONL backend, cleanup rewrites the file atomically. For Postgres, it runs a DELETE WHERE status IN ('done','canceled') AND updated_at < cutoff.

5. JSONL File Management

Files: ops/backlog/items.jsonl, ops/backlog/events.jsonl

The JSONL backend is append-only (updates append a new line; reads use last-write-wins per id). The file grows over time until cleanup() rewrites it.

Check file size

wc -l ops/backlog/items.jsonl
ls -lh ops/backlog/items.jsonl

Manual compaction (outside cleanup schedule)

python3 -c "
from services.router.backlog_store import JsonlBacklogStore
s = JsonlBacklogStore()
deleted = s.cleanup(retention_days=30)
print(f'Removed {deleted} old items')
"

6. Dashboard & Monitoring

# HTTP
GET /v1/backlog/dashboard?env=prod

# Example response
{
  "total": 42,
  "status_counts": {"open": 18, "in_progress": 5, "blocked": 3, "done": 14, "canceled": 2},
  "priority_counts": {"P0": 1, "P1": 9, "P2": 22, "P3": 10},
  "overdue_count": 4,
  "overdue": [
    {"id": "bl_...", "service": "gateway", "priority": "P1", "due_date": "2026-02-10", ...}
  ],
  "top_services": [{"service": "gateway", "count": 5}, ...]
}

Alert thresholds (recommended):

overdue_count > 5 → notify oncall
priority_counts.P0 > 0 AND overdue → page CTO

7. Troubleshooting

Items not generated

Check if platform digest exists: ls ops/reports/platform/*.json
Verify generation.weekly_from_pressure_digest: true in config/backlog_policy.yml
Check max_items_per_run — may cap generation if many services match.

Duplicate items across weeks

Normal — each week gets a new dedupe_key ...:YYYY-WW:.... Items from previous weeks remain unless closed. This is intentional: unresolved issues accumulate visibility week-over-week.

Postgres connection failures

Check: BACKLOG_POSTGRES_DSN, network access, and that migration has been run. The AutoBacklogStore will fall back to JSONL and log a warning.

Wrong owner assigned

Check config/backlog_policy.yml → ownership.overrides. Add/update service-level overrides as needed. Re-run auto_generate_weekly — the upsert will update the existing item if ownership changed (title/meta update path only; owner field is preserved on existing items). For immediate correction, use set_status + add_comment or upsert with explicit owner.

8. Configuration Reference

config/backlog_policy.yml — key sections:

Section	Key	Default	Description
`defaults`	`retention_days`	180	Days to keep done/canceled items
`defaults`	`max_items_per_run`	50	Cap per generation run
`dedupe`	`key_prefix`	platform_backlog	Dedupe key prefix
`categories.*`	`priority`	varies	Default priority per category
`categories.*`	`due_days`	varies	Days until due from creation
`generation`	`weekly_from_pressure_digest`	true	Enable weekly generation
`generation`	`daily_from_risk_digest`	false	Enable daily generation from risk
`ownership`	`default_owner`	oncall	Fallback owner
`ownership.overrides`	`{service}`	—	Per-service owner override

9. Scheduler Wiring: cron vs task_registry

Architecture

There are two sources of truth for scheduled jobs:

File	Role
`ops/task_registry.yml`	Declarative registry — defines what jobs exist, their schedule, inputs, permissions, and dry-run behavior. Used for documentation, audits, and future scheduler integrations.
`ops/cron/jobs.cron`	Active scheduler — physical cron entries that actually run jobs. Must be kept in sync with `task_registry.yml`.

How governance jobs are executed

All governance jobs use the universal runner:

python3 ops/scripts/run_governance_job.py \
    --tool <tool_name> \
    --action <action> \
    --params-json '<json>'

This POSTs to POST /v1/tools/execute on the router. The router applies RBAC (agent_id=scheduler, which has tools.backlog.admin + tools.pressure.write + tools.risk.write via the scheduler service account) and executes the tool.

Governance cron schedule

0  *  * * *   hourly_risk_snapshot          (risk_history_tool.snapshot)
0  9  * * *   daily_risk_digest             (risk_history_tool.digest)
20 3  * * *   risk_history_cleanup          (risk_history_tool.cleanup)
0  6  * * 1   weekly_platform_priority_digest (architecture_pressure_tool.digest)
20 6  * * 1   weekly_backlog_generate       (backlog_tool.auto_generate_weekly)
40 3  * * *   daily_backlog_cleanup         (backlog_tool.cleanup)

Deployment

# 1. Copy cron file to /etc/cron.d/
sudo cp ops/cron/jobs.cron /etc/cron.d/daarion-governance
sudo chmod 644 /etc/cron.d/daarion-governance

# 2. Edit REPO_ROOT and ROUTER_URL if needed
sudo nano /etc/cron.d/daarion-governance

# 3. Verify syntax
crontab -T /etc/cron.d/daarion-governance

# 4. Check logs
tail -f /var/log/daarion/risk_snapshot.log
tail -f /var/log/daarion/backlog_generate.log

Dry-run testing

python3 ops/scripts/run_governance_job.py \
    --tool backlog_tool --action auto_generate_weekly \
    --params-json '{"env":"prod"}' \
    --dry-run

Expected artifacts

After first run:

ops/reports/risk/YYYY-MM-DD.md and .json (daily digest)
ops/reports/platform/YYYY-WW.md and .json (weekly platform digest)
ops/backlog/items.jsonl (if BACKLOG_BACKEND=jsonl) or Postgres backlog_items table

Troubleshooting

Symptom	Cause	Fix
`Cannot reach http://localhost:8000`	Router not running or wrong `ROUTER_URL`	Check compose, set `ROUTER_URL` in cron header
`HTTP 401 from /v1/tools/execute`	Missing `SCHEDULER_API_KEY`	Set env var or check auth config
`error: No platform digest found`	`weekly_backlog_generate` ran before `weekly_platform_priority_digest`	Fix cron timing (06:00 vs 06:20) or run digest manually
Job output empty	Scheduler running but tool silently skipped	Check tool policy (e.g. `weekly_from_pressure_digest: false`)

8.6 KiB Raw Blame History