Files
microdao-daarion/ops/runbook-audit-postgres.md
Apple 67225a39fa docs(platform): add policy configs, runbooks, ops scripts and platform documentation
Config policies (16 files): alert_routing, architecture_pressure, backlog,
cost_weights, data_governance, incident_escalation, incident_intelligence,
network_allowlist, nodes_registry, observability_sources, rbac_tools_matrix,
release_gate, risk_attribution, risk_policy, slo_policy, tool_limits, tools_rollout

Ops (22 files): Caddyfile, calendar compose, grafana voice dashboard,
deployments/incidents logs, runbooks for alerts/audit/backlog/incidents/sofiia/voice,
cron jobs, scripts (alert_triage, audit_cleanup, migrate_*, governance, schedule),
task_registry, voice alerts/ha/latency/policy

Docs (30+ files): HUMANIZED_STEPAN v2.7-v3 changelogs and runbooks,
NODA1/NODA2 status and setup, audit index and traces, backlog, incident,
supervisor, tools, voice, opencode, release, risk, aistalk, spacebot

Made-with: Cursor
2026-03-03 07:14:53 -08:00

5.4 KiB

Runbook: Postgres Audit Backend

Overview

The audit backend stores structured, non-payload ToolGovernance events for FinOps, privacy analysis, and incident triage.

Backend Config Use case
auto AUDIT_BACKEND=auto + DATABASE_URL=... Recommended for prod/staging: tries Postgres, falls back to JSONL on failure
postgres AUDIT_BACKEND=postgres Hard-require Postgres; fails on DB down
jsonl AUDIT_BACKEND=jsonl JSONL files only (default / dev)
null AUDIT_BACKEND=null Discard all events (useful for testing)

1. Initial Setup (NODE1 / Gateway)

1.1 Create tool_audit_events table (idempotent)

DATABASE_URL="postgresql://user:password@host:5432/daarion" \
  python3 ops/scripts/migrate_audit_postgres.py

Dry-run (print DDL only):

python3 ops/scripts/migrate_audit_postgres.py --dry-run

1.2 Configure environment

In services/router/.env (or your Docker env):

AUDIT_BACKEND=auto
DATABASE_URL=postgresql://audit_user:secret@pg-host:5432/daarion
AUDIT_JSONL_DIR=/var/log/daarion/audit   # fallback dir

Restart the router after changes.

1.3 Verify

# Check router logs for:
# AuditStore: auto (postgres→jsonl fallback) dsn=postgresql://...
docker logs router 2>&1 | grep AuditStore

# Or call the dashboard:
curl http://localhost:8080/v1/finops/dashboard?window_hours=24 \
  -H "X-Agent-Id: sofiia"

2. AUDIT_BACKEND=auto Fallback Behaviour

When AUDIT_BACKEND=auto:

  1. Normal operation: all writes/reads go to Postgres.
  2. Postgres failure: AutoAuditStore catches the error, logs a WARNING, and switches to JSONL for the next ~5 minutes.
  3. Recovery: after 5 minutes the next write attempt re-tries Postgres. If successful, switches back silently.

This means tool calls are never blocked by a DB outage; events continue to land in JSONL.


3. Schema

CREATE TABLE IF NOT EXISTS tool_audit_events (
    id            BIGSERIAL    PRIMARY KEY,
    ts            TIMESTAMPTZ  NOT NULL,
    req_id        TEXT         NOT NULL,
    workspace_id  TEXT         NOT NULL,
    user_id       TEXT         NOT NULL,
    agent_id      TEXT         NOT NULL,
    tool          TEXT         NOT NULL,
    action        TEXT         NOT NULL,
    status        TEXT         NOT NULL,
    duration_ms   INT          NOT NULL DEFAULT 0,
    in_size       INT          NOT NULL DEFAULT 0,
    out_size      INT          NOT NULL DEFAULT 0,
    input_hash    TEXT         NOT NULL DEFAULT '',
    graph_run_id  TEXT,
    graph_node    TEXT,
    job_id        TEXT
);

Indexes: ts, (workspace_id, ts), (tool, ts), (agent_id, ts).


4. Scheduled Operational Jobs

Jobs are run via ops/scripts/schedule_jobs.py (called by cron — see ops/cron/jobs.cron):

Job Schedule What it does
audit_cleanup Daily 03:30 Deletes/gzips JSONL files older than 30 days
daily_cost_digest Daily 09:00 Cost digest → ops/reports/cost/YYYY-MM-DD.{json,md}
daily_privacy_digest Daily 09:10 Privacy digest → ops/reports/privacy/YYYY-MM-DD.{json,md}
weekly_drift_full Mon 02:00 Full drift → ops/reports/drift/week-YYYY-WW.json

Run manually

# Cost digest
AUDIT_BACKEND=auto DATABASE_URL=... \
  python3 ops/scripts/schedule_jobs.py daily_cost_digest

# Privacy digest
python3 ops/scripts/schedule_jobs.py daily_privacy_digest

# Weekly drift
python3 ops/scripts/schedule_jobs.py weekly_drift_full

5. Dashboard Endpoints

Endpoint RBAC Description
GET /v1/finops/dashboard?window_hours=24 tools.cost.read FinOps cost digest
GET /v1/privacy/dashboard?window_hours=24 tools.data_gov.read Privacy/audit digest

Headers:

  • X-Agent-Id: sofiia (or any agent with appropriate entitlements)
  • X-Workspace-Id: your-ws

6. Maintenance & Troubleshooting

Check active backend at runtime

curl -s http://localhost:8080/v1/finops/dashboard \
  -H "X-Agent-Id: sofiia" | python3 -m json.tool | grep source_backend

Force Postgres migration (re-apply schema)

python3 ops/scripts/migrate_audit_postgres.py

Postgres is down — expected behaviour

  • Router logs: WARNING: AutoAuditStore: Postgres write failed (...), switching to JSONL fallback
  • Events land in AUDIT_JSONL_DIR/tool_audit_YYYY-MM-DD.jsonl
  • Recovery automatic after 5 minutes
  • No tool call failures

JSONL fallback getting large

Run compaction:

python3 ops/scripts/audit_compact.py \
  --audit-dir ops/audit --window-days 7 --output ops/audit/compact

Then cleanup old originals:

python3 ops/scripts/audit_cleanup.py \
  --audit-dir ops/audit --retention-days 30

Retention enforcement

Enforced by daily audit_cleanup job (cron 03:30). Policy defined in config/data_governance_policy.yml:

retention:
  audit_jsonl_days: 30
  audit_postgres_days: 90

Postgres retention (if needed) must be managed separately with a DELETE FROM tool_audit_events WHERE ts < NOW() - INTERVAL '90 days' job or pg_partman.


7. Security Notes

  • No PII or payload is stored in tool_audit_events — only sizes, hashes, and metadata.
  • DATABASE_URL must be a restricted user with INSERT/SELECT on tool_audit_events only.
  • JSONL fallback files inherit filesystem permissions; ensure directory is chmod 700.