Files
microdao-daarion/ops/runbook-audit-postgres.md
Apple 67225a39fa docs(platform): add policy configs, runbooks, ops scripts and platform documentation
Config policies (16 files): alert_routing, architecture_pressure, backlog,
cost_weights, data_governance, incident_escalation, incident_intelligence,
network_allowlist, nodes_registry, observability_sources, rbac_tools_matrix,
release_gate, risk_attribution, risk_policy, slo_policy, tool_limits, tools_rollout

Ops (22 files): Caddyfile, calendar compose, grafana voice dashboard,
deployments/incidents logs, runbooks for alerts/audit/backlog/incidents/sofiia/voice,
cron jobs, scripts (alert_triage, audit_cleanup, migrate_*, governance, schedule),
task_registry, voice alerts/ha/latency/policy

Docs (30+ files): HUMANIZED_STEPAN v2.7-v3 changelogs and runbooks,
NODA1/NODA2 status and setup, audit index and traces, backlog, incident,
supervisor, tools, voice, opencode, release, risk, aistalk, spacebot

Made-with: Cursor
2026-03-03 07:14:53 -08:00

193 lines
5.4 KiB
Markdown

# Runbook: Postgres Audit Backend
## Overview
The audit backend stores structured, non-payload `ToolGovernance` events for FinOps, privacy analysis, and incident triage.
| Backend | Config | Use case |
|---------|--------|----------|
| `auto` | `AUDIT_BACKEND=auto` + `DATABASE_URL=...` | **Recommended for prod/staging**: tries Postgres, falls back to JSONL on failure |
| `postgres` | `AUDIT_BACKEND=postgres` | Hard-require Postgres; fails on DB down |
| `jsonl` | `AUDIT_BACKEND=jsonl` | JSONL files only (default / dev) |
| `null` | `AUDIT_BACKEND=null` | Discard all events (useful for testing) |
---
## 1. Initial Setup (NODE1 / Gateway)
### 1.1 Create `tool_audit_events` table (idempotent)
```bash
DATABASE_URL="postgresql://user:password@host:5432/daarion" \
python3 ops/scripts/migrate_audit_postgres.py
```
Dry-run (print DDL only):
```bash
python3 ops/scripts/migrate_audit_postgres.py --dry-run
```
### 1.2 Configure environment
In `services/router/.env` (or your Docker env):
```env
AUDIT_BACKEND=auto
DATABASE_URL=postgresql://audit_user:secret@pg-host:5432/daarion
AUDIT_JSONL_DIR=/var/log/daarion/audit # fallback dir
```
Restart the router after changes.
### 1.3 Verify
```bash
# Check router logs for:
# AuditStore: auto (postgres→jsonl fallback) dsn=postgresql://...
docker logs router 2>&1 | grep AuditStore
# Or call the dashboard:
curl http://localhost:8080/v1/finops/dashboard?window_hours=24 \
-H "X-Agent-Id: sofiia"
```
---
## 2. `AUDIT_BACKEND=auto` Fallback Behaviour
When `AUDIT_BACKEND=auto`:
1. **Normal operation**: all writes/reads go to Postgres.
2. **Postgres failure**: `AutoAuditStore` catches the error, logs a WARNING, and switches to JSONL for the next ~5 minutes.
3. **Recovery**: after 5 minutes the next write attempt re-tries Postgres. If successful, switches back silently.
This means **tool calls are never blocked** by a DB outage; events continue to land in JSONL.
---
## 3. Schema
```sql
CREATE TABLE IF NOT EXISTS tool_audit_events (
id BIGSERIAL PRIMARY KEY,
ts TIMESTAMPTZ NOT NULL,
req_id TEXT NOT NULL,
workspace_id TEXT NOT NULL,
user_id TEXT NOT NULL,
agent_id TEXT NOT NULL,
tool TEXT NOT NULL,
action TEXT NOT NULL,
status TEXT NOT NULL,
duration_ms INT NOT NULL DEFAULT 0,
in_size INT NOT NULL DEFAULT 0,
out_size INT NOT NULL DEFAULT 0,
input_hash TEXT NOT NULL DEFAULT '',
graph_run_id TEXT,
graph_node TEXT,
job_id TEXT
);
```
Indexes: `ts`, `(workspace_id, ts)`, `(tool, ts)`, `(agent_id, ts)`.
---
## 4. Scheduled Operational Jobs
Jobs are run via `ops/scripts/schedule_jobs.py` (called by cron — see `ops/cron/jobs.cron`):
| Job | Schedule | What it does |
|-----|----------|--------------|
| `audit_cleanup` | Daily 03:30 | Deletes/gzips JSONL files older than 30 days |
| `daily_cost_digest` | Daily 09:00 | Cost digest → `ops/reports/cost/YYYY-MM-DD.{json,md}` |
| `daily_privacy_digest` | Daily 09:10 | Privacy digest → `ops/reports/privacy/YYYY-MM-DD.{json,md}` |
| `weekly_drift_full` | Mon 02:00 | Full drift → `ops/reports/drift/week-YYYY-WW.json` |
### Run manually
```bash
# Cost digest
AUDIT_BACKEND=auto DATABASE_URL=... \
python3 ops/scripts/schedule_jobs.py daily_cost_digest
# Privacy digest
python3 ops/scripts/schedule_jobs.py daily_privacy_digest
# Weekly drift
python3 ops/scripts/schedule_jobs.py weekly_drift_full
```
---
## 5. Dashboard Endpoints
| Endpoint | RBAC | Description |
|----------|------|-------------|
| `GET /v1/finops/dashboard?window_hours=24` | `tools.cost.read` | FinOps cost digest |
| `GET /v1/privacy/dashboard?window_hours=24` | `tools.data_gov.read` | Privacy/audit digest |
Headers:
- `X-Agent-Id: sofiia` (or any agent with appropriate entitlements)
- `X-Workspace-Id: your-ws`
---
## 6. Maintenance & Troubleshooting
### Check active backend at runtime
```bash
curl -s http://localhost:8080/v1/finops/dashboard \
-H "X-Agent-Id: sofiia" | python3 -m json.tool | grep source_backend
```
### Force Postgres migration (re-apply schema)
```bash
python3 ops/scripts/migrate_audit_postgres.py
```
### Postgres is down — expected behaviour
- Router logs: `WARNING: AutoAuditStore: Postgres write failed (...), switching to JSONL fallback`
- Events land in `AUDIT_JSONL_DIR/tool_audit_YYYY-MM-DD.jsonl`
- Recovery automatic after 5 minutes
- No tool call failures
### JSONL fallback getting large
Run compaction:
```bash
python3 ops/scripts/audit_compact.py \
--audit-dir ops/audit --window-days 7 --output ops/audit/compact
```
Then cleanup old originals:
```bash
python3 ops/scripts/audit_cleanup.py \
--audit-dir ops/audit --retention-days 30
```
### Retention enforcement
Enforced by daily `audit_cleanup` job (cron 03:30). Policy defined in `config/data_governance_policy.yml`:
```yaml
retention:
audit_jsonl_days: 30
audit_postgres_days: 90
```
Postgres retention (if needed) must be managed separately with a `DELETE FROM tool_audit_events WHERE ts < NOW() - INTERVAL '90 days'` job or pg_partman.
---
## 7. Security Notes
- No PII or payload is stored in `tool_audit_events` — only sizes, hashes, and metadata.
- `DATABASE_URL` must be a restricted user with `INSERT/SELECT` on `tool_audit_events` only.
- JSONL fallback files inherit filesystem permissions; ensure directory is `chmod 700`.