Config policies (16 files): alert_routing, architecture_pressure, backlog, cost_weights, data_governance, incident_escalation, incident_intelligence, network_allowlist, nodes_registry, observability_sources, rbac_tools_matrix, release_gate, risk_attribution, risk_policy, slo_policy, tool_limits, tools_rollout Ops (22 files): Caddyfile, calendar compose, grafana voice dashboard, deployments/incidents logs, runbooks for alerts/audit/backlog/incidents/sofiia/voice, cron jobs, scripts (alert_triage, audit_cleanup, migrate_*, governance, schedule), task_registry, voice alerts/ha/latency/policy Docs (30+ files): HUMANIZED_STEPAN v2.7-v3 changelogs and runbooks, NODA1/NODA2 status and setup, audit index and traces, backlog, incident, supervisor, tools, voice, opencode, release, risk, aistalk, spacebot Made-with: Cursor
193 lines
5.4 KiB
Markdown
193 lines
5.4 KiB
Markdown
# Runbook: Postgres Audit Backend
|
|
|
|
## Overview
|
|
|
|
The audit backend stores structured, non-payload `ToolGovernance` events for FinOps, privacy analysis, and incident triage.
|
|
|
|
| Backend | Config | Use case |
|
|
|---------|--------|----------|
|
|
| `auto` | `AUDIT_BACKEND=auto` + `DATABASE_URL=...` | **Recommended for prod/staging**: tries Postgres, falls back to JSONL on failure |
|
|
| `postgres` | `AUDIT_BACKEND=postgres` | Hard-require Postgres; fails on DB down |
|
|
| `jsonl` | `AUDIT_BACKEND=jsonl` | JSONL files only (default / dev) |
|
|
| `null` | `AUDIT_BACKEND=null` | Discard all events (useful for testing) |
|
|
|
|
---
|
|
|
|
## 1. Initial Setup (NODE1 / Gateway)
|
|
|
|
### 1.1 Create `tool_audit_events` table (idempotent)
|
|
|
|
```bash
|
|
DATABASE_URL="postgresql://user:password@host:5432/daarion" \
|
|
python3 ops/scripts/migrate_audit_postgres.py
|
|
```
|
|
|
|
Dry-run (print DDL only):
|
|
|
|
```bash
|
|
python3 ops/scripts/migrate_audit_postgres.py --dry-run
|
|
```
|
|
|
|
### 1.2 Configure environment
|
|
|
|
In `services/router/.env` (or your Docker env):
|
|
|
|
```env
|
|
AUDIT_BACKEND=auto
|
|
DATABASE_URL=postgresql://audit_user:secret@pg-host:5432/daarion
|
|
AUDIT_JSONL_DIR=/var/log/daarion/audit # fallback dir
|
|
```
|
|
|
|
Restart the router after changes.
|
|
|
|
### 1.3 Verify
|
|
|
|
```bash
|
|
# Check router logs for:
|
|
# AuditStore: auto (postgres→jsonl fallback) dsn=postgresql://...
|
|
docker logs router 2>&1 | grep AuditStore
|
|
|
|
# Or call the dashboard:
|
|
curl http://localhost:8080/v1/finops/dashboard?window_hours=24 \
|
|
-H "X-Agent-Id: sofiia"
|
|
```
|
|
|
|
---
|
|
|
|
## 2. `AUDIT_BACKEND=auto` Fallback Behaviour
|
|
|
|
When `AUDIT_BACKEND=auto`:
|
|
|
|
1. **Normal operation**: all writes/reads go to Postgres.
|
|
2. **Postgres failure**: `AutoAuditStore` catches the error, logs a WARNING, and switches to JSONL for the next ~5 minutes.
|
|
3. **Recovery**: after 5 minutes the next write attempt re-tries Postgres. If successful, switches back silently.
|
|
|
|
This means **tool calls are never blocked** by a DB outage; events continue to land in JSONL.
|
|
|
|
---
|
|
|
|
## 3. Schema
|
|
|
|
```sql
|
|
CREATE TABLE IF NOT EXISTS tool_audit_events (
|
|
id BIGSERIAL PRIMARY KEY,
|
|
ts TIMESTAMPTZ NOT NULL,
|
|
req_id TEXT NOT NULL,
|
|
workspace_id TEXT NOT NULL,
|
|
user_id TEXT NOT NULL,
|
|
agent_id TEXT NOT NULL,
|
|
tool TEXT NOT NULL,
|
|
action TEXT NOT NULL,
|
|
status TEXT NOT NULL,
|
|
duration_ms INT NOT NULL DEFAULT 0,
|
|
in_size INT NOT NULL DEFAULT 0,
|
|
out_size INT NOT NULL DEFAULT 0,
|
|
input_hash TEXT NOT NULL DEFAULT '',
|
|
graph_run_id TEXT,
|
|
graph_node TEXT,
|
|
job_id TEXT
|
|
);
|
|
```
|
|
|
|
Indexes: `ts`, `(workspace_id, ts)`, `(tool, ts)`, `(agent_id, ts)`.
|
|
|
|
---
|
|
|
|
## 4. Scheduled Operational Jobs
|
|
|
|
Jobs are run via `ops/scripts/schedule_jobs.py` (called by cron — see `ops/cron/jobs.cron`):
|
|
|
|
| Job | Schedule | What it does |
|
|
|-----|----------|--------------|
|
|
| `audit_cleanup` | Daily 03:30 | Deletes/gzips JSONL files older than 30 days |
|
|
| `daily_cost_digest` | Daily 09:00 | Cost digest → `ops/reports/cost/YYYY-MM-DD.{json,md}` |
|
|
| `daily_privacy_digest` | Daily 09:10 | Privacy digest → `ops/reports/privacy/YYYY-MM-DD.{json,md}` |
|
|
| `weekly_drift_full` | Mon 02:00 | Full drift → `ops/reports/drift/week-YYYY-WW.json` |
|
|
|
|
### Run manually
|
|
|
|
```bash
|
|
# Cost digest
|
|
AUDIT_BACKEND=auto DATABASE_URL=... \
|
|
python3 ops/scripts/schedule_jobs.py daily_cost_digest
|
|
|
|
# Privacy digest
|
|
python3 ops/scripts/schedule_jobs.py daily_privacy_digest
|
|
|
|
# Weekly drift
|
|
python3 ops/scripts/schedule_jobs.py weekly_drift_full
|
|
```
|
|
|
|
---
|
|
|
|
## 5. Dashboard Endpoints
|
|
|
|
| Endpoint | RBAC | Description |
|
|
|----------|------|-------------|
|
|
| `GET /v1/finops/dashboard?window_hours=24` | `tools.cost.read` | FinOps cost digest |
|
|
| `GET /v1/privacy/dashboard?window_hours=24` | `tools.data_gov.read` | Privacy/audit digest |
|
|
|
|
Headers:
|
|
- `X-Agent-Id: sofiia` (or any agent with appropriate entitlements)
|
|
- `X-Workspace-Id: your-ws`
|
|
|
|
---
|
|
|
|
## 6. Maintenance & Troubleshooting
|
|
|
|
### Check active backend at runtime
|
|
|
|
```bash
|
|
curl -s http://localhost:8080/v1/finops/dashboard \
|
|
-H "X-Agent-Id: sofiia" | python3 -m json.tool | grep source_backend
|
|
```
|
|
|
|
### Force Postgres migration (re-apply schema)
|
|
|
|
```bash
|
|
python3 ops/scripts/migrate_audit_postgres.py
|
|
```
|
|
|
|
### Postgres is down — expected behaviour
|
|
|
|
- Router logs: `WARNING: AutoAuditStore: Postgres write failed (...), switching to JSONL fallback`
|
|
- Events land in `AUDIT_JSONL_DIR/tool_audit_YYYY-MM-DD.jsonl`
|
|
- Recovery automatic after 5 minutes
|
|
- No tool call failures
|
|
|
|
### JSONL fallback getting large
|
|
|
|
Run compaction:
|
|
|
|
```bash
|
|
python3 ops/scripts/audit_compact.py \
|
|
--audit-dir ops/audit --window-days 7 --output ops/audit/compact
|
|
```
|
|
|
|
Then cleanup old originals:
|
|
|
|
```bash
|
|
python3 ops/scripts/audit_cleanup.py \
|
|
--audit-dir ops/audit --retention-days 30
|
|
```
|
|
|
|
### Retention enforcement
|
|
|
|
Enforced by daily `audit_cleanup` job (cron 03:30). Policy defined in `config/data_governance_policy.yml`:
|
|
|
|
```yaml
|
|
retention:
|
|
audit_jsonl_days: 30
|
|
audit_postgres_days: 90
|
|
```
|
|
|
|
Postgres retention (if needed) must be managed separately with a `DELETE FROM tool_audit_events WHERE ts < NOW() - INTERVAL '90 days'` job or pg_partman.
|
|
|
|
---
|
|
|
|
## 7. Security Notes
|
|
|
|
- No PII or payload is stored in `tool_audit_events` — only sizes, hashes, and metadata.
|
|
- `DATABASE_URL` must be a restricted user with `INSERT/SELECT` on `tool_audit_events` only.
|
|
- JSONL fallback files inherit filesystem permissions; ensure directory is `chmod 700`.
|