Files
microdao-daarion/ops/runbook-sofiia-supervisor.md
Apple 67225a39fa docs(platform): add policy configs, runbooks, ops scripts and platform documentation
Config policies (16 files): alert_routing, architecture_pressure, backlog,
cost_weights, data_governance, incident_escalation, incident_intelligence,
network_allowlist, nodes_registry, observability_sources, rbac_tools_matrix,
release_gate, risk_attribution, risk_policy, slo_policy, tool_limits, tools_rollout

Ops (22 files): Caddyfile, calendar compose, grafana voice dashboard,
deployments/incidents logs, runbooks for alerts/audit/backlog/incidents/sofiia/voice,
cron jobs, scripts (alert_triage, audit_cleanup, migrate_*, governance, schedule),
task_registry, voice alerts/ha/latency/policy

Docs (30+ files): HUMANIZED_STEPAN v2.7-v3 changelogs and runbooks,
NODA1/NODA2 status and setup, audit index and traces, backlog, incident,
supervisor, tools, voice, opencode, release, risk, aistalk, spacebot

Made-with: Cursor
2026-03-03 07:14:53 -08:00

258 lines
6.8 KiB
Markdown

# Runbook: sofiia-supervisor (NODA2)
**Service**: `sofiia-supervisor` + `sofiia-redis`
**Host**: NODA2 | **External port**: 8084
**Escalation**: #platform-ops → @platform-oncall
---
## Health Check
```bash
# Basic health
curl -sf http://localhost:8084/healthz && echo OK
# Expected response:
# {"status":"ok","service":"sofiia-supervisor","graphs":["release_check","incident_triage"],
# "state_backend":"redis","gateway_url":"http://router:8000"}
# Redis health
docker exec sofiia-redis redis-cli ping
# Expected: PONG
```
---
## Logs
```bash
# Supervisor logs (last 100 lines)
docker logs sofiia-supervisor --tail 100 -f
# Filter tool call events (no payload)
docker logs sofiia-supervisor 2>&1 | grep "gateway_call\|gateway_ok\|gateway_tool_fail"
# Redis logs
docker logs sofiia-redis --tail 50
# All supervisor logs to file
docker logs sofiia-supervisor > /tmp/supervisor-$(date +%Y%m%d-%H%M%S).log 2>&1
```
Log format:
```
2026-02-23T10:00:01Z [INFO] gateway_call tool=job_orchestrator_tool action=start_task node=start_job run=gr_abc123 hash=d4e5f6 size=312 attempt=1
2026-02-23T10:00:02Z [INFO] gateway_ok tool=job_orchestrator_tool node=start_job run=gr_abc123 elapsed_ms=145
```
**Payload is NEVER logged.** Only: tool name, action, node, run_id, input hash, size, elapsed time.
---
## Restart
```bash
# Graceful restart (in-flight runs will fail → status=failed in Redis)
docker compose -f docker-compose.node2-sofiia-supervisor.yml restart sofiia-supervisor
# Full restart with rebuild (after code changes)
docker compose -f docker-compose.node2-sofiia-supervisor.yml \
up -d --build sofiia-supervisor
# Check container status after restart
docker ps --filter name=sofiia-supervisor --format "table {{.Names}}\t{{.Status}}"
```
---
## Start / Stop
```bash
# Start (attached to dagi-network-node2)
docker compose \
-f docker-compose.node2.yml \
-f docker-compose.node2-sofiia-supervisor.yml \
up -d sofiia-supervisor sofiia-redis
# Stop (preserves Redis data)
docker compose -f docker-compose.node2-sofiia-supervisor.yml stop sofiia-supervisor
# Stop + remove containers (keeps volumes)
docker compose -f docker-compose.node2-sofiia-supervisor.yml down
# Full teardown (removes volumes — DESTROYS run history)
docker compose -f docker-compose.node2-sofiia-supervisor.yml down -v
```
---
## State Cleanup
```bash
# Connect to Redis
docker exec -it sofiia-redis redis-cli
# List all run keys
127.0.0.1:6379> KEYS run:*
# Check a specific run
127.0.0.1:6379> GET run:gr_abc123
# Check run TTL (seconds until expiry)
127.0.0.1:6379> TTL run:gr_abc123
# Manually delete a stuck/stale run
127.0.0.1:6379> DEL run:gr_abc123 run:gr_abc123:events
# Count all active runs
127.0.0.1:6379> DBSIZE
# Flush all run data (CAUTION: destroys all history)
# 127.0.0.1:6379> FLUSHDB
# Exit
127.0.0.1:6379> EXIT
```
Default TTL: `RUN_TTL_SEC=86400` (24h). Runs auto-expire.
---
## Common Issues
### `sofiia-supervisor` can't reach router
```bash
# Check network
docker exec sofiia-supervisor curl -sf http://router:8000/healthz
# If fails: verify router is on dagi-network-node2
docker network inspect dagi-network-node2 | grep -A3 router
```
**Fix**: Ensure both services are on `dagi-network-node2` (see compose `networks` section).
---
### Run stuck in `running` status
Cause: Graph crashed mid-execution or supervisor was restarted.
```bash
# Manually cancel via API
curl -X POST http://localhost:8084/v1/runs/gr_STUCK_ID/cancel
# Or force-set status in Redis
docker exec -it sofiia-redis redis-cli
> GET run:gr_STUCK_ID
> SET run:gr_STUCK_ID '{"run_id":"gr_STUCK_ID","graph":"release_check","status":"failed",...}'
> EXIT
```
---
### Redis connection error
```bash
docker logs sofiia-supervisor 2>&1 | grep "Redis connection error"
# Check Redis is running
docker ps --filter name=sofiia-redis
# Restart Redis (data preserved in volume)
docker compose -f docker-compose.node2-sofiia-supervisor.yml restart sofiia-redis
# Test connection
docker exec sofiia-redis redis-cli -h sofiia-redis ping
```
---
### High memory on Redis
```bash
# Check memory usage
docker exec sofiia-redis redis-cli info memory | grep used_memory_human
# Redis is configured with maxmemory=256mb + allkeys-lru policy
# Old runs will be evicted automatically
# Manual cleanup of old runs (older than 12h):
# Write a cleanup script or reduce RUN_TTL_SEC in .env
```
---
### Gateway returns 401 Unauthorized
Cause: `SUPERVISOR_API_KEY` mismatch between supervisor and router.
```bash
# Check env
docker exec sofiia-supervisor env | grep SUPERVISOR_API_KEY
# Compare with router
docker exec dagi-router-node2 env | grep SUPERVISOR_API_KEY
```
Both must match. Set via `SUPERVISOR_API_KEY=...` in docker-compose or `.env`.
---
## Metrics / Monitoring
Currently no dedicated metrics endpoint. Monitor via:
1. **`/healthz`** — service up/down
2. **Docker stats**`docker stats sofiia-supervisor sofiia-redis`
3. **Log patterns**`gateway_ok`, `gateway_tool_fail`, `run_graph error`
Planned: Prometheus `/metrics` endpoint with run counts per graph/status.
---
## Upgrade
```bash
# Pull new image (if using registry)
docker pull daarion/sofiia-supervisor:latest
# Or rebuild from source
cd /path/to/microdao-daarion
docker compose -f docker-compose.node2-sofiia-supervisor.yml \
build --no-cache sofiia-supervisor
# Rolling restart (zero-downtime is NOT guaranteed — single instance)
docker compose -f docker-compose.node2-sofiia-supervisor.yml \
up -d sofiia-supervisor
```
---
## Available Graphs
| Graph | Description | Key nodes |
|-------|-------------|-----------|
| `release_check` | Release validation pipeline | jobs → poll → result |
| `incident_triage` | Collect observability + KB + SLO/privacy/cost context | overview → logs → health → traces → slo_context → privacy → cost → report |
| `postmortem_draft` | Generate postmortem from incident | load_incident → ensure_triage → draft → attach_artifacts → followups |
### postmortem_draft (new)
```bash
curl -X POST http://localhost:8084/v1/graphs/postmortem_draft/runs \
-H "Content-Type: application/json" \
-d '{"agent_id":"sofiia","input":{"incident_id":"inc_..."}}'
```
Generates markdown + JSON postmortem, attaches as incident artifacts, and appends follow-up timeline events. See `docs/supervisor/postmortem_draft_graph.md`.
---
## Known Limitations (MVP)
1. **Single worker** (`--workers 1`) — graph runs are sequential per process. For concurrent load, increase workers (but Redis state handles consistency).
2. **No LangGraph checkpointing** — runs interrupted by restart will show as `failed`; they do not resume.
3. **Polling-based job status**`release_check` polls `job_orchestrator_tool` every 3s. Tune `JOB_POLL_INTERVAL_SEC` if needed.
4. **In-flight cancellation**`cancel` sets status in Redis but cannot interrupt an already-executing tool call. Cancellation is effective between nodes.