Config policies (16 files): alert_routing, architecture_pressure, backlog, cost_weights, data_governance, incident_escalation, incident_intelligence, network_allowlist, nodes_registry, observability_sources, rbac_tools_matrix, release_gate, risk_attribution, risk_policy, slo_policy, tool_limits, tools_rollout Ops (22 files): Caddyfile, calendar compose, grafana voice dashboard, deployments/incidents logs, runbooks for alerts/audit/backlog/incidents/sofiia/voice, cron jobs, scripts (alert_triage, audit_cleanup, migrate_*, governance, schedule), task_registry, voice alerts/ha/latency/policy Docs (30+ files): HUMANIZED_STEPAN v2.7-v3 changelogs and runbooks, NODA1/NODA2 status and setup, audit index and traces, backlog, incident, supervisor, tools, voice, opencode, release, risk, aistalk, spacebot Made-with: Cursor
6.8 KiB
Runbook: sofiia-supervisor (NODA2)
Service: sofiia-supervisor + sofiia-redis
Host: NODA2 | External port: 8084
Escalation: #platform-ops → @platform-oncall
Health Check
# Basic health
curl -sf http://localhost:8084/healthz && echo OK
# Expected response:
# {"status":"ok","service":"sofiia-supervisor","graphs":["release_check","incident_triage"],
# "state_backend":"redis","gateway_url":"http://router:8000"}
# Redis health
docker exec sofiia-redis redis-cli ping
# Expected: PONG
Logs
# Supervisor logs (last 100 lines)
docker logs sofiia-supervisor --tail 100 -f
# Filter tool call events (no payload)
docker logs sofiia-supervisor 2>&1 | grep "gateway_call\|gateway_ok\|gateway_tool_fail"
# Redis logs
docker logs sofiia-redis --tail 50
# All supervisor logs to file
docker logs sofiia-supervisor > /tmp/supervisor-$(date +%Y%m%d-%H%M%S).log 2>&1
Log format:
2026-02-23T10:00:01Z [INFO] gateway_call tool=job_orchestrator_tool action=start_task node=start_job run=gr_abc123 hash=d4e5f6 size=312 attempt=1
2026-02-23T10:00:02Z [INFO] gateway_ok tool=job_orchestrator_tool node=start_job run=gr_abc123 elapsed_ms=145
Payload is NEVER logged. Only: tool name, action, node, run_id, input hash, size, elapsed time.
Restart
# Graceful restart (in-flight runs will fail → status=failed in Redis)
docker compose -f docker-compose.node2-sofiia-supervisor.yml restart sofiia-supervisor
# Full restart with rebuild (after code changes)
docker compose -f docker-compose.node2-sofiia-supervisor.yml \
up -d --build sofiia-supervisor
# Check container status after restart
docker ps --filter name=sofiia-supervisor --format "table {{.Names}}\t{{.Status}}"
Start / Stop
# Start (attached to dagi-network-node2)
docker compose \
-f docker-compose.node2.yml \
-f docker-compose.node2-sofiia-supervisor.yml \
up -d sofiia-supervisor sofiia-redis
# Stop (preserves Redis data)
docker compose -f docker-compose.node2-sofiia-supervisor.yml stop sofiia-supervisor
# Stop + remove containers (keeps volumes)
docker compose -f docker-compose.node2-sofiia-supervisor.yml down
# Full teardown (removes volumes — DESTROYS run history)
docker compose -f docker-compose.node2-sofiia-supervisor.yml down -v
State Cleanup
# Connect to Redis
docker exec -it sofiia-redis redis-cli
# List all run keys
127.0.0.1:6379> KEYS run:*
# Check a specific run
127.0.0.1:6379> GET run:gr_abc123
# Check run TTL (seconds until expiry)
127.0.0.1:6379> TTL run:gr_abc123
# Manually delete a stuck/stale run
127.0.0.1:6379> DEL run:gr_abc123 run:gr_abc123:events
# Count all active runs
127.0.0.1:6379> DBSIZE
# Flush all run data (CAUTION: destroys all history)
# 127.0.0.1:6379> FLUSHDB
# Exit
127.0.0.1:6379> EXIT
Default TTL: RUN_TTL_SEC=86400 (24h). Runs auto-expire.
Common Issues
sofiia-supervisor can't reach router
# Check network
docker exec sofiia-supervisor curl -sf http://router:8000/healthz
# If fails: verify router is on dagi-network-node2
docker network inspect dagi-network-node2 | grep -A3 router
Fix: Ensure both services are on dagi-network-node2 (see compose networks section).
Run stuck in running status
Cause: Graph crashed mid-execution or supervisor was restarted.
# Manually cancel via API
curl -X POST http://localhost:8084/v1/runs/gr_STUCK_ID/cancel
# Or force-set status in Redis
docker exec -it sofiia-redis redis-cli
> GET run:gr_STUCK_ID
> SET run:gr_STUCK_ID '{"run_id":"gr_STUCK_ID","graph":"release_check","status":"failed",...}'
> EXIT
Redis connection error
docker logs sofiia-supervisor 2>&1 | grep "Redis connection error"
# Check Redis is running
docker ps --filter name=sofiia-redis
# Restart Redis (data preserved in volume)
docker compose -f docker-compose.node2-sofiia-supervisor.yml restart sofiia-redis
# Test connection
docker exec sofiia-redis redis-cli -h sofiia-redis ping
High memory on Redis
# Check memory usage
docker exec sofiia-redis redis-cli info memory | grep used_memory_human
# Redis is configured with maxmemory=256mb + allkeys-lru policy
# Old runs will be evicted automatically
# Manual cleanup of old runs (older than 12h):
# Write a cleanup script or reduce RUN_TTL_SEC in .env
Gateway returns 401 Unauthorized
Cause: SUPERVISOR_API_KEY mismatch between supervisor and router.
# Check env
docker exec sofiia-supervisor env | grep SUPERVISOR_API_KEY
# Compare with router
docker exec dagi-router-node2 env | grep SUPERVISOR_API_KEY
Both must match. Set via SUPERVISOR_API_KEY=... in docker-compose or .env.
Metrics / Monitoring
Currently no dedicated metrics endpoint. Monitor via:
/healthz— service up/down- Docker stats —
docker stats sofiia-supervisor sofiia-redis - Log patterns —
gateway_ok,gateway_tool_fail,run_graph error
Planned: Prometheus /metrics endpoint with run counts per graph/status.
Upgrade
# Pull new image (if using registry)
docker pull daarion/sofiia-supervisor:latest
# Or rebuild from source
cd /path/to/microdao-daarion
docker compose -f docker-compose.node2-sofiia-supervisor.yml \
build --no-cache sofiia-supervisor
# Rolling restart (zero-downtime is NOT guaranteed — single instance)
docker compose -f docker-compose.node2-sofiia-supervisor.yml \
up -d sofiia-supervisor
Available Graphs
| Graph | Description | Key nodes |
|---|---|---|
release_check |
Release validation pipeline | jobs → poll → result |
incident_triage |
Collect observability + KB + SLO/privacy/cost context | overview → logs → health → traces → slo_context → privacy → cost → report |
postmortem_draft |
Generate postmortem from incident | load_incident → ensure_triage → draft → attach_artifacts → followups |
postmortem_draft (new)
curl -X POST http://localhost:8084/v1/graphs/postmortem_draft/runs \
-H "Content-Type: application/json" \
-d '{"agent_id":"sofiia","input":{"incident_id":"inc_..."}}'
Generates markdown + JSON postmortem, attaches as incident artifacts, and appends follow-up timeline events. See docs/supervisor/postmortem_draft_graph.md.
Known Limitations (MVP)
- Single worker (
--workers 1) — graph runs are sequential per process. For concurrent load, increase workers (but Redis state handles consistency). - No LangGraph checkpointing — runs interrupted by restart will show as
failed; they do not resume. - Polling-based job status —
release_checkpollsjob_orchestrator_toolevery 3s. TuneJOB_POLL_INTERVAL_SECif needed. - In-flight cancellation —
cancelsets status in Redis but cannot interrupt an already-executing tool call. Cancellation is effective between nodes.