# Runbook: sofiia-supervisor (NODA2) **Service**: `sofiia-supervisor` + `sofiia-redis` **Host**: NODA2 | **External port**: 8084 **Escalation**: #platform-ops → @platform-oncall --- ## Health Check ```bash # Basic health curl -sf http://localhost:8084/healthz && echo OK # Expected response: # {"status":"ok","service":"sofiia-supervisor","graphs":["release_check","incident_triage"], # "state_backend":"redis","gateway_url":"http://router:8000"} # Redis health docker exec sofiia-redis redis-cli ping # Expected: PONG ``` --- ## Logs ```bash # Supervisor logs (last 100 lines) docker logs sofiia-supervisor --tail 100 -f # Filter tool call events (no payload) docker logs sofiia-supervisor 2>&1 | grep "gateway_call\|gateway_ok\|gateway_tool_fail" # Redis logs docker logs sofiia-redis --tail 50 # All supervisor logs to file docker logs sofiia-supervisor > /tmp/supervisor-$(date +%Y%m%d-%H%M%S).log 2>&1 ``` Log format: ``` 2026-02-23T10:00:01Z [INFO] gateway_call tool=job_orchestrator_tool action=start_task node=start_job run=gr_abc123 hash=d4e5f6 size=312 attempt=1 2026-02-23T10:00:02Z [INFO] gateway_ok tool=job_orchestrator_tool node=start_job run=gr_abc123 elapsed_ms=145 ``` **Payload is NEVER logged.** Only: tool name, action, node, run_id, input hash, size, elapsed time. --- ## Restart ```bash # Graceful restart (in-flight runs will fail → status=failed in Redis) docker compose -f docker-compose.node2-sofiia-supervisor.yml restart sofiia-supervisor # Full restart with rebuild (after code changes) docker compose -f docker-compose.node2-sofiia-supervisor.yml \ up -d --build sofiia-supervisor # Check container status after restart docker ps --filter name=sofiia-supervisor --format "table {{.Names}}\t{{.Status}}" ``` --- ## Start / Stop ```bash # Start (attached to dagi-network-node2) docker compose \ -f docker-compose.node2.yml \ -f docker-compose.node2-sofiia-supervisor.yml \ up -d sofiia-supervisor sofiia-redis # Stop (preserves Redis data) docker compose -f docker-compose.node2-sofiia-supervisor.yml stop sofiia-supervisor # Stop + remove containers (keeps volumes) docker compose -f docker-compose.node2-sofiia-supervisor.yml down # Full teardown (removes volumes — DESTROYS run history) docker compose -f docker-compose.node2-sofiia-supervisor.yml down -v ``` --- ## State Cleanup ```bash # Connect to Redis docker exec -it sofiia-redis redis-cli # List all run keys 127.0.0.1:6379> KEYS run:* # Check a specific run 127.0.0.1:6379> GET run:gr_abc123 # Check run TTL (seconds until expiry) 127.0.0.1:6379> TTL run:gr_abc123 # Manually delete a stuck/stale run 127.0.0.1:6379> DEL run:gr_abc123 run:gr_abc123:events # Count all active runs 127.0.0.1:6379> DBSIZE # Flush all run data (CAUTION: destroys all history) # 127.0.0.1:6379> FLUSHDB # Exit 127.0.0.1:6379> EXIT ``` Default TTL: `RUN_TTL_SEC=86400` (24h). Runs auto-expire. --- ## Common Issues ### `sofiia-supervisor` can't reach router ```bash # Check network docker exec sofiia-supervisor curl -sf http://router:8000/healthz # If fails: verify router is on dagi-network-node2 docker network inspect dagi-network-node2 | grep -A3 router ``` **Fix**: Ensure both services are on `dagi-network-node2` (see compose `networks` section). --- ### Run stuck in `running` status Cause: Graph crashed mid-execution or supervisor was restarted. ```bash # Manually cancel via API curl -X POST http://localhost:8084/v1/runs/gr_STUCK_ID/cancel # Or force-set status in Redis docker exec -it sofiia-redis redis-cli > GET run:gr_STUCK_ID > SET run:gr_STUCK_ID '{"run_id":"gr_STUCK_ID","graph":"release_check","status":"failed",...}' > EXIT ``` --- ### Redis connection error ```bash docker logs sofiia-supervisor 2>&1 | grep "Redis connection error" # Check Redis is running docker ps --filter name=sofiia-redis # Restart Redis (data preserved in volume) docker compose -f docker-compose.node2-sofiia-supervisor.yml restart sofiia-redis # Test connection docker exec sofiia-redis redis-cli -h sofiia-redis ping ``` --- ### High memory on Redis ```bash # Check memory usage docker exec sofiia-redis redis-cli info memory | grep used_memory_human # Redis is configured with maxmemory=256mb + allkeys-lru policy # Old runs will be evicted automatically # Manual cleanup of old runs (older than 12h): # Write a cleanup script or reduce RUN_TTL_SEC in .env ``` --- ### Gateway returns 401 Unauthorized Cause: `SUPERVISOR_API_KEY` mismatch between supervisor and router. ```bash # Check env docker exec sofiia-supervisor env | grep SUPERVISOR_API_KEY # Compare with router docker exec dagi-router-node2 env | grep SUPERVISOR_API_KEY ``` Both must match. Set via `SUPERVISOR_API_KEY=...` in docker-compose or `.env`. --- ## Metrics / Monitoring Currently no dedicated metrics endpoint. Monitor via: 1. **`/healthz`** — service up/down 2. **Docker stats** — `docker stats sofiia-supervisor sofiia-redis` 3. **Log patterns** — `gateway_ok`, `gateway_tool_fail`, `run_graph error` Planned: Prometheus `/metrics` endpoint with run counts per graph/status. --- ## Upgrade ```bash # Pull new image (if using registry) docker pull daarion/sofiia-supervisor:latest # Or rebuild from source cd /path/to/microdao-daarion docker compose -f docker-compose.node2-sofiia-supervisor.yml \ build --no-cache sofiia-supervisor # Rolling restart (zero-downtime is NOT guaranteed — single instance) docker compose -f docker-compose.node2-sofiia-supervisor.yml \ up -d sofiia-supervisor ``` --- ## Available Graphs | Graph | Description | Key nodes | |-------|-------------|-----------| | `release_check` | Release validation pipeline | jobs → poll → result | | `incident_triage` | Collect observability + KB + SLO/privacy/cost context | overview → logs → health → traces → slo_context → privacy → cost → report | | `postmortem_draft` | Generate postmortem from incident | load_incident → ensure_triage → draft → attach_artifacts → followups | ### postmortem_draft (new) ```bash curl -X POST http://localhost:8084/v1/graphs/postmortem_draft/runs \ -H "Content-Type: application/json" \ -d '{"agent_id":"sofiia","input":{"incident_id":"inc_..."}}' ``` Generates markdown + JSON postmortem, attaches as incident artifacts, and appends follow-up timeline events. See `docs/supervisor/postmortem_draft_graph.md`. --- ## Known Limitations (MVP) 1. **Single worker** (`--workers 1`) — graph runs are sequential per process. For concurrent load, increase workers (but Redis state handles consistency). 2. **No LangGraph checkpointing** — runs interrupted by restart will show as `failed`; they do not resume. 3. **Polling-based job status** — `release_check` polls `job_orchestrator_tool` every 3s. Tune `JOB_POLL_INTERVAL_SEC` if needed. 4. **In-flight cancellation** — `cancel` sets status in Redis but cannot interrupt an already-executing tool call. Cancellation is effective between nodes.