microdao-daarion/ops/runbook-sofiia-supervisor.md

# Runbook: sofiia-supervisor (NODA2)

**Service**: `sofiia-supervisor` + `sofiia-redis`
**Host**: NODA2 | **External port**: 8084
**Escalation**: #platform-ops → @platform-oncall

---

## Health Check

```bash
# Basic health
curl -sf http://localhost:8084/healthz && echo OK

# Expected response:
# {"status":"ok","service":"sofiia-supervisor","graphs":["release_check","incident_triage"],
#  "state_backend":"redis","gateway_url":"http://router:8000"}

# Redis health
docker exec sofiia-redis redis-cli ping
# Expected: PONG
```

---

## Logs

```bash
# Supervisor logs (last 100 lines)
docker logs sofiia-supervisor --tail 100 -f

# Filter tool call events (no payload)
docker logs sofiia-supervisor 2>&1 | grep "gateway_call\|gateway_ok\|gateway_tool_fail"

# Redis logs
docker logs sofiia-redis --tail 50

# All supervisor logs to file
docker logs sofiia-supervisor > /tmp/supervisor-$(date +%Y%m%d-%H%M%S).log 2>&1
```

Log format:
```
2026-02-23T10:00:01Z [INFO] gateway_call tool=job_orchestrator_tool action=start_task node=start_job run=gr_abc123 hash=d4e5f6 size=312 attempt=1
2026-02-23T10:00:02Z [INFO] gateway_ok tool=job_orchestrator_tool node=start_job run=gr_abc123 elapsed_ms=145
```

**Payload is NEVER logged.** Only: tool name, action, node, run_id, input hash, size, elapsed time.

---

## Restart

```bash
# Graceful restart (in-flight runs will fail → status=failed in Redis)
docker compose -f docker-compose.node2-sofiia-supervisor.yml restart sofiia-supervisor

# Full restart with rebuild (after code changes)
docker compose -f docker-compose.node2-sofiia-supervisor.yml \
  up -d --build sofiia-supervisor

# Check container status after restart
docker ps --filter name=sofiia-supervisor --format "table {{.Names}}\t{{.Status}}"
```

---

## Start / Stop

```bash
# Start (attached to dagi-network-node2)
docker compose \
  -f docker-compose.node2.yml \
  -f docker-compose.node2-sofiia-supervisor.yml \
  up -d sofiia-supervisor sofiia-redis

# Stop (preserves Redis data)
docker compose -f docker-compose.node2-sofiia-supervisor.yml stop sofiia-supervisor

# Stop + remove containers (keeps volumes)
docker compose -f docker-compose.node2-sofiia-supervisor.yml down

# Full teardown (removes volumes — DESTROYS run history)
docker compose -f docker-compose.node2-sofiia-supervisor.yml down -v
```

---

## State Cleanup

```bash
# Connect to Redis
docker exec -it sofiia-redis redis-cli

# List all run keys
127.0.0.1:6379> KEYS run:*

# Check a specific run
127.0.0.1:6379> GET run:gr_abc123

# Check run TTL (seconds until expiry)
127.0.0.1:6379> TTL run:gr_abc123

# Manually delete a stuck/stale run
127.0.0.1:6379> DEL run:gr_abc123 run:gr_abc123:events

# Count all active runs
127.0.0.1:6379> DBSIZE

# Flush all run data (CAUTION: destroys all history)
# 127.0.0.1:6379> FLUSHDB

# Exit
127.0.0.1:6379> EXIT
```

Default TTL: `RUN_TTL_SEC=86400` (24h). Runs auto-expire.

---

## Common Issues

### `sofiia-supervisor` can't reach router

```bash
# Check network
docker exec sofiia-supervisor curl -sf http://router:8000/healthz

# If fails: verify router is on dagi-network-node2
docker network inspect dagi-network-node2 | grep -A3 router
```

**Fix**: Ensure both services are on `dagi-network-node2` (see compose `networks` section).

---

### Run stuck in `running` status

Cause: Graph crashed mid-execution or supervisor was restarted.

```bash
# Manually cancel via API
curl -X POST http://localhost:8084/v1/runs/gr_STUCK_ID/cancel

# Or force-set status in Redis
docker exec -it sofiia-redis redis-cli
> GET run:gr_STUCK_ID
> SET run:gr_STUCK_ID '{"run_id":"gr_STUCK_ID","graph":"release_check","status":"failed",...}'
> EXIT
```

---

### Redis connection error

```bash
docker logs sofiia-supervisor 2>&1 | grep "Redis connection error"

# Check Redis is running
docker ps --filter name=sofiia-redis

# Restart Redis (data preserved in volume)
docker compose -f docker-compose.node2-sofiia-supervisor.yml restart sofiia-redis

# Test connection
docker exec sofiia-redis redis-cli -h sofiia-redis ping
```

---

### High memory on Redis

```bash
# Check memory usage
docker exec sofiia-redis redis-cli info memory | grep used_memory_human

# Redis is configured with maxmemory=256mb + allkeys-lru policy
# Old runs will be evicted automatically

# Manual cleanup of old runs (older than 12h):
# Write a cleanup script or reduce RUN_TTL_SEC in .env
```

---

### Gateway returns 401 Unauthorized

Cause: `SUPERVISOR_API_KEY` mismatch between supervisor and router.

```bash
# Check env
docker exec sofiia-supervisor env | grep SUPERVISOR_API_KEY

# Compare with router
docker exec dagi-router-node2 env | grep SUPERVISOR_API_KEY
```

Both must match. Set via `SUPERVISOR_API_KEY=...` in docker-compose or `.env`.

---

## Metrics / Monitoring

Currently no dedicated metrics endpoint. Monitor via:

1. **`/healthz`** — service up/down
2. **Docker stats** — `docker stats sofiia-supervisor sofiia-redis`
3. **Log patterns** — `gateway_ok`, `gateway_tool_fail`, `run_graph error`

Planned: Prometheus `/metrics` endpoint with run counts per graph/status.

---

## Upgrade

```bash
# Pull new image (if using registry)
docker pull daarion/sofiia-supervisor:latest

# Or rebuild from source
cd /path/to/microdao-daarion
docker compose -f docker-compose.node2-sofiia-supervisor.yml \
  build --no-cache sofiia-supervisor

# Rolling restart (zero-downtime is NOT guaranteed — single instance)
docker compose -f docker-compose.node2-sofiia-supervisor.yml \
  up -d sofiia-supervisor
```

---

## Available Graphs

| Graph | Description | Key nodes |
|-------|-------------|-----------|
| `release_check` | Release validation pipeline | jobs → poll → result |
| `incident_triage` | Collect observability + KB + SLO/privacy/cost context | overview → logs → health → traces → slo_context → privacy → cost → report |
| `postmortem_draft` | Generate postmortem from incident | load_incident → ensure_triage → draft → attach_artifacts → followups |

### postmortem_draft (new)

```bash
curl -X POST http://localhost:8084/v1/graphs/postmortem_draft/runs \
  -H "Content-Type: application/json" \
  -d '{"agent_id":"sofiia","input":{"incident_id":"inc_..."}}'
```

Generates markdown + JSON postmortem, attaches as incident artifacts, and appends follow-up timeline events. See `docs/supervisor/postmortem_draft_graph.md`.

---

## Known Limitations (MVP)

1. **Single worker** (`--workers 1`) — graph runs are sequential per process. For concurrent load, increase workers (but Redis state handles consistency).
2. **No LangGraph checkpointing** — runs interrupted by restart will show as `failed`; they do not resume.
3. **Polling-based job status** — `release_check` polls `job_orchestrator_tool` every 3s. Tune `JOB_POLL_INTERVAL_SEC` if needed.
4. **In-flight cancellation** — `cancel` sets status in Redis but cannot interrupt an already-executing tool call. Cancellation is effective between nodes.