Config policies (16 files): alert_routing, architecture_pressure, backlog, cost_weights, data_governance, incident_escalation, incident_intelligence, network_allowlist, nodes_registry, observability_sources, rbac_tools_matrix, release_gate, risk_attribution, risk_policy, slo_policy, tool_limits, tools_rollout Ops (22 files): Caddyfile, calendar compose, grafana voice dashboard, deployments/incidents logs, runbooks for alerts/audit/backlog/incidents/sofiia/voice, cron jobs, scripts (alert_triage, audit_cleanup, migrate_*, governance, schedule), task_registry, voice alerts/ha/latency/policy Docs (30+ files): HUMANIZED_STEPAN v2.7-v3 changelogs and runbooks, NODA1/NODA2 status and setup, audit index and traces, backlog, incident, supervisor, tools, voice, opencode, release, risk, aistalk, spacebot Made-with: Cursor
258 lines
6.8 KiB
Markdown
258 lines
6.8 KiB
Markdown
# Runbook: sofiia-supervisor (NODA2)
|
|
|
|
**Service**: `sofiia-supervisor` + `sofiia-redis`
|
|
**Host**: NODA2 | **External port**: 8084
|
|
**Escalation**: #platform-ops → @platform-oncall
|
|
|
|
---
|
|
|
|
## Health Check
|
|
|
|
```bash
|
|
# Basic health
|
|
curl -sf http://localhost:8084/healthz && echo OK
|
|
|
|
# Expected response:
|
|
# {"status":"ok","service":"sofiia-supervisor","graphs":["release_check","incident_triage"],
|
|
# "state_backend":"redis","gateway_url":"http://router:8000"}
|
|
|
|
# Redis health
|
|
docker exec sofiia-redis redis-cli ping
|
|
# Expected: PONG
|
|
```
|
|
|
|
---
|
|
|
|
## Logs
|
|
|
|
```bash
|
|
# Supervisor logs (last 100 lines)
|
|
docker logs sofiia-supervisor --tail 100 -f
|
|
|
|
# Filter tool call events (no payload)
|
|
docker logs sofiia-supervisor 2>&1 | grep "gateway_call\|gateway_ok\|gateway_tool_fail"
|
|
|
|
# Redis logs
|
|
docker logs sofiia-redis --tail 50
|
|
|
|
# All supervisor logs to file
|
|
docker logs sofiia-supervisor > /tmp/supervisor-$(date +%Y%m%d-%H%M%S).log 2>&1
|
|
```
|
|
|
|
Log format:
|
|
```
|
|
2026-02-23T10:00:01Z [INFO] gateway_call tool=job_orchestrator_tool action=start_task node=start_job run=gr_abc123 hash=d4e5f6 size=312 attempt=1
|
|
2026-02-23T10:00:02Z [INFO] gateway_ok tool=job_orchestrator_tool node=start_job run=gr_abc123 elapsed_ms=145
|
|
```
|
|
|
|
**Payload is NEVER logged.** Only: tool name, action, node, run_id, input hash, size, elapsed time.
|
|
|
|
---
|
|
|
|
## Restart
|
|
|
|
```bash
|
|
# Graceful restart (in-flight runs will fail → status=failed in Redis)
|
|
docker compose -f docker-compose.node2-sofiia-supervisor.yml restart sofiia-supervisor
|
|
|
|
# Full restart with rebuild (after code changes)
|
|
docker compose -f docker-compose.node2-sofiia-supervisor.yml \
|
|
up -d --build sofiia-supervisor
|
|
|
|
# Check container status after restart
|
|
docker ps --filter name=sofiia-supervisor --format "table {{.Names}}\t{{.Status}}"
|
|
```
|
|
|
|
---
|
|
|
|
## Start / Stop
|
|
|
|
```bash
|
|
# Start (attached to dagi-network-node2)
|
|
docker compose \
|
|
-f docker-compose.node2.yml \
|
|
-f docker-compose.node2-sofiia-supervisor.yml \
|
|
up -d sofiia-supervisor sofiia-redis
|
|
|
|
# Stop (preserves Redis data)
|
|
docker compose -f docker-compose.node2-sofiia-supervisor.yml stop sofiia-supervisor
|
|
|
|
# Stop + remove containers (keeps volumes)
|
|
docker compose -f docker-compose.node2-sofiia-supervisor.yml down
|
|
|
|
# Full teardown (removes volumes — DESTROYS run history)
|
|
docker compose -f docker-compose.node2-sofiia-supervisor.yml down -v
|
|
```
|
|
|
|
---
|
|
|
|
## State Cleanup
|
|
|
|
```bash
|
|
# Connect to Redis
|
|
docker exec -it sofiia-redis redis-cli
|
|
|
|
# List all run keys
|
|
127.0.0.1:6379> KEYS run:*
|
|
|
|
# Check a specific run
|
|
127.0.0.1:6379> GET run:gr_abc123
|
|
|
|
# Check run TTL (seconds until expiry)
|
|
127.0.0.1:6379> TTL run:gr_abc123
|
|
|
|
# Manually delete a stuck/stale run
|
|
127.0.0.1:6379> DEL run:gr_abc123 run:gr_abc123:events
|
|
|
|
# Count all active runs
|
|
127.0.0.1:6379> DBSIZE
|
|
|
|
# Flush all run data (CAUTION: destroys all history)
|
|
# 127.0.0.1:6379> FLUSHDB
|
|
|
|
# Exit
|
|
127.0.0.1:6379> EXIT
|
|
```
|
|
|
|
Default TTL: `RUN_TTL_SEC=86400` (24h). Runs auto-expire.
|
|
|
|
---
|
|
|
|
## Common Issues
|
|
|
|
### `sofiia-supervisor` can't reach router
|
|
|
|
```bash
|
|
# Check network
|
|
docker exec sofiia-supervisor curl -sf http://router:8000/healthz
|
|
|
|
# If fails: verify router is on dagi-network-node2
|
|
docker network inspect dagi-network-node2 | grep -A3 router
|
|
```
|
|
|
|
**Fix**: Ensure both services are on `dagi-network-node2` (see compose `networks` section).
|
|
|
|
---
|
|
|
|
### Run stuck in `running` status
|
|
|
|
Cause: Graph crashed mid-execution or supervisor was restarted.
|
|
|
|
```bash
|
|
# Manually cancel via API
|
|
curl -X POST http://localhost:8084/v1/runs/gr_STUCK_ID/cancel
|
|
|
|
# Or force-set status in Redis
|
|
docker exec -it sofiia-redis redis-cli
|
|
> GET run:gr_STUCK_ID
|
|
> SET run:gr_STUCK_ID '{"run_id":"gr_STUCK_ID","graph":"release_check","status":"failed",...}'
|
|
> EXIT
|
|
```
|
|
|
|
---
|
|
|
|
### Redis connection error
|
|
|
|
```bash
|
|
docker logs sofiia-supervisor 2>&1 | grep "Redis connection error"
|
|
|
|
# Check Redis is running
|
|
docker ps --filter name=sofiia-redis
|
|
|
|
# Restart Redis (data preserved in volume)
|
|
docker compose -f docker-compose.node2-sofiia-supervisor.yml restart sofiia-redis
|
|
|
|
# Test connection
|
|
docker exec sofiia-redis redis-cli -h sofiia-redis ping
|
|
```
|
|
|
|
---
|
|
|
|
### High memory on Redis
|
|
|
|
```bash
|
|
# Check memory usage
|
|
docker exec sofiia-redis redis-cli info memory | grep used_memory_human
|
|
|
|
# Redis is configured with maxmemory=256mb + allkeys-lru policy
|
|
# Old runs will be evicted automatically
|
|
|
|
# Manual cleanup of old runs (older than 12h):
|
|
# Write a cleanup script or reduce RUN_TTL_SEC in .env
|
|
```
|
|
|
|
---
|
|
|
|
### Gateway returns 401 Unauthorized
|
|
|
|
Cause: `SUPERVISOR_API_KEY` mismatch between supervisor and router.
|
|
|
|
```bash
|
|
# Check env
|
|
docker exec sofiia-supervisor env | grep SUPERVISOR_API_KEY
|
|
|
|
# Compare with router
|
|
docker exec dagi-router-node2 env | grep SUPERVISOR_API_KEY
|
|
```
|
|
|
|
Both must match. Set via `SUPERVISOR_API_KEY=...` in docker-compose or `.env`.
|
|
|
|
---
|
|
|
|
## Metrics / Monitoring
|
|
|
|
Currently no dedicated metrics endpoint. Monitor via:
|
|
|
|
1. **`/healthz`** — service up/down
|
|
2. **Docker stats** — `docker stats sofiia-supervisor sofiia-redis`
|
|
3. **Log patterns** — `gateway_ok`, `gateway_tool_fail`, `run_graph error`
|
|
|
|
Planned: Prometheus `/metrics` endpoint with run counts per graph/status.
|
|
|
|
---
|
|
|
|
## Upgrade
|
|
|
|
```bash
|
|
# Pull new image (if using registry)
|
|
docker pull daarion/sofiia-supervisor:latest
|
|
|
|
# Or rebuild from source
|
|
cd /path/to/microdao-daarion
|
|
docker compose -f docker-compose.node2-sofiia-supervisor.yml \
|
|
build --no-cache sofiia-supervisor
|
|
|
|
# Rolling restart (zero-downtime is NOT guaranteed — single instance)
|
|
docker compose -f docker-compose.node2-sofiia-supervisor.yml \
|
|
up -d sofiia-supervisor
|
|
```
|
|
|
|
---
|
|
|
|
## Available Graphs
|
|
|
|
| Graph | Description | Key nodes |
|
|
|-------|-------------|-----------|
|
|
| `release_check` | Release validation pipeline | jobs → poll → result |
|
|
| `incident_triage` | Collect observability + KB + SLO/privacy/cost context | overview → logs → health → traces → slo_context → privacy → cost → report |
|
|
| `postmortem_draft` | Generate postmortem from incident | load_incident → ensure_triage → draft → attach_artifacts → followups |
|
|
|
|
### postmortem_draft (new)
|
|
|
|
```bash
|
|
curl -X POST http://localhost:8084/v1/graphs/postmortem_draft/runs \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"agent_id":"sofiia","input":{"incident_id":"inc_..."}}'
|
|
```
|
|
|
|
Generates markdown + JSON postmortem, attaches as incident artifacts, and appends follow-up timeline events. See `docs/supervisor/postmortem_draft_graph.md`.
|
|
|
|
---
|
|
|
|
## Known Limitations (MVP)
|
|
|
|
1. **Single worker** (`--workers 1`) — graph runs are sequential per process. For concurrent load, increase workers (but Redis state handles consistency).
|
|
2. **No LangGraph checkpointing** — runs interrupted by restart will show as `failed`; they do not resume.
|
|
3. **Polling-based job status** — `release_check` polls `job_orchestrator_tool` every 3s. Tune `JOB_POLL_INTERVAL_SEC` if needed.
|
|
4. **In-flight cancellation** — `cancel` sets status in Redis but cannot interrupt an already-executing tool call. Cancellation is effective between nodes.
|