Files

Apple 67225a39fa docs(platform): add policy configs, runbooks, ops scripts and platform documentation

Config policies (16 files): alert_routing, architecture_pressure, backlog,
cost_weights, data_governance, incident_escalation, incident_intelligence,
network_allowlist, nodes_registry, observability_sources, rbac_tools_matrix,
release_gate, risk_attribution, risk_policy, slo_policy, tool_limits, tools_rollout

Ops (22 files): Caddyfile, calendar compose, grafana voice dashboard,
deployments/incidents logs, runbooks for alerts/audit/backlog/incidents/sofiia/voice,
cron jobs, scripts (alert_triage, audit_cleanup, migrate_*, governance, schedule),
task_registry, voice alerts/ha/latency/policy

Docs (30+ files): HUMANIZED_STEPAN v2.7-v3 changelogs and runbooks,
NODA1/NODA2 status and setup, audit index and traces, backlog, incident,
supervisor, tools, voice, opencode, release, risk, aistalk, spacebot

Made-with: Cursor

2026-03-03 07:14:53 -08:00

6.8 KiB

Raw Permalink Blame History

Runbook: sofiia-supervisor (NODA2)

Service: sofiia-supervisor + sofiia-redis
Host: NODA2 | External port: 8084
Escalation: #platform-ops → @platform-oncall

Health Check

# Basic health
curl -sf http://localhost:8084/healthz && echo OK

# Expected response:
# {"status":"ok","service":"sofiia-supervisor","graphs":["release_check","incident_triage"],
#  "state_backend":"redis","gateway_url":"http://router:8000"}

# Redis health
docker exec sofiia-redis redis-cli ping
# Expected: PONG

Logs

# Supervisor logs (last 100 lines)
docker logs sofiia-supervisor --tail 100 -f

# Filter tool call events (no payload)
docker logs sofiia-supervisor 2>&1 | grep "gateway_call\|gateway_ok\|gateway_tool_fail"

# Redis logs
docker logs sofiia-redis --tail 50

# All supervisor logs to file
docker logs sofiia-supervisor > /tmp/supervisor-$(date +%Y%m%d-%H%M%S).log 2>&1

Log format:

2026-02-23T10:00:01Z [INFO] gateway_call tool=job_orchestrator_tool action=start_task node=start_job run=gr_abc123 hash=d4e5f6 size=312 attempt=1
2026-02-23T10:00:02Z [INFO] gateway_ok tool=job_orchestrator_tool node=start_job run=gr_abc123 elapsed_ms=145

Payload is NEVER logged. Only: tool name, action, node, run_id, input hash, size, elapsed time.

Restart

# Graceful restart (in-flight runs will fail → status=failed in Redis)
docker compose -f docker-compose.node2-sofiia-supervisor.yml restart sofiia-supervisor

# Full restart with rebuild (after code changes)
docker compose -f docker-compose.node2-sofiia-supervisor.yml \
  up -d --build sofiia-supervisor

# Check container status after restart
docker ps --filter name=sofiia-supervisor --format "table {{.Names}}\t{{.Status}}"

Start / Stop

# Start (attached to dagi-network-node2)
docker compose \
  -f docker-compose.node2.yml \
  -f docker-compose.node2-sofiia-supervisor.yml \
  up -d sofiia-supervisor sofiia-redis

# Stop (preserves Redis data)
docker compose -f docker-compose.node2-sofiia-supervisor.yml stop sofiia-supervisor

# Stop + remove containers (keeps volumes)
docker compose -f docker-compose.node2-sofiia-supervisor.yml down

# Full teardown (removes volumes — DESTROYS run history)
docker compose -f docker-compose.node2-sofiia-supervisor.yml down -v

State Cleanup

# Connect to Redis
docker exec -it sofiia-redis redis-cli

# List all run keys
127.0.0.1:6379> KEYS run:*

# Check a specific run
127.0.0.1:6379> GET run:gr_abc123

# Check run TTL (seconds until expiry)
127.0.0.1:6379> TTL run:gr_abc123

# Manually delete a stuck/stale run
127.0.0.1:6379> DEL run:gr_abc123 run:gr_abc123:events

# Count all active runs
127.0.0.1:6379> DBSIZE

# Flush all run data (CAUTION: destroys all history)
# 127.0.0.1:6379> FLUSHDB

# Exit
127.0.0.1:6379> EXIT

Default TTL: RUN_TTL_SEC=86400 (24h). Runs auto-expire.

Common Issues

`sofiia-supervisor` can't reach router

# Check network
docker exec sofiia-supervisor curl -sf http://router:8000/healthz

# If fails: verify router is on dagi-network-node2
docker network inspect dagi-network-node2 | grep -A3 router

Fix: Ensure both services are on dagi-network-node2 (see compose networks section).

Run stuck in `running` status

Cause: Graph crashed mid-execution or supervisor was restarted.

# Manually cancel via API
curl -X POST http://localhost:8084/v1/runs/gr_STUCK_ID/cancel

# Or force-set status in Redis
docker exec -it sofiia-redis redis-cli
> GET run:gr_STUCK_ID
> SET run:gr_STUCK_ID '{"run_id":"gr_STUCK_ID","graph":"release_check","status":"failed",...}'
> EXIT

Redis connection error

docker logs sofiia-supervisor 2>&1 | grep "Redis connection error"

# Check Redis is running
docker ps --filter name=sofiia-redis

# Restart Redis (data preserved in volume)
docker compose -f docker-compose.node2-sofiia-supervisor.yml restart sofiia-redis

# Test connection
docker exec sofiia-redis redis-cli -h sofiia-redis ping

High memory on Redis

# Check memory usage
docker exec sofiia-redis redis-cli info memory | grep used_memory_human

# Redis is configured with maxmemory=256mb + allkeys-lru policy
# Old runs will be evicted automatically

# Manual cleanup of old runs (older than 12h):
# Write a cleanup script or reduce RUN_TTL_SEC in .env

Gateway returns 401 Unauthorized

Cause: SUPERVISOR_API_KEY mismatch between supervisor and router.

# Check env
docker exec sofiia-supervisor env | grep SUPERVISOR_API_KEY

# Compare with router
docker exec dagi-router-node2 env | grep SUPERVISOR_API_KEY

Both must match. Set via SUPERVISOR_API_KEY=... in docker-compose or .env.

Metrics / Monitoring

Currently no dedicated metrics endpoint. Monitor via:

/healthz — service up/down
Docker stats — docker stats sofiia-supervisor sofiia-redis
Log patterns — gateway_ok, gateway_tool_fail, run_graph error

Planned: Prometheus /metrics endpoint with run counts per graph/status.

Upgrade

# Pull new image (if using registry)
docker pull daarion/sofiia-supervisor:latest

# Or rebuild from source
cd /path/to/microdao-daarion
docker compose -f docker-compose.node2-sofiia-supervisor.yml \
  build --no-cache sofiia-supervisor

# Rolling restart (zero-downtime is NOT guaranteed — single instance)
docker compose -f docker-compose.node2-sofiia-supervisor.yml \
  up -d sofiia-supervisor

Available Graphs

Graph	Description	Key nodes
`release_check`	Release validation pipeline	jobs → poll → result
`incident_triage`	Collect observability + KB + SLO/privacy/cost context	overview → logs → health → traces → slo_context → privacy → cost → report
`postmortem_draft`	Generate postmortem from incident	load_incident → ensure_triage → draft → attach_artifacts → followups

postmortem_draft (new)

curl -X POST http://localhost:8084/v1/graphs/postmortem_draft/runs \
  -H "Content-Type: application/json" \
  -d '{"agent_id":"sofiia","input":{"incident_id":"inc_..."}}'

Generates markdown + JSON postmortem, attaches as incident artifacts, and appends follow-up timeline events. See docs/supervisor/postmortem_draft_graph.md.

Known Limitations (MVP)

Single worker (--workers 1) — graph runs are sequential per process. For concurrent load, increase workers (but Redis state handles consistency).
No LangGraph checkpointing — runs interrupted by restart will show as failed; they do not resume.
Polling-based job status — release_check polls job_orchestrator_tool every 3s. Tune JOB_POLL_INTERVAL_SEC if needed.
In-flight cancellation — cancel sets status in Redis but cannot interrupt an already-executing tool call. Cancellation is effective between nodes.

6.8 KiB Raw Permalink Blame History