Files
microdao-daarion/runbooks/RUNBOOK-E2E-FAILURE.md
Apple ef3473db21 snapshot: NODE1 production state 2026-02-09
Complete snapshot of /opt/microdao-daarion/ from NODE1 (144.76.224.179).
This represents the actual running production code that has diverged
significantly from the previous main branch.

Key changes from old main:
- Gateway (http_api.py): expanded from ~40KB to 164KB with full agent support
- Router: new /v1/agents/{id}/infer endpoint with vision + DeepSeek routing
- Behavior Policy: SOWA v2.2 (3-level: FULL/ACK/SILENT)
- Agent Registry: config/agent_registry.yml as single source of truth
- 13 agents configured (was 3)
- Memory service integration
- CrewAI teams and roles

Excluded from snapshot: venv/, .env, data/, backups, .tgz archives

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-02-09 08:46:46 -08:00

1.8 KiB

Runbook: Agent E2E Failure (E2E=0)

Тригери

  • AgentE2EFailure: agent_e2e_success{target="gateway_health"} == 0
  • AgentPingFailure: agent_e2e_success{target="agent_ping"} == 0

Швидка діагностика (5 команд)

# 1. Prober status
curl -sS http://localhost:9108/metrics | grep agent_e2e_success

# 2. Gateway logs (останні помилки)
docker logs dagi-gateway-node1 --tail 20 2>&1 | grep -iE "error|fail|timeout"

# 3. Router health
curl -sS http://localhost:9102/health

# 4. NATS connectivity
docker run --rm --network dagi-network natsio/nats-box nats -s nats://dagi-nats-node1:4222 server ping

# 5. Memory-service health
curl -sS http://localhost:8000/health

Детальна діагностика

Якщо Gateway DOWN

docker ps | grep gateway
docker logs dagi-gateway-node1 --tail 50
docker restart dagi-gateway-node1

Якщо Router не відповідає

docker logs dagi-router-node1 --tail 50
# Перевірити Ollama
curl -sS http://172.17.0.1:11434/api/tags | head

Якщо Memory-service DOWN

docker logs dagi-memory-service-node1 --tail 50
# Перевірити Qdrant
curl -sS http://localhost:6333/collections | head

Якщо NATS проблеми

# JetStream status
docker run --rm --network dagi-network natsio/nats-box nats -s nats://dagi-nats-node1:4222 stream ls
docker run --rm --network dagi-network natsio/nats-box nats -s nats://dagi-nats-node1:4222 consumer info ARTIFACT_JOBS render_pdf_worker

Ескалація

  1. Перезапуск сервісу не допоміг → перевірити ресурси (docker stats)
  2. OOM kills → dmesg | grep -i oom
  3. Disk full → df -h

Контакти

  • Slack: #daarion-alerts
  • On-call: check PagerDuty