Files
microdao-daarion/runbooks/RUNBOOK-E2E-FAILURE.md
Apple ef3473db21 snapshot: NODE1 production state 2026-02-09
Complete snapshot of /opt/microdao-daarion/ from NODE1 (144.76.224.179).
This represents the actual running production code that has diverged
significantly from the previous main branch.

Key changes from old main:
- Gateway (http_api.py): expanded from ~40KB to 164KB with full agent support
- Router: new /v1/agents/{id}/infer endpoint with vision + DeepSeek routing
- Behavior Policy: SOWA v2.2 (3-level: FULL/ACK/SILENT)
- Agent Registry: config/agent_registry.yml as single source of truth
- 13 agents configured (was 3)
- Memory service integration
- CrewAI teams and roles

Excluded from snapshot: venv/, .env, data/, backups, .tgz archives

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-02-09 08:46:46 -08:00

64 lines
1.8 KiB
Markdown

# Runbook: Agent E2E Failure (E2E=0)
## Тригери
- `AgentE2EFailure`: agent_e2e_success{target="gateway_health"} == 0
- `AgentPingFailure`: agent_e2e_success{target="agent_ping"} == 0
## Швидка діагностика (5 команд)
```bash
# 1. Prober status
curl -sS http://localhost:9108/metrics | grep agent_e2e_success
# 2. Gateway logs (останні помилки)
docker logs dagi-gateway-node1 --tail 20 2>&1 | grep -iE "error|fail|timeout"
# 3. Router health
curl -sS http://localhost:9102/health
# 4. NATS connectivity
docker run --rm --network dagi-network natsio/nats-box nats -s nats://dagi-nats-node1:4222 server ping
# 5. Memory-service health
curl -sS http://localhost:8000/health
```
## Детальна діагностика
### Якщо Gateway DOWN
```bash
docker ps | grep gateway
docker logs dagi-gateway-node1 --tail 50
docker restart dagi-gateway-node1
```
### Якщо Router не відповідає
```bash
docker logs dagi-router-node1 --tail 50
# Перевірити Ollama
curl -sS http://172.17.0.1:11434/api/tags | head
```
### Якщо Memory-service DOWN
```bash
docker logs dagi-memory-service-node1 --tail 50
# Перевірити Qdrant
curl -sS http://localhost:6333/collections | head
```
### Якщо NATS проблеми
```bash
# JetStream status
docker run --rm --network dagi-network natsio/nats-box nats -s nats://dagi-nats-node1:4222 stream ls
docker run --rm --network dagi-network natsio/nats-box nats -s nats://dagi-nats-node1:4222 consumer info ARTIFACT_JOBS render_pdf_worker
```
## Ескалація
1. Перезапуск сервісу не допоміг → перевірити ресурси (`docker stats`)
2. OOM kills → `dmesg | grep -i oom`
3. Disk full → `df -h`
## Контакти
- Slack: #daarion-alerts
- On-call: check PagerDuty