snapshot: NODE1 production state 2026-02-09
Complete snapshot of /opt/microdao-daarion/ from NODE1 (144.76.224.179).
This represents the actual running production code that has diverged
significantly from the previous main branch.
Key changes from old main:
- Gateway (http_api.py): expanded from ~40KB to 164KB with full agent support
- Router: new /v1/agents/{id}/infer endpoint with vision + DeepSeek routing
- Behavior Policy: SOWA v2.2 (3-level: FULL/ACK/SILENT)
- Agent Registry: config/agent_registry.yml as single source of truth
- 13 agents configured (was 3)
- Memory service integration
- CrewAI teams and roles
Excluded from snapshot: venv/, .env, data/, backups, .tgz archives
Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
63
runbooks/RUNBOOK-E2E-FAILURE.md
Normal file
63
runbooks/RUNBOOK-E2E-FAILURE.md
Normal file
@@ -0,0 +1,63 @@
|
||||
# Runbook: Agent E2E Failure (E2E=0)
|
||||
|
||||
## Тригери
|
||||
- `AgentE2EFailure`: agent_e2e_success{target="gateway_health"} == 0
|
||||
- `AgentPingFailure`: agent_e2e_success{target="agent_ping"} == 0
|
||||
|
||||
## Швидка діагностика (5 команд)
|
||||
|
||||
```bash
|
||||
# 1. Prober status
|
||||
curl -sS http://localhost:9108/metrics | grep agent_e2e_success
|
||||
|
||||
# 2. Gateway logs (останні помилки)
|
||||
docker logs dagi-gateway-node1 --tail 20 2>&1 | grep -iE "error|fail|timeout"
|
||||
|
||||
# 3. Router health
|
||||
curl -sS http://localhost:9102/health
|
||||
|
||||
# 4. NATS connectivity
|
||||
docker run --rm --network dagi-network natsio/nats-box nats -s nats://dagi-nats-node1:4222 server ping
|
||||
|
||||
# 5. Memory-service health
|
||||
curl -sS http://localhost:8000/health
|
||||
```
|
||||
|
||||
## Детальна діагностика
|
||||
|
||||
### Якщо Gateway DOWN
|
||||
```bash
|
||||
docker ps | grep gateway
|
||||
docker logs dagi-gateway-node1 --tail 50
|
||||
docker restart dagi-gateway-node1
|
||||
```
|
||||
|
||||
### Якщо Router не відповідає
|
||||
```bash
|
||||
docker logs dagi-router-node1 --tail 50
|
||||
# Перевірити Ollama
|
||||
curl -sS http://172.17.0.1:11434/api/tags | head
|
||||
```
|
||||
|
||||
### Якщо Memory-service DOWN
|
||||
```bash
|
||||
docker logs dagi-memory-service-node1 --tail 50
|
||||
# Перевірити Qdrant
|
||||
curl -sS http://localhost:6333/collections | head
|
||||
```
|
||||
|
||||
### Якщо NATS проблеми
|
||||
```bash
|
||||
# JetStream status
|
||||
docker run --rm --network dagi-network natsio/nats-box nats -s nats://dagi-nats-node1:4222 stream ls
|
||||
docker run --rm --network dagi-network natsio/nats-box nats -s nats://dagi-nats-node1:4222 consumer info ARTIFACT_JOBS render_pdf_worker
|
||||
```
|
||||
|
||||
## Ескалація
|
||||
1. Перезапуск сервісу не допоміг → перевірити ресурси (`docker stats`)
|
||||
2. OOM kills → `dmesg | grep -i oom`
|
||||
3. Disk full → `df -h`
|
||||
|
||||
## Контакти
|
||||
- Slack: #daarion-alerts
|
||||
- On-call: check PagerDuty
|
||||
Reference in New Issue
Block a user