Complete snapshot of /opt/microdao-daarion/ from NODE1 (144.76.224.179).
This represents the actual running production code that has diverged
significantly from the previous main branch.
Key changes from old main:
- Gateway (http_api.py): expanded from ~40KB to 164KB with full agent support
- Router: new /v1/agents/{id}/infer endpoint with vision + DeepSeek routing
- Behavior Policy: SOWA v2.2 (3-level: FULL/ACK/SILENT)
- Agent Registry: config/agent_registry.yml as single source of truth
- 13 agents configured (was 3)
- Memory service integration
- CrewAI teams and roles
Excluded from snapshot: venv/, .env, data/, backups, .tgz archives
Co-authored-by: Cursor <cursoragent@cursor.com>
5.7 KiB
5.7 KiB
Runbook: NODE1 Recovery & Safety
Purpose
Швидко відновити роботу NODE1 після збоїв (Telegram webhook 500, router DNS, NATS/worker, Grafana crash-loop) і уникнути випадкового зупинення не того стеку.
Quick links / aliases
./stack-node1 ps|up|down|logs(node1 stack)./stack-staging ps|up|down|logs(staging stack)- NODE1 Docker network:
dagi-network(дляnats-box)
Scope (NODE1 stack)
- dagi-gateway-node1 (9300)
- dagi-router-node1 (router API)
- dagi-nats-node1 (4222, JetStream enabled)
- crewai-nats-worker
- dagi-memory-service-node1 (8000)
- dagi-qdrant-node1 (6333)
- dagi-postgres (5432)
- dagi-redis-node1 (6379)
- dagi-neo4j-node1 (7474/7687)
- prometheus (9090)
- grafana
- dagi-crawl4ai-node1 (11235)
- control-plane (9200)
- other node1 services as defined in docker-compose.node1.yml
Safety rules (DO THIS FIRST)
- Always set project name for NODE1:
export COMPOSE_PROJECT_NAME=dagi_node1
- Always use the correct compose file:
-f docker-compose.node1.yml
- Never run
docker compose downwithout verifying target:docker compose -f docker-compose.node1.yml ps
- If staging exists, it MUST have a different
COMPOSE_PROJECT_NAMEand networks.
Quick status
docker compose -f docker-compose.node1.yml psdocker compose -f docker-compose.node1.yml logs --tail=80 dagi-gateway-node1 dagi-router-node1 dagi-nats-node1 crewai-nats-worker grafana
Standard restart order (most incidents)
- NATS (foundation)
- Router (dependency for Gateway routing)
- Gateway (webhooks)
- Worker (async jobs)
- Grafana (observability only)
Commands:
docker compose -f docker-compose.node1.yml up -d dagi-nats-node1docker compose -f docker-compose.node1.yml up -d dagi-router-node1docker compose -f docker-compose.node1.yml up -d dagi-gateway-node1docker compose -f docker-compose.node1.yml up -d crewai-nats-workerdocker compose -f docker-compose.node1.yml up -d grafana
Incident playbooks
A) Telegram webhook returns 500 (e.g. /greenfood/telegram/webhook)
Symptoms:
- 500 responses from gateway
- gateway logs show router request failures
Check:
docker logs --tail=200 dagi-gateway-node1 | grep -E "webhook|Router request failed|GREENFOOD"docker compose -f docker-compose.node1.yml ps | grep -E "dagi-gateway-node1|dagi-router-node1"
Fix:
- Ensure router is healthy:
docker logs --tail=120 dagi-router-node1docker inspect --format {{json .State.Health}} dagi-router-node1
- Ensure gateway can resolve router (Docker DNS):
docker exec -it dagi-gateway-node1 getent hosts router || true
- Restart router + gateway:
docker restart dagi-router-node1docker restart dagi-gateway-node1
Root cause examples:
- router container crash-loop → DNS name
routernot resolvable - ROUTER_URL points to non-existing host/service in node1 network
B) Router crash-loop on startup (Pydantic / config errors)
Symptoms:
- router restarting
- traceback in
docker logs dagi-router-node1
Fix:
- Read the first error in logs:
docker logs --tail=200 dagi-router-node1
- Hotfix then rebuild/recreate if needed:
- code fix (example previously:
temperature: float = 0.2) docker compose -f docker-compose.node1.yml up -d --build --force-recreate dagi-router-node1
- code fix (example previously:
C) NATS worker shows Subscription failed / NotFoundError
Symptoms:
- worker logs mention
NotFoundError - worker cannot subscribe / consume tasks
Check:
docker logs --tail=200 crewai-nats-workerdocker logs --tail=200 dagi-nats-node1 | grep -i jetstream
Fix (JetStream):
- Ensure JetStream enabled (NATS started with
-js). - Ensure required stream exists (example used on NODE1):
- Stream:
STREAM_AGENT_RUN - Subjects:
agent.run.>
- Stream:
- Using nats-box (inside node1 network):
docker run --rm -it --network <NODE1_NETWORK> natsio/nats-box:latest sh- create stream/consumer as required by worker subjects
- Restart worker:
docker restart crewai-nats-worker
D) Grafana crash-loop due to provisioning alert rule
Symptoms:
- grafana restarting
- logs mention invalid alert rule / relative time range
From: 0s, To: 0s
Fix:
- Identify failing rule file:
docker logs --tail=200 grafana
- Fix provisioning yaml (example path used on NODE1):
/opt/microdao-daarion/monitoring/grafana/provisioning/alerting/alerts.yml- Ensure rule has valid
relativeTimeRange
- Restart grafana:
docker restart grafana
Post-recovery verification checklist
- Core health:
docker compose -f docker-compose.node1.yml ps | grep -E "Up|healthy"
- Router reachable from gateway:
docker exec -it dagi-gateway-node1 getent hosts router
- NATS OK:
docker logs --tail=80 dagi-nats-node1 | grep -i "JetStream\|Server is ready"
- Worker subscribed:
docker logs --tail=120 crewai-nats-worker | grep -E "Subscribed|Subscription OK|NotFoundError" || true
- GREENFOOD policy sanity:
- рекламне оголошення → ігнор
- пряме питання → відповідь ≤ 3 речень
Known configuration anchors (update when changed)
- GREENFOOD торговa група:
t.me/+SPm1OV-pDJZhZGFi - ROUTER_URL used by gateway:
http://router:8000(must resolve inside node1 network) - NATS_URL:
nats://nats:4222 - JetStream Stream:
STREAM_AGENT_RUN(agent.run.>) - Grafana alerts provisioning file:
monitoring/grafana/provisioning/alerting/alerts.yml
Appendix: common commands
docker compose -f docker-compose.node1.yml psdocker compose -f docker-compose.node1.yml logs -f <service>docker restart <container>docker compose -f docker-compose.node1.yml up -d --build --force-recreate <service>docker system df