# Runbook: NODE1 Recovery & Safety ## Purpose Швидко відновити роботу NODE1 після збоїв (Telegram webhook 500, router DNS, NATS/worker, Grafana crash-loop) і уникнути випадкового зупинення не того стеку. ## Quick links / aliases - `./stack-node1 ps|up|down|logs` (node1 stack) - `./stack-staging ps|up|down|logs` (staging stack) - NODE1 Docker network: `dagi-network` (для `nats-box`) ## Scope (NODE1 stack) - dagi-gateway-node1 (9300) - dagi-router-node1 (router API) - dagi-nats-node1 (4222, JetStream enabled) - crewai-nats-worker - dagi-memory-service-node1 (8000) - dagi-qdrant-node1 (6333) - dagi-postgres (5432) - dagi-redis-node1 (6379) - dagi-neo4j-node1 (7474/7687) - prometheus (9090) - grafana - dagi-crawl4ai-node1 (11235) - control-plane (9200) - other node1 services as defined in docker-compose.node1.yml ## Safety rules (DO THIS FIRST) 1) Always set project name for NODE1: - `export COMPOSE_PROJECT_NAME=dagi_node1` 2) Always use the correct compose file: - `-f docker-compose.node1.yml` 3) Never run `docker compose down` without verifying target: - `docker compose -f docker-compose.node1.yml ps` 4) If staging exists, it MUST have a different `COMPOSE_PROJECT_NAME` and networks. ## Quick status - `docker compose -f docker-compose.node1.yml ps` - `docker compose -f docker-compose.node1.yml logs --tail=80 dagi-gateway-node1 dagi-router-node1 dagi-nats-node1 crewai-nats-worker grafana` ## Standard restart order (most incidents) 1) NATS (foundation) 2) Router (dependency for Gateway routing) 3) Gateway (webhooks) 4) Worker (async jobs) 5) Grafana (observability only) Commands: - `docker compose -f docker-compose.node1.yml up -d dagi-nats-node1` - `docker compose -f docker-compose.node1.yml up -d dagi-router-node1` - `docker compose -f docker-compose.node1.yml up -d dagi-gateway-node1` - `docker compose -f docker-compose.node1.yml up -d crewai-nats-worker` - `docker compose -f docker-compose.node1.yml up -d grafana` ## Incident playbooks ### A) Telegram webhook returns 500 (e.g. /greenfood/telegram/webhook) Symptoms: - 500 responses from gateway - gateway logs show router request failures Check: - `docker logs --tail=200 dagi-gateway-node1 | grep -E "webhook|Router request failed|GREENFOOD"` - `docker compose -f docker-compose.node1.yml ps | grep -E "dagi-gateway-node1|dagi-router-node1"` Fix: 1) Ensure router is healthy: - `docker logs --tail=120 dagi-router-node1` - `docker inspect --format {{json .State.Health}} dagi-router-node1` 2) Ensure gateway can resolve router (Docker DNS): - `docker exec -it dagi-gateway-node1 getent hosts router || true` 3) Restart router + gateway: - `docker restart dagi-router-node1` - `docker restart dagi-gateway-node1` Root cause examples: - router container crash-loop → DNS name `router` not resolvable - ROUTER_URL points to non-existing host/service in node1 network ### B) Router crash-loop on startup (Pydantic / config errors) Symptoms: - router restarting - traceback in `docker logs dagi-router-node1` Fix: 1) Read the first error in logs: - `docker logs --tail=200 dagi-router-node1` 2) Hotfix then rebuild/recreate if needed: - code fix (example previously: `temperature: float = 0.2`) - `docker compose -f docker-compose.node1.yml up -d --build --force-recreate dagi-router-node1` ### C) NATS worker shows Subscription failed / NotFoundError Symptoms: - worker logs mention `NotFoundError` - worker cannot subscribe / consume tasks Check: - `docker logs --tail=200 crewai-nats-worker` - `docker logs --tail=200 dagi-nats-node1 | grep -i jetstream` Fix (JetStream): 1) Ensure JetStream enabled (NATS started with `-js`). 2) Ensure required stream exists (example used on NODE1): - Stream: `STREAM_AGENT_RUN` - Subjects: `agent.run.>` 3) Using nats-box (inside node1 network): - `docker run --rm -it --network natsio/nats-box:latest sh` - create stream/consumer as required by worker subjects 4) Restart worker: - `docker restart crewai-nats-worker` ### D) Grafana crash-loop due to provisioning alert rule Symptoms: - grafana restarting - logs mention invalid alert rule / relative time range `From: 0s, To: 0s` Fix: 1) Identify failing rule file: - `docker logs --tail=200 grafana` 2) Fix provisioning yaml (example path used on NODE1): - `/opt/microdao-daarion/monitoring/grafana/provisioning/alerting/alerts.yml` - Ensure rule has valid `relativeTimeRange` 3) Restart grafana: - `docker restart grafana` ## Post-recovery verification checklist 1) Core health: - `docker compose -f docker-compose.node1.yml ps | grep -E "Up|healthy"` 2) Router reachable from gateway: - `docker exec -it dagi-gateway-node1 getent hosts router` 3) NATS OK: - `docker logs --tail=80 dagi-nats-node1 | grep -i "JetStream\|Server is ready"` 4) Worker subscribed: - `docker logs --tail=120 crewai-nats-worker | grep -E "Subscribed|Subscription OK|NotFoundError" || true` 5) GREENFOOD policy sanity: - рекламне оголошення → ігнор - пряме питання → відповідь ≤ 3 речень ## Known configuration anchors (update when changed) - GREENFOOD торговa група: `t.me/+SPm1OV-pDJZhZGFi` - ROUTER_URL used by gateway: `http://router:8000` (must resolve inside node1 network) - NATS_URL: `nats://nats:4222` - JetStream Stream: `STREAM_AGENT_RUN` (`agent.run.>`) - Grafana alerts provisioning file: `monitoring/grafana/provisioning/alerting/alerts.yml` ## Appendix: common commands - `docker compose -f docker-compose.node1.yml ps` - `docker compose -f docker-compose.node1.yml logs -f ` - `docker restart ` - `docker compose -f docker-compose.node1.yml up -d --build --force-recreate ` - `docker system df`