Files
microdao-daarion/docs/RUNBOOK_NODE1_RECOVERY_SAFETY.md
Apple ef3473db21 snapshot: NODE1 production state 2026-02-09
Complete snapshot of /opt/microdao-daarion/ from NODE1 (144.76.224.179).
This represents the actual running production code that has diverged
significantly from the previous main branch.

Key changes from old main:
- Gateway (http_api.py): expanded from ~40KB to 164KB with full agent support
- Router: new /v1/agents/{id}/infer endpoint with vision + DeepSeek routing
- Behavior Policy: SOWA v2.2 (3-level: FULL/ACK/SILENT)
- Agent Registry: config/agent_registry.yml as single source of truth
- 13 agents configured (was 3)
- Memory service integration
- CrewAI teams and roles

Excluded from snapshot: venv/, .env, data/, backups, .tgz archives

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-02-09 08:46:46 -08:00

5.7 KiB
Raw Blame History

Runbook: NODE1 Recovery & Safety

Purpose

Швидко відновити роботу NODE1 після збоїв (Telegram webhook 500, router DNS, NATS/worker, Grafana crash-loop) і уникнути випадкового зупинення не того стеку.

  • ./stack-node1 ps|up|down|logs (node1 stack)
  • ./stack-staging ps|up|down|logs (staging stack)
  • NODE1 Docker network: dagi-network (для nats-box)

Scope (NODE1 stack)

  • dagi-gateway-node1 (9300)
  • dagi-router-node1 (router API)
  • dagi-nats-node1 (4222, JetStream enabled)
  • crewai-nats-worker
  • dagi-memory-service-node1 (8000)
  • dagi-qdrant-node1 (6333)
  • dagi-postgres (5432)
  • dagi-redis-node1 (6379)
  • dagi-neo4j-node1 (7474/7687)
  • prometheus (9090)
  • grafana
  • dagi-crawl4ai-node1 (11235)
  • control-plane (9200)
  • other node1 services as defined in docker-compose.node1.yml

Safety rules (DO THIS FIRST)

  1. Always set project name for NODE1:
    • export COMPOSE_PROJECT_NAME=dagi_node1
  2. Always use the correct compose file:
    • -f docker-compose.node1.yml
  3. Never run docker compose down without verifying target:
    • docker compose -f docker-compose.node1.yml ps
  4. If staging exists, it MUST have a different COMPOSE_PROJECT_NAME and networks.

Quick status

  • docker compose -f docker-compose.node1.yml ps
  • docker compose -f docker-compose.node1.yml logs --tail=80 dagi-gateway-node1 dagi-router-node1 dagi-nats-node1 crewai-nats-worker grafana

Standard restart order (most incidents)

  1. NATS (foundation)
  2. Router (dependency for Gateway routing)
  3. Gateway (webhooks)
  4. Worker (async jobs)
  5. Grafana (observability only)

Commands:

  • docker compose -f docker-compose.node1.yml up -d dagi-nats-node1
  • docker compose -f docker-compose.node1.yml up -d dagi-router-node1
  • docker compose -f docker-compose.node1.yml up -d dagi-gateway-node1
  • docker compose -f docker-compose.node1.yml up -d crewai-nats-worker
  • docker compose -f docker-compose.node1.yml up -d grafana

Incident playbooks

A) Telegram webhook returns 500 (e.g. /greenfood/telegram/webhook)

Symptoms:

  • 500 responses from gateway
  • gateway logs show router request failures

Check:

  • docker logs --tail=200 dagi-gateway-node1 | grep -E "webhook|Router request failed|GREENFOOD"
  • docker compose -f docker-compose.node1.yml ps | grep -E "dagi-gateway-node1|dagi-router-node1"

Fix:

  1. Ensure router is healthy:
    • docker logs --tail=120 dagi-router-node1
    • docker inspect --format {{json .State.Health}} dagi-router-node1
  2. Ensure gateway can resolve router (Docker DNS):
    • docker exec -it dagi-gateway-node1 getent hosts router || true
  3. Restart router + gateway:
    • docker restart dagi-router-node1
    • docker restart dagi-gateway-node1

Root cause examples:

  • router container crash-loop → DNS name router not resolvable
  • ROUTER_URL points to non-existing host/service in node1 network

B) Router crash-loop on startup (Pydantic / config errors)

Symptoms:

  • router restarting
  • traceback in docker logs dagi-router-node1

Fix:

  1. Read the first error in logs:
    • docker logs --tail=200 dagi-router-node1
  2. Hotfix then rebuild/recreate if needed:
    • code fix (example previously: temperature: float = 0.2)
    • docker compose -f docker-compose.node1.yml up -d --build --force-recreate dagi-router-node1

C) NATS worker shows Subscription failed / NotFoundError

Symptoms:

  • worker logs mention NotFoundError
  • worker cannot subscribe / consume tasks

Check:

  • docker logs --tail=200 crewai-nats-worker
  • docker logs --tail=200 dagi-nats-node1 | grep -i jetstream

Fix (JetStream):

  1. Ensure JetStream enabled (NATS started with -js).
  2. Ensure required stream exists (example used on NODE1):
    • Stream: STREAM_AGENT_RUN
    • Subjects: agent.run.>
  3. Using nats-box (inside node1 network):
    • docker run --rm -it --network <NODE1_NETWORK> natsio/nats-box:latest sh
    • create stream/consumer as required by worker subjects
  4. Restart worker:
    • docker restart crewai-nats-worker

D) Grafana crash-loop due to provisioning alert rule

Symptoms:

  • grafana restarting
  • logs mention invalid alert rule / relative time range From: 0s, To: 0s

Fix:

  1. Identify failing rule file:
    • docker logs --tail=200 grafana
  2. Fix provisioning yaml (example path used on NODE1):
    • /opt/microdao-daarion/monitoring/grafana/provisioning/alerting/alerts.yml
    • Ensure rule has valid relativeTimeRange
  3. Restart grafana:
    • docker restart grafana

Post-recovery verification checklist

  1. Core health:
  • docker compose -f docker-compose.node1.yml ps | grep -E "Up|healthy"
  1. Router reachable from gateway:
  • docker exec -it dagi-gateway-node1 getent hosts router
  1. NATS OK:
  • docker logs --tail=80 dagi-nats-node1 | grep -i "JetStream\|Server is ready"
  1. Worker subscribed:
  • docker logs --tail=120 crewai-nats-worker | grep -E "Subscribed|Subscription OK|NotFoundError" || true
  1. GREENFOOD policy sanity:
  • рекламне оголошення → ігнор
  • пряме питання → відповідь ≤ 3 речень

Known configuration anchors (update when changed)

  • GREENFOOD торговa група: t.me/+SPm1OV-pDJZhZGFi
  • ROUTER_URL used by gateway: http://router:8000 (must resolve inside node1 network)
  • NATS_URL: nats://nats:4222
  • JetStream Stream: STREAM_AGENT_RUN (agent.run.>)
  • Grafana alerts provisioning file: monitoring/grafana/provisioning/alerting/alerts.yml

Appendix: common commands

  • docker compose -f docker-compose.node1.yml ps
  • docker compose -f docker-compose.node1.yml logs -f <service>
  • docker restart <container>
  • docker compose -f docker-compose.node1.yml up -d --build --force-recreate <service>
  • docker system df