Files
microdao-daarion/docs/RUNBOOK_NODE1_RECOVERY_SAFETY.md
Apple 0c8bef82f4 feat: Add Alateya, Clan, Eonarch agents + fix gateway-router connection
## Agents Added
- Alateya: R&D, biotech, innovations
- Clan (Spirit): Community spirit agent
- Eonarch: Consciousness evolution agent

## Changes
- docker-compose.node1.yml: Added tokens for all 3 new agents
- gateway-bot/http_api.py: Added configs and webhook endpoints
- gateway-bot/clan_prompt.txt: New prompt file
- gateway-bot/eonarch_prompt.txt: New prompt file

## Fixes
- Fixed ROUTER_URL from :9102 to :8000 (internal container port)
- All 9 Telegram agents now working

## Documentation
- Created PROJECT-MASTER-INDEX.md - single entry point
- Added various status documents and scripts

Tokens configured:
- Helion, NUTRA, Agromatrix (existing)
- Alateya, Clan, Eonarch (new)
- Druid, GreenFood, DAARWIZZ (configured)
2026-01-28 06:40:34 -08:00

151 lines
5.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Runbook: NODE1 Recovery & Safety
## Purpose
Швидко відновити роботу NODE1 після збоїв (Telegram webhook 500, router DNS, NATS/worker, Grafana crash-loop) і уникнути випадкового зупинення не того стеку.
## Quick links / aliases
- `./stack-node1 ps|up|down|logs` (node1 stack)
- `./stack-staging ps|up|down|logs` (staging stack)
- NODE1 Docker network: `dagi-network` (для `nats-box`)
## Scope (NODE1 stack)
- dagi-gateway-node1 (9300)
- dagi-router-node1 (router API)
- dagi-nats-node1 (4222, JetStream enabled)
- crewai-nats-worker
- dagi-memory-service-node1 (8000)
- dagi-qdrant-node1 (6333)
- dagi-postgres (5432)
- dagi-redis-node1 (6379)
- dagi-neo4j-node1 (7474/7687)
- prometheus (9090)
- grafana
- dagi-crawl4ai-node1 (11235)
- control-plane (9200)
- other node1 services as defined in docker-compose.node1.yml
## Safety rules (DO THIS FIRST)
1) Always set project name for NODE1:
- `export COMPOSE_PROJECT_NAME=dagi_node1`
2) Always use the correct compose file:
- `-f docker-compose.node1.yml`
3) Never run `docker compose down` without verifying target:
- `docker compose -f docker-compose.node1.yml ps`
4) If staging exists, it MUST have a different `COMPOSE_PROJECT_NAME` and networks.
## Quick status
- `docker compose -f docker-compose.node1.yml ps`
- `docker compose -f docker-compose.node1.yml logs --tail=80 dagi-gateway-node1 dagi-router-node1 dagi-nats-node1 crewai-nats-worker grafana`
## Standard restart order (most incidents)
1) NATS (foundation)
2) Router (dependency for Gateway routing)
3) Gateway (webhooks)
4) Worker (async jobs)
5) Grafana (observability only)
Commands:
- `docker compose -f docker-compose.node1.yml up -d dagi-nats-node1`
- `docker compose -f docker-compose.node1.yml up -d dagi-router-node1`
- `docker compose -f docker-compose.node1.yml up -d dagi-gateway-node1`
- `docker compose -f docker-compose.node1.yml up -d crewai-nats-worker`
- `docker compose -f docker-compose.node1.yml up -d grafana`
## Incident playbooks
### A) Telegram webhook returns 500 (e.g. /greenfood/telegram/webhook)
Symptoms:
- 500 responses from gateway
- gateway logs show router request failures
Check:
- `docker logs --tail=200 dagi-gateway-node1 | grep -E "webhook|Router request failed|GREENFOOD"`
- `docker compose -f docker-compose.node1.yml ps | grep -E "dagi-gateway-node1|dagi-router-node1"`
Fix:
1) Ensure router is healthy:
- `docker logs --tail=120 dagi-router-node1`
- `docker inspect --format '{{json .State.Health}}' dagi-router-node1`
2) Ensure gateway can resolve router (Docker DNS):
- `docker exec -it dagi-gateway-node1 getent hosts router || true`
3) Restart router + gateway:
- `docker restart dagi-router-node1`
- `docker restart dagi-gateway-node1`
Root cause examples:
- router container crash-loop → DNS name `router` not resolvable
- ROUTER_URL points to non-existing host/service in node1 network
### B) Router crash-loop on startup (Pydantic / config errors)
Symptoms:
- router restarting
- traceback in `docker logs dagi-router-node1`
Fix:
1) Read the first error in logs:
- `docker logs --tail=200 dagi-router-node1`
2) Hotfix then rebuild/recreate if needed:
- code fix (example previously: `temperature: float = 0.2`)
- `docker compose -f docker-compose.node1.yml up -d --build --force-recreate dagi-router-node1`
### C) NATS worker shows Subscription failed / NotFoundError
Symptoms:
- worker logs mention `NotFoundError`
- worker cannot subscribe / consume tasks
Check:
- `docker logs --tail=200 crewai-nats-worker`
- `docker logs --tail=200 dagi-nats-node1 | grep -i jetstream`
Fix (JetStream):
1) Ensure JetStream enabled (NATS started with `-js`).
2) Ensure required stream exists (example used on NODE1):
- Stream: `STREAM_AGENT_RUN`
- Subjects: `agent.run.>`
3) Using nats-box (inside node1 network):
- `docker run --rm -it --network <NODE1_NETWORK> natsio/nats-box:latest sh`
- create stream/consumer as required by worker subjects
4) Restart worker:
- `docker restart crewai-nats-worker`
### D) Grafana crash-loop due to provisioning alert rule
Symptoms:
- grafana restarting
- logs mention invalid alert rule / relative time range `From: 0s, To: 0s`
Fix:
1) Identify failing rule file:
- `docker logs --tail=200 grafana`
2) Fix provisioning yaml (example path used on NODE1):
- `/opt/microdao-daarion/monitoring/grafana/provisioning/alerting/alerts.yml`
- Ensure rule has valid `relativeTimeRange`
3) Restart grafana:
- `docker restart grafana`
## Post-recovery verification checklist
1) Core health:
- `docker compose -f docker-compose.node1.yml ps | grep -E "Up|healthy"`
2) Router reachable from gateway:
- `docker exec -it dagi-gateway-node1 getent hosts router`
3) NATS OK:
- `docker logs --tail=80 dagi-nats-node1 | grep -i "JetStream\|Server is ready"`
4) Worker subscribed:
- `docker logs --tail=120 crewai-nats-worker | grep -E "Subscribed|Subscription OK|NotFoundError" || true`
5) GREENFOOD policy sanity:
- рекламне оголошення → ігнор
- пряме питання → відповідь ≤ 3 речень
## Known configuration anchors (update when changed)
- GREENFOOD торговa група: `t.me/+SPm1OV-pDJZhZGFi`
- ROUTER_URL used by gateway: `http://router:8000` (must resolve inside node1 network)
- NATS_URL: `nats://nats:4222`
- JetStream Stream: `STREAM_AGENT_RUN` (`agent.run.>`)
- Grafana alerts provisioning file: `monitoring/grafana/provisioning/alerting/alerts.yml`
## Appendix: common commands
- `docker compose -f docker-compose.node1.yml ps`
- `docker compose -f docker-compose.node1.yml logs -f <service>`
- `docker restart <container>`
- `docker compose -f docker-compose.node1.yml up -d --build --force-recreate <service>`
- `docker system df`