## Agents Added - Alateya: R&D, biotech, innovations - Clan (Spirit): Community spirit agent - Eonarch: Consciousness evolution agent ## Changes - docker-compose.node1.yml: Added tokens for all 3 new agents - gateway-bot/http_api.py: Added configs and webhook endpoints - gateway-bot/clan_prompt.txt: New prompt file - gateway-bot/eonarch_prompt.txt: New prompt file ## Fixes - Fixed ROUTER_URL from :9102 to :8000 (internal container port) - All 9 Telegram agents now working ## Documentation - Created PROJECT-MASTER-INDEX.md - single entry point - Added various status documents and scripts Tokens configured: - Helion, NUTRA, Agromatrix (existing) - Alateya, Clan, Eonarch (new) - Druid, GreenFood, DAARWIZZ (configured)
151 lines
5.7 KiB
Markdown
151 lines
5.7 KiB
Markdown
# Runbook: NODE1 Recovery & Safety
|
||
|
||
## Purpose
|
||
Швидко відновити роботу NODE1 після збоїв (Telegram webhook 500, router DNS, NATS/worker, Grafana crash-loop) і уникнути випадкового зупинення не того стеку.
|
||
|
||
## Quick links / aliases
|
||
- `./stack-node1 ps|up|down|logs` (node1 stack)
|
||
- `./stack-staging ps|up|down|logs` (staging stack)
|
||
- NODE1 Docker network: `dagi-network` (для `nats-box`)
|
||
|
||
## Scope (NODE1 stack)
|
||
- dagi-gateway-node1 (9300)
|
||
- dagi-router-node1 (router API)
|
||
- dagi-nats-node1 (4222, JetStream enabled)
|
||
- crewai-nats-worker
|
||
- dagi-memory-service-node1 (8000)
|
||
- dagi-qdrant-node1 (6333)
|
||
- dagi-postgres (5432)
|
||
- dagi-redis-node1 (6379)
|
||
- dagi-neo4j-node1 (7474/7687)
|
||
- prometheus (9090)
|
||
- grafana
|
||
- dagi-crawl4ai-node1 (11235)
|
||
- control-plane (9200)
|
||
- other node1 services as defined in docker-compose.node1.yml
|
||
|
||
## Safety rules (DO THIS FIRST)
|
||
1) Always set project name for NODE1:
|
||
- `export COMPOSE_PROJECT_NAME=dagi_node1`
|
||
2) Always use the correct compose file:
|
||
- `-f docker-compose.node1.yml`
|
||
3) Never run `docker compose down` without verifying target:
|
||
- `docker compose -f docker-compose.node1.yml ps`
|
||
4) If staging exists, it MUST have a different `COMPOSE_PROJECT_NAME` and networks.
|
||
|
||
## Quick status
|
||
- `docker compose -f docker-compose.node1.yml ps`
|
||
- `docker compose -f docker-compose.node1.yml logs --tail=80 dagi-gateway-node1 dagi-router-node1 dagi-nats-node1 crewai-nats-worker grafana`
|
||
|
||
## Standard restart order (most incidents)
|
||
1) NATS (foundation)
|
||
2) Router (dependency for Gateway routing)
|
||
3) Gateway (webhooks)
|
||
4) Worker (async jobs)
|
||
5) Grafana (observability only)
|
||
|
||
Commands:
|
||
- `docker compose -f docker-compose.node1.yml up -d dagi-nats-node1`
|
||
- `docker compose -f docker-compose.node1.yml up -d dagi-router-node1`
|
||
- `docker compose -f docker-compose.node1.yml up -d dagi-gateway-node1`
|
||
- `docker compose -f docker-compose.node1.yml up -d crewai-nats-worker`
|
||
- `docker compose -f docker-compose.node1.yml up -d grafana`
|
||
|
||
## Incident playbooks
|
||
|
||
### A) Telegram webhook returns 500 (e.g. /greenfood/telegram/webhook)
|
||
Symptoms:
|
||
- 500 responses from gateway
|
||
- gateway logs show router request failures
|
||
|
||
Check:
|
||
- `docker logs --tail=200 dagi-gateway-node1 | grep -E "webhook|Router request failed|GREENFOOD"`
|
||
- `docker compose -f docker-compose.node1.yml ps | grep -E "dagi-gateway-node1|dagi-router-node1"`
|
||
|
||
Fix:
|
||
1) Ensure router is healthy:
|
||
- `docker logs --tail=120 dagi-router-node1`
|
||
- `docker inspect --format '{{json .State.Health}}' dagi-router-node1`
|
||
2) Ensure gateway can resolve router (Docker DNS):
|
||
- `docker exec -it dagi-gateway-node1 getent hosts router || true`
|
||
3) Restart router + gateway:
|
||
- `docker restart dagi-router-node1`
|
||
- `docker restart dagi-gateway-node1`
|
||
|
||
Root cause examples:
|
||
- router container crash-loop → DNS name `router` not resolvable
|
||
- ROUTER_URL points to non-existing host/service in node1 network
|
||
|
||
### B) Router crash-loop on startup (Pydantic / config errors)
|
||
Symptoms:
|
||
- router restarting
|
||
- traceback in `docker logs dagi-router-node1`
|
||
|
||
Fix:
|
||
1) Read the first error in logs:
|
||
- `docker logs --tail=200 dagi-router-node1`
|
||
2) Hotfix then rebuild/recreate if needed:
|
||
- code fix (example previously: `temperature: float = 0.2`)
|
||
- `docker compose -f docker-compose.node1.yml up -d --build --force-recreate dagi-router-node1`
|
||
|
||
### C) NATS worker shows Subscription failed / NotFoundError
|
||
Symptoms:
|
||
- worker logs mention `NotFoundError`
|
||
- worker cannot subscribe / consume tasks
|
||
|
||
Check:
|
||
- `docker logs --tail=200 crewai-nats-worker`
|
||
- `docker logs --tail=200 dagi-nats-node1 | grep -i jetstream`
|
||
|
||
Fix (JetStream):
|
||
1) Ensure JetStream enabled (NATS started with `-js`).
|
||
2) Ensure required stream exists (example used on NODE1):
|
||
- Stream: `STREAM_AGENT_RUN`
|
||
- Subjects: `agent.run.>`
|
||
3) Using nats-box (inside node1 network):
|
||
- `docker run --rm -it --network <NODE1_NETWORK> natsio/nats-box:latest sh`
|
||
- create stream/consumer as required by worker subjects
|
||
4) Restart worker:
|
||
- `docker restart crewai-nats-worker`
|
||
|
||
### D) Grafana crash-loop due to provisioning alert rule
|
||
Symptoms:
|
||
- grafana restarting
|
||
- logs mention invalid alert rule / relative time range `From: 0s, To: 0s`
|
||
|
||
Fix:
|
||
1) Identify failing rule file:
|
||
- `docker logs --tail=200 grafana`
|
||
2) Fix provisioning yaml (example path used on NODE1):
|
||
- `/opt/microdao-daarion/monitoring/grafana/provisioning/alerting/alerts.yml`
|
||
- Ensure rule has valid `relativeTimeRange`
|
||
3) Restart grafana:
|
||
- `docker restart grafana`
|
||
|
||
## Post-recovery verification checklist
|
||
1) Core health:
|
||
- `docker compose -f docker-compose.node1.yml ps | grep -E "Up|healthy"`
|
||
2) Router reachable from gateway:
|
||
- `docker exec -it dagi-gateway-node1 getent hosts router`
|
||
3) NATS OK:
|
||
- `docker logs --tail=80 dagi-nats-node1 | grep -i "JetStream\|Server is ready"`
|
||
4) Worker subscribed:
|
||
- `docker logs --tail=120 crewai-nats-worker | grep -E "Subscribed|Subscription OK|NotFoundError" || true`
|
||
5) GREENFOOD policy sanity:
|
||
- рекламне оголошення → ігнор
|
||
- пряме питання → відповідь ≤ 3 речень
|
||
|
||
## Known configuration anchors (update when changed)
|
||
- GREENFOOD торговa група: `t.me/+SPm1OV-pDJZhZGFi`
|
||
- ROUTER_URL used by gateway: `http://router:8000` (must resolve inside node1 network)
|
||
- NATS_URL: `nats://nats:4222`
|
||
- JetStream Stream: `STREAM_AGENT_RUN` (`agent.run.>`)
|
||
- Grafana alerts provisioning file: `monitoring/grafana/provisioning/alerting/alerts.yml`
|
||
|
||
## Appendix: common commands
|
||
- `docker compose -f docker-compose.node1.yml ps`
|
||
- `docker compose -f docker-compose.node1.yml logs -f <service>`
|
||
- `docker restart <container>`
|
||
- `docker compose -f docker-compose.node1.yml up -d --build --force-recreate <service>`
|
||
- `docker system df`
|