307 lines
9.4 KiB
Markdown
307 lines
9.4 KiB
Markdown
# 🏗️ NODA1 Production Stack
|
||
|
||
**Version:** 2.2
|
||
**Last Updated:** 2026-02-11
|
||
**Status:** Production (drift-controlled) ✅
|
||
|
||
## 🔎 Current Reality (2026-02-11)
|
||
|
||
- Deploy root: `/opt/microdao-daarion` (single runtime root)
|
||
- Drift control: `/opt/microdao-daarion/ops/drift-check.sh` → expected `DRIFT_CHECK: OK`
|
||
- Gateway: `agents_count=13` (user-facing)
|
||
- Router: 15 active agents (13 user-facing + 2 internal)
|
||
- Internal routing defaults:
|
||
- `monitor` → local (`swapper+ollama`, `qwen3-8b`)
|
||
- `devtools` → local (`swapper+ollama`, `qwen3-8b`) + conditional cloud fallback for heavy task types
|
||
- Memory service: `/health` and `/stats` return `200`
|
||
|
||
## 📍 Node Information
|
||
|
||
- **Hostname:** node1-daarion
|
||
- **IP Address:** 144.76.224.179
|
||
- **IPv6:** 2a01:4f8:201:2a6::2
|
||
- **Location:** Hetzner Cloud (Germany)
|
||
- **Role:** Production Router + Gateway + All Services
|
||
- **Uptime Target:** 24/7
|
||
- **SSH:** `ssh root@144.76.224.179`
|
||
|
||
## 🖥️ Hardware
|
||
|
||
- **CPU:** Available cores (view with `nproc`)
|
||
- **RAM:** 62GB
|
||
- **Disk:** 1.7TB (~1.3TB available)
|
||
- **GPU:** NVIDIA RTX 4000 SFF Ada Generation (20GB VRAM)
|
||
|
||
## 🐳 Docker Services (27+ active)
|
||
|
||
### Core Services (✅ All Healthy)
|
||
| Service | Port | Container | Health |
|
||
|---------|------|-----------|--------|
|
||
| Router | 9102 | dagi-router-node1 | ✅ |
|
||
| Gateway | 9300 | dagi-gateway | ✅ |
|
||
| Memory Service | 8000 | dagi-memory-service-node1 | ✅ |
|
||
| RAG Service | 9500 | rag-service-node1 | ✅ |
|
||
| Swapper Service | 8890-8891 | swapper-service-node1 | ✅ |
|
||
| Vision Encoder | 8001 | dagi-vision-encoder-node1 | ✅ |
|
||
|
||
### Databases (✅ All Healthy)
|
||
| Service | Port | Container | Health |
|
||
|---------|------|-----------|--------|
|
||
| PostgreSQL | 5432 | dagi-postgres | ✅ |
|
||
| Qdrant | 6333-6334 | dagi-qdrant-node1 | ✅ |
|
||
| Redis | 6379 | dagi-redis-node1 | ✅ |
|
||
| Neo4j | 7474, 7687 | dagi-neo4j-node1 | ✅ |
|
||
|
||
### Supporting Services
|
||
| Service | Port | Container | Health |
|
||
|---------|------|-----------|--------|
|
||
| NATS | 4222 | dagi-nats-node1 | ✅ |
|
||
| MinIO | 9000-9001 | dagi-minio-node1 | ✅ |
|
||
| Crawl4AI | 11235 | dagi-crawl4ai-node1 | ✅ |
|
||
| Parser Pipeline | 8101 | parser-pipeline | ✅ |
|
||
| Ingest Service | 8100 | ingest-service | ✅ |
|
||
|
||
### AI/ML Services
|
||
| Service | Port | Container | Status |
|
||
|---------|------|-----------|--------|
|
||
| CrewAI | - | dagi-crewai-node1 | ✅ |
|
||
| CrewAI NATS Worker | 9011 | crewai-nats-worker | ✅ |
|
||
|
||
### Artifact Services
|
||
| Service | Port | Container | Status |
|
||
|---------|------|-----------|--------|
|
||
| Artifact Registry | 9220 | artifact-registry-node1 | ✅ |
|
||
| Brand Registry | 9210 | brand-registry-node1 | ✅ |
|
||
| Brand Intake | 9211 | brand-intake-node1 | ✅ |
|
||
| Presentation Renderer | 9212 | presentation-renderer-node1 | ✅ |
|
||
|
||
### Monitoring (✅ All Healthy)
|
||
| Service | Port | Container | Health |
|
||
|---------|------|-----------|--------|
|
||
| Prometheus | 9090 | prometheus | ✅ |
|
||
| Grafana | 3030 | grafana | ✅ |
|
||
|
||
## 🤖 Telegram Bots (13 user-facing)
|
||
|
||
У production gateway зараз user-facing агенти:
|
||
`daarwizz`, `helion`, `alateya`, `druid`, `nutra`, `agromatrix`, `greenfood`, `clan`, `eonarch`, `yaromir`, `soul`, `senpai`, `sofiia`.
|
||
|
||
Швидка перевірка:
|
||
|
||
```bash
|
||
curl -sS http://localhost:9300/health
|
||
```
|
||
|
||
## 📊 Health Check Endpoints
|
||
|
||
```bash
|
||
# All services quick check
|
||
curl http://localhost:9102/health # Router
|
||
curl http://localhost:9300/health # Gateway
|
||
curl http://localhost:8000/health # Memory Service
|
||
curl http://localhost:9500/health # RAG
|
||
curl http://localhost:8890/health # Swapper
|
||
curl http://localhost:6333/healthz # Qdrant
|
||
curl http://localhost:8001/health # Vision Encoder
|
||
curl http://localhost:8101/health # Parser Pipeline
|
||
curl http://localhost:9090/-/healthy # Prometheus
|
||
curl http://localhost:3030/api/health # Grafana
|
||
```
|
||
|
||
## 🔧 Common Operations
|
||
|
||
### View all services
|
||
```bash
|
||
docker ps --format "table {{.Names}}\t{{.Status}}"
|
||
```
|
||
|
||
### Restart a service
|
||
```bash
|
||
docker restart <container-name>
|
||
```
|
||
|
||
### View logs
|
||
```bash
|
||
docker logs <container-name> --tail 50 -f
|
||
```
|
||
|
||
### System status
|
||
```bash
|
||
nvidia-smi # GPU status
|
||
df -h # Disk usage
|
||
free -h # Memory usage
|
||
uptime # System uptime
|
||
```
|
||
|
||
## 💾 Backups
|
||
|
||
### PostgreSQL (Auto)
|
||
- **Location:** `/opt/backups/postgres/`
|
||
- **Schedule:** Every 6 hours (3:00, 9:00, 15:00, 21:00)
|
||
- **Retention:** 7 days daily, 4 weeks, 6 months
|
||
- **Container:** postgres-backup-node1
|
||
|
||
### Qdrant (Manual)
|
||
```bash
|
||
# Create snapshot
|
||
curl -X POST "http://localhost:6333/snapshots"
|
||
|
||
# List snapshots
|
||
curl "http://localhost:6333/snapshots"
|
||
```
|
||
|
||
### Manual backup all
|
||
```bash
|
||
cd /opt/microdao-daarion
|
||
./scripts/backup/backup_all.sh
|
||
```
|
||
|
||
## 🔒 Security Status
|
||
|
||
- ✅ No suspicious processes
|
||
- ✅ No executables in /tmp
|
||
- ✅ Firewall configured
|
||
- ✅ Daily backups active
|
||
- ✅ System load normal (< 1.0)
|
||
|
||
## ⚙️ Configuration Files
|
||
|
||
- **Docker Compose:** `docker-compose.node1.yml`
|
||
- **Router Config:** `services/router/router_config.yaml`
|
||
- **Backup Compose:** `docker-compose.backups.yml`
|
||
|
||
## 📝 Recent Changes (2026-01-26)
|
||
|
||
### ✅ Fixed Issues
|
||
1. **Memory Service** - Fixed MEMORY_QDRANT_HOST (was `qdrant`, now `dagi-qdrant-node1`)
|
||
2. **Qdrant snapshot** created before fix: `full-snapshot-2026-01-26-10-11-31.snapshot`
|
||
|
||
### ⚠️ Known Issues
|
||
- **Control-plane** container port 9200 not published to host (internal only)
|
||
- **Image-gen** service not running (use swapper-service instead)
|
||
|
||
## 🆚 Version History
|
||
|
||
### v2.1 (2026-01-26)
|
||
- Memory Service DNS fix (qdrant → dagi-qdrant-node1)
|
||
- Full health check verified
|
||
- Documentation updated
|
||
|
||
### v2.0 (2026-01-22)
|
||
- Git repository initialized
|
||
- Qdrant healthcheck fixed
|
||
- render-pdf-worker disabled
|
||
|
||
### v1.x (2026-01-10 - 2026-01-19)
|
||
- Previous deployment
|
||
- Security incidents resolved
|
||
|
||
## 📞 Support
|
||
|
||
- **SSH:** root@144.76.224.179
|
||
- **Monitoring:** http://144.76.224.179:3030 (Grafana)
|
||
- **Metrics:** http://144.76.224.179:9090 (Prometheus)
|
||
|
||
---
|
||
|
||
**Maintained by:** NODA1 System
|
||
**Last Health Check:** 2026-01-26 11:13 UTC
|
||
**Status:** ✅ All systems operational
|
||
|
||
---
|
||
|
||
## 🔧 By Design (очікувана поведінка)
|
||
|
||
### Сервіси без публічних портів
|
||
|
||
| Сервіс | Порт | Статус | Пояснення |
|
||
|--------|------|--------|-----------|
|
||
| **RBAC** | 9200 | Internal only | Порт не опублікований. Доступ тільки з docker network. |
|
||
| **Image-gen** | 8892 | Не використовується | Генерація зображень йде через `swapper-service (8890)`. |
|
||
| **Parser** | 9400 | Відсутній | Замінено на `parser-pipeline (8101)` як єдину точку парсингу. |
|
||
|
||
### Діагностика internal сервісів
|
||
|
||
```bash
|
||
# RBAC (зсередини docker network)
|
||
docker exec dagi-gateway curl -sS http://rbac:9200/health
|
||
|
||
# Перевірка DNS resolution
|
||
docker exec dagi-memory-service-node1 python3 -c "import socket; print(socket.gethostbyname('dagi-qdrant-node1'))"
|
||
```
|
||
|
||
### Нормальні значення
|
||
|
||
- **Qdrant**: 18 колекцій, 900+ векторів
|
||
- **Memory Service**: 200 OK на `/health` (healthcheck через python urllib)
|
||
- **Load average**: < 2.0 норма, < 5.0 допустимо
|
||
|
||
---
|
||
|
||
## 📊 Prometheus Alerting
|
||
|
||
### Налаштовані алерти
|
||
|
||
| Alert | Умова | Severity |
|
||
|-------|-------|----------|
|
||
| ServiceDown | `up == 0` > 2m | critical |
|
||
| QdrantCollectionsLow | collections < 10 | warning |
|
||
| QdrantVectorsDropped | vectors < 500 | warning |
|
||
| HostDiskSpaceLow | free < 15% | warning |
|
||
| HostMemoryHigh | usage > 90% | warning |
|
||
| HostHighLoad | load15 > 10 | warning |
|
||
|
||
### Перевірка rules
|
||
|
||
```bash
|
||
curl -sS http://127.0.0.1:9090/api/v1/rules | python3 -m json.tool | head -50
|
||
```
|
||
|
||
---
|
||
|
||
## 🔄 Qdrant Backup/Restore
|
||
|
||
### Снапшоти
|
||
|
||
- **Розташування**: `/opt/backups/qdrant/` та через API
|
||
- **Retention**: щоденні автоматичні
|
||
- **Останній**: `full-snapshot-2026-01-26-10-11-31.snapshot` (1.2GB)
|
||
|
||
### Restore Drill (перевірено 2026-01-26)
|
||
|
||
```bash
|
||
# Restore успішно протестовано на окремому порту 16333
|
||
# helion_messages: 365 points відновлено і перевірено пошуком
|
||
```
|
||
|
||
---
|
||
|
||
**Last Updated:** 2026-01-26 11:40
|
||
|
||
---
|
||
|
||
## Behavior Policy v2.1 CHANGELOG
|
||
|
||
**Date:** 2026-02-07
|
||
**Version:** Behavior Policy v2.1 / Global System Prompt v2.1
|
||
|
||
### Architecture (Source of Truth)
|
||
|
||
| Layer | Component | Location |
|
||
|-------|-----------|----------|
|
||
| Policy document | Global System Prompt v2.1 | prompts/global_system_prompt_v2.md |
|
||
| Gateway (source of truth) | detect_url + detect_explicit_request | gateway-bot/behavior_policy.py |
|
||
| Decision layer | behavior_policy.py v2.1 | gateway-bot/behavior_policy.py |
|
||
| HTTP API (gateway) | http_api.py | gateway-bot/http_api.py |
|
||
| PromptBuilder | N/A (runtime_context injected at gateway) | services/router/prompt_builder.py |
|
||
| Tests | 39 tests | tests/test_behavior_policy.py |
|
||
| Runbook | Behavior Policy v2.1 | runbooks/behavior-policy-v2.1.md |
|
||
|
||
As of v2.1, runtime_context injection happens in gateway (http_api.py), not PromptBuilder.
|
||
|
||
### Breaking Changes (from v1.1)
|
||
- Bare @mention in public/topic WITHOUT has_explicit_request -> NO_OUTPUT
|
||
- Gateway computes has_link and has_explicit_request (behavior_policy does NOT override)
|
||
- thread_has_agent_participation is now REQUIRED (fallback: false)
|
||
- has_explicit_request contract: imperative OR (? AND (dm OR reply OR mention OR thread))
|