Files
microdao-daarion/NODA1-README.md

307 lines
9.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# 🏗️ NODA1 Production Stack
**Version:** 2.2
**Last Updated:** 2026-02-11
**Status:** Production (drift-controlled) ✅
## 🔎 Current Reality (2026-02-11)
- Deploy root: `/opt/microdao-daarion` (single runtime root)
- Drift control: `/opt/microdao-daarion/ops/drift-check.sh` → expected `DRIFT_CHECK: OK`
- Gateway: `agents_count=13` (user-facing)
- Router: 15 active agents (13 user-facing + 2 internal)
- Internal routing defaults:
- `monitor` → local (`swapper+ollama`, `qwen3-8b`)
- `devtools` → local (`swapper+ollama`, `qwen3-8b`) + conditional cloud fallback for heavy task types
- Memory service: `/health` and `/stats` return `200`
## 📍 Node Information
- **Hostname:** node1-daarion
- **IP Address:** 144.76.224.179
- **IPv6:** 2a01:4f8:201:2a6::2
- **Location:** Hetzner Cloud (Germany)
- **Role:** Production Router + Gateway + All Services
- **Uptime Target:** 24/7
- **SSH:** `ssh root@144.76.224.179`
## 🖥️ Hardware
- **CPU:** Available cores (view with `nproc`)
- **RAM:** 62GB
- **Disk:** 1.7TB (~1.3TB available)
- **GPU:** NVIDIA RTX 4000 SFF Ada Generation (20GB VRAM)
## 🐳 Docker Services (27+ active)
### Core Services (✅ All Healthy)
| Service | Port | Container | Health |
|---------|------|-----------|--------|
| Router | 9102 | dagi-router-node1 | ✅ |
| Gateway | 9300 | dagi-gateway | ✅ |
| Memory Service | 8000 | dagi-memory-service-node1 | ✅ |
| RAG Service | 9500 | rag-service-node1 | ✅ |
| Swapper Service | 8890-8891 | swapper-service-node1 | ✅ |
| Vision Encoder | 8001 | dagi-vision-encoder-node1 | ✅ |
### Databases (✅ All Healthy)
| Service | Port | Container | Health |
|---------|------|-----------|--------|
| PostgreSQL | 5432 | dagi-postgres | ✅ |
| Qdrant | 6333-6334 | dagi-qdrant-node1 | ✅ |
| Redis | 6379 | dagi-redis-node1 | ✅ |
| Neo4j | 7474, 7687 | dagi-neo4j-node1 | ✅ |
### Supporting Services
| Service | Port | Container | Health |
|---------|------|-----------|--------|
| NATS | 4222 | dagi-nats-node1 | ✅ |
| MinIO | 9000-9001 | dagi-minio-node1 | ✅ |
| Crawl4AI | 11235 | dagi-crawl4ai-node1 | ✅ |
| Parser Pipeline | 8101 | parser-pipeline | ✅ |
| Ingest Service | 8100 | ingest-service | ✅ |
### AI/ML Services
| Service | Port | Container | Status |
|---------|------|-----------|--------|
| CrewAI | - | dagi-crewai-node1 | ✅ |
| CrewAI NATS Worker | 9011 | crewai-nats-worker | ✅ |
### Artifact Services
| Service | Port | Container | Status |
|---------|------|-----------|--------|
| Artifact Registry | 9220 | artifact-registry-node1 | ✅ |
| Brand Registry | 9210 | brand-registry-node1 | ✅ |
| Brand Intake | 9211 | brand-intake-node1 | ✅ |
| Presentation Renderer | 9212 | presentation-renderer-node1 | ✅ |
### Monitoring (✅ All Healthy)
| Service | Port | Container | Health |
|---------|------|-----------|--------|
| Prometheus | 9090 | prometheus | ✅ |
| Grafana | 3030 | grafana | ✅ |
## 🤖 Telegram Bots (13 user-facing)
У production gateway зараз user-facing агенти:
`daarwizz`, `helion`, `alateya`, `druid`, `nutra`, `agromatrix`, `greenfood`, `clan`, `eonarch`, `yaromir`, `soul`, `senpai`, `sofiia`.
Швидка перевірка:
```bash
curl -sS http://localhost:9300/health
```
## 📊 Health Check Endpoints
```bash
# All services quick check
curl http://localhost:9102/health # Router
curl http://localhost:9300/health # Gateway
curl http://localhost:8000/health # Memory Service
curl http://localhost:9500/health # RAG
curl http://localhost:8890/health # Swapper
curl http://localhost:6333/healthz # Qdrant
curl http://localhost:8001/health # Vision Encoder
curl http://localhost:8101/health # Parser Pipeline
curl http://localhost:9090/-/healthy # Prometheus
curl http://localhost:3030/api/health # Grafana
```
## 🔧 Common Operations
### View all services
```bash
docker ps --format "table {{.Names}}\t{{.Status}}"
```
### Restart a service
```bash
docker restart <container-name>
```
### View logs
```bash
docker logs <container-name> --tail 50 -f
```
### System status
```bash
nvidia-smi # GPU status
df -h # Disk usage
free -h # Memory usage
uptime # System uptime
```
## 💾 Backups
### PostgreSQL (Auto)
- **Location:** `/opt/backups/postgres/`
- **Schedule:** Every 6 hours (3:00, 9:00, 15:00, 21:00)
- **Retention:** 7 days daily, 4 weeks, 6 months
- **Container:** postgres-backup-node1
### Qdrant (Manual)
```bash
# Create snapshot
curl -X POST "http://localhost:6333/snapshots"
# List snapshots
curl "http://localhost:6333/snapshots"
```
### Manual backup all
```bash
cd /opt/microdao-daarion
./scripts/backup/backup_all.sh
```
## 🔒 Security Status
- ✅ No suspicious processes
- ✅ No executables in /tmp
- ✅ Firewall configured
- ✅ Daily backups active
- ✅ System load normal (< 1.0)
## ⚙️ Configuration Files
- **Docker Compose:** `docker-compose.node1.yml`
- **Router Config:** `services/router/router_config.yaml`
- **Backup Compose:** `docker-compose.backups.yml`
## 📝 Recent Changes (2026-01-26)
### ✅ Fixed Issues
1. **Memory Service** - Fixed MEMORY_QDRANT_HOST (was `qdrant`, now `dagi-qdrant-node1`)
2. **Qdrant snapshot** created before fix: `full-snapshot-2026-01-26-10-11-31.snapshot`
### ⚠️ Known Issues
- **Control-plane** container port 9200 not published to host (internal only)
- **Image-gen** service not running (use swapper-service instead)
## 🆚 Version History
### v2.1 (2026-01-26)
- Memory Service DNS fix (qdrant → dagi-qdrant-node1)
- Full health check verified
- Documentation updated
### v2.0 (2026-01-22)
- Git repository initialized
- Qdrant healthcheck fixed
- render-pdf-worker disabled
### v1.x (2026-01-10 - 2026-01-19)
- Previous deployment
- Security incidents resolved
## 📞 Support
- **SSH:** root@144.76.224.179
- **Monitoring:** http://144.76.224.179:3030 (Grafana)
- **Metrics:** http://144.76.224.179:9090 (Prometheus)
---
**Maintained by:** NODA1 System
**Last Health Check:** 2026-01-26 11:13 UTC
**Status:** ✅ All systems operational
---
## 🔧 By Design (очікувана поведінка)
### Сервіси без публічних портів
| Сервіс | Порт | Статус | Пояснення |
|--------|------|--------|-----------|
| **RBAC** | 9200 | Internal only | Порт не опублікований. Доступ тільки з docker network. |
| **Image-gen** | 8892 | Не використовується | Генерація зображень йде через `swapper-service (8890)`. |
| **Parser** | 9400 | Відсутній | Замінено на `parser-pipeline (8101)` як єдину точку парсингу. |
### Діагностика internal сервісів
```bash
# RBAC (зсередини docker network)
docker exec dagi-gateway curl -sS http://rbac:9200/health
# Перевірка DNS resolution
docker exec dagi-memory-service-node1 python3 -c "import socket; print(socket.gethostbyname('dagi-qdrant-node1'))"
```
### Нормальні значення
- **Qdrant**: 18 колекцій, 900+ векторів
- **Memory Service**: 200 OK на `/health` (healthcheck через python urllib)
- **Load average**: < 2.0 норма, < 5.0 допустимо
---
## 📊 Prometheus Alerting
### Налаштовані алерти
| Alert | Умова | Severity |
|-------|-------|----------|
| ServiceDown | `up == 0` > 2m | critical |
| QdrantCollectionsLow | collections < 10 | warning |
| QdrantVectorsDropped | vectors < 500 | warning |
| HostDiskSpaceLow | free < 15% | warning |
| HostMemoryHigh | usage > 90% | warning |
| HostHighLoad | load15 > 10 | warning |
### Перевірка rules
```bash
curl -sS http://127.0.0.1:9090/api/v1/rules | python3 -m json.tool | head -50
```
---
## 🔄 Qdrant Backup/Restore
### Снапшоти
- **Розташування**: `/opt/backups/qdrant/` та через API
- **Retention**: щоденні автоматичні
- **Останній**: `full-snapshot-2026-01-26-10-11-31.snapshot` (1.2GB)
### Restore Drill (перевірено 2026-01-26)
```bash
# Restore успішно протестовано на окремому порту 16333
# helion_messages: 365 points відновлено і перевірено пошуком
```
---
**Last Updated:** 2026-01-26 11:40
---
## Behavior Policy v2.1 CHANGELOG
**Date:** 2026-02-07
**Version:** Behavior Policy v2.1 / Global System Prompt v2.1
### Architecture (Source of Truth)
| Layer | Component | Location |
|-------|-----------|----------|
| Policy document | Global System Prompt v2.1 | prompts/global_system_prompt_v2.md |
| Gateway (source of truth) | detect_url + detect_explicit_request | gateway-bot/behavior_policy.py |
| Decision layer | behavior_policy.py v2.1 | gateway-bot/behavior_policy.py |
| HTTP API (gateway) | http_api.py | gateway-bot/http_api.py |
| PromptBuilder | N/A (runtime_context injected at gateway) | services/router/prompt_builder.py |
| Tests | 39 tests | tests/test_behavior_policy.py |
| Runbook | Behavior Policy v2.1 | runbooks/behavior-policy-v2.1.md |
As of v2.1, runtime_context injection happens in gateway (http_api.py), not PromptBuilder.
### Breaking Changes (from v1.1)
- Bare @mention in public/topic WITHOUT has_explicit_request -> NO_OUTPUT
- Gateway computes has_link and has_explicit_request (behavior_policy does NOT override)
- thread_has_agent_participation is now REQUIRED (fallback: false)
- has_explicit_request contract: imperative OR (? AND (dm OR reply OR mention OR thread))