snapshot: NODE1 production state 2026-02-09
Complete snapshot of /opt/microdao-daarion/ from NODE1 (144.76.224.179).
This represents the actual running production code that has diverged
significantly from the previous main branch.
Key changes from old main:
- Gateway (http_api.py): expanded from ~40KB to 164KB with full agent support
- Router: new /v1/agents/{id}/infer endpoint with vision + DeepSeek routing
- Behavior Policy: SOWA v2.2 (3-level: FULL/ACK/SILENT)
- Agent Registry: config/agent_registry.yml as single source of truth
- 13 agents configured (was 3)
- Memory service integration
- CrewAI teams and roles
Excluded from snapshot: venv/, .env, data/, backups, .tgz archives
Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
188
docs/MULTINODE_ARCHITECTURE.md
Normal file
188
docs/MULTINODE_ARCHITECTURE.md
Normal file
@@ -0,0 +1,188 @@
|
||||
# DAARION Multi-Node Architecture
|
||||
|
||||
## Current State: Single Node (NODE1)
|
||||
|
||||
```
|
||||
NODE1 (144.76.224.179) - Hetzner GEX44
|
||||
├── RTX 4000 SFF Ada (20GB VRAM)
|
||||
├── Services:
|
||||
│ ├── Gateway :9300
|
||||
│ ├── Router :9102
|
||||
│ ├── Swapper :8890 (GPU)
|
||||
│ ├── Memory Service :8000
|
||||
│ ├── CrewAI :9010
|
||||
│ ├── CrewAI Worker :9011
|
||||
│ ├── Ingest :8100
|
||||
│ ├── Parser :8101
|
||||
│ ├── Prometheus :9090
|
||||
│ └── Grafana :3030
|
||||
├── Data:
|
||||
│ ├── PostgreSQL :5432
|
||||
│ ├── Qdrant :6333
|
||||
│ ├── Neo4j :7687
|
||||
│ └── Redis :6379
|
||||
└── Messaging:
|
||||
└── NATS JetStream :4222
|
||||
```
|
||||
|
||||
## Target: Multi-Node Topology
|
||||
|
||||
### Edge Router Pattern
|
||||
|
||||
```
|
||||
┌─────────────────────┐
|
||||
│ Global Entry │
|
||||
│ gateway.daarion.city│
|
||||
│ (CloudFlare) │
|
||||
└──────────┬──────────┘
|
||||
│
|
||||
┌────────────────┼────────────────┐
|
||||
│ │ │
|
||||
┌────────▼────────┐ ┌────▼────────┐ ┌────▼────────┐
|
||||
│ NODE1 │ │ NODE2 │ │ NODE3 │
|
||||
│ (Primary) │ │ (Replica) │ │ (Edge) │
|
||||
│ Hetzner GEX44 │ │ Hetzner │ │ Hetzner │
|
||||
└────────┬────────┘ └─────┬───────┘ └─────┬───────┘
|
||||
│ │ │
|
||||
┌────────▼────────────────▼───────────────▼────────┐
|
||||
│ NATS Supercluster │
|
||||
│ (Leafnodes / Mirrored Streams) │
|
||||
└──────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### Node Roles
|
||||
|
||||
#### NODE1 (Primary)
|
||||
- GPU workloads (Swapper, Vision, FLUX)
|
||||
- Primary data stores
|
||||
- CrewAI orchestration
|
||||
|
||||
#### NODE2 (Replica)
|
||||
- Read replicas (Qdrant, Neo4j)
|
||||
- Backup Gateway/Router
|
||||
- Async workers
|
||||
|
||||
#### NODE3+ (Edge)
|
||||
- Regional edge routing
|
||||
- Local cache (Redis)
|
||||
- NATS leafnode
|
||||
|
||||
### Data Replication Strategy
|
||||
|
||||
```yaml
|
||||
PostgreSQL:
|
||||
mode: primary-replica
|
||||
primary: NODE1
|
||||
replicas: [NODE2]
|
||||
sync: async (streaming)
|
||||
|
||||
Qdrant:
|
||||
mode: sharded
|
||||
shards: 2
|
||||
replication_factor: 2
|
||||
nodes: [NODE1, NODE2]
|
||||
|
||||
Neo4j:
|
||||
mode: causal-cluster
|
||||
core_servers: [NODE1, NODE2]
|
||||
read_replicas: [NODE3]
|
||||
|
||||
Redis:
|
||||
mode: sentinel
|
||||
master: NODE1
|
||||
replicas: [NODE2]
|
||||
sentinels: 3
|
||||
|
||||
NATS:
|
||||
mode: supercluster
|
||||
clusters:
|
||||
- name: core
|
||||
nodes: [NODE1, NODE2]
|
||||
- name: edge
|
||||
nodes: [NODE3]
|
||||
leafnode_to: core
|
||||
```
|
||||
|
||||
### Service Distribution
|
||||
|
||||
| Service | NODE1 | NODE2 | NODE3 |
|
||||
|---------|-------|-------|-------|
|
||||
| Gateway | ✓ | ✓ | ✓ |
|
||||
| Router | ✓ | ✓ (standby) | - |
|
||||
| Swapper (GPU) | ✓ | - | - |
|
||||
| Memory Service | ✓ | ✓ (read) | - |
|
||||
| PostgreSQL | ✓ (primary) | ✓ (replica) | - |
|
||||
| Qdrant | ✓ (shard1) | ✓ (shard2) | - |
|
||||
| Neo4j | ✓ (core) | ✓ (core) | ✓ (read) |
|
||||
| Redis | ✓ (master) | ✓ (replica) | ✓ (cache) |
|
||||
| NATS | ✓ (cluster) | ✓ (cluster) | ✓ (leaf) |
|
||||
|
||||
### NATS Subject Routing
|
||||
|
||||
```
|
||||
# Core subjects (replicated across all nodes)
|
||||
message.*
|
||||
attachment.*
|
||||
agent.run.*
|
||||
|
||||
# Node-specific subjects
|
||||
node.{node_id}.local.*
|
||||
|
||||
# Edge subjects (local only)
|
||||
cache.invalidate.*
|
||||
```
|
||||
|
||||
### Implementation Phases
|
||||
|
||||
#### Phase 3.1: Prepare NODE1 for replication
|
||||
- [x] Enable PostgreSQL streaming replication
|
||||
- [x] Configure Qdrant for clustering
|
||||
- [x] Set up NATS cluster mode
|
||||
|
||||
#### Phase 3.2: Deploy NODE2
|
||||
- [ ] Provision Hetzner server
|
||||
- [ ] Deploy base stack
|
||||
- [ ] Configure replicas
|
||||
- [ ] Test failover
|
||||
|
||||
#### Phase 3.3: Add Edge Nodes
|
||||
- [ ] Deploy lightweight edge stack
|
||||
- [ ] Configure NATS leafnodes
|
||||
- [ ] Set up geo-routing
|
||||
|
||||
### Environment Variables for Multi-Node
|
||||
|
||||
```bash
|
||||
# NODE1 specific
|
||||
NODE_ID=node1
|
||||
NODE_ROLE=primary
|
||||
CLUSTER_PEERS=node2:4222,node3:4222
|
||||
|
||||
# Replication
|
||||
PG_REPLICATION_USER=replicator
|
||||
PG_REPLICATION_PASSWORD=<secure>
|
||||
QDRANT_CLUSTER_ENABLED=true
|
||||
NATS_CLUSTER_NAME=daarion-core
|
||||
```
|
||||
|
||||
### Health Check Endpoints
|
||||
|
||||
Each node exposes:
|
||||
- `/health` - basic health
|
||||
- `/ready` - ready for traffic
|
||||
- `/cluster/status` - cluster membership
|
||||
- `/cluster/peers` - peer connectivity
|
||||
|
||||
### Failover Scenarios
|
||||
|
||||
1. **NODE1 down**: NODE2 promotes to primary
|
||||
2. **Network partition**: Split-brain prevention via NATS
|
||||
3. **GPU failure**: Fallback to API models
|
||||
|
||||
---
|
||||
|
||||
## Next Steps
|
||||
|
||||
1. Prepare NODE1 for replication configs
|
||||
2. Document NODE2 provisioning
|
||||
3. Create deployment scripts
|
||||
Reference in New Issue
Block a user