Files
microdao-daarion/docs/MULTINODE_ARCHITECTURE.md
Apple ef3473db21 snapshot: NODE1 production state 2026-02-09
Complete snapshot of /opt/microdao-daarion/ from NODE1 (144.76.224.179).
This represents the actual running production code that has diverged
significantly from the previous main branch.

Key changes from old main:
- Gateway (http_api.py): expanded from ~40KB to 164KB with full agent support
- Router: new /v1/agents/{id}/infer endpoint with vision + DeepSeek routing
- Behavior Policy: SOWA v2.2 (3-level: FULL/ACK/SILENT)
- Agent Registry: config/agent_registry.yml as single source of truth
- 13 agents configured (was 3)
- Memory service integration
- CrewAI teams and roles

Excluded from snapshot: venv/, .env, data/, backups, .tgz archives

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-02-09 08:46:46 -08:00

189 lines
4.9 KiB
Markdown

# DAARION Multi-Node Architecture
## Current State: Single Node (NODE1)
```
NODE1 (144.76.224.179) - Hetzner GEX44
├── RTX 4000 SFF Ada (20GB VRAM)
├── Services:
│ ├── Gateway :9300
│ ├── Router :9102
│ ├── Swapper :8890 (GPU)
│ ├── Memory Service :8000
│ ├── CrewAI :9010
│ ├── CrewAI Worker :9011
│ ├── Ingest :8100
│ ├── Parser :8101
│ ├── Prometheus :9090
│ └── Grafana :3030
├── Data:
│ ├── PostgreSQL :5432
│ ├── Qdrant :6333
│ ├── Neo4j :7687
│ └── Redis :6379
└── Messaging:
└── NATS JetStream :4222
```
## Target: Multi-Node Topology
### Edge Router Pattern
```
┌─────────────────────┐
│ Global Entry │
│ gateway.daarion.city│
│ (CloudFlare) │
└──────────┬──────────┘
┌────────────────┼────────────────┐
│ │ │
┌────────▼────────┐ ┌────▼────────┐ ┌────▼────────┐
│ NODE1 │ │ NODE2 │ │ NODE3 │
│ (Primary) │ │ (Replica) │ │ (Edge) │
│ Hetzner GEX44 │ │ Hetzner │ │ Hetzner │
└────────┬────────┘ └─────┬───────┘ └─────┬───────┘
│ │ │
┌────────▼────────────────▼───────────────▼────────┐
│ NATS Supercluster │
│ (Leafnodes / Mirrored Streams) │
└──────────────────────────────────────────────────┘
```
### Node Roles
#### NODE1 (Primary)
- GPU workloads (Swapper, Vision, FLUX)
- Primary data stores
- CrewAI orchestration
#### NODE2 (Replica)
- Read replicas (Qdrant, Neo4j)
- Backup Gateway/Router
- Async workers
#### NODE3+ (Edge)
- Regional edge routing
- Local cache (Redis)
- NATS leafnode
### Data Replication Strategy
```yaml
PostgreSQL:
mode: primary-replica
primary: NODE1
replicas: [NODE2]
sync: async (streaming)
Qdrant:
mode: sharded
shards: 2
replication_factor: 2
nodes: [NODE1, NODE2]
Neo4j:
mode: causal-cluster
core_servers: [NODE1, NODE2]
read_replicas: [NODE3]
Redis:
mode: sentinel
master: NODE1
replicas: [NODE2]
sentinels: 3
NATS:
mode: supercluster
clusters:
- name: core
nodes: [NODE1, NODE2]
- name: edge
nodes: [NODE3]
leafnode_to: core
```
### Service Distribution
| Service | NODE1 | NODE2 | NODE3 |
|---------|-------|-------|-------|
| Gateway | ✓ | ✓ | ✓ |
| Router | ✓ | ✓ (standby) | - |
| Swapper (GPU) | ✓ | - | - |
| Memory Service | ✓ | ✓ (read) | - |
| PostgreSQL | ✓ (primary) | ✓ (replica) | - |
| Qdrant | ✓ (shard1) | ✓ (shard2) | - |
| Neo4j | ✓ (core) | ✓ (core) | ✓ (read) |
| Redis | ✓ (master) | ✓ (replica) | ✓ (cache) |
| NATS | ✓ (cluster) | ✓ (cluster) | ✓ (leaf) |
### NATS Subject Routing
```
# Core subjects (replicated across all nodes)
message.*
attachment.*
agent.run.*
# Node-specific subjects
node.{node_id}.local.*
# Edge subjects (local only)
cache.invalidate.*
```
### Implementation Phases
#### Phase 3.1: Prepare NODE1 for replication
- [x] Enable PostgreSQL streaming replication
- [x] Configure Qdrant for clustering
- [x] Set up NATS cluster mode
#### Phase 3.2: Deploy NODE2
- [ ] Provision Hetzner server
- [ ] Deploy base stack
- [ ] Configure replicas
- [ ] Test failover
#### Phase 3.3: Add Edge Nodes
- [ ] Deploy lightweight edge stack
- [ ] Configure NATS leafnodes
- [ ] Set up geo-routing
### Environment Variables for Multi-Node
```bash
# NODE1 specific
NODE_ID=node1
NODE_ROLE=primary
CLUSTER_PEERS=node2:4222,node3:4222
# Replication
PG_REPLICATION_USER=replicator
PG_REPLICATION_PASSWORD=<secure>
QDRANT_CLUSTER_ENABLED=true
NATS_CLUSTER_NAME=daarion-core
```
### Health Check Endpoints
Each node exposes:
- `/health` - basic health
- `/ready` - ready for traffic
- `/cluster/status` - cluster membership
- `/cluster/peers` - peer connectivity
### Failover Scenarios
1. **NODE1 down**: NODE2 promotes to primary
2. **Network partition**: Split-brain prevention via NATS
3. **GPU failure**: Fallback to API models
---
## Next Steps
1. Prepare NODE1 for replication configs
2. Document NODE2 provisioning
3. Create deployment scripts