Complete snapshot of /opt/microdao-daarion/ from NODE1 (144.76.224.179).
This represents the actual running production code that has diverged
significantly from the previous main branch.
Key changes from old main:
- Gateway (http_api.py): expanded from ~40KB to 164KB with full agent support
- Router: new /v1/agents/{id}/infer endpoint with vision + DeepSeek routing
- Behavior Policy: SOWA v2.2 (3-level: FULL/ACK/SILENT)
- Agent Registry: config/agent_registry.yml as single source of truth
- 13 agents configured (was 3)
- Memory service integration
- CrewAI teams and roles
Excluded from snapshot: venv/, .env, data/, backups, .tgz archives
Co-authored-by: Cursor <cursoragent@cursor.com>
189 lines
4.9 KiB
Markdown
189 lines
4.9 KiB
Markdown
# DAARION Multi-Node Architecture
|
|
|
|
## Current State: Single Node (NODE1)
|
|
|
|
```
|
|
NODE1 (144.76.224.179) - Hetzner GEX44
|
|
├── RTX 4000 SFF Ada (20GB VRAM)
|
|
├── Services:
|
|
│ ├── Gateway :9300
|
|
│ ├── Router :9102
|
|
│ ├── Swapper :8890 (GPU)
|
|
│ ├── Memory Service :8000
|
|
│ ├── CrewAI :9010
|
|
│ ├── CrewAI Worker :9011
|
|
│ ├── Ingest :8100
|
|
│ ├── Parser :8101
|
|
│ ├── Prometheus :9090
|
|
│ └── Grafana :3030
|
|
├── Data:
|
|
│ ├── PostgreSQL :5432
|
|
│ ├── Qdrant :6333
|
|
│ ├── Neo4j :7687
|
|
│ └── Redis :6379
|
|
└── Messaging:
|
|
└── NATS JetStream :4222
|
|
```
|
|
|
|
## Target: Multi-Node Topology
|
|
|
|
### Edge Router Pattern
|
|
|
|
```
|
|
┌─────────────────────┐
|
|
│ Global Entry │
|
|
│ gateway.daarion.city│
|
|
│ (CloudFlare) │
|
|
└──────────┬──────────┘
|
|
│
|
|
┌────────────────┼────────────────┐
|
|
│ │ │
|
|
┌────────▼────────┐ ┌────▼────────┐ ┌────▼────────┐
|
|
│ NODE1 │ │ NODE2 │ │ NODE3 │
|
|
│ (Primary) │ │ (Replica) │ │ (Edge) │
|
|
│ Hetzner GEX44 │ │ Hetzner │ │ Hetzner │
|
|
└────────┬────────┘ └─────┬───────┘ └─────┬───────┘
|
|
│ │ │
|
|
┌────────▼────────────────▼───────────────▼────────┐
|
|
│ NATS Supercluster │
|
|
│ (Leafnodes / Mirrored Streams) │
|
|
└──────────────────────────────────────────────────┘
|
|
```
|
|
|
|
### Node Roles
|
|
|
|
#### NODE1 (Primary)
|
|
- GPU workloads (Swapper, Vision, FLUX)
|
|
- Primary data stores
|
|
- CrewAI orchestration
|
|
|
|
#### NODE2 (Replica)
|
|
- Read replicas (Qdrant, Neo4j)
|
|
- Backup Gateway/Router
|
|
- Async workers
|
|
|
|
#### NODE3+ (Edge)
|
|
- Regional edge routing
|
|
- Local cache (Redis)
|
|
- NATS leafnode
|
|
|
|
### Data Replication Strategy
|
|
|
|
```yaml
|
|
PostgreSQL:
|
|
mode: primary-replica
|
|
primary: NODE1
|
|
replicas: [NODE2]
|
|
sync: async (streaming)
|
|
|
|
Qdrant:
|
|
mode: sharded
|
|
shards: 2
|
|
replication_factor: 2
|
|
nodes: [NODE1, NODE2]
|
|
|
|
Neo4j:
|
|
mode: causal-cluster
|
|
core_servers: [NODE1, NODE2]
|
|
read_replicas: [NODE3]
|
|
|
|
Redis:
|
|
mode: sentinel
|
|
master: NODE1
|
|
replicas: [NODE2]
|
|
sentinels: 3
|
|
|
|
NATS:
|
|
mode: supercluster
|
|
clusters:
|
|
- name: core
|
|
nodes: [NODE1, NODE2]
|
|
- name: edge
|
|
nodes: [NODE3]
|
|
leafnode_to: core
|
|
```
|
|
|
|
### Service Distribution
|
|
|
|
| Service | NODE1 | NODE2 | NODE3 |
|
|
|---------|-------|-------|-------|
|
|
| Gateway | ✓ | ✓ | ✓ |
|
|
| Router | ✓ | ✓ (standby) | - |
|
|
| Swapper (GPU) | ✓ | - | - |
|
|
| Memory Service | ✓ | ✓ (read) | - |
|
|
| PostgreSQL | ✓ (primary) | ✓ (replica) | - |
|
|
| Qdrant | ✓ (shard1) | ✓ (shard2) | - |
|
|
| Neo4j | ✓ (core) | ✓ (core) | ✓ (read) |
|
|
| Redis | ✓ (master) | ✓ (replica) | ✓ (cache) |
|
|
| NATS | ✓ (cluster) | ✓ (cluster) | ✓ (leaf) |
|
|
|
|
### NATS Subject Routing
|
|
|
|
```
|
|
# Core subjects (replicated across all nodes)
|
|
message.*
|
|
attachment.*
|
|
agent.run.*
|
|
|
|
# Node-specific subjects
|
|
node.{node_id}.local.*
|
|
|
|
# Edge subjects (local only)
|
|
cache.invalidate.*
|
|
```
|
|
|
|
### Implementation Phases
|
|
|
|
#### Phase 3.1: Prepare NODE1 for replication
|
|
- [x] Enable PostgreSQL streaming replication
|
|
- [x] Configure Qdrant for clustering
|
|
- [x] Set up NATS cluster mode
|
|
|
|
#### Phase 3.2: Deploy NODE2
|
|
- [ ] Provision Hetzner server
|
|
- [ ] Deploy base stack
|
|
- [ ] Configure replicas
|
|
- [ ] Test failover
|
|
|
|
#### Phase 3.3: Add Edge Nodes
|
|
- [ ] Deploy lightweight edge stack
|
|
- [ ] Configure NATS leafnodes
|
|
- [ ] Set up geo-routing
|
|
|
|
### Environment Variables for Multi-Node
|
|
|
|
```bash
|
|
# NODE1 specific
|
|
NODE_ID=node1
|
|
NODE_ROLE=primary
|
|
CLUSTER_PEERS=node2:4222,node3:4222
|
|
|
|
# Replication
|
|
PG_REPLICATION_USER=replicator
|
|
PG_REPLICATION_PASSWORD=<secure>
|
|
QDRANT_CLUSTER_ENABLED=true
|
|
NATS_CLUSTER_NAME=daarion-core
|
|
```
|
|
|
|
### Health Check Endpoints
|
|
|
|
Each node exposes:
|
|
- `/health` - basic health
|
|
- `/ready` - ready for traffic
|
|
- `/cluster/status` - cluster membership
|
|
- `/cluster/peers` - peer connectivity
|
|
|
|
### Failover Scenarios
|
|
|
|
1. **NODE1 down**: NODE2 promotes to primary
|
|
2. **Network partition**: Split-brain prevention via NATS
|
|
3. **GPU failure**: Fallback to API models
|
|
|
|
---
|
|
|
|
## Next Steps
|
|
|
|
1. Prepare NODE1 for replication configs
|
|
2. Document NODE2 provisioning
|
|
3. Create deployment scripts
|