# DAARION Multi-Node Architecture ## Current State: Single Node (NODE1) ``` NODE1 (144.76.224.179) - Hetzner GEX44 ├── RTX 4000 SFF Ada (20GB VRAM) ├── Services: │ ├── Gateway :9300 │ ├── Router :9102 │ ├── Swapper :8890 (GPU) │ ├── Memory Service :8000 │ ├── CrewAI :9010 │ ├── CrewAI Worker :9011 │ ├── Ingest :8100 │ ├── Parser :8101 │ ├── Prometheus :9090 │ └── Grafana :3030 ├── Data: │ ├── PostgreSQL :5432 │ ├── Qdrant :6333 │ ├── Neo4j :7687 │ └── Redis :6379 └── Messaging: └── NATS JetStream :4222 ``` ## Target: Multi-Node Topology ### Edge Router Pattern ``` ┌─────────────────────┐ │ Global Entry │ │ gateway.daarion.city│ │ (CloudFlare) │ └──────────┬──────────┘ │ ┌────────────────┼────────────────┐ │ │ │ ┌────────▼────────┐ ┌────▼────────┐ ┌────▼────────┐ │ NODE1 │ │ NODE2 │ │ NODE3 │ │ (Primary) │ │ (Replica) │ │ (Edge) │ │ Hetzner GEX44 │ │ Hetzner │ │ Hetzner │ └────────┬────────┘ └─────┬───────┘ └─────┬───────┘ │ │ │ ┌────────▼────────────────▼───────────────▼────────┐ │ NATS Supercluster │ │ (Leafnodes / Mirrored Streams) │ └──────────────────────────────────────────────────┘ ``` ### Node Roles #### NODE1 (Primary) - GPU workloads (Swapper, Vision, FLUX) - Primary data stores - CrewAI orchestration #### NODE2 (Replica) - Read replicas (Qdrant, Neo4j) - Backup Gateway/Router - Async workers #### NODE3+ (Edge) - Regional edge routing - Local cache (Redis) - NATS leafnode ### Data Replication Strategy ```yaml PostgreSQL: mode: primary-replica primary: NODE1 replicas: [NODE2] sync: async (streaming) Qdrant: mode: sharded shards: 2 replication_factor: 2 nodes: [NODE1, NODE2] Neo4j: mode: causal-cluster core_servers: [NODE1, NODE2] read_replicas: [NODE3] Redis: mode: sentinel master: NODE1 replicas: [NODE2] sentinels: 3 NATS: mode: supercluster clusters: - name: core nodes: [NODE1, NODE2] - name: edge nodes: [NODE3] leafnode_to: core ``` ### Service Distribution | Service | NODE1 | NODE2 | NODE3 | |---------|-------|-------|-------| | Gateway | ✓ | ✓ | ✓ | | Router | ✓ | ✓ (standby) | - | | Swapper (GPU) | ✓ | - | - | | Memory Service | ✓ | ✓ (read) | - | | PostgreSQL | ✓ (primary) | ✓ (replica) | - | | Qdrant | ✓ (shard1) | ✓ (shard2) | - | | Neo4j | ✓ (core) | ✓ (core) | ✓ (read) | | Redis | ✓ (master) | ✓ (replica) | ✓ (cache) | | NATS | ✓ (cluster) | ✓ (cluster) | ✓ (leaf) | ### NATS Subject Routing ``` # Core subjects (replicated across all nodes) message.* attachment.* agent.run.* # Node-specific subjects node.{node_id}.local.* # Edge subjects (local only) cache.invalidate.* ``` ### Implementation Phases #### Phase 3.1: Prepare NODE1 for replication - [x] Enable PostgreSQL streaming replication - [x] Configure Qdrant for clustering - [x] Set up NATS cluster mode #### Phase 3.2: Deploy NODE2 - [ ] Provision Hetzner server - [ ] Deploy base stack - [ ] Configure replicas - [ ] Test failover #### Phase 3.3: Add Edge Nodes - [ ] Deploy lightweight edge stack - [ ] Configure NATS leafnodes - [ ] Set up geo-routing ### Environment Variables for Multi-Node ```bash # NODE1 specific NODE_ID=node1 NODE_ROLE=primary CLUSTER_PEERS=node2:4222,node3:4222 # Replication PG_REPLICATION_USER=replicator PG_REPLICATION_PASSWORD= QDRANT_CLUSTER_ENABLED=true NATS_CLUSTER_NAME=daarion-core ``` ### Health Check Endpoints Each node exposes: - `/health` - basic health - `/ready` - ready for traffic - `/cluster/status` - cluster membership - `/cluster/peers` - peer connectivity ### Failover Scenarios 1. **NODE1 down**: NODE2 promotes to primary 2. **Network partition**: Split-brain prevention via NATS 3. **GPU failure**: Fallback to API models --- ## Next Steps 1. Prepare NODE1 for replication configs 2. Document NODE2 provisioning 3. Create deployment scripts