# DAARION Multi-Node Architecture

## Current State: Single Node (NODE1)

```
NODE1 (144.76.224.179) - Hetzner GEX44
├── RTX 4000 SFF Ada (20GB VRAM)
├── Services:
│   ├── Gateway :9300
│   ├── Router :9102
│   ├── Swapper :8890 (GPU)
│   ├── Memory Service :8000
│   ├── CrewAI :9010
│   ├── CrewAI Worker :9011
│   ├── Ingest :8100
│   ├── Parser :8101
│   ├── Prometheus :9090
│   └── Grafana :3030
├── Data:
│   ├── PostgreSQL :5432
│   ├── Qdrant :6333
│   ├── Neo4j :7687
│   └── Redis :6379
└── Messaging:
    └── NATS JetStream :4222
```

## Target: Multi-Node Topology

### Edge Router Pattern

```
                    ┌─────────────────────┐
                    │   Global Entry      │
                    │ gateway.daarion.city│
                    │    (CloudFlare)     │
                    └──────────┬──────────┘
                               │
              ┌────────────────┼────────────────┐
              │                │                │
     ┌────────▼────────┐ ┌────▼────────┐ ┌────▼────────┐
     │     NODE1       │ │    NODE2    │ │   NODE3     │
     │   (Primary)     │ │  (Replica)  │ │   (Edge)    │
     │  Hetzner GEX44  │ │  Hetzner    │ │  Hetzner    │
     └────────┬────────┘ └─────┬───────┘ └─────┬───────┘
              │                │               │
     ┌────────▼────────────────▼───────────────▼────────┐
     │              NATS Supercluster                   │
     │         (Leafnodes / Mirrored Streams)           │
     └──────────────────────────────────────────────────┘
```

### Node Roles

#### NODE1 (Primary)
- GPU workloads (Swapper, Vision, FLUX)
- Primary data stores
- CrewAI orchestration

#### NODE2 (Replica)
- Read replicas (Qdrant, Neo4j)
- Backup Gateway/Router
- Async workers

#### NODE3+ (Edge)
- Regional edge routing
- Local cache (Redis)
- NATS leafnode

### Data Replication Strategy

```yaml
PostgreSQL:
  mode: primary-replica
  primary: NODE1
  replicas: [NODE2]
  sync: async (streaming)

Qdrant:
  mode: sharded
  shards: 2
  replication_factor: 2
  nodes: [NODE1, NODE2]

Neo4j:
  mode: causal-cluster
  core_servers: [NODE1, NODE2]
  read_replicas: [NODE3]

Redis:
  mode: sentinel
  master: NODE1
  replicas: [NODE2]
  sentinels: 3

NATS:
  mode: supercluster
  clusters:
    - name: core
      nodes: [NODE1, NODE2]
    - name: edge
      nodes: [NODE3]
      leafnode_to: core
```

### Service Distribution

| Service | NODE1 | NODE2 | NODE3 |
|---------|-------|-------|-------|
| Gateway | ✓ | ✓ | ✓ |
| Router | ✓ | ✓ (standby) | - |
| Swapper (GPU) | ✓ | - | - |
| Memory Service | ✓ | ✓ (read) | - |
| PostgreSQL | ✓ (primary) | ✓ (replica) | - |
| Qdrant | ✓ (shard1) | ✓ (shard2) | - |
| Neo4j | ✓ (core) | ✓ (core) | ✓ (read) |
| Redis | ✓ (master) | ✓ (replica) | ✓ (cache) |
| NATS | ✓ (cluster) | ✓ (cluster) | ✓ (leaf) |

### NATS Subject Routing

```
# Core subjects (replicated across all nodes)
message.*
attachment.*
agent.run.*

# Node-specific subjects
node.{node_id}.local.*

# Edge subjects (local only)
cache.invalidate.*
```

### Implementation Phases

#### Phase 3.1: Prepare NODE1 for replication
- [x] Enable PostgreSQL streaming replication
- [x] Configure Qdrant for clustering
- [x] Set up NATS cluster mode

#### Phase 3.2: Deploy NODE2
- [ ] Provision Hetzner server
- [ ] Deploy base stack
- [ ] Configure replicas
- [ ] Test failover

#### Phase 3.3: Add Edge Nodes
- [ ] Deploy lightweight edge stack
- [ ] Configure NATS leafnodes
- [ ] Set up geo-routing

### Environment Variables for Multi-Node

```bash
# NODE1 specific
NODE_ID=node1
NODE_ROLE=primary
CLUSTER_PEERS=node2:4222,node3:4222

# Replication
PG_REPLICATION_USER=replicator
PG_REPLICATION_PASSWORD=<secure>
QDRANT_CLUSTER_ENABLED=true
NATS_CLUSTER_NAME=daarion-core
```

### Health Check Endpoints

Each node exposes:
- `/health` - basic health
- `/ready` - ready for traffic
- `/cluster/status` - cluster membership
- `/cluster/peers` - peer connectivity

### Failover Scenarios

1. **NODE1 down**: NODE2 promotes to primary
2. **Network partition**: Split-brain prevention via NATS
3. **GPU failure**: Fallback to API models

---

## Next Steps

1. Prepare NODE1 for replication configs
2. Document NODE2 provisioning
3. Create deployment scripts