Files
microdao-daarion/docs/MULTINODE_ARCHITECTURE.md
Apple ef3473db21 snapshot: NODE1 production state 2026-02-09
Complete snapshot of /opt/microdao-daarion/ from NODE1 (144.76.224.179).
This represents the actual running production code that has diverged
significantly from the previous main branch.

Key changes from old main:
- Gateway (http_api.py): expanded from ~40KB to 164KB with full agent support
- Router: new /v1/agents/{id}/infer endpoint with vision + DeepSeek routing
- Behavior Policy: SOWA v2.2 (3-level: FULL/ACK/SILENT)
- Agent Registry: config/agent_registry.yml as single source of truth
- 13 agents configured (was 3)
- Memory service integration
- CrewAI teams and roles

Excluded from snapshot: venv/, .env, data/, backups, .tgz archives

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-02-09 08:46:46 -08:00

4.9 KiB

DAARION Multi-Node Architecture

Current State: Single Node (NODE1)

NODE1 (144.76.224.179) - Hetzner GEX44
├── RTX 4000 SFF Ada (20GB VRAM)
├── Services:
│   ├── Gateway :9300
│   ├── Router :9102
│   ├── Swapper :8890 (GPU)
│   ├── Memory Service :8000
│   ├── CrewAI :9010
│   ├── CrewAI Worker :9011
│   ├── Ingest :8100
│   ├── Parser :8101
│   ├── Prometheus :9090
│   └── Grafana :3030
├── Data:
│   ├── PostgreSQL :5432
│   ├── Qdrant :6333
│   ├── Neo4j :7687
│   └── Redis :6379
└── Messaging:
    └── NATS JetStream :4222

Target: Multi-Node Topology

Edge Router Pattern

                    ┌─────────────────────┐
                    │   Global Entry      │
                    │ gateway.daarion.city│
                    │    (CloudFlare)     │
                    └──────────┬──────────┘
                               │
              ┌────────────────┼────────────────┐
              │                │                │
     ┌────────▼────────┐ ┌────▼────────┐ ┌────▼────────┐
     │     NODE1       │ │    NODE2    │ │   NODE3     │
     │   (Primary)     │ │  (Replica)  │ │   (Edge)    │
     │  Hetzner GEX44  │ │  Hetzner    │ │  Hetzner    │
     └────────┬────────┘ └─────┬───────┘ └─────┬───────┘
              │                │               │
     ┌────────▼────────────────▼───────────────▼────────┐
     │              NATS Supercluster                   │
     │         (Leafnodes / Mirrored Streams)           │
     └──────────────────────────────────────────────────┘

Node Roles

NODE1 (Primary)

  • GPU workloads (Swapper, Vision, FLUX)
  • Primary data stores
  • CrewAI orchestration

NODE2 (Replica)

  • Read replicas (Qdrant, Neo4j)
  • Backup Gateway/Router
  • Async workers

NODE3+ (Edge)

  • Regional edge routing
  • Local cache (Redis)
  • NATS leafnode

Data Replication Strategy

PostgreSQL:
  mode: primary-replica
  primary: NODE1
  replicas: [NODE2]
  sync: async (streaming)

Qdrant:
  mode: sharded
  shards: 2
  replication_factor: 2
  nodes: [NODE1, NODE2]

Neo4j:
  mode: causal-cluster
  core_servers: [NODE1, NODE2]
  read_replicas: [NODE3]

Redis:
  mode: sentinel
  master: NODE1
  replicas: [NODE2]
  sentinels: 3

NATS:
  mode: supercluster
  clusters:
    - name: core
      nodes: [NODE1, NODE2]
    - name: edge
      nodes: [NODE3]
      leafnode_to: core

Service Distribution

Service NODE1 NODE2 NODE3
Gateway
Router ✓ (standby) -
Swapper (GPU) - -
Memory Service ✓ (read) -
PostgreSQL ✓ (primary) ✓ (replica) -
Qdrant ✓ (shard1) ✓ (shard2) -
Neo4j ✓ (core) ✓ (core) ✓ (read)
Redis ✓ (master) ✓ (replica) ✓ (cache)
NATS ✓ (cluster) ✓ (cluster) ✓ (leaf)

NATS Subject Routing

# Core subjects (replicated across all nodes)
message.*
attachment.*
agent.run.*

# Node-specific subjects
node.{node_id}.local.*

# Edge subjects (local only)
cache.invalidate.*

Implementation Phases

Phase 3.1: Prepare NODE1 for replication

  • Enable PostgreSQL streaming replication
  • Configure Qdrant for clustering
  • Set up NATS cluster mode

Phase 3.2: Deploy NODE2

  • Provision Hetzner server
  • Deploy base stack
  • Configure replicas
  • Test failover

Phase 3.3: Add Edge Nodes

  • Deploy lightweight edge stack
  • Configure NATS leafnodes
  • Set up geo-routing

Environment Variables for Multi-Node

# NODE1 specific
NODE_ID=node1
NODE_ROLE=primary
CLUSTER_PEERS=node2:4222,node3:4222

# Replication
PG_REPLICATION_USER=replicator
PG_REPLICATION_PASSWORD=<secure>
QDRANT_CLUSTER_ENABLED=true
NATS_CLUSTER_NAME=daarion-core

Health Check Endpoints

Each node exposes:

  • /health - basic health
  • /ready - ready for traffic
  • /cluster/status - cluster membership
  • /cluster/peers - peer connectivity

Failover Scenarios

  1. NODE1 down: NODE2 promotes to primary
  2. Network partition: Split-brain prevention via NATS
  3. GPU failure: Fallback to API models

Next Steps

  1. Prepare NODE1 for replication configs
  2. Document NODE2 provisioning
  3. Create deployment scripts