Files
microdao-daarion/ops/CHANGELOG_FABRIC.md
Apple ed7ad49d3a P3.2+P3.3+P3.4: NODA1 node-worker + NATS auth config + Prometheus counters
P3.2 — Multi-node deployment:
- Added node-worker service to docker-compose.node1.yml (NODE_ID=noda1)
- NCS NODA1 now has NODE_WORKER_URL for metrics collection
- Fixed NODE_ID consistency: router NODA1 uses 'noda1'
- NODA2 node-worker/NCS gets NCS_REPORT_URL for latency reporting

P3.3 — NATS accounts/auth (opt-in config):
- config/nats-server.conf with 3 accounts: SYS, FABRIC, APP
- Per-user topic permissions (router, ncs, node_worker)
- Leafnode listener :7422 with auth
- Not yet activated (requires credential provisioning)

P3.4 — Prometheus counters:
- Router /fabric_metrics: caps_refresh, caps_stale, model_select,
  offload_total, breaker_state, score_ms histogram
- Node Worker /prom_metrics: jobs_total, inflight gauge, latency_ms histogram
- NCS /prom_metrics: runtime_health, runtime_p50/p95, node_wait_ms
- All bound to 127.0.0.1 (not externally exposed)

Made-with: Cursor
2026-02-27 03:03:18 -08:00

3.5 KiB

Agent Fabric Layer — Changelog

v0.4 — P3.2/P3.3/P3.4 Multi-node Deploy + Auth + Prometheus (2026-02-27)

P3.2 — NCS + Node Worker on NODA1

  • Added node-worker service to docker-compose.node1.yml (NODE_ID=noda1)
  • NCS on NODA1 now has NODE_WORKER_URL for metrics collection
  • Fixed NODE_ID consistency: router on NODA1 now uses noda1 (was node-1-hetzner-gex44)
  • Global pool will show 2 nodes after NODA1 deployment

P3.3 — NATS Accounts/Auth Config

  • Created config/nats-server.conf with 3 accounts: SYS, FABRIC, APP
  • FABRIC account: per-user permissions for router, ncs, node_worker
  • Leafnode listener on :7422 with auth
  • Opt-in: not yet active (requires credential setup + client changes)

P3.4 — Prometheus Counters

  • Router (/fabric_metrics):
    • fabric_caps_refresh_total{status}, fabric_caps_stale_total
    • fabric_model_select_total{chosen_node,chosen_runtime,type}
    • fabric_offload_total{status,node,type}
    • fabric_breaker_state{node,type} (gauge)
    • fabric_score_ms (histogram: 100-10000ms buckets)
  • Node Worker (/prom_metrics):
    • node_worker_jobs_total{type,status}
    • node_worker_inflight (gauge)
    • node_worker_latency_ms{type,model} (histogram)
  • NCS (/prom_metrics):
    • ncs_runtime_health{runtime} (gauge)
    • ncs_runtime_p50_ms{runtime}, ncs_runtime_p95_ms{runtime}
    • ncs_node_wait_ms

v0.3 — P3.1 GPU/Queue-aware Routing (2026-02-27)

NCS (Node Capabilities Service)

  • NEW metrics.py module: NodeLoad + RuntimeLoad collection
  • Capabilities payload now includes node_load and runtime_load
  • node_load: inflight_jobs, queue_depth, concurrency_limit, estimated_wait_ms, cpu_load_1m, mem_pressure
  • runtime_load: per-runtime healthy status, p50_ms, p95_ms from rolling window
  • NEW POST /capabilities/report_latency — accepts latency reports from node-worker
  • NCS fetches worker metrics via NODE_WORKER_URL env

Node Worker

  • NEW GET /metrics endpoint: inflight_jobs, concurrency_limit, last_latencies_llm/vision
  • Latency tracking: rolling buffer of last 50 latencies per type
  • Fire-and-forget latency reporting to NCS after each successful job

Router (model_select v3)

  • NEW score_candidate() function: wait + model_latency + cross_penalty + prefer_bonus
  • Selection uses scoring instead of simple local-first ordering
  • LOCAL_THRESHOLD_MS = 250: prefer local if within threshold of remote
  • ModelSelection.score field added
  • Structured log format: [score] agent=X type=Y chosen=LOCAL:node/model score=N

Tests

  • 12 scoring tests (local wins, remote wins, exclude, breaker, type filter, prefer list, cross penalty, wait, threshold)
  • 7 NCS metrics tests (latency stats, cpu load, mem pressure, node load, runtime load)

No Breaking Changes

  • JobRequest/JobResponse envelope unchanged
  • Existing capabilities fields preserved
  • All new fields are optional/additive

v0.2 — P2.2+P2.3 NATS Offload (2026-02-26)

  • Node Worker service (NATS offload executor)
  • offload_client.py (circuit breaker, retries, deadline)
  • model_select with exclude_nodes + force_local
  • Router /infer remote offload path

v0.1 — P2 Global Capabilities (2026-02-26)

  • Node Capabilities Service (NCS) on each node
  • global_capabilities_client.py (NATS scatter-gather discovery)
  • model_select v2 (multi-node aware)
  • NATS wildcard discovery: node.*.capabilities.get

v0.0 — P1 NCS-first Selection (2026-02-26)

  • capabilities_client.py (single-node HTTP)
  • model_select v1 (profile → NCS → static fallback)
  • Grok API integration fix