P3.2+P3.3+P3.4: NODA1 node-worker + NATS auth config + Prometheus counters

P3.2 — Multi-node deployment:
- Added node-worker service to docker-compose.node1.yml (NODE_ID=noda1)
- NCS NODA1 now has NODE_WORKER_URL for metrics collection
- Fixed NODE_ID consistency: router NODA1 uses 'noda1'
- NODA2 node-worker/NCS gets NCS_REPORT_URL for latency reporting

P3.3 — NATS accounts/auth (opt-in config):
- config/nats-server.conf with 3 accounts: SYS, FABRIC, APP
- Per-user topic permissions (router, ncs, node_worker)
- Leafnode listener :7422 with auth
- Not yet activated (requires credential provisioning)

P3.4 — Prometheus counters:
- Router /fabric_metrics: caps_refresh, caps_stale, model_select,
  offload_total, breaker_state, score_ms histogram
- Node Worker /prom_metrics: jobs_total, inflight gauge, latency_ms histogram
- NCS /prom_metrics: runtime_health, runtime_p50/p95, node_wait_ms
- All bound to 127.0.0.1 (not externally exposed)

Made-with: Cursor
This commit is contained in:
Apple
2026-02-27 03:03:18 -08:00
parent a605b8c43e
commit ed7ad49d3a
13 changed files with 408 additions and 1 deletions

View File

@@ -52,6 +52,7 @@ try:
import global_capabilities_client
from model_select import select_model_for_agent, ModelSelection, CLOUD_PROVIDERS as NCS_CLOUD_PROVIDERS
import offload_client
import fabric_metrics as fm
NCS_AVAILABLE = True
except ImportError:
NCS_AVAILABLE = False
@@ -940,6 +941,17 @@ async def healthz():
return await health()
@app.get("/fabric_metrics")
async def fabric_metrics_endpoint():
"""Prometheus metrics for Fabric routing layer."""
if NCS_AVAILABLE:
data = fm.get_metrics_text()
if data:
from starlette.responses import Response
return Response(content=data, media_type="text/plain; charset=utf-8")
return {"error": "fabric metrics not available"}
@app.get("/monitor/status")
async def monitor_status(request: Request = None):
"""
@@ -1747,6 +1759,8 @@ async def agent_infer(agent_id: str, request: InferRequest):
timeout_ms=infer_timeout,
)
if offload_resp and offload_resp.get("status") == "ok":
if NCS_AVAILABLE:
fm.inc_offload("ok", ncs_selection.node, job_payload["required_type"])
result_text = offload_resp.get("result", {}).get("text", "")
return InferResponse(
response=result_text,
@@ -1756,6 +1770,8 @@ async def agent_infer(agent_id: str, request: InferRequest):
)
else:
offload_status = offload_resp.get("status", "none") if offload_resp else "no_reply"
if NCS_AVAILABLE:
fm.inc_offload(offload_status, ncs_selection.node, job_payload["required_type"])
logger.warning(
f"[fallback] offload to {ncs_selection.node} failed ({offload_status}) "
f"→ re-selecting with exclude={ncs_selection.node}, force_local"