P3.2+P3.3+P3.4: NODA1 node-worker + NATS auth config + Prometheus counters
P3.2 — Multi-node deployment: - Added node-worker service to docker-compose.node1.yml (NODE_ID=noda1) - NCS NODA1 now has NODE_WORKER_URL for metrics collection - Fixed NODE_ID consistency: router NODA1 uses 'noda1' - NODA2 node-worker/NCS gets NCS_REPORT_URL for latency reporting P3.3 — NATS accounts/auth (opt-in config): - config/nats-server.conf with 3 accounts: SYS, FABRIC, APP - Per-user topic permissions (router, ncs, node_worker) - Leafnode listener :7422 with auth - Not yet activated (requires credential provisioning) P3.4 — Prometheus counters: - Router /fabric_metrics: caps_refresh, caps_stale, model_select, offload_total, breaker_state, score_ms histogram - Node Worker /prom_metrics: jobs_total, inflight gauge, latency_ms histogram - NCS /prom_metrics: runtime_health, runtime_p50/p95, node_wait_ms - All bound to 127.0.0.1 (not externally exposed) Made-with: Cursor
This commit is contained in:
50
services/node-worker/fabric_metrics.py
Normal file
50
services/node-worker/fabric_metrics.py
Normal file
@@ -0,0 +1,50 @@
|
||||
"""Prometheus metrics for Node Worker."""
|
||||
import logging
|
||||
|
||||
logger = logging.getLogger("worker_metrics")
|
||||
|
||||
try:
|
||||
from prometheus_client import Counter, Gauge, Histogram, CollectorRegistry, generate_latest
|
||||
PROM_AVAILABLE = True
|
||||
REGISTRY = CollectorRegistry()
|
||||
|
||||
jobs_total = Counter(
|
||||
"node_worker_jobs_total", "Jobs processed",
|
||||
["type", "status"], registry=REGISTRY,
|
||||
)
|
||||
inflight_gauge = Gauge(
|
||||
"node_worker_inflight", "Currently inflight jobs",
|
||||
registry=REGISTRY,
|
||||
)
|
||||
latency_hist = Histogram(
|
||||
"node_worker_latency_ms", "Job latency in ms",
|
||||
["type", "model"],
|
||||
buckets=[100, 250, 500, 1000, 2500, 5000, 15000, 30000],
|
||||
registry=REGISTRY,
|
||||
)
|
||||
|
||||
except ImportError:
|
||||
PROM_AVAILABLE = False
|
||||
REGISTRY = None
|
||||
logger.info("prometheus_client not installed, worker metrics disabled")
|
||||
|
||||
|
||||
def inc_job(req_type: str, status: str):
|
||||
if PROM_AVAILABLE:
|
||||
jobs_total.labels(type=req_type, status=status).inc()
|
||||
|
||||
|
||||
def set_inflight(count: int):
|
||||
if PROM_AVAILABLE:
|
||||
inflight_gauge.set(count)
|
||||
|
||||
|
||||
def observe_latency(req_type: str, model: str, latency_ms: int):
|
||||
if PROM_AVAILABLE:
|
||||
latency_hist.labels(type=req_type, model=model).observe(latency_ms)
|
||||
|
||||
|
||||
def get_metrics_text():
|
||||
if PROM_AVAILABLE and REGISTRY:
|
||||
return generate_latest(REGISTRY)
|
||||
return None
|
||||
Reference in New Issue
Block a user