P3.1: GPU/Queue-aware routing — NCS metrics + scoring-based model selection
NCS (services/node-capabilities/metrics.py): - NodeLoad: inflight_jobs, queue_depth, concurrency_limit, estimated_wait_ms, cpu_load_1m, mem_pressure (macOS + Linux), rtt_ms_to_hub - RuntimeLoad: per-runtime healthy, p50_ms, p95_ms from rolling 50-sample window - POST /capabilities/report_latency for node-worker → NCS reporting - NCS fetches worker metrics via NODE_WORKER_URL Node Worker: - GET /metrics endpoint (inflight, concurrency, latency buffers) - Latency tracking per job type (llm/vision) with rolling buffer - Fire-and-forget latency reporting to NCS after each successful job Router (model_select v3): - score_candidate(): wait + model_latency + cross_node_penalty + prefer_bonus - LOCAL_THRESHOLD_MS=250: prefer local if within threshold of remote - ModelSelection.score field for observability - Structured [score] logs with chosen node, model, and score breakdown Tests: 19 new (12 scoring + 7 NCS metrics), 36 total pass Docs: ops/runbook_p3_1.md, ops/CHANGELOG_FABRIC.md No breaking changes to JobRequest/JobResponse or capabilities schema. Made-with: Cursor
This commit is contained in:
@@ -4,10 +4,14 @@ import time
|
||||
import logging
|
||||
from typing import Any, Dict, List, Optional
|
||||
|
||||
from fastapi import FastAPI
|
||||
from fastapi import FastAPI, Request
|
||||
from fastapi.responses import JSONResponse
|
||||
import httpx
|
||||
|
||||
from metrics import (
|
||||
build_node_load, build_runtime_load, record_latency,
|
||||
)
|
||||
|
||||
logging.basicConfig(level=logging.INFO)
|
||||
logger = logging.getLogger("node-capabilities")
|
||||
|
||||
@@ -195,20 +199,24 @@ async def _build_capabilities() -> Dict[str, Any]:
|
||||
disk = _collect_disk_inventory()
|
||||
served = _build_served_models(ollama, swapper, llama)
|
||||
|
||||
runtimes = {"ollama": ollama, "swapper": swapper}
|
||||
if llama:
|
||||
runtimes["llama_server"] = llama
|
||||
|
||||
node_load = await build_node_load()
|
||||
runtime_load = await build_runtime_load(runtimes)
|
||||
|
||||
result = {
|
||||
"node_id": NODE_ID,
|
||||
"updated_at": time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime()),
|
||||
"runtimes": {
|
||||
"ollama": ollama,
|
||||
"swapper": swapper,
|
||||
},
|
||||
"runtimes": runtimes,
|
||||
"served_models": served,
|
||||
"served_count": len(served),
|
||||
"node_load": node_load,
|
||||
"runtime_load": runtime_load,
|
||||
"inventory_only": disk,
|
||||
"inventory_count": len(disk),
|
||||
}
|
||||
if llama:
|
||||
result["runtimes"]["llama_server"] = llama
|
||||
|
||||
_cache = result
|
||||
_cache_ts = time.time()
|
||||
@@ -240,6 +248,17 @@ async def capabilities_refresh():
|
||||
return JSONResponse(content={"refreshed": True, "served_count": data["served_count"]})
|
||||
|
||||
|
||||
@app.post("/capabilities/report_latency")
|
||||
async def report_latency_endpoint(request: Request):
|
||||
data = await request.json()
|
||||
runtime = data.get("runtime", "ollama")
|
||||
req_type = data.get("type", "llm")
|
||||
latency_ms = data.get("latency_ms", 0)
|
||||
if latency_ms > 0:
|
||||
record_latency(runtime, req_type, latency_ms)
|
||||
return {"ok": True}
|
||||
|
||||
|
||||
# ── NATS request/reply (optional) ─────────────────────────────────────────────
|
||||
|
||||
ENABLE_NATS = os.getenv("ENABLE_NATS_CAPS", "false").lower() in ("true", "1", "yes")
|
||||
|
||||
Reference in New Issue
Block a user