# P3.1 — GPU/Queue-aware Routing Runbook ## What Changed NCS now exposes **runtime health and load metrics** alongside model inventory. Router uses a **scoring function** to pick the fastest node+model combo. Node-worker reports latencies back to NCS for p50/p95 calculation. ## Verification Commands ### 1. NCS capabilities with load metrics ```bash curl -s http://127.0.0.1:8099/capabilities | jq '.node_load' ``` Expected: `inflight_jobs`, `estimated_wait_ms`, `cpu_load_1m`, `mem_pressure` ### 2. Runtime load (p50/p95) ```bash curl -s http://127.0.0.1:8099/capabilities | jq '.runtime_load' ``` Expected: per-runtime `p50_ms`, `p95_ms` after some traffic ### 3. Node-worker metrics ```bash curl -s http://127.0.0.1:8109/metrics | jq ``` Expected: `inflight_jobs`, `concurrency_limit`, `last_latencies_llm` ### 4. NATS capabilities (includes metrics) ```bash nats req node.noda2.capabilities.get '{}' ``` ### 5. Router scoring logs ```bash docker logs dagi-router-node2 2>&1 | grep '\[score\]' ``` Expected: `chosen=LOCAL:nodeX/modelY score=NNN` ### 6. Report latency manually ```bash curl -s -X POST http://127.0.0.1:8099/capabilities/report_latency \ -H "Content-Type: application/json" \ -d '{"runtime":"ollama","type":"llm","latency_ms":450}' ``` ## Scoring Formula ``` score = wait + model_latency + cross_node_penalty + prefer_bonus wait = node_load.estimated_wait_ms (0 if idle) model_latency = model_p50_ms or runtime p50_ms or 1500 (default) cross_penalty = 0 if local, else rtt_ms * 2 prefer_bonus = -1000 for first prefer match, -900 for second, etc. ``` If best_local_score <= best_remote_score + 250ms → prefer local. ## Estimated Wait Formula ``` if inflight_jobs < concurrency_limit: estimated_wait = 0 else: estimated_wait = (inflight - concurrency + 1) * p50_ms ``` ## Troubleshooting | Symptom | Check | Fix | |---------|-------|-----| | NCS shows `p50=null` | No traffic yet | Send test requests | | `estimated_wait_ms` always 0 | Inflight < limit | Expected if not saturated | | `mem_pressure=null` | Container lacks `memory_pressure` | Expected in Docker | | Scoring always picks local | Remote score higher | Check remote rtt/wait | | Node-worker latencies empty | NCS can't reach worker | Check `NODE_WORKER_URL` env |