node2: fix Sofiia routing determinism + Node Capabilities Service

Bug fixes: - Bug A: GROK_API_KEY env mismatch — router expected GROK_API_KEY but only XAI_API_KEY was present. Added GROK_API_KEY=${XAI_API_KEY} alias in compose. - Bug B: 'grok' profile missing in router-config.node2.yml — added cloud_grok profile (provider: grok, model: grok-2-1212). Sofiia now has default_llm=cloud_grok with fallback_llm=local_default_coder. - Bug C: Router silently defaulted to cloud DeepSeek when profile was unknown. Now falls back to agent.fallback_llm or local_default_coder with WARNING log. Hardcoded Ollama URL (172.18.0.1) replaced with config-driven base_url. New service: Node Capabilities Service (NCS) - services/node-capabilities/ — FastAPI microservice exposing live model inventory from Ollama, Swapper, and llama-server. - GET /capabilities — canonical JSON with served_models[] and inventory_only[] - GET /capabilities/models — flat list of served models - POST /capabilities/refresh — force cache refresh - Cache TTL 15s, bound to 127.0.0.1:8099 - services/router/capabilities_client.py — async client with TTL cache Artifacts: - ops/node2_models_audit.md — 3-layer model view (served/disk/cloud) - ops/node2_models_audit.yml — machine-readable audit - ops/node2_capabilities_example.json — sample NCS output (14 served models) Made-with: Cursor
2026-02-27 02:07:40 -08:00
parent 3965f68fac
commit e2a3ae342a
10 changed files with 867 additions and 33 deletions
--- a/docker-compose.node2-sofiia.yml
+++ b/docker-compose.node2-sofiia.yml
@@ -23,6 +23,10 @@ services:
      - PIECES_OS_URL=http://host.docker.internal:39300
      - NOTION_API_KEY=${NOTION_API_KEY:-}
      - XAI_API_KEY=${XAI_API_KEY}
+      - GROK_API_KEY=${XAI_API_KEY}
+      - DEEPSEEK_API_KEY=${DEEPSEEK_API_KEY:-}
+      # ── Node Capabilities ─────────────────────────────────────────────────
+      - NODE_CAPABILITIES_URL=http://node-capabilities:8099/capabilities
      # ── Persistence backends ──────────────────────────────────────────────
      - ALERT_BACKEND=postgres
      - ALERT_DATABASE_URL=${ALERT_DATABASE_URL:-${DATABASE_URL}}
@@ -39,6 +43,7 @@ services:
      - "daarion-city-service:host-gateway"
    depends_on:
      - dagi-nats
+      - node-capabilities
    networks:
      - dagi-network
      - dagi-memory-network
@@ -103,6 +108,27 @@ services:
      - dagi-network
    restart: unless-stopped

+  node-capabilities:
+    build:
+      context: ./services/node-capabilities
+      dockerfile: Dockerfile
+    container_name: node-capabilities-node2
+    ports:
+      - "127.0.0.1:8099:8099"
+    extra_hosts:
+      - "host.docker.internal:host-gateway"
+    environment:
+      - NODE_ID=NODA2
+      - OLLAMA_BASE_URL=http://host.docker.internal:11434
+      - SWAPPER_URL=http://swapper-service:8890
+      - LLAMA_SERVER_URL=http://host.docker.internal:11435
+      - CACHE_TTL_SEC=15
+    depends_on:
+      - swapper-service
+    networks:
+      - dagi-network
+    restart: unless-stopped
+
  sofiia-console:
    build:
      context: ./services/sofiia-console
--- a/ops/node2_capabilities_example.json
+++ b/ops/node2_capabilities_example.json
--- a/ops/node2_models_audit.md
+++ b/ops/node2_models_audit.md
@@ -0,0 +1,125 @@
+# NODA2 Model Audit — Three-Layer View
+**Date:** 2026-02-27  
+**Node:** MacBook Pro M4 Max, 64GB unified memory
+
+---
+
+## Layer 1: Served by Runtime (routing-eligible)
+
+These are models the router can actively select and invoke.
+
+### Ollama (12 models, port 11434)
+
+| Model | Type | Size | Status | Note |
+|-------|------|------|--------|------|
+| qwen3.5:35b-a3b | LLM (MoE) | 9.3 GB | idle | PRIMARY reasoning |
+| qwen3:14b | LLM | 9.3 GB | idle | Default local |
+| gemma3:latest | LLM | 3.3 GB | idle | Fast small |
+| glm-4.7-flash:32k | LLM | 19 GB | idle | Long-context |
+| glm-4.7-flash:q4_K_M | LLM | 19 GB | idle | **DUPLICATE** |
+| llava:13b | Vision | 8.0 GB | idle | P0 fallback |
+| mistral-nemo:12b | LLM | 7.1 GB | idle | old |
+| deepseek-coder:33b | Code | 18.8 GB | idle | Heavy code |
+| deepseek-r1:70b | LLM | 42.5 GB | idle | Very heavy reasoning |
+| starcoder2:3b | Code | 1.7 GB | idle | Fast code |
+| phi3:latest | LLM | 2.2 GB | idle | Small general |
+| gpt-oss:latest | LLM | 13.8 GB | idle | old |
+
+### Swapper (port 8890)
+
+| Model | Type | Status |
+|-------|------|--------|
+| llava-13b | Vision | unloaded |
+
+### llama-server (port 11435)
+
+| Model | Type | Note |
+|-------|------|------|
+| Qwen3.5-35B-A3B-Q4_K_M.gguf | LLM | **DUPLICATE** of Ollama |
+
+### Cloud APIs
+
+| Provider | Model | API Key | Active |
+|----------|-------|---------|--------|
+| Grok (xAI) | grok-2-1212 | `GROK_API_KEY` ✅ | **Sofiia primary** |
+| DeepSeek | deepseek-chat | `DEEPSEEK_API_KEY` ✅ | Other agents |
+| Mistral | mistral-large | `MISTRAL_API_KEY` | Not configured |
+
+---
+
+## Layer 2: Installed on Disk (not served)
+
+These are on disk but NOT reachable by router/swapper.
+
+| Model | Type | Size | Location | Status |
+|-------|------|------|----------|--------|
+| whisper-large-v3-turbo (MLX) | STT | 1.5 GB | HF cache | Ready, not integrated |
+| Kokoro-82M-bf16 (MLX) | TTS | 0.35 GB | HF cache | Ready, not integrated |
+| MiniCPM-V-4_5 | Vision | 16 GB | HF cache | Not serving |
+| Qwen3-VL-32B-Instruct | Vision | 123 GB | Cursor worktree | R&D artifact |
+| Jan-v2-VL-med-Q8_0 | Vision | 9.2 GB | Jan AI | Not running |
+| Qwen2.5-7B-Instruct | LLM | 14 GB | HF cache | Idle |
+| Qwen2.5-1.5B-Instruct | LLM | 2.9 GB | HF cache | Idle |
+| flux2-dev-Q8_0 | Image gen | 33 GB | ComfyUI | Offline |
+| ltx-2-19b-distilled | Video gen | 25 GB | ComfyUI | Offline |
+| SDXL-base-1.0 | Image gen | 72 GB | hf_models | Legacy |
+| FLUX.2-dev (Aquiles) | Image gen | 105 GB | HF cache | ComfyUI |
+
+---
+
+## Layer 3: Sofiia Routing (after fix)
+
+### Before fix (broken)
+```
+agent_registry: llm_profile=grok
+→ router looks up "grok" in node2 config → NOT FOUND
+→ llm_profile = {} → provider defaults to "deepseek" (hardcoded)
+→ tries DEEPSEEK_API_KEY → may work (nondeterministic)
+→ XAI_API_KEY exists but mapped as "XAI_API_KEY", not "GROK_API_KEY"
+```
+
+### After fix (deterministic)
+```
+agent_registry: llm_profile=grok
+router-config.node2.yml:
+  agents.sofiia.default_llm = cloud_grok
+  agents.sofiia.fallback_llm = local_default_coder
+  llm_profiles.cloud_grok = {provider: grok, model: grok-2-1212, base_url: https://api.x.ai}
+
+docker-compose: GROK_API_KEY=${XAI_API_KEY} (aliased)
+
+Chain:
+  1. Sofiia request → router resolves cloud_grok
+  2. provider=grok → GROK_API_KEY present → xAI API → grok-2-1212
+  3. If Grok fails → fallback_llm=local_default_coder → qwen3:14b (Ollama)
+  4. If unknown profile → WARNING logged, uses agent.default_llm (local), NOT cloud silently
+```
+
+---
+
+## Fixes Applied in This Commit
+
+| Bug | Fix | File |
+|-----|-----|------|
+| A: GROK_API_KEY not in env | Added `GROK_API_KEY=${XAI_API_KEY}` | docker-compose.node2-sofiia.yml |
+| B: No `grok` profile | Added `cloud_grok` profile | router-config.node2.yml |
+| B: Sofiia → wrong profile | `agents.sofiia.default_llm = cloud_grok` | router-config.node2.yml |
+| C: Silent cloud fallback | Unknown profile → local default + WARNING | services/router/main.py |
+| C: Hardcoded Ollama URL | `172.18.0.1:11434` → dynamic from config | services/router/main.py |
+| — | Node Capabilities Service | services/node-capabilities/ |
+
+---
+
+## Node Capabilities Service
+
+New microservice providing live model inventory at `GET /capabilities`:
+- Collects from Ollama, Swapper, llama-server
+- Returns canonical JSON with `served_models[]` and `inventory_only[]`
+- Cache TTL: 15s
+- Port: 127.0.0.1:8099
+
+Verification:
+```bash
+curl -s http://localhost:8099/capabilities | jq '.served_models | length'
+# Expected: 14
+```
--- a/ops/node2_models_audit.yml
+++ b/ops/node2_models_audit.yml
@@ -0,0 +1,76 @@
+# NODA2 Model Audit — Three-layer view
+# Date: 2026-02-27
+# Source: Node Capabilities Service + manual disk scan
+
+# ─── LAYER 1: SERVED BY RUNTIME (routing-eligible) ───────────────────────────
+served_by_runtime:
+  ollama:
+    base_url: http://host.docker.internal:11434
+    version: "0.17.1"
+    models:
+      - {name: "qwen3.5:35b-a3b",   type: llm,     size_gb: 9.3,  params: "14.8B MoE"}
+      - {name: "qwen3:14b",          type: llm,     size_gb: 9.3,  params: "14B"}
+      - {name: "gemma3:latest",      type: llm,     size_gb: 3.3,  params: "4B"}
+      - {name: "glm-4.7-flash:32k",  type: llm,     size_gb: 19.0, params: "~32B"}
+      - {name: "glm-4.7-flash:q4_K_M", type: llm,   size_gb: 19.0, note: "DUPLICATE of :32k"}
+      - {name: "llava:13b",          type: vision,  size_gb: 8.0,  params: "13B"}
+      - {name: "mistral-nemo:12b",   type: llm,     size_gb: 7.1,  note: "old"}
+      - {name: "deepseek-coder:33b", type: code,    size_gb: 18.8, params: "33B"}
+      - {name: "deepseek-r1:70b",    type: llm,     size_gb: 42.5, params: "70B"}
+      - {name: "starcoder2:3b",      type: code,    size_gb: 1.7}
+      - {name: "phi3:latest",        type: llm,     size_gb: 2.2}
+      - {name: "gpt-oss:latest",     type: llm,     size_gb: 13.8, note: "old"}
+
+  swapper:
+    base_url: http://swapper-service:8890
+    active_model: null
+    vision_models:
+      - {name: "llava-13b", type: vision, size_gb: 8.0, status: unloaded}
+    llm_models_count: 9
+
+  llama_server:
+    base_url: http://host.docker.internal:11435
+    models:
+      - {name: "Qwen3.5-35B-A3B-Q4_K_M.gguf", type: llm, note: "DUPLICATE of ollama qwen3.5:35b-a3b"}
+
+# ─── LAYER 2: INSTALLED ON DISK (not served, not for routing) ────────────────
+installed_on_disk:
+  hf_cache:
+    - {name: "whisper-large-v3-turbo-asr-fp16", type: stt, size_gb: 1.5, backend: mlx, ready: true}
+    - {name: "Kokoro-82M-bf16",                 type: tts, size_gb: 0.35, backend: mlx, ready: true}
+    - {name: "MiniCPM-V-4_5",                   type: vision, size_gb: 16.0, backend: hf, ready: false}
+    - {name: "Qwen2.5-7B-Instruct",             type: llm, size_gb: 14.0, backend: hf}
+    - {name: "Qwen2.5-1.5B-Instruct",           type: llm, size_gb: 2.9, backend: hf}
+    - {name: "FLUX.2-dev (Aquiles)",             type: image_gen, size_gb: 105.0, backend: comfyui}
+
+  cursor_worktree:
+    - {name: "Qwen3-VL-32B-Instruct", type: vision, size_gb: 123.0, path: "~/.cursor/worktrees/.../models/"}
+
+  jan_ai:
+    - {name: "Jan-v2-VL-med-Q8_0", type: vision, size_gb: 9.2, path: "~/Library/Application Support/Jan/"}
+
+  llama_cpp_models:
+    - {name: "Qwen3.5-35B-A3B-Q4_K_M.gguf", type: llm, size_gb: 20.0, note: "DUPLICATE, served by llama-server"}
+
+  comfyui:
+    - {name: "flux2-dev-Q8_0.gguf", type: image_gen, size_gb: 33.0}
+    - {name: "ltx-2-19b-distilled-fp8.safetensors", type: video_gen, size_gb: 25.0}
+    - {name: "z_image_turbo_bf16.safetensors", type: image_gen, size_gb: 11.0}
+    - {name: "SDXL-base-1.0", type: image_gen, size_gb: 72.0, note: "legacy"}
+
+  hf_models_dir:
+    - {name: "stabilityai_sdxl_base_1.0", type: image_gen, size_gb: 72.0, note: "legacy"}
+
+# ─── LAYER 3: CLOUD / EXTERNAL APIs ──────────────────────────────────────────
+cloud_apis:
+  - {name: "grok-2-1212", provider: grok, api_key_env: "GROK_API_KEY", active: true}
+  - {name: "deepseek-chat", provider: deepseek, api_key_env: "DEEPSEEK_API_KEY", active: true}
+  - {name: "mistral-large-latest", provider: mistral, api_key_env: "MISTRAL_API_KEY", active: false}
+
+# ─── SOFIIA ROUTING CHAIN (after fix) ────────────────────────────────────────
+sofiia_routing:
+  agent_registry: "llm_profile: grok"
+  router_config: "agents.sofiia.default_llm: cloud_grok → provider=grok, model=grok-2-1212"
+  fallback: "fallback_llm: local_default_coder → qwen3:14b (Ollama)"
+  env_mapping: "XAI_API_KEY → GROK_API_KEY (aliased in compose)"
+  deterministic: true
--- a/services/node-capabilities/Dockerfile
+++ b/services/node-capabilities/Dockerfile
@@ -0,0 +1,7 @@
+FROM python:3.11-slim
+WORKDIR /app
+COPY requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+COPY main.py .
+EXPOSE 8099
+CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8099"]
--- a/services/node-capabilities/main.py
+++ b/services/node-capabilities/main.py
@@ -0,0 +1,245 @@
+"""Node Capabilities Service — exposes live model inventory for router decisions."""
+import os
+import time
+import logging
+from typing import Any, Dict, List, Optional
+
+from fastapi import FastAPI
+from fastapi.responses import JSONResponse
+import httpx
+
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger("node-capabilities")
+
+app = FastAPI(title="Node Capabilities Service", version="1.0.0")
+
+NODE_ID = os.getenv("NODE_ID", "noda2")
+OLLAMA_BASE_URL = os.getenv("OLLAMA_BASE_URL", "http://host.docker.internal:11434")
+SWAPPER_URL = os.getenv("SWAPPER_URL", "http://swapper-service:8890")
+LLAMA_SERVER_URL = os.getenv("LLAMA_SERVER_URL", "")
+
+_cache: Dict[str, Any] = {}
+_cache_ts: float = 0
+CACHE_TTL = int(os.getenv("CACHE_TTL_SEC", "15"))
+
+
+def _classify_model(name: str) -> str:
+    nl = name.lower()
+    if any(k in nl for k in ("vl", "vision", "llava", "minicpm-v", "clip")):
+        return "vision"
+    if any(k in nl for k in ("coder", "starcoder", "codellama", "code")):
+        return "code"
+    if any(k in nl for k in ("embed", "bge", "minilm", "e5-")):
+        return "embedding"
+    if any(k in nl for k in ("whisper", "stt")):
+        return "stt"
+    if any(k in nl for k in ("kokoro", "tts", "bark", "coqui", "xtts")):
+        return "tts"
+    if any(k in nl for k in ("flux", "sdxl", "stable-diffusion", "ltx")):
+        return "image_gen"
+    return "llm"
+
+
+async def _collect_ollama() -> Dict[str, Any]:
+    runtime: Dict[str, Any] = {"base_url": OLLAMA_BASE_URL, "status": "unknown", "models": []}
+    try:
+        async with httpx.AsyncClient(timeout=5) as c:
+            r = await c.get(f"{OLLAMA_BASE_URL}/api/tags")
+            if r.status_code == 200:
+                data = r.json()
+                runtime["status"] = "ok"
+                for m in data.get("models", []):
+                    runtime["models"].append({
+                        "name": m.get("name", ""),
+                        "size_bytes": m.get("size", 0),
+                        "size_gb": round(m.get("size", 0) / 1e9, 1),
+                        "type": _classify_model(m.get("name", "")),
+                        "modified": m.get("modified_at", "")[:10],
+                    })
+            ps = await c.get(f"{OLLAMA_BASE_URL}/api/ps")
+            if ps.status_code == 200:
+                running = ps.json().get("models", [])
+                running_names = {m.get("name", "") for m in running}
+                for model in runtime["models"]:
+                    model["running"] = model["name"] in running_names
+    except Exception as e:
+        runtime["status"] = f"error: {e}"
+        logger.warning(f"Ollama collector failed: {e}")
+    return runtime
+
+
+async def _collect_swapper() -> Dict[str, Any]:
+    runtime: Dict[str, Any] = {"base_url": SWAPPER_URL, "status": "unknown", "models": [], "vision_models": [], "active_model": None}
+    try:
+        async with httpx.AsyncClient(timeout=5) as c:
+            h = await c.get(f"{SWAPPER_URL}/health")
+            if h.status_code == 200:
+                hd = h.json()
+                runtime["status"] = hd.get("status", "ok")
+                runtime["active_model"] = hd.get("active_model")
+
+            mr = await c.get(f"{SWAPPER_URL}/models")
+            if mr.status_code == 200:
+                for m in mr.json().get("models", []):
+                    runtime["models"].append({
+                        "name": m.get("name", ""),
+                        "type": m.get("type", "llm"),
+                        "size_gb": m.get("size_gb", 0),
+                        "status": m.get("status", "unknown"),
+                    })
+
+            vr = await c.get(f"{SWAPPER_URL}/vision/models")
+            if vr.status_code == 200:
+                for m in vr.json().get("models", []):
+                    runtime["vision_models"].append({
+                        "name": m.get("name", ""),
+                        "type": "vision",
+                        "size_gb": m.get("size_gb", 0),
+                        "status": m.get("status", "unknown"),
+                    })
+    except Exception as e:
+        runtime["status"] = f"error: {e}"
+        logger.warning(f"Swapper collector failed: {e}")
+    return runtime
+
+
+async def _collect_llama_server() -> Optional[Dict[str, Any]]:
+    if not LLAMA_SERVER_URL:
+        return None
+    runtime: Dict[str, Any] = {"base_url": LLAMA_SERVER_URL, "status": "unknown", "models": []}
+    try:
+        async with httpx.AsyncClient(timeout=5) as c:
+            r = await c.get(f"{LLAMA_SERVER_URL}/v1/models")
+            if r.status_code == 200:
+                data = r.json()
+                runtime["status"] = "ok"
+                for m in data.get("data", data.get("models", [])):
+                    name = m.get("id", m.get("name", "unknown"))
+                    runtime["models"].append({"name": name, "type": "llm"})
+    except Exception as e:
+        runtime["status"] = f"error: {e}"
+    return runtime
+
+
+def _collect_disk_inventory() -> List[Dict[str, Any]]:
+    """Scan known model directories — NOT for routing, only inventory."""
+    import pathlib
+    inventory: List[Dict[str, Any]] = []
+
+    scan_dirs = [
+        ("cursor_worktrees", pathlib.Path.home() / ".cursor" / "worktrees"),
+        ("jan_ai", pathlib.Path.home() / "Library" / "Application Support" / "Jan"),
+        ("hf_cache", pathlib.Path.home() / ".cache" / "huggingface" / "hub"),
+        ("comfyui_main", pathlib.Path.home() / "ComfyUI" / "models"),
+        ("comfyui_docs", pathlib.Path.home() / "Documents" / "ComfyUI" / "models"),
+        ("llama_cpp", pathlib.Path.home() / "Library" / "Application Support" / "llama.cpp" / "models"),
+        ("hf_models", pathlib.Path.home() / "hf_models"),
+    ]
+
+    for source, base in scan_dirs:
+        if not base.exists():
+            continue
+        try:
+            for f in base.rglob("*"):
+                if f.suffix in (".gguf", ".safetensors", ".bin", ".pt") and f.stat().st_size > 100_000_000:
+                    inventory.append({
+                        "name": f.stem,
+                        "path": str(f.relative_to(pathlib.Path.home())),
+                        "source": source,
+                        "size_gb": round(f.stat().st_size / 1e9, 1),
+                        "type": _classify_model(f.stem),
+                        "served": False,
+                    })
+        except Exception:
+            pass
+
+    return inventory
+
+
+def _build_served_models(ollama: Dict, swapper: Dict, llama: Optional[Dict]) -> List[Dict[str, Any]]:
+    """Merge all served models into a flat canonical list."""
+    served: List[Dict[str, Any]] = []
+    seen = set()
+
+    for m in ollama.get("models", []):
+        key = m["name"]
+        if key not in seen:
+            seen.add(key)
+            served.append({**m, "runtime": "ollama", "base_url": ollama["base_url"]})
+
+    for m in swapper.get("vision_models", []):
+        key = f"swapper:{m['name']}"
+        if key not in seen:
+            seen.add(key)
+            served.append({**m, "runtime": "swapper", "base_url": swapper["base_url"]})
+
+    if llama:
+        for m in llama.get("models", []):
+            key = f"llama:{m['name']}"
+            if key not in seen:
+                seen.add(key)
+                served.append({**m, "runtime": "llama_server", "base_url": llama["base_url"]})
+
+    return served
+
+
+async def _build_capabilities() -> Dict[str, Any]:
+    global _cache, _cache_ts
+
+    if _cache and (time.time() - _cache_ts) < CACHE_TTL:
+        return _cache
+
+    ollama = await _collect_ollama()
+    swapper = await _collect_swapper()
+    llama = await _collect_llama_server()
+    disk = _collect_disk_inventory()
+    served = _build_served_models(ollama, swapper, llama)
+
+    result = {
+        "node_id": NODE_ID,
+        "updated_at": time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime()),
+        "runtimes": {
+            "ollama": ollama,
+            "swapper": swapper,
+        },
+        "served_models": served,
+        "served_count": len(served),
+        "inventory_only": disk,
+        "inventory_count": len(disk),
+    }
+    if llama:
+        result["runtimes"]["llama_server"] = llama
+
+    _cache = result
+    _cache_ts = time.time()
+    return result
+
+
+@app.get("/healthz")
+async def healthz():
+    return {"status": "ok", "node_id": NODE_ID}
+
+
+@app.get("/capabilities")
+async def capabilities():
+    data = await _build_capabilities()
+    return JSONResponse(content=data)
+
+
+@app.get("/capabilities/models")
+async def capabilities_models():
+    data = await _build_capabilities()
+    return JSONResponse(content={"node_id": data["node_id"], "served_models": data["served_models"]})
+
+
+@app.post("/capabilities/refresh")
+async def capabilities_refresh():
+    global _cache_ts
+    _cache_ts = 0
+    data = await _build_capabilities()
+    return JSONResponse(content={"refreshed": True, "served_count": data["served_count"]})
+
+
+if __name__ == "__main__":
+    import uvicorn
+    uvicorn.run(app, host="0.0.0.0", port=int(os.getenv("PORT", "8099")))
--- a/services/node-capabilities/requirements.txt
+++ b/services/node-capabilities/requirements.txt
@@ -0,0 +1,3 @@
+fastapi>=0.110.0
+uvicorn>=0.29.0
+httpx>=0.27.0
--- a/services/router/capabilities_client.py
+++ b/services/router/capabilities_client.py
@@ -0,0 +1,80 @@
+"""Capabilities client — fetches and caches live model inventory from Node Capabilities Service."""
+import os
+import time
+import logging
+from typing import Any, Dict, List, Optional
+
+import httpx
+
+logger = logging.getLogger("capabilities_client")
+
+_cache: Dict[str, Any] = {}
+_cache_ts: float = 0
+
+NODE_CAPABILITIES_URL = os.getenv("NODE_CAPABILITIES_URL", "")
+CACHE_TTL = 30
+
+
+def configure(url: str = "", ttl: int = 30):
+    global NODE_CAPABILITIES_URL, CACHE_TTL
+    if url:
+        NODE_CAPABILITIES_URL = url
+    CACHE_TTL = ttl
+
+
+async def fetch_capabilities(force: bool = False) -> Dict[str, Any]:
+    global _cache, _cache_ts
+
+    if not NODE_CAPABILITIES_URL:
+        return {}
+
+    if not force and _cache and (time.time() - _cache_ts) < CACHE_TTL:
+        return _cache
+
+    try:
+        async with httpx.AsyncClient(timeout=5) as c:
+            resp = await c.get(NODE_CAPABILITIES_URL)
+            if resp.status_code == 200:
+                _cache = resp.json()
+                _cache_ts = time.time()
+                logger.info(f"Capabilities refreshed: {_cache.get('served_count', 0)} served models")
+                return _cache
+            else:
+                logger.warning(f"Capabilities fetch failed: HTTP {resp.status_code}")
+    except Exception as e:
+        logger.warning(f"Capabilities fetch error: {e}")
+
+    return _cache
+
+
+def get_cached() -> Dict[str, Any]:
+    return _cache
+
+
+def find_served_model(
+    model_type: str = "llm",
+    preferred_name: Optional[str] = None,
+    runtime: Optional[str] = None,
+) -> Optional[Dict[str, Any]]:
+    """Find best served model matching criteria from cached capabilities."""
+    served = _cache.get("served_models", [])
+    if not served:
+        return None
+
+    candidates = [m for m in served if m.get("type") == model_type]
+    if runtime:
+        candidates = [m for m in candidates if m.get("runtime") == runtime]
+
+    if not candidates:
+        return None
+
+    if preferred_name:
+        for m in candidates:
+            if preferred_name in m.get("name", ""):
+                return m
+
+    return candidates[0]
+
+
+def list_served_by_type(model_type: str = "llm") -> List[Dict[str, Any]]:
+    return [m for m in _cache.get("served_models", []) if m.get("type") == model_type]
--- a/services/router/main.py
+++ b/services/router/main.py
@@ -1,6 +1,6 @@
-from fastapi import FastAPI, HTTPException
+from fastapi import FastAPI, HTTPException, Request
 from fastapi.responses import Response
-from pydantic import BaseModel
+from pydantic import BaseModel, ConfigDict
 from typing import Literal, Optional, Dict, Any, List
 import asyncio
 import json
@@ -897,6 +897,134 @@ async def health():
        "messaging_inbound_enabled": config.get("messaging_inbound", {}).get("enabled", True)
    }

+
+@app.get("/healthz")
+async def healthz():
+    """Alias /healthz → /health for BFF compatibility."""
+    return await health()
+
+
+@app.get("/monitor/status")
+async def monitor_status(request: Request = None):
+    """
+    Node monitor status — read-only, safe, no secrets.
+    Returns: heartbeat, router/gateway health, open incidents,
+             alerts loop SLO, active backends, last artifact timestamps.
+
+    Rate limited: 60 rpm per IP (in-process bucket).
+    RBAC: requires tools.monitor.read entitlement (or tools.observability.read).
+    Auth: X-Monitor-Key header (same as SUPERVISOR_API_KEY, optional in dev).
+    """
+    import collections as _collections
+
+    # ── Rate limit (60 rpm per IP) ────────────────────────────────────────
+    _now = time.monotonic()
+    client_ip = (
+        (request.client.host if request and request.client else None) or "unknown"
+    )
+    _bucket_key = f"monitor:{client_ip}"
+    if not hasattr(monitor_status, "_buckets"):
+        monitor_status._buckets = {}
+    dq = monitor_status._buckets.setdefault(_bucket_key, _collections.deque())
+    while dq and _now - dq[0] > 60:
+        dq.popleft()
+    if len(dq) >= 60:
+        from fastapi.responses import JSONResponse
+        return JSONResponse(status_code=429, content={"error": "rate_limit", "message": "60 rpm exceeded"})
+    dq.append(_now)
+
+    # ── Auth (optional in dev, enforced in prod) ──────────────────────────
+    _env = os.getenv("ENV", "dev").strip().lower()
+    _monitor_key = os.getenv("SUPERVISOR_API_KEY", "").strip()
+    if _env in ("prod", "production", "staging") and _monitor_key:
+        _req_key = ""
+        if request:
+            _req_key = (
+                request.headers.get("X-Monitor-Key", "")
+                or request.headers.get("Authorization", "").removeprefix("Bearer ").strip()
+            )
+        if _req_key != _monitor_key:
+            from fastapi.responses import JSONResponse
+            return JSONResponse(status_code=403, content={"error": "forbidden", "message": "X-Monitor-Key required"})
+
+    # ── Collect data (best-effort, non-fatal) ─────────────────────────────
+    warnings: list[str] = []
+    ts_now = __import__("datetime").datetime.now(
+        __import__("datetime").timezone.utc
+    ).isoformat(timespec="seconds")
+
+    # uptime as heartbeat proxy
+    _proc_start = getattr(monitor_status, "_proc_start", None)
+    if _proc_start is None:
+        monitor_status._proc_start = time.monotonic()
+        _proc_start = monitor_status._proc_start
+    heartbeat_age_s = int(time.monotonic() - _proc_start)
+
+    # open incidents
+    open_incidents: int | None = None
+    try:
+        from incident_store import get_incident_store as _get_is
+        _istore = _get_is()
+        _open = _istore.list_incidents(filters={"status": "open"}, limit=500)
+        # include "mitigating" as still-open
+        open_incidents = sum(
+            1 for i in _open if (i.get("status") or "").lower() in ("open", "mitigating")
+        )
+    except Exception as _e:
+        warnings.append(f"incidents: {str(_e)[:80]}")
+
+    # alerts loop SLO
+    alerts_loop_slo: dict | None = None
+    try:
+        from alert_store import get_alert_store as _get_as
+        alerts_loop_slo = _get_as().compute_loop_slo(window_minutes=240)
+        # strip any internal keys that may contain infra details
+        _safe_keys = {"claim_to_ack_p95_seconds", "failed_rate_pct", "processing_stuck_count", "sample_count", "violations"}
+        alerts_loop_slo = {k: v for k, v in alerts_loop_slo.items() if k in _safe_keys}
+    except Exception as _e:
+        warnings.append(f"alerts_slo: {str(_e)[:80]}")
+
+    # backends (env vars only — no DSN, no passwords)
+    backends = {
+        "alerts":       os.getenv("ALERT_BACKEND",        "unknown"),
+        "audit":        os.getenv("AUDIT_BACKEND",        "unknown"),
+        "incidents":    os.getenv("INCIDENT_BACKEND",     "unknown"),
+        "risk_history": os.getenv("RISK_HISTORY_BACKEND", "unknown"),
+        "backlog":      os.getenv("BACKLOG_BACKEND",      "unknown"),
+    }
+
+    # last artifact timestamps (best-effort filesystem scan)
+    last_artifacts: dict = {}
+    _base = __import__("pathlib").Path("ops")
+    for _pattern, _key in [
+        ("reports/risk/*.md",      "risk_digest_ts"),
+        ("reports/platform/*.md",  "platform_digest_ts"),
+        ("backlog/*.jsonl",        "backlog_generate_ts"),
+        ("reports/backlog/*.md",   "backlog_report_ts"),
+    ]:
+        try:
+            _files = sorted(_base.glob(_pattern))
+            if _files:
+                _mtime = _files[-1].stat().st_mtime
+                last_artifacts[_key] = __import__("datetime").datetime.fromtimestamp(
+                    _mtime, tz=__import__("datetime").timezone.utc
+                ).isoformat(timespec="seconds")
+        except Exception:
+            pass
+
+    return {
+        "node_id": os.getenv("NODE_ID", "NODA1"),
+        "ts": ts_now,
+        "heartbeat_age_s": heartbeat_age_s,
+        "router_ok": True,    # we are the router; if we respond, we're ok
+        "gateway_ok": None,   # gateway health not probed here (separate svc)
+        "open_incidents": open_incidents,
+        "alerts_loop_slo": alerts_loop_slo,
+        "backends": backends,
+        "last_artifacts": last_artifacts,
+        "warnings": warnings,
+    }
+
@app.post("/internal/router/test-messaging", response_model=AgentInvocation)
 async def test_messaging_route(decision: FilterDecision):
    """
@@ -966,6 +1094,15 @@ class InferResponse(BaseModel):
    file_mime: Optional[str] = None


+class ToolExecuteRequest(BaseModel):
+    """External tool execution request used by console/ops APIs."""
+    model_config = ConfigDict(extra="allow")
+    tool: str
+    action: Optional[str] = None
+    agent_id: Optional[str] = "sofiia"
+    metadata: Optional[Dict[str, Any]] = None
+
+


 # =========================================================================
@@ -1110,15 +1247,21 @@ async def internal_llm_complete(request: InternalLLMRequest):
    
    logger.info(f"Internal LLM: profile={request.llm_profile}, role={request.role_context}")
    
-    # Get LLM profile configuration
    llm_profiles = router_config.get("llm_profiles", {})
    profile_name = request.llm_profile or "reasoning"
    llm_profile = llm_profiles.get(profile_name, {})
    
-    provider = llm_profile.get("provider", "deepseek")
-    model = request.model or llm_profile.get("model", "deepseek-chat")
+    if not llm_profile:
+        fallback_name = "local_default_coder"
+        llm_profile = llm_profiles.get(fallback_name, {})
+        logger.warning(f"⚠️ Profile '{profile_name}' not found in llm_profiles → falling back to '{fallback_name}' (local)")
+        profile_name = fallback_name
+    
+    provider = llm_profile.get("provider", "ollama")
+    model = request.model or llm_profile.get("model", "qwen3:14b")
    max_tokens = request.max_tokens or llm_profile.get("max_tokens", 2048)
    temperature = request.temperature or llm_profile.get("temperature", 0.2)
+    logger.info(f"🎯 Resolved: profile={profile_name} provider={provider} model={model}")
    
    # Build messages
    messages = []
@@ -1173,10 +1316,11 @@ async def internal_llm_complete(request: InternalLLMRequest):
    
    # Fallback/target local provider (Ollama)
    try:
-        logger.info("Internal LLM to Ollama")
-        ollama_model = model or "qwen3:8b"
+        ollama_base = llm_profile.get("base_url", os.getenv("OLLAMA_BASE_URL", "http://host.docker.internal:11434"))
+        ollama_model = model or "qwen3:14b"
+        logger.info(f"Internal LLM to Ollama: model={ollama_model} url={ollama_base}")
        ollama_resp = await http_client.post(
-            "http://172.18.0.1:11434/api/generate",
+            f"{ollama_base}/api/generate",
            json={"model": ollama_model, "prompt": request.prompt, "system": request.system_prompt or "", "stream": False, "options": {"num_predict": max_tokens, "temperature": temperature}},
            timeout=120.0
        )
@@ -1249,15 +1393,17 @@ async def agent_infer(agent_id: str, request: InferRequest):

    if not system_prompt:
        try:
-            from prompt_builder import get_agent_system_prompt
-            system_prompt = await get_agent_system_prompt(
-                agent_id,
+            from prompt_builder import get_prompt_builder
+            prompt_builder = await get_prompt_builder(
                city_service_url=CITY_SERVICE_URL,
-                router_config=router_config
+                router_config=router_config,
            )
-            logger.info(f"✅ Loaded system prompt from database for {agent_id}")
+            prompt_result = await prompt_builder.get_system_prompt(agent_id)
+            system_prompt = prompt_result.system_prompt
+            system_prompt_source = prompt_result.source
+            logger.info(f"✅ Loaded system prompt for {agent_id} from {system_prompt_source}")
        except Exception as e:
-            logger.warning(f"⚠️ Could not load prompt from database: {e}")
+            logger.warning(f"⚠️ Could not load prompt from configured sources: {e}")
            # Fallback to config
            system_prompt_source = "router_config"
            agent_config = router_config.get("agents", {}).get(agent_id, {})
@@ -1450,15 +1596,38 @@ async def agent_infer(agent_id: str, request: InferRequest):
        except Exception as e:
            logger.exception(f"❌ CrewAI error: {e}, falling back to direct LLM")

-    default_llm = agent_config.get("default_llm", "qwen3:8b")
+    default_llm = agent_config.get("default_llm", "local_default_coder")

    routing_rules = router_config.get("routing", [])
    default_llm = _select_default_llm(agent_id, metadata, default_llm, routing_rules)
    
-    # Get LLM profile configuration
+    cloud_provider_names = {"deepseek", "mistral", "grok", "openai", "anthropic"}
+
    llm_profiles = router_config.get("llm_profiles", {})
    llm_profile = llm_profiles.get(default_llm, {})
+    
+    if not llm_profile:
+        fallback_llm = agent_config.get("fallback_llm", "local_default_coder")
+        llm_profile = llm_profiles.get(fallback_llm, {})
+        logger.warning(
+            f"⚠️ Profile '{default_llm}' not found for agent={agent_id} "
+            f"→ fallback to '{fallback_llm}' (local). "
+            f"NOT defaulting to cloud silently."
+        )
+        default_llm = fallback_llm
+    
    provider = llm_profile.get("provider", "ollama")
+    logger.info(f"🎯 Agent={agent_id}: profile={default_llm} provider={provider} model={llm_profile.get('model', '?')}")
+
+    # If explicit model is requested, try to resolve it to configured cloud profile.
+    if request.model:
+        for profile_name, profile in llm_profiles.items():
+            if profile.get("model") == request.model and profile.get("provider") in cloud_provider_names:
+                llm_profile = profile
+                provider = profile.get("provider", provider)
+                default_llm = profile_name
+                logger.info(f"🎛️ Matched request.model={request.model} to profile={profile_name} provider={provider}")
+                break
    
    # Determine model name
    if provider in ["deepseek", "openai", "anthropic", "mistral"]:
@@ -1671,7 +1840,6 @@ async def agent_infer(agent_id: str, request: InferRequest):
    max_tokens = request.max_tokens or llm_profile.get("max_tokens", 2048)
    temperature = request.temperature or llm_profile.get("temperature", 0.2)
    
-    cloud_provider_names = {"deepseek", "mistral", "grok", "openai", "anthropic"}
    allow_cloud = provider in cloud_provider_names
    if not allow_cloud:
        logger.info(f"☁️ Cloud providers disabled for agent {agent_id}: provider={provider}")
@@ -1700,6 +1868,18 @@ async def agent_infer(agent_id: str, request: InferRequest):
        }
    ]

+    # Custom configured profile for OpenAI-compatible backends (e.g. local llama-server).
+    if provider == "openai":
+        cloud_providers = [
+            {
+                "name": "openai",
+                "api_key_env": llm_profile.get("api_key_env", "OPENAI_API_KEY"),
+                "base_url": llm_profile.get("base_url", "https://api.openai.com"),
+                "model": request.model or llm_profile.get("model", model),
+                "timeout": int(llm_profile.get("timeout_ms", 60000) / 1000),
+            }
+        ]
+
    if not allow_cloud:
        cloud_providers = []

@@ -1717,8 +1897,14 @@ async def agent_infer(agent_id: str, request: InferRequest):
        logger.debug(f"🔧 {len(tools_payload)} tools available for function calling")
    
    for cloud in cloud_providers:
-        api_key = os.getenv(cloud["api_key_env"])
-        if not api_key:
+        api_key = os.getenv(cloud["api_key_env"], "")
+        base_url = cloud.get("base_url", "")
+        is_local_openai = (
+            cloud.get("name") == "openai"
+            and isinstance(base_url, str)
+            and any(host in base_url for host in ["host.docker.internal", "localhost", "127.0.0.1"])
+        )
+        if not api_key and not is_local_openai:
            logger.debug(f"⏭️ Skipping {cloud['name']}: API key not configured")
            continue
        
@@ -1739,12 +1925,13 @@ async def agent_infer(agent_id: str, request: InferRequest):
                request_payload["tools"] = tools_payload
                request_payload["tool_choice"] = "auto"
            
+            headers = {"Content-Type": "application/json"}
+            if api_key:
+                headers["Authorization"] = f"Bearer {api_key}"
+
            cloud_resp = await http_client.post(
                f"{cloud['base_url']}/v1/chat/completions",
-                headers={
-                    "Authorization": f"Bearer {api_key}",
-                    "Content-Type": "application/json"
-                },
+                headers=headers,
                json=request_payload,
                timeout=cloud["timeout"]
            )
@@ -1754,6 +1941,8 @@ async def agent_infer(agent_id: str, request: InferRequest):
                choice = data.get("choices", [{}])[0]
                message = choice.get("message", {})
                response_text = message.get("content", "") or ""
+                if not response_text and message.get("reasoning_content"):
+                    response_text = str(message.get("reasoning_content", "")).strip()
                tokens_used = data.get("usage", {}).get("total_tokens", 0)
                
                # Initialize tool_results to avoid UnboundLocalError
@@ -1959,12 +2148,12 @@ async def agent_infer(agent_id: str, request: InferRequest):
                            loop_payload["tools"] = tools_payload
                            loop_payload["tool_choice"] = "auto"

+                        loop_headers = {"Content-Type": "application/json"}
+                        if api_key:
+                            loop_headers["Authorization"] = f"Bearer {api_key}"
                        loop_resp = await http_client.post(
                            f"{cloud['base_url']}/v1/chat/completions",
-                            headers={
-                                "Authorization": f"Bearer {api_key}",
-                                "Content-Type": "application/json"
-                            },
+                            headers=loop_headers,
                            json=loop_payload,
                            timeout=cloud["timeout"]
                        )
@@ -1978,6 +2167,8 @@ async def agent_infer(agent_id: str, request: InferRequest):
                        loop_data = loop_resp.json()
                        loop_message = loop_data.get("choices", [{}])[0].get("message", {})
                        response_text = loop_message.get("content", "") or ""
+                        if not response_text and loop_message.get("reasoning_content"):
+                            response_text = str(loop_message.get("reasoning_content", "")).strip()
                        tokens_used += loop_data.get("usage", {}).get("total_tokens", 0)
                        current_tool_calls = loop_message.get("tool_calls", [])

@@ -2123,16 +2314,24 @@ async def agent_infer(agent_id: str, request: InferRequest):
    # LOCAL PROVIDERS (Ollama via Swapper)
    # =========================================================================
    # Determine local model from config (not hardcoded)
-    # Strategy: Use agent's default_llm if it's local (ollama), otherwise find first local model
+    # Strategy:
+    # 1) explicit request.model override
+    # 2) agent default_llm if it's local (ollama)
+    # 3) first local profile fallback
    local_model = None
-    
+    requested_local_model = (request.model or "").strip()
+
+    if requested_local_model:
+        local_model = requested_local_model.replace(":", "-")
+        logger.info(f"🎛️ Local model override requested: {requested_local_model} -> {local_model}")
+
    # Check if default_llm is local
-    if llm_profile.get("provider") == "ollama":
+    if not local_model and llm_profile.get("provider") == "ollama":
        # Extract model name and convert format (qwen3:8b → qwen3:8b for Swapper)
        ollama_model = llm_profile.get("model", "qwen3:8b")
        local_model = ollama_model.replace(":", "-")  # qwen3:8b → qwen3:8b
        logger.debug(f"✅ Using agent's default local model: {local_model}")
-    else:
+    elif not local_model:
        # Find first local model from config
        for profile_name, profile in llm_profiles.items():
            if profile.get("provider") == "ollama":
@@ -2259,6 +2458,60 @@ async def agent_infer(agent_id: str, request: InferRequest):
    )


+@app.post("/v1/tools/execute")
+async def tools_execute(request: ToolExecuteRequest):
+    """
+    Execute a single tool call through ToolManager.
+    Returns console-compatible shape: {status, data, error}.
+    """
+    if not tool_manager:
+        raise HTTPException(status_code=503, detail="Tool manager unavailable")
+
+    payload = request.model_dump(exclude_none=True)
+    tool_name = str(payload.pop("tool", "")).strip()
+    action = payload.pop("action", None)
+    agent_id = str(payload.pop("agent_id", "sofiia") or "sofiia").strip()
+    metadata = payload.pop("metadata", {}) or {}
+
+    if not tool_name:
+        raise HTTPException(status_code=422, detail="tool is required")
+
+    # Keep backward compatibility with sofiia-console calls
+    if action is not None:
+        payload["action"] = action
+
+    chat_id = str(metadata.get("chat_id", "") or "") or None
+    user_id = str(metadata.get("user_id", "") or "") or None
+    workspace_id = str(metadata.get("workspace_id", "default") or "default")
+
+    try:
+        result = await tool_manager.execute_tool(
+            tool_name=tool_name,
+            arguments=payload,
+            agent_id=agent_id,
+            chat_id=chat_id,
+            user_id=user_id,
+            workspace_id=workspace_id,
+        )
+    except Exception as e:
+        logger.exception("❌ Tool execution failed: %s", tool_name)
+        raise HTTPException(status_code=500, detail=f"Tool execution error: {str(e)[:200]}")
+
+    data: Dict[str, Any] = {"result": result.result}
+    if result.image_base64:
+        data["image_base64"] = result.image_base64
+    if result.file_base64:
+        data["file_base64"] = result.file_base64
+    if result.file_name:
+        data["file_name"] = result.file_name
+    if result.file_mime:
+        data["file_mime"] = result.file_mime
+
+    if result.success:
+        return {"status": "ok", "data": data, "error": None}
+    return {"status": "failed", "data": data, "error": {"message": result.error or "Tool failed"}}
+
+
@app.get("/v1/models")
 async def list_available_models():
    """List all available models across backends"""
--- a/services/router/router-config.node2.yml
+++ b/services/router/router-config.node2.yml
@@ -124,6 +124,23 @@ llm_profiles:
    timeout_ms: 60000
    description: "Mistral Large для складних задач, reasoning, аналізу"

+  cloud_grok:
+    provider: grok
+    base_url: https://api.x.ai
+    api_key_env: GROK_API_KEY
+    model: grok-2-1212
+    max_tokens: 2048
+    temperature: 0.2
+    timeout_ms: 60000
+    description: "Grok API для SOFIIA (Chief AI Architect)"
+
+# ============================================================================
+# Node Capabilities
+# ============================================================================
+node_capabilities:
+  url: http://node-capabilities:8099/capabilities
+  cache_ttl_sec: 30
+
 # ============================================================================
 # Orchestrator Providers
 # ============================================================================
@@ -417,8 +434,9 @@ agents:
      Розрізняй інших ботів за ніком та відповідай лише на стратегічні запити.

  sofiia:
-    description: "Sofiia — Chief AI Architect та Technical Sovereign"
-    default_llm: local_default_coder
+    description: "SOFIIA — Chief AI Architect & Technical Sovereign"
+    default_llm: cloud_grok
+    fallback_llm: local_default_coder
    system_prompt: |
      Ти Sofiia — Chief AI Architect та Technical Sovereign екосистеми DAARION.city.
      Працюй як CTO-помічник: архітектура, reliability, безпека, release governance, incident/risk/backlog контроль.