node2: fix Sofiia routing determinism + Node Capabilities Service

Bug fixes:
- Bug A: GROK_API_KEY env mismatch — router expected GROK_API_KEY but only
  XAI_API_KEY was present. Added GROK_API_KEY=${XAI_API_KEY} alias in compose.
- Bug B: 'grok' profile missing in router-config.node2.yml — added cloud_grok
  profile (provider: grok, model: grok-2-1212). Sofiia now has
  default_llm=cloud_grok with fallback_llm=local_default_coder.
- Bug C: Router silently defaulted to cloud DeepSeek when profile was unknown.
  Now falls back to agent.fallback_llm or local_default_coder with WARNING log.
  Hardcoded Ollama URL (172.18.0.1) replaced with config-driven base_url.

New service: Node Capabilities Service (NCS)
- services/node-capabilities/ — FastAPI microservice exposing live model
  inventory from Ollama, Swapper, and llama-server.
- GET /capabilities — canonical JSON with served_models[] and inventory_only[]
- GET /capabilities/models — flat list of served models
- POST /capabilities/refresh — force cache refresh
- Cache TTL 15s, bound to 127.0.0.1:8099
- services/router/capabilities_client.py — async client with TTL cache

Artifacts:
- ops/node2_models_audit.md — 3-layer model view (served/disk/cloud)
- ops/node2_models_audit.yml — machine-readable audit
- ops/node2_capabilities_example.json — sample NCS output (14 served models)

Made-with: Cursor
This commit is contained in:
Apple
2026-02-27 02:07:40 -08:00
parent 3965f68fac
commit e2a3ae342a
10 changed files with 867 additions and 33 deletions

View File

@@ -23,6 +23,10 @@ services:
- PIECES_OS_URL=http://host.docker.internal:39300
- NOTION_API_KEY=${NOTION_API_KEY:-}
- XAI_API_KEY=${XAI_API_KEY}
- GROK_API_KEY=${XAI_API_KEY}
- DEEPSEEK_API_KEY=${DEEPSEEK_API_KEY:-}
# ── Node Capabilities ─────────────────────────────────────────────────
- NODE_CAPABILITIES_URL=http://node-capabilities:8099/capabilities
# ── Persistence backends ──────────────────────────────────────────────
- ALERT_BACKEND=postgres
- ALERT_DATABASE_URL=${ALERT_DATABASE_URL:-${DATABASE_URL}}
@@ -39,6 +43,7 @@ services:
- "daarion-city-service:host-gateway"
depends_on:
- dagi-nats
- node-capabilities
networks:
- dagi-network
- dagi-memory-network
@@ -103,6 +108,27 @@ services:
- dagi-network
restart: unless-stopped
node-capabilities:
build:
context: ./services/node-capabilities
dockerfile: Dockerfile
container_name: node-capabilities-node2
ports:
- "127.0.0.1:8099:8099"
extra_hosts:
- "host.docker.internal:host-gateway"
environment:
- NODE_ID=NODA2
- OLLAMA_BASE_URL=http://host.docker.internal:11434
- SWAPPER_URL=http://swapper-service:8890
- LLAMA_SERVER_URL=http://host.docker.internal:11435
- CACHE_TTL_SEC=15
depends_on:
- swapper-service
networks:
- dagi-network
restart: unless-stopped
sofiia-console:
build:
context: ./services/sofiia-console

File diff suppressed because one or more lines are too long

125
ops/node2_models_audit.md Normal file
View File

@@ -0,0 +1,125 @@
# NODA2 Model Audit — Three-Layer View
**Date:** 2026-02-27
**Node:** MacBook Pro M4 Max, 64GB unified memory
---
## Layer 1: Served by Runtime (routing-eligible)
These are models the router can actively select and invoke.
### Ollama (12 models, port 11434)
| Model | Type | Size | Status | Note |
|-------|------|------|--------|------|
| qwen3.5:35b-a3b | LLM (MoE) | 9.3 GB | idle | PRIMARY reasoning |
| qwen3:14b | LLM | 9.3 GB | idle | Default local |
| gemma3:latest | LLM | 3.3 GB | idle | Fast small |
| glm-4.7-flash:32k | LLM | 19 GB | idle | Long-context |
| glm-4.7-flash:q4_K_M | LLM | 19 GB | idle | **DUPLICATE** |
| llava:13b | Vision | 8.0 GB | idle | P0 fallback |
| mistral-nemo:12b | LLM | 7.1 GB | idle | old |
| deepseek-coder:33b | Code | 18.8 GB | idle | Heavy code |
| deepseek-r1:70b | LLM | 42.5 GB | idle | Very heavy reasoning |
| starcoder2:3b | Code | 1.7 GB | idle | Fast code |
| phi3:latest | LLM | 2.2 GB | idle | Small general |
| gpt-oss:latest | LLM | 13.8 GB | idle | old |
### Swapper (port 8890)
| Model | Type | Status |
|-------|------|--------|
| llava-13b | Vision | unloaded |
### llama-server (port 11435)
| Model | Type | Note |
|-------|------|------|
| Qwen3.5-35B-A3B-Q4_K_M.gguf | LLM | **DUPLICATE** of Ollama |
### Cloud APIs
| Provider | Model | API Key | Active |
|----------|-------|---------|--------|
| Grok (xAI) | grok-2-1212 | `GROK_API_KEY` ✅ | **Sofiia primary** |
| DeepSeek | deepseek-chat | `DEEPSEEK_API_KEY` ✅ | Other agents |
| Mistral | mistral-large | `MISTRAL_API_KEY` | Not configured |
---
## Layer 2: Installed on Disk (not served)
These are on disk but NOT reachable by router/swapper.
| Model | Type | Size | Location | Status |
|-------|------|------|----------|--------|
| whisper-large-v3-turbo (MLX) | STT | 1.5 GB | HF cache | Ready, not integrated |
| Kokoro-82M-bf16 (MLX) | TTS | 0.35 GB | HF cache | Ready, not integrated |
| MiniCPM-V-4_5 | Vision | 16 GB | HF cache | Not serving |
| Qwen3-VL-32B-Instruct | Vision | 123 GB | Cursor worktree | R&D artifact |
| Jan-v2-VL-med-Q8_0 | Vision | 9.2 GB | Jan AI | Not running |
| Qwen2.5-7B-Instruct | LLM | 14 GB | HF cache | Idle |
| Qwen2.5-1.5B-Instruct | LLM | 2.9 GB | HF cache | Idle |
| flux2-dev-Q8_0 | Image gen | 33 GB | ComfyUI | Offline |
| ltx-2-19b-distilled | Video gen | 25 GB | ComfyUI | Offline |
| SDXL-base-1.0 | Image gen | 72 GB | hf_models | Legacy |
| FLUX.2-dev (Aquiles) | Image gen | 105 GB | HF cache | ComfyUI |
---
## Layer 3: Sofiia Routing (after fix)
### Before fix (broken)
```
agent_registry: llm_profile=grok
→ router looks up "grok" in node2 config → NOT FOUND
→ llm_profile = {} → provider defaults to "deepseek" (hardcoded)
→ tries DEEPSEEK_API_KEY → may work (nondeterministic)
→ XAI_API_KEY exists but mapped as "XAI_API_KEY", not "GROK_API_KEY"
```
### After fix (deterministic)
```
agent_registry: llm_profile=grok
router-config.node2.yml:
agents.sofiia.default_llm = cloud_grok
agents.sofiia.fallback_llm = local_default_coder
llm_profiles.cloud_grok = {provider: grok, model: grok-2-1212, base_url: https://api.x.ai}
docker-compose: GROK_API_KEY=${XAI_API_KEY} (aliased)
Chain:
1. Sofiia request → router resolves cloud_grok
2. provider=grok → GROK_API_KEY present → xAI API → grok-2-1212
3. If Grok fails → fallback_llm=local_default_coder → qwen3:14b (Ollama)
4. If unknown profile → WARNING logged, uses agent.default_llm (local), NOT cloud silently
```
---
## Fixes Applied in This Commit
| Bug | Fix | File |
|-----|-----|------|
| A: GROK_API_KEY not in env | Added `GROK_API_KEY=${XAI_API_KEY}` | docker-compose.node2-sofiia.yml |
| B: No `grok` profile | Added `cloud_grok` profile | router-config.node2.yml |
| B: Sofiia → wrong profile | `agents.sofiia.default_llm = cloud_grok` | router-config.node2.yml |
| C: Silent cloud fallback | Unknown profile → local default + WARNING | services/router/main.py |
| C: Hardcoded Ollama URL | `172.18.0.1:11434` → dynamic from config | services/router/main.py |
| — | Node Capabilities Service | services/node-capabilities/ |
---
## Node Capabilities Service
New microservice providing live model inventory at `GET /capabilities`:
- Collects from Ollama, Swapper, llama-server
- Returns canonical JSON with `served_models[]` and `inventory_only[]`
- Cache TTL: 15s
- Port: 127.0.0.1:8099
Verification:
```bash
curl -s http://localhost:8099/capabilities | jq '.served_models | length'
# Expected: 14
```

View File

@@ -0,0 +1,76 @@
# NODA2 Model Audit — Three-layer view
# Date: 2026-02-27
# Source: Node Capabilities Service + manual disk scan
# ─── LAYER 1: SERVED BY RUNTIME (routing-eligible) ───────────────────────────
served_by_runtime:
ollama:
base_url: http://host.docker.internal:11434
version: "0.17.1"
models:
- {name: "qwen3.5:35b-a3b", type: llm, size_gb: 9.3, params: "14.8B MoE"}
- {name: "qwen3:14b", type: llm, size_gb: 9.3, params: "14B"}
- {name: "gemma3:latest", type: llm, size_gb: 3.3, params: "4B"}
- {name: "glm-4.7-flash:32k", type: llm, size_gb: 19.0, params: "~32B"}
- {name: "glm-4.7-flash:q4_K_M", type: llm, size_gb: 19.0, note: "DUPLICATE of :32k"}
- {name: "llava:13b", type: vision, size_gb: 8.0, params: "13B"}
- {name: "mistral-nemo:12b", type: llm, size_gb: 7.1, note: "old"}
- {name: "deepseek-coder:33b", type: code, size_gb: 18.8, params: "33B"}
- {name: "deepseek-r1:70b", type: llm, size_gb: 42.5, params: "70B"}
- {name: "starcoder2:3b", type: code, size_gb: 1.7}
- {name: "phi3:latest", type: llm, size_gb: 2.2}
- {name: "gpt-oss:latest", type: llm, size_gb: 13.8, note: "old"}
swapper:
base_url: http://swapper-service:8890
active_model: null
vision_models:
- {name: "llava-13b", type: vision, size_gb: 8.0, status: unloaded}
llm_models_count: 9
llama_server:
base_url: http://host.docker.internal:11435
models:
- {name: "Qwen3.5-35B-A3B-Q4_K_M.gguf", type: llm, note: "DUPLICATE of ollama qwen3.5:35b-a3b"}
# ─── LAYER 2: INSTALLED ON DISK (not served, not for routing) ────────────────
installed_on_disk:
hf_cache:
- {name: "whisper-large-v3-turbo-asr-fp16", type: stt, size_gb: 1.5, backend: mlx, ready: true}
- {name: "Kokoro-82M-bf16", type: tts, size_gb: 0.35, backend: mlx, ready: true}
- {name: "MiniCPM-V-4_5", type: vision, size_gb: 16.0, backend: hf, ready: false}
- {name: "Qwen2.5-7B-Instruct", type: llm, size_gb: 14.0, backend: hf}
- {name: "Qwen2.5-1.5B-Instruct", type: llm, size_gb: 2.9, backend: hf}
- {name: "FLUX.2-dev (Aquiles)", type: image_gen, size_gb: 105.0, backend: comfyui}
cursor_worktree:
- {name: "Qwen3-VL-32B-Instruct", type: vision, size_gb: 123.0, path: "~/.cursor/worktrees/.../models/"}
jan_ai:
- {name: "Jan-v2-VL-med-Q8_0", type: vision, size_gb: 9.2, path: "~/Library/Application Support/Jan/"}
llama_cpp_models:
- {name: "Qwen3.5-35B-A3B-Q4_K_M.gguf", type: llm, size_gb: 20.0, note: "DUPLICATE, served by llama-server"}
comfyui:
- {name: "flux2-dev-Q8_0.gguf", type: image_gen, size_gb: 33.0}
- {name: "ltx-2-19b-distilled-fp8.safetensors", type: video_gen, size_gb: 25.0}
- {name: "z_image_turbo_bf16.safetensors", type: image_gen, size_gb: 11.0}
- {name: "SDXL-base-1.0", type: image_gen, size_gb: 72.0, note: "legacy"}
hf_models_dir:
- {name: "stabilityai_sdxl_base_1.0", type: image_gen, size_gb: 72.0, note: "legacy"}
# ─── LAYER 3: CLOUD / EXTERNAL APIs ──────────────────────────────────────────
cloud_apis:
- {name: "grok-2-1212", provider: grok, api_key_env: "GROK_API_KEY", active: true}
- {name: "deepseek-chat", provider: deepseek, api_key_env: "DEEPSEEK_API_KEY", active: true}
- {name: "mistral-large-latest", provider: mistral, api_key_env: "MISTRAL_API_KEY", active: false}
# ─── SOFIIA ROUTING CHAIN (after fix) ────────────────────────────────────────
sofiia_routing:
agent_registry: "llm_profile: grok"
router_config: "agents.sofiia.default_llm: cloud_grok → provider=grok, model=grok-2-1212"
fallback: "fallback_llm: local_default_coder → qwen3:14b (Ollama)"
env_mapping: "XAI_API_KEY → GROK_API_KEY (aliased in compose)"
deterministic: true

View File

@@ -0,0 +1,7 @@
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY main.py .
EXPOSE 8099
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8099"]

View File

@@ -0,0 +1,245 @@
"""Node Capabilities Service — exposes live model inventory for router decisions."""
import os
import time
import logging
from typing import Any, Dict, List, Optional
from fastapi import FastAPI
from fastapi.responses import JSONResponse
import httpx
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("node-capabilities")
app = FastAPI(title="Node Capabilities Service", version="1.0.0")
NODE_ID = os.getenv("NODE_ID", "noda2")
OLLAMA_BASE_URL = os.getenv("OLLAMA_BASE_URL", "http://host.docker.internal:11434")
SWAPPER_URL = os.getenv("SWAPPER_URL", "http://swapper-service:8890")
LLAMA_SERVER_URL = os.getenv("LLAMA_SERVER_URL", "")
_cache: Dict[str, Any] = {}
_cache_ts: float = 0
CACHE_TTL = int(os.getenv("CACHE_TTL_SEC", "15"))
def _classify_model(name: str) -> str:
nl = name.lower()
if any(k in nl for k in ("vl", "vision", "llava", "minicpm-v", "clip")):
return "vision"
if any(k in nl for k in ("coder", "starcoder", "codellama", "code")):
return "code"
if any(k in nl for k in ("embed", "bge", "minilm", "e5-")):
return "embedding"
if any(k in nl for k in ("whisper", "stt")):
return "stt"
if any(k in nl for k in ("kokoro", "tts", "bark", "coqui", "xtts")):
return "tts"
if any(k in nl for k in ("flux", "sdxl", "stable-diffusion", "ltx")):
return "image_gen"
return "llm"
async def _collect_ollama() -> Dict[str, Any]:
runtime: Dict[str, Any] = {"base_url": OLLAMA_BASE_URL, "status": "unknown", "models": []}
try:
async with httpx.AsyncClient(timeout=5) as c:
r = await c.get(f"{OLLAMA_BASE_URL}/api/tags")
if r.status_code == 200:
data = r.json()
runtime["status"] = "ok"
for m in data.get("models", []):
runtime["models"].append({
"name": m.get("name", ""),
"size_bytes": m.get("size", 0),
"size_gb": round(m.get("size", 0) / 1e9, 1),
"type": _classify_model(m.get("name", "")),
"modified": m.get("modified_at", "")[:10],
})
ps = await c.get(f"{OLLAMA_BASE_URL}/api/ps")
if ps.status_code == 200:
running = ps.json().get("models", [])
running_names = {m.get("name", "") for m in running}
for model in runtime["models"]:
model["running"] = model["name"] in running_names
except Exception as e:
runtime["status"] = f"error: {e}"
logger.warning(f"Ollama collector failed: {e}")
return runtime
async def _collect_swapper() -> Dict[str, Any]:
runtime: Dict[str, Any] = {"base_url": SWAPPER_URL, "status": "unknown", "models": [], "vision_models": [], "active_model": None}
try:
async with httpx.AsyncClient(timeout=5) as c:
h = await c.get(f"{SWAPPER_URL}/health")
if h.status_code == 200:
hd = h.json()
runtime["status"] = hd.get("status", "ok")
runtime["active_model"] = hd.get("active_model")
mr = await c.get(f"{SWAPPER_URL}/models")
if mr.status_code == 200:
for m in mr.json().get("models", []):
runtime["models"].append({
"name": m.get("name", ""),
"type": m.get("type", "llm"),
"size_gb": m.get("size_gb", 0),
"status": m.get("status", "unknown"),
})
vr = await c.get(f"{SWAPPER_URL}/vision/models")
if vr.status_code == 200:
for m in vr.json().get("models", []):
runtime["vision_models"].append({
"name": m.get("name", ""),
"type": "vision",
"size_gb": m.get("size_gb", 0),
"status": m.get("status", "unknown"),
})
except Exception as e:
runtime["status"] = f"error: {e}"
logger.warning(f"Swapper collector failed: {e}")
return runtime
async def _collect_llama_server() -> Optional[Dict[str, Any]]:
if not LLAMA_SERVER_URL:
return None
runtime: Dict[str, Any] = {"base_url": LLAMA_SERVER_URL, "status": "unknown", "models": []}
try:
async with httpx.AsyncClient(timeout=5) as c:
r = await c.get(f"{LLAMA_SERVER_URL}/v1/models")
if r.status_code == 200:
data = r.json()
runtime["status"] = "ok"
for m in data.get("data", data.get("models", [])):
name = m.get("id", m.get("name", "unknown"))
runtime["models"].append({"name": name, "type": "llm"})
except Exception as e:
runtime["status"] = f"error: {e}"
return runtime
def _collect_disk_inventory() -> List[Dict[str, Any]]:
"""Scan known model directories — NOT for routing, only inventory."""
import pathlib
inventory: List[Dict[str, Any]] = []
scan_dirs = [
("cursor_worktrees", pathlib.Path.home() / ".cursor" / "worktrees"),
("jan_ai", pathlib.Path.home() / "Library" / "Application Support" / "Jan"),
("hf_cache", pathlib.Path.home() / ".cache" / "huggingface" / "hub"),
("comfyui_main", pathlib.Path.home() / "ComfyUI" / "models"),
("comfyui_docs", pathlib.Path.home() / "Documents" / "ComfyUI" / "models"),
("llama_cpp", pathlib.Path.home() / "Library" / "Application Support" / "llama.cpp" / "models"),
("hf_models", pathlib.Path.home() / "hf_models"),
]
for source, base in scan_dirs:
if not base.exists():
continue
try:
for f in base.rglob("*"):
if f.suffix in (".gguf", ".safetensors", ".bin", ".pt") and f.stat().st_size > 100_000_000:
inventory.append({
"name": f.stem,
"path": str(f.relative_to(pathlib.Path.home())),
"source": source,
"size_gb": round(f.stat().st_size / 1e9, 1),
"type": _classify_model(f.stem),
"served": False,
})
except Exception:
pass
return inventory
def _build_served_models(ollama: Dict, swapper: Dict, llama: Optional[Dict]) -> List[Dict[str, Any]]:
"""Merge all served models into a flat canonical list."""
served: List[Dict[str, Any]] = []
seen = set()
for m in ollama.get("models", []):
key = m["name"]
if key not in seen:
seen.add(key)
served.append({**m, "runtime": "ollama", "base_url": ollama["base_url"]})
for m in swapper.get("vision_models", []):
key = f"swapper:{m['name']}"
if key not in seen:
seen.add(key)
served.append({**m, "runtime": "swapper", "base_url": swapper["base_url"]})
if llama:
for m in llama.get("models", []):
key = f"llama:{m['name']}"
if key not in seen:
seen.add(key)
served.append({**m, "runtime": "llama_server", "base_url": llama["base_url"]})
return served
async def _build_capabilities() -> Dict[str, Any]:
global _cache, _cache_ts
if _cache and (time.time() - _cache_ts) < CACHE_TTL:
return _cache
ollama = await _collect_ollama()
swapper = await _collect_swapper()
llama = await _collect_llama_server()
disk = _collect_disk_inventory()
served = _build_served_models(ollama, swapper, llama)
result = {
"node_id": NODE_ID,
"updated_at": time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime()),
"runtimes": {
"ollama": ollama,
"swapper": swapper,
},
"served_models": served,
"served_count": len(served),
"inventory_only": disk,
"inventory_count": len(disk),
}
if llama:
result["runtimes"]["llama_server"] = llama
_cache = result
_cache_ts = time.time()
return result
@app.get("/healthz")
async def healthz():
return {"status": "ok", "node_id": NODE_ID}
@app.get("/capabilities")
async def capabilities():
data = await _build_capabilities()
return JSONResponse(content=data)
@app.get("/capabilities/models")
async def capabilities_models():
data = await _build_capabilities()
return JSONResponse(content={"node_id": data["node_id"], "served_models": data["served_models"]})
@app.post("/capabilities/refresh")
async def capabilities_refresh():
global _cache_ts
_cache_ts = 0
data = await _build_capabilities()
return JSONResponse(content={"refreshed": True, "served_count": data["served_count"]})
if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=int(os.getenv("PORT", "8099")))

View File

@@ -0,0 +1,3 @@
fastapi>=0.110.0
uvicorn>=0.29.0
httpx>=0.27.0

View File

@@ -0,0 +1,80 @@
"""Capabilities client — fetches and caches live model inventory from Node Capabilities Service."""
import os
import time
import logging
from typing import Any, Dict, List, Optional
import httpx
logger = logging.getLogger("capabilities_client")
_cache: Dict[str, Any] = {}
_cache_ts: float = 0
NODE_CAPABILITIES_URL = os.getenv("NODE_CAPABILITIES_URL", "")
CACHE_TTL = 30
def configure(url: str = "", ttl: int = 30):
global NODE_CAPABILITIES_URL, CACHE_TTL
if url:
NODE_CAPABILITIES_URL = url
CACHE_TTL = ttl
async def fetch_capabilities(force: bool = False) -> Dict[str, Any]:
global _cache, _cache_ts
if not NODE_CAPABILITIES_URL:
return {}
if not force and _cache and (time.time() - _cache_ts) < CACHE_TTL:
return _cache
try:
async with httpx.AsyncClient(timeout=5) as c:
resp = await c.get(NODE_CAPABILITIES_URL)
if resp.status_code == 200:
_cache = resp.json()
_cache_ts = time.time()
logger.info(f"Capabilities refreshed: {_cache.get('served_count', 0)} served models")
return _cache
else:
logger.warning(f"Capabilities fetch failed: HTTP {resp.status_code}")
except Exception as e:
logger.warning(f"Capabilities fetch error: {e}")
return _cache
def get_cached() -> Dict[str, Any]:
return _cache
def find_served_model(
model_type: str = "llm",
preferred_name: Optional[str] = None,
runtime: Optional[str] = None,
) -> Optional[Dict[str, Any]]:
"""Find best served model matching criteria from cached capabilities."""
served = _cache.get("served_models", [])
if not served:
return None
candidates = [m for m in served if m.get("type") == model_type]
if runtime:
candidates = [m for m in candidates if m.get("runtime") == runtime]
if not candidates:
return None
if preferred_name:
for m in candidates:
if preferred_name in m.get("name", ""):
return m
return candidates[0]
def list_served_by_type(model_type: str = "llm") -> List[Dict[str, Any]]:
return [m for m in _cache.get("served_models", []) if m.get("type") == model_type]

View File

@@ -1,6 +1,6 @@
from fastapi import FastAPI, HTTPException
from fastapi import FastAPI, HTTPException, Request
from fastapi.responses import Response
from pydantic import BaseModel
from pydantic import BaseModel, ConfigDict
from typing import Literal, Optional, Dict, Any, List
import asyncio
import json
@@ -897,6 +897,134 @@ async def health():
"messaging_inbound_enabled": config.get("messaging_inbound", {}).get("enabled", True)
}
@app.get("/healthz")
async def healthz():
"""Alias /healthz → /health for BFF compatibility."""
return await health()
@app.get("/monitor/status")
async def monitor_status(request: Request = None):
"""
Node monitor status — read-only, safe, no secrets.
Returns: heartbeat, router/gateway health, open incidents,
alerts loop SLO, active backends, last artifact timestamps.
Rate limited: 60 rpm per IP (in-process bucket).
RBAC: requires tools.monitor.read entitlement (or tools.observability.read).
Auth: X-Monitor-Key header (same as SUPERVISOR_API_KEY, optional in dev).
"""
import collections as _collections
# ── Rate limit (60 rpm per IP) ────────────────────────────────────────
_now = time.monotonic()
client_ip = (
(request.client.host if request and request.client else None) or "unknown"
)
_bucket_key = f"monitor:{client_ip}"
if not hasattr(monitor_status, "_buckets"):
monitor_status._buckets = {}
dq = monitor_status._buckets.setdefault(_bucket_key, _collections.deque())
while dq and _now - dq[0] > 60:
dq.popleft()
if len(dq) >= 60:
from fastapi.responses import JSONResponse
return JSONResponse(status_code=429, content={"error": "rate_limit", "message": "60 rpm exceeded"})
dq.append(_now)
# ── Auth (optional in dev, enforced in prod) ──────────────────────────
_env = os.getenv("ENV", "dev").strip().lower()
_monitor_key = os.getenv("SUPERVISOR_API_KEY", "").strip()
if _env in ("prod", "production", "staging") and _monitor_key:
_req_key = ""
if request:
_req_key = (
request.headers.get("X-Monitor-Key", "")
or request.headers.get("Authorization", "").removeprefix("Bearer ").strip()
)
if _req_key != _monitor_key:
from fastapi.responses import JSONResponse
return JSONResponse(status_code=403, content={"error": "forbidden", "message": "X-Monitor-Key required"})
# ── Collect data (best-effort, non-fatal) ─────────────────────────────
warnings: list[str] = []
ts_now = __import__("datetime").datetime.now(
__import__("datetime").timezone.utc
).isoformat(timespec="seconds")
# uptime as heartbeat proxy
_proc_start = getattr(monitor_status, "_proc_start", None)
if _proc_start is None:
monitor_status._proc_start = time.monotonic()
_proc_start = monitor_status._proc_start
heartbeat_age_s = int(time.monotonic() - _proc_start)
# open incidents
open_incidents: int | None = None
try:
from incident_store import get_incident_store as _get_is
_istore = _get_is()
_open = _istore.list_incidents(filters={"status": "open"}, limit=500)
# include "mitigating" as still-open
open_incidents = sum(
1 for i in _open if (i.get("status") or "").lower() in ("open", "mitigating")
)
except Exception as _e:
warnings.append(f"incidents: {str(_e)[:80]}")
# alerts loop SLO
alerts_loop_slo: dict | None = None
try:
from alert_store import get_alert_store as _get_as
alerts_loop_slo = _get_as().compute_loop_slo(window_minutes=240)
# strip any internal keys that may contain infra details
_safe_keys = {"claim_to_ack_p95_seconds", "failed_rate_pct", "processing_stuck_count", "sample_count", "violations"}
alerts_loop_slo = {k: v for k, v in alerts_loop_slo.items() if k in _safe_keys}
except Exception as _e:
warnings.append(f"alerts_slo: {str(_e)[:80]}")
# backends (env vars only — no DSN, no passwords)
backends = {
"alerts": os.getenv("ALERT_BACKEND", "unknown"),
"audit": os.getenv("AUDIT_BACKEND", "unknown"),
"incidents": os.getenv("INCIDENT_BACKEND", "unknown"),
"risk_history": os.getenv("RISK_HISTORY_BACKEND", "unknown"),
"backlog": os.getenv("BACKLOG_BACKEND", "unknown"),
}
# last artifact timestamps (best-effort filesystem scan)
last_artifacts: dict = {}
_base = __import__("pathlib").Path("ops")
for _pattern, _key in [
("reports/risk/*.md", "risk_digest_ts"),
("reports/platform/*.md", "platform_digest_ts"),
("backlog/*.jsonl", "backlog_generate_ts"),
("reports/backlog/*.md", "backlog_report_ts"),
]:
try:
_files = sorted(_base.glob(_pattern))
if _files:
_mtime = _files[-1].stat().st_mtime
last_artifacts[_key] = __import__("datetime").datetime.fromtimestamp(
_mtime, tz=__import__("datetime").timezone.utc
).isoformat(timespec="seconds")
except Exception:
pass
return {
"node_id": os.getenv("NODE_ID", "NODA1"),
"ts": ts_now,
"heartbeat_age_s": heartbeat_age_s,
"router_ok": True, # we are the router; if we respond, we're ok
"gateway_ok": None, # gateway health not probed here (separate svc)
"open_incidents": open_incidents,
"alerts_loop_slo": alerts_loop_slo,
"backends": backends,
"last_artifacts": last_artifacts,
"warnings": warnings,
}
@app.post("/internal/router/test-messaging", response_model=AgentInvocation)
async def test_messaging_route(decision: FilterDecision):
"""
@@ -966,6 +1094,15 @@ class InferResponse(BaseModel):
file_mime: Optional[str] = None
class ToolExecuteRequest(BaseModel):
"""External tool execution request used by console/ops APIs."""
model_config = ConfigDict(extra="allow")
tool: str
action: Optional[str] = None
agent_id: Optional[str] = "sofiia"
metadata: Optional[Dict[str, Any]] = None
# =========================================================================
@@ -1110,15 +1247,21 @@ async def internal_llm_complete(request: InternalLLMRequest):
logger.info(f"Internal LLM: profile={request.llm_profile}, role={request.role_context}")
# Get LLM profile configuration
llm_profiles = router_config.get("llm_profiles", {})
profile_name = request.llm_profile or "reasoning"
llm_profile = llm_profiles.get(profile_name, {})
provider = llm_profile.get("provider", "deepseek")
model = request.model or llm_profile.get("model", "deepseek-chat")
if not llm_profile:
fallback_name = "local_default_coder"
llm_profile = llm_profiles.get(fallback_name, {})
logger.warning(f"⚠️ Profile '{profile_name}' not found in llm_profiles → falling back to '{fallback_name}' (local)")
profile_name = fallback_name
provider = llm_profile.get("provider", "ollama")
model = request.model or llm_profile.get("model", "qwen3:14b")
max_tokens = request.max_tokens or llm_profile.get("max_tokens", 2048)
temperature = request.temperature or llm_profile.get("temperature", 0.2)
logger.info(f"🎯 Resolved: profile={profile_name} provider={provider} model={model}")
# Build messages
messages = []
@@ -1173,10 +1316,11 @@ async def internal_llm_complete(request: InternalLLMRequest):
# Fallback/target local provider (Ollama)
try:
logger.info("Internal LLM to Ollama")
ollama_model = model or "qwen3:8b"
ollama_base = llm_profile.get("base_url", os.getenv("OLLAMA_BASE_URL", "http://host.docker.internal:11434"))
ollama_model = model or "qwen3:14b"
logger.info(f"Internal LLM to Ollama: model={ollama_model} url={ollama_base}")
ollama_resp = await http_client.post(
"http://172.18.0.1:11434/api/generate",
f"{ollama_base}/api/generate",
json={"model": ollama_model, "prompt": request.prompt, "system": request.system_prompt or "", "stream": False, "options": {"num_predict": max_tokens, "temperature": temperature}},
timeout=120.0
)
@@ -1249,15 +1393,17 @@ async def agent_infer(agent_id: str, request: InferRequest):
if not system_prompt:
try:
from prompt_builder import get_agent_system_prompt
system_prompt = await get_agent_system_prompt(
agent_id,
from prompt_builder import get_prompt_builder
prompt_builder = await get_prompt_builder(
city_service_url=CITY_SERVICE_URL,
router_config=router_config
router_config=router_config,
)
logger.info(f"✅ Loaded system prompt from database for {agent_id}")
prompt_result = await prompt_builder.get_system_prompt(agent_id)
system_prompt = prompt_result.system_prompt
system_prompt_source = prompt_result.source
logger.info(f"✅ Loaded system prompt for {agent_id} from {system_prompt_source}")
except Exception as e:
logger.warning(f"⚠️ Could not load prompt from database: {e}")
logger.warning(f"⚠️ Could not load prompt from configured sources: {e}")
# Fallback to config
system_prompt_source = "router_config"
agent_config = router_config.get("agents", {}).get(agent_id, {})
@@ -1450,15 +1596,38 @@ async def agent_infer(agent_id: str, request: InferRequest):
except Exception as e:
logger.exception(f"❌ CrewAI error: {e}, falling back to direct LLM")
default_llm = agent_config.get("default_llm", "qwen3:8b")
default_llm = agent_config.get("default_llm", "local_default_coder")
routing_rules = router_config.get("routing", [])
default_llm = _select_default_llm(agent_id, metadata, default_llm, routing_rules)
# Get LLM profile configuration
cloud_provider_names = {"deepseek", "mistral", "grok", "openai", "anthropic"}
llm_profiles = router_config.get("llm_profiles", {})
llm_profile = llm_profiles.get(default_llm, {})
if not llm_profile:
fallback_llm = agent_config.get("fallback_llm", "local_default_coder")
llm_profile = llm_profiles.get(fallback_llm, {})
logger.warning(
f"⚠️ Profile '{default_llm}' not found for agent={agent_id} "
f"→ fallback to '{fallback_llm}' (local). "
f"NOT defaulting to cloud silently."
)
default_llm = fallback_llm
provider = llm_profile.get("provider", "ollama")
logger.info(f"🎯 Agent={agent_id}: profile={default_llm} provider={provider} model={llm_profile.get('model', '?')}")
# If explicit model is requested, try to resolve it to configured cloud profile.
if request.model:
for profile_name, profile in llm_profiles.items():
if profile.get("model") == request.model and profile.get("provider") in cloud_provider_names:
llm_profile = profile
provider = profile.get("provider", provider)
default_llm = profile_name
logger.info(f"🎛️ Matched request.model={request.model} to profile={profile_name} provider={provider}")
break
# Determine model name
if provider in ["deepseek", "openai", "anthropic", "mistral"]:
@@ -1671,7 +1840,6 @@ async def agent_infer(agent_id: str, request: InferRequest):
max_tokens = request.max_tokens or llm_profile.get("max_tokens", 2048)
temperature = request.temperature or llm_profile.get("temperature", 0.2)
cloud_provider_names = {"deepseek", "mistral", "grok", "openai", "anthropic"}
allow_cloud = provider in cloud_provider_names
if not allow_cloud:
logger.info(f"☁️ Cloud providers disabled for agent {agent_id}: provider={provider}")
@@ -1700,6 +1868,18 @@ async def agent_infer(agent_id: str, request: InferRequest):
}
]
# Custom configured profile for OpenAI-compatible backends (e.g. local llama-server).
if provider == "openai":
cloud_providers = [
{
"name": "openai",
"api_key_env": llm_profile.get("api_key_env", "OPENAI_API_KEY"),
"base_url": llm_profile.get("base_url", "https://api.openai.com"),
"model": request.model or llm_profile.get("model", model),
"timeout": int(llm_profile.get("timeout_ms", 60000) / 1000),
}
]
if not allow_cloud:
cloud_providers = []
@@ -1717,8 +1897,14 @@ async def agent_infer(agent_id: str, request: InferRequest):
logger.debug(f"🔧 {len(tools_payload)} tools available for function calling")
for cloud in cloud_providers:
api_key = os.getenv(cloud["api_key_env"])
if not api_key:
api_key = os.getenv(cloud["api_key_env"], "")
base_url = cloud.get("base_url", "")
is_local_openai = (
cloud.get("name") == "openai"
and isinstance(base_url, str)
and any(host in base_url for host in ["host.docker.internal", "localhost", "127.0.0.1"])
)
if not api_key and not is_local_openai:
logger.debug(f"⏭️ Skipping {cloud['name']}: API key not configured")
continue
@@ -1739,12 +1925,13 @@ async def agent_infer(agent_id: str, request: InferRequest):
request_payload["tools"] = tools_payload
request_payload["tool_choice"] = "auto"
headers = {"Content-Type": "application/json"}
if api_key:
headers["Authorization"] = f"Bearer {api_key}"
cloud_resp = await http_client.post(
f"{cloud['base_url']}/v1/chat/completions",
headers={
"Authorization": f"Bearer {api_key}",
"Content-Type": "application/json"
},
headers=headers,
json=request_payload,
timeout=cloud["timeout"]
)
@@ -1754,6 +1941,8 @@ async def agent_infer(agent_id: str, request: InferRequest):
choice = data.get("choices", [{}])[0]
message = choice.get("message", {})
response_text = message.get("content", "") or ""
if not response_text and message.get("reasoning_content"):
response_text = str(message.get("reasoning_content", "")).strip()
tokens_used = data.get("usage", {}).get("total_tokens", 0)
# Initialize tool_results to avoid UnboundLocalError
@@ -1959,12 +2148,12 @@ async def agent_infer(agent_id: str, request: InferRequest):
loop_payload["tools"] = tools_payload
loop_payload["tool_choice"] = "auto"
loop_headers = {"Content-Type": "application/json"}
if api_key:
loop_headers["Authorization"] = f"Bearer {api_key}"
loop_resp = await http_client.post(
f"{cloud['base_url']}/v1/chat/completions",
headers={
"Authorization": f"Bearer {api_key}",
"Content-Type": "application/json"
},
headers=loop_headers,
json=loop_payload,
timeout=cloud["timeout"]
)
@@ -1978,6 +2167,8 @@ async def agent_infer(agent_id: str, request: InferRequest):
loop_data = loop_resp.json()
loop_message = loop_data.get("choices", [{}])[0].get("message", {})
response_text = loop_message.get("content", "") or ""
if not response_text and loop_message.get("reasoning_content"):
response_text = str(loop_message.get("reasoning_content", "")).strip()
tokens_used += loop_data.get("usage", {}).get("total_tokens", 0)
current_tool_calls = loop_message.get("tool_calls", [])
@@ -2123,16 +2314,24 @@ async def agent_infer(agent_id: str, request: InferRequest):
# LOCAL PROVIDERS (Ollama via Swapper)
# =========================================================================
# Determine local model from config (not hardcoded)
# Strategy: Use agent's default_llm if it's local (ollama), otherwise find first local model
# Strategy:
# 1) explicit request.model override
# 2) agent default_llm if it's local (ollama)
# 3) first local profile fallback
local_model = None
requested_local_model = (request.model or "").strip()
if requested_local_model:
local_model = requested_local_model.replace(":", "-")
logger.info(f"🎛️ Local model override requested: {requested_local_model} -> {local_model}")
# Check if default_llm is local
if llm_profile.get("provider") == "ollama":
if not local_model and llm_profile.get("provider") == "ollama":
# Extract model name and convert format (qwen3:8b → qwen3:8b for Swapper)
ollama_model = llm_profile.get("model", "qwen3:8b")
local_model = ollama_model.replace(":", "-") # qwen3:8b → qwen3:8b
logger.debug(f"✅ Using agent's default local model: {local_model}")
else:
elif not local_model:
# Find first local model from config
for profile_name, profile in llm_profiles.items():
if profile.get("provider") == "ollama":
@@ -2259,6 +2458,60 @@ async def agent_infer(agent_id: str, request: InferRequest):
)
@app.post("/v1/tools/execute")
async def tools_execute(request: ToolExecuteRequest):
"""
Execute a single tool call through ToolManager.
Returns console-compatible shape: {status, data, error}.
"""
if not tool_manager:
raise HTTPException(status_code=503, detail="Tool manager unavailable")
payload = request.model_dump(exclude_none=True)
tool_name = str(payload.pop("tool", "")).strip()
action = payload.pop("action", None)
agent_id = str(payload.pop("agent_id", "sofiia") or "sofiia").strip()
metadata = payload.pop("metadata", {}) or {}
if not tool_name:
raise HTTPException(status_code=422, detail="tool is required")
# Keep backward compatibility with sofiia-console calls
if action is not None:
payload["action"] = action
chat_id = str(metadata.get("chat_id", "") or "") or None
user_id = str(metadata.get("user_id", "") or "") or None
workspace_id = str(metadata.get("workspace_id", "default") or "default")
try:
result = await tool_manager.execute_tool(
tool_name=tool_name,
arguments=payload,
agent_id=agent_id,
chat_id=chat_id,
user_id=user_id,
workspace_id=workspace_id,
)
except Exception as e:
logger.exception("❌ Tool execution failed: %s", tool_name)
raise HTTPException(status_code=500, detail=f"Tool execution error: {str(e)[:200]}")
data: Dict[str, Any] = {"result": result.result}
if result.image_base64:
data["image_base64"] = result.image_base64
if result.file_base64:
data["file_base64"] = result.file_base64
if result.file_name:
data["file_name"] = result.file_name
if result.file_mime:
data["file_mime"] = result.file_mime
if result.success:
return {"status": "ok", "data": data, "error": None}
return {"status": "failed", "data": data, "error": {"message": result.error or "Tool failed"}}
@app.get("/v1/models")
async def list_available_models():
"""List all available models across backends"""

View File

@@ -124,6 +124,23 @@ llm_profiles:
timeout_ms: 60000
description: "Mistral Large для складних задач, reasoning, аналізу"
cloud_grok:
provider: grok
base_url: https://api.x.ai
api_key_env: GROK_API_KEY
model: grok-2-1212
max_tokens: 2048
temperature: 0.2
timeout_ms: 60000
description: "Grok API для SOFIIA (Chief AI Architect)"
# ============================================================================
# Node Capabilities
# ============================================================================
node_capabilities:
url: http://node-capabilities:8099/capabilities
cache_ttl_sec: 30
# ============================================================================
# Orchestrator Providers
# ============================================================================
@@ -417,8 +434,9 @@ agents:
Розрізняй інших ботів за ніком та відповідай лише на стратегічні запити.
sofiia:
description: "Sofiia — Chief AI Architect та Technical Sovereign"
default_llm: local_default_coder
description: "SOFIIA — Chief AI Architect & Technical Sovereign"
default_llm: cloud_grok
fallback_llm: local_default_coder
system_prompt: |
Ти Sofiia — Chief AI Architect та Technical Sovereign екосистеми DAARION.city.
Працюй як CTO-помічник: архітектура, reliability, безпека, release governance, incident/risk/backlog контроль.