- Node Worker: replace swapper_vision with ollama_vision (direct Ollama API)
- Node Worker: add NATS subjects for stt/tts/image (stubs ready)
- Node Worker: remove SWAPPER_URL dependency from config
- Router: vision calls go directly to Ollama /api/generate with images
- Router: local LLM calls go directly to Ollama /api/generate
- Router: add OLLAMA_URL and PREFER_NODE_WORKER=true feature flag
- Router: /v1/models now uses NCS global capabilities pool
- NCS: SWAPPER_URL="" -> skip Swapper probing (status=disabled)
- Swapper configs: remove all hardcoded model lists, keep only runtime
URLs, timeouts, limits
- docker-compose.node1.yml: add OLLAMA_URL, PREFER_NODE_WORKER for router;
SWAPPER_URL= for NCS; remove swapper-service from node-worker depends_on
- docker-compose.node2-sofiia.yml: same changes for NODA2
Swapper service still runs but is NOT in the critical inference path.
Source of truth for models is now NCS -> Ollama /api/tags.
Made-with: Cursor
NATS wildcards (node.*.capabilities.get) only work for subscriptions,
not for publish. Switch to a dedicated broadcast subject
(fabric.capabilities.discover) that all NCS instances subscribe to,
enabling proper scatter-gather discovery across nodes.
Made-with: Cursor
Architecture for 150+ nodes:
- global_capabilities_client.py: NATS scatter-gather discovery using
wildcard subject node.*.capabilities.get — zero static node lists.
New nodes auto-register by deploying NCS and subscribing to NATS.
Dead nodes expire from cache after 3x TTL automatically.
Multi-node model_select.py:
- ModelSelection now includes node, local, via_nats fields
- select_best_model prefers local candidates, then remote
- Prefer list resolution: local first, remote second
- All logged per request: node, runtime, model, local/remote
NODA1 compose:
- Added node-capabilities service (NCS) to docker-compose.node1.yml
- NATS subscription: node.noda1.capabilities.get
- Router env: NODE_CAPABILITIES_URL + ENABLE_GLOBAL_CAPS_NATS=true
NODA2 compose:
- Router env: ENABLE_GLOBAL_CAPS_NATS=true
Router main.py:
- Startup: initializes global_capabilities_client (NATS connect + first
discovery). Falls back to local-only capabilities_client if unavailable.
- /infer: uses get_global_capabilities() for cross-node model pool
- Offload support: send_offload_request(node_id, type, payload) via NATS
Verified on NODA2:
- Global caps: 1 node, 14 models (NODA1 not yet deployed)
- Sofiia: cloud_grok → grok-4-1-fast-reasoning (OK)
- Helion: NCS → qwen3:14b local (OK)
- When NODA1 deploys NCS, its models appear automatically via NATS discovery
Made-with: Cursor
Router model selection:
- New model_select.py: resolve_effective_profile → profile_requirements →
select_best_model pipeline. NCS-first with graceful static fallback.
- selection_policies in router-config.node2.yml define prefer order per
profile without hardcoding models (e.g. local_default_coder prefers
qwen3:14b then qwen3.5:35b-a3b).
- Cloud profiles (cloud_grok, cloud_deepseek) skip NCS; on cloud failure
use fallback_profile via NCS for local selection.
- Structured logs: selected_profile, required_type, runtime, model,
caps_age_s, fallback_reason on every infer request.
Grok model fix:
- grok-2-1212 no longer exists on xAI API → updated to
grok-4-1-fast-reasoning across all 3 hardcoded locations in main.py
and router-config.node2.yml.
NCS NATS request/reply:
- node-capabilities subscribes to node.noda2.capabilities.get (NATS
request/reply). Enabled via ENABLE_NATS_CAPS=true in compose.
- NODA1 router can query NODA2 capabilities over NATS leafnode without
HTTP connectivity.
Verified:
- NCS: 14 served models from Ollama+Swapper+llama-server
- NATS: request/reply returns full capabilities JSON
- Sofiia: cloud_grok → grok-4-1-fast-reasoning (tested, 200 OK)
- Helion: NCS → qwen3:14b via Ollama (caps_age=23.7s cache hit)
- Router health: ok
Made-with: Cursor
Bug fixes:
- Bug A: GROK_API_KEY env mismatch — router expected GROK_API_KEY but only
XAI_API_KEY was present. Added GROK_API_KEY=${XAI_API_KEY} alias in compose.
- Bug B: 'grok' profile missing in router-config.node2.yml — added cloud_grok
profile (provider: grok, model: grok-2-1212). Sofiia now has
default_llm=cloud_grok with fallback_llm=local_default_coder.
- Bug C: Router silently defaulted to cloud DeepSeek when profile was unknown.
Now falls back to agent.fallback_llm or local_default_coder with WARNING log.
Hardcoded Ollama URL (172.18.0.1) replaced with config-driven base_url.
New service: Node Capabilities Service (NCS)
- services/node-capabilities/ — FastAPI microservice exposing live model
inventory from Ollama, Swapper, and llama-server.
- GET /capabilities — canonical JSON with served_models[] and inventory_only[]
- GET /capabilities/models — flat list of served models
- POST /capabilities/refresh — force cache refresh
- Cache TTL 15s, bound to 127.0.0.1:8099
- services/router/capabilities_client.py — async client with TTL cache
Artifacts:
- ops/node2_models_audit.md — 3-layer model view (served/disk/cloud)
- ops/node2_models_audit.yml — machine-readable audit
- ops/node2_capabilities_example.json — sample NCS output (14 served models)
Made-with: Cursor
Clarify Helion group behavior: stay silent unless energy topic or direct mention, but answer operational questions when directly addressed.
Co-authored-by: Cursor <cursoragent@cursor.com>
Prevent DeepSeek DSML from leaking to users and avoid returning raw memory_search/web results when DSML is detected.
Co-authored-by: Cursor <cursoragent@cursor.com>
Router (main.py):
- When DSML detected in 2nd LLM response after tool execution,
make a 3rd LLM call with explicit synthesis prompt instead of
returning raw tool results to the user
- Falls back to format_tool_calls_for_response only if 3rd call fails
Router (tool_manager.py):
- Added _strip_think_tags() helper for <think>...</think> removal
from DeepSeek reasoning artifacts
Gateway (http_api.py):
- Strip <think>...</think> tags before sending to Telegram
- Strip DSML/XML-like markup (function_calls, invoke, parameter tags)
- Ensure empty text after stripping gets "..." fallback
Deployed to NODE1 and verified services running.
Co-authored-by: Cursor <cursoragent@cursor.com>
- Fixed unquoted `helion` variable reference to string literal `"helion"`
in tool_manager.py search_memories fallback
- Replaced `[Контекст пам'яті]` with `[INTERNAL MEMORY - do NOT repeat
to user]` in all 3 injection points in main.py
- Verified: Senpai now responds without Helion contamination or memory
brief leaking
Tested and deployed on NODE1.
Co-authored-by: Cursor <cursoragent@cursor.com>