Commit Graph

218 Commits

Author SHA1 Message Date
Apple
5b4c4f92ba feat(aurora): add detection overlays with face/plate boxes in compare UI 2026-03-01 05:00:29 -08:00
Apple
79f26ab683 feat(aurora-ui): add interactive pre-analysis controls and quality report 2026-03-01 04:10:10 -08:00
Apple
fe0f2e23c2 feat(aurora): expose quality report API and proxy via sofiia console 2026-03-01 03:59:54 -08:00
Apple
c230abe9cf fix(aurora): harden Kling integration and surface config diagnostics 2026-03-01 03:55:16 -08:00
Apple
ff97d3cf4a fix(console): route Aurora Kling enhance via standard proxy base URL 2026-03-01 03:48:19 -08:00
Apple
57632699c0 chore(cleanup): remove obsolete compose version and trim router Dockerfile 2026-03-01 01:37:30 -08:00
Apple
de234112f3 feat(node2): wire calendar-service and core automation tools in router 2026-03-01 01:37:13 -08:00
Apple
9a36020316 P3.5-P3.7: 2-layer inventory, capability routing, STT/TTS adapters, Dev Contract
NCS:
- _collect_worker_caps() fetches capability flags from node-worker /caps
- _derive_capabilities() merges served model types + worker provider flags
- installed_artifacts replaces inventory_only (disk scan with DISK_SCAN_PATHS env)
- New endpoints: /capabilities/caps, /capabilities/installed

Node Worker:
- STT_PROVIDER, TTS_PROVIDER, OCR_PROVIDER, IMAGE_PROVIDER env flags
- /caps endpoint returns capabilities + providers for NCS aggregation
- STT adapter (providers/stt_mlx_whisper.py) — remote + local mode
- TTS adapter (providers/tts_mlx_kokoro.py) — remote + local mode
- OCR handler via vision_prompted (ollama_vision with OCR prompt)
- NATS subjects: node.{id}.stt/tts/ocr/image.request

Router:
- POST /v1/capability/{stt,tts,ocr,image} — capability-based offload routing
- GET /v1/capabilities — global view with capabilities_by_node
- require_fresh_caps(ttl) preflight guard
- find_nodes_with_capability(cap) + load-based node selection

Ops:
- ops/fabric_snapshot.py — full runtime snapshot collector
- ops/fabric_preflight.sh — quick check + snapshot save + diff
- docs/fabric_contract.md — Dev Contract v0.1 (preflight-first)
- tests/test_fabric_contract.py — CI enforcement (6 tests)

Made-with: Cursor
2026-02-27 05:24:09 -08:00
Apple
194c87f53c feat(fabric): decommission Swapper from critical path, NCS = source of truth
- Node Worker: replace swapper_vision with ollama_vision (direct Ollama API)
- Node Worker: add NATS subjects for stt/tts/image (stubs ready)
- Node Worker: remove SWAPPER_URL dependency from config
- Router: vision calls go directly to Ollama /api/generate with images
- Router: local LLM calls go directly to Ollama /api/generate
- Router: add OLLAMA_URL and PREFER_NODE_WORKER=true feature flag
- Router: /v1/models now uses NCS global capabilities pool
- NCS: SWAPPER_URL="" -> skip Swapper probing (status=disabled)
- Swapper configs: remove all hardcoded model lists, keep only runtime
  URLs, timeouts, limits
- docker-compose.node1.yml: add OLLAMA_URL, PREFER_NODE_WORKER for router;
  SWAPPER_URL= for NCS; remove swapper-service from node-worker depends_on
- docker-compose.node2-sofiia.yml: same changes for NODA2

Swapper service still runs but is NOT in the critical inference path.
Source of truth for models is now NCS -> Ollama /api/tags.

Made-with: Cursor
2026-02-27 04:16:16 -08:00
Apple
90080c632a fix(fabric): use broadcast subject for NATS capabilities discovery
NATS wildcards (node.*.capabilities.get) only work for subscriptions,
not for publish. Switch to a dedicated broadcast subject
(fabric.capabilities.discover) that all NCS instances subscribe to,
enabling proper scatter-gather discovery across nodes.

Made-with: Cursor
2026-02-27 03:20:13 -08:00
Apple
a6531507df merge: integrate remote codex/sync-node1-runtime with fabric layer changes
Resolve conflicts in docker-compose.node1.yml, services/router/main.py,
and gateway-bot/services/doc_service.py — keeping both fabric layer
(NCS, node-worker, Prometheus) and document ingest/query endpoints.

Made-with: Cursor
2026-02-27 03:09:12 -08:00
Apple
ed7ad49d3a P3.2+P3.3+P3.4: NODA1 node-worker + NATS auth config + Prometheus counters
P3.2 — Multi-node deployment:
- Added node-worker service to docker-compose.node1.yml (NODE_ID=noda1)
- NCS NODA1 now has NODE_WORKER_URL for metrics collection
- Fixed NODE_ID consistency: router NODA1 uses 'noda1'
- NODA2 node-worker/NCS gets NCS_REPORT_URL for latency reporting

P3.3 — NATS accounts/auth (opt-in config):
- config/nats-server.conf with 3 accounts: SYS, FABRIC, APP
- Per-user topic permissions (router, ncs, node_worker)
- Leafnode listener :7422 with auth
- Not yet activated (requires credential provisioning)

P3.4 — Prometheus counters:
- Router /fabric_metrics: caps_refresh, caps_stale, model_select,
  offload_total, breaker_state, score_ms histogram
- Node Worker /prom_metrics: jobs_total, inflight gauge, latency_ms histogram
- NCS /prom_metrics: runtime_health, runtime_p50/p95, node_wait_ms
- All bound to 127.0.0.1 (not externally exposed)

Made-with: Cursor
2026-02-27 03:03:18 -08:00
Apple
a605b8c43e P3.1: GPU/Queue-aware routing — NCS metrics + scoring-based model selection
NCS (services/node-capabilities/metrics.py):
- NodeLoad: inflight_jobs, queue_depth, concurrency_limit, estimated_wait_ms,
  cpu_load_1m, mem_pressure (macOS + Linux), rtt_ms_to_hub
- RuntimeLoad: per-runtime healthy, p50_ms, p95_ms from rolling 50-sample window
- POST /capabilities/report_latency for node-worker → NCS reporting
- NCS fetches worker metrics via NODE_WORKER_URL

Node Worker:
- GET /metrics endpoint (inflight, concurrency, latency buffers)
- Latency tracking per job type (llm/vision) with rolling buffer
- Fire-and-forget latency reporting to NCS after each successful job

Router (model_select v3):
- score_candidate(): wait + model_latency + cross_node_penalty + prefer_bonus
- LOCAL_THRESHOLD_MS=250: prefer local if within threshold of remote
- ModelSelection.score field for observability
- Structured [score] logs with chosen node, model, and score breakdown

Tests: 19 new (12 scoring + 7 NCS metrics), 36 total pass
Docs: ops/runbook_p3_1.md, ops/CHANGELOG_FABRIC.md

No breaking changes to JobRequest/JobResponse or capabilities schema.

Made-with: Cursor
2026-02-27 02:55:44 -08:00
Apple
c4b94a327d P2.2+P2.3: NATS offload node-worker + router offload integration
Node Worker (services/node-worker/):
- NATS subscriber for node.{NODE_ID}.llm.request / vision.request
- Canonical JobRequest/JobResponse envelope (Pydantic)
- Idempotency cache (TTL 10min) with inflight dedup
- Deadline enforcement (DEADLINE_EXCEEDED on expired jobs)
- Concurrency limiter (semaphore, returns busy)
- Ollama + Swapper vision providers

Router offload (services/router/offload_client.py):
- NATS req/reply with configurable retries
- Circuit breaker per node+type (3 fails/60s → open 120s)
- Concurrency semaphore for remote requests

Model selection (services/router/model_select.py):
- exclude_nodes parameter for circuit-broken nodes
- force_local flag for fallback re-selection
- Integrated circuit breaker state awareness

Router /infer pipeline:
- Remote offload path when NCS selects remote node
- Automatic fallback: exclude failed node → force_local re-select
- Deadline propagation from router to node-worker

Tests: 17 unit tests (idempotency, deadline, circuit breaker)
Docs: ops/offload_routing.md (subjects, envelope, verification)
Made-with: Cursor
2026-02-27 02:44:05 -08:00
Apple
a92c424845 P2: Global multi-node model selection + NCS on NODA1
Architecture for 150+ nodes:
- global_capabilities_client.py: NATS scatter-gather discovery using
  wildcard subject node.*.capabilities.get — zero static node lists.
  New nodes auto-register by deploying NCS and subscribing to NATS.
  Dead nodes expire from cache after 3x TTL automatically.

Multi-node model_select.py:
- ModelSelection now includes node, local, via_nats fields
- select_best_model prefers local candidates, then remote
- Prefer list resolution: local first, remote second
- All logged per request: node, runtime, model, local/remote

NODA1 compose:
- Added node-capabilities service (NCS) to docker-compose.node1.yml
- NATS subscription: node.noda1.capabilities.get
- Router env: NODE_CAPABILITIES_URL + ENABLE_GLOBAL_CAPS_NATS=true

NODA2 compose:
- Router env: ENABLE_GLOBAL_CAPS_NATS=true

Router main.py:
- Startup: initializes global_capabilities_client (NATS connect + first
  discovery). Falls back to local-only capabilities_client if unavailable.
- /infer: uses get_global_capabilities() for cross-node model pool
- Offload support: send_offload_request(node_id, type, payload) via NATS

Verified on NODA2:
- Global caps: 1 node, 14 models (NODA1 not yet deployed)
- Sofiia: cloud_grok → grok-4-1-fast-reasoning (OK)
- Helion: NCS → qwen3:14b local (OK)
- When NODA1 deploys NCS, its models appear automatically via NATS discovery

Made-with: Cursor
2026-02-27 02:26:12 -08:00
Apple
89c3f2ac66 P1: NCS-first model selection + NATS capabilities + Grok 4.1
Router model selection:
- New model_select.py: resolve_effective_profile → profile_requirements →
  select_best_model pipeline. NCS-first with graceful static fallback.
- selection_policies in router-config.node2.yml define prefer order per
  profile without hardcoding models (e.g. local_default_coder prefers
  qwen3:14b then qwen3.5:35b-a3b).
- Cloud profiles (cloud_grok, cloud_deepseek) skip NCS; on cloud failure
  use fallback_profile via NCS for local selection.
- Structured logs: selected_profile, required_type, runtime, model,
  caps_age_s, fallback_reason on every infer request.

Grok model fix:
- grok-2-1212 no longer exists on xAI API → updated to
  grok-4-1-fast-reasoning across all 3 hardcoded locations in main.py
  and router-config.node2.yml.

NCS NATS request/reply:
- node-capabilities subscribes to node.noda2.capabilities.get (NATS
  request/reply). Enabled via ENABLE_NATS_CAPS=true in compose.
- NODA1 router can query NODA2 capabilities over NATS leafnode without
  HTTP connectivity.

Verified:
- NCS: 14 served models from Ollama+Swapper+llama-server
- NATS: request/reply returns full capabilities JSON
- Sofiia: cloud_grok → grok-4-1-fast-reasoning (tested, 200 OK)
- Helion: NCS → qwen3:14b via Ollama (caps_age=23.7s cache hit)
- Router health: ok

Made-with: Cursor
2026-02-27 02:17:34 -08:00
Apple
e2a3ae342a node2: fix Sofiia routing determinism + Node Capabilities Service
Bug fixes:
- Bug A: GROK_API_KEY env mismatch — router expected GROK_API_KEY but only
  XAI_API_KEY was present. Added GROK_API_KEY=${XAI_API_KEY} alias in compose.
- Bug B: 'grok' profile missing in router-config.node2.yml — added cloud_grok
  profile (provider: grok, model: grok-2-1212). Sofiia now has
  default_llm=cloud_grok with fallback_llm=local_default_coder.
- Bug C: Router silently defaulted to cloud DeepSeek when profile was unknown.
  Now falls back to agent.fallback_llm or local_default_coder with WARNING log.
  Hardcoded Ollama URL (172.18.0.1) replaced with config-driven base_url.

New service: Node Capabilities Service (NCS)
- services/node-capabilities/ — FastAPI microservice exposing live model
  inventory from Ollama, Swapper, and llama-server.
- GET /capabilities — canonical JSON with served_models[] and inventory_only[]
- GET /capabilities/models — flat list of served models
- POST /capabilities/refresh — force cache refresh
- Cache TTL 15s, bound to 127.0.0.1:8099
- services/router/capabilities_client.py — async client with TTL cache

Artifacts:
- ops/node2_models_audit.md — 3-layer model view (served/disk/cloud)
- ops/node2_models_audit.yml — machine-readable audit
- ops/node2_capabilities_example.json — sample NCS output (14 served models)

Made-with: Cursor
2026-02-27 02:07:40 -08:00
Apple
7b8499dd8a node2: P0 vision restore + P1 security hardening + node-specific router config
P0 — Vision:
- swapper_config_node2.yaml: add llava-13b as vision model (vision:true)
  /vision/models now returns non-empty list; inference verified ~3.5s
- ollama.url fixed to host.docker.internal:11434 (was localhost, broken in Docker)

P1 — Security:
- Remove NODES_NODA1_SSH_PASSWORD from .env and docker-compose.node2-sofiia.yml
- SSH ED25519 key generated, authorized on NODA1, mounted as /run/secrets/noda1_ssh_key
- sofiia-console reads key via NODES_NODA1_SSH_PRIVATE_KEY env var
- secrets/noda1_id_ed25519 added to .gitignore

P1 — Router:
- services/router/router-config.node2.yml: new node2-specific config
  replaces all 172.17.0.1:11434 → host.docker.internal:11434
- docker-compose.node2-sofiia.yml: mount router-config.node2.yml (not root config)

P1 — Ports:
- router (9102), swapper (8890), sofiia-console (8002): bind to 127.0.0.1
- gateway (9300): keep 0.0.0.0 (Telegram webhook requires public access)

Artifacts:
- ops/patch_node2_P0P1_20260227.md — change log
- ops/validation_node2_P0P1_20260227.md — all checks PASS
- ops/node2.env.example — safe env template (no secrets)
- ops/security_hardening_node2.md — SSH key migration guide + firewall
- ops/node2_models_pull.sh — model pull script for P0/P1

Made-with: Cursor
2026-02-27 01:27:38 -08:00
NODA1 System
cca16254e5 feat(docs): add document write-back publish pipeline 2026-02-21 17:02:55 +01:00
NODA1 System
f53e71a0f4 feat(docs): add versioned document update and versions APIs 2026-02-21 16:49:24 +01:00
NODA1 System
5d52cf81c4 feat(docs): add standard file processing and router document ingest/query 2026-02-21 14:02:59 +01:00
NODA1 System
f44e920486 agromatrix: enforce mentor auth and expose shared-memory review via gateway 2026-02-21 13:18:36 +01:00
NODA1 System
68ac8fa355 agromatrix: add shared-memory review api and crawl4ai robustness 2026-02-21 13:18:36 +01:00
NODA1 System
01bfa97783 agromatrix: tighten numeric source contract guard 2026-02-21 13:18:36 +01:00
NODA1 System
d963c52fe5 agromatrix: add pending-question memory, anti-repeat guard, and numeric contract 2026-02-21 13:18:36 +01:00
NODA1 System
a87a1fe52c agromatrix: deterministic plant-id flow + confidence guard + plantnet env 2026-02-21 13:18:36 +01:00
NODA1 System
50dfcd7390 router: enforce direct image inputs for plant tools and inject runtime image_data 2026-02-21 13:18:36 +01:00
NODA1 System
a91309de11 agromatrix: deploy context/photo learning + deterministic excel policy 2026-02-21 13:18:36 +01:00
Apple
195eb9b7ac agents: add planned AISTALK orchestrator and crew profile 2026-02-20 10:24:59 -08:00
NODA1 System
987ece5bac ops: add plant-vision node1 service and update monitor/prober scripts 2026-02-20 17:57:40 +01:00
NODA1 System
90eff85662 crewai: add agromatrix and plant-intel role packs with updated team config 2026-02-20 17:56:55 +01:00
NODA1 System
a8a153a87a router: add tool manager runtime and memory retrieval updates 2026-02-20 17:56:33 +01:00
Apple
e01ed7be75 router: remove qwen2.5 profile and pin monitor to local qwen3 2026-02-19 00:25:55 -08:00
Apple
c57e6ed96b services: update comfy agent, senpai md consumer, and swapper deps 2026-02-19 00:14:18 -08:00
Apple
c201d105f6 services: add clan consent/visibility and oneok adapter stack 2026-02-19 00:14:12 -08:00
Apple
dfc0ef1ceb runtime: sync router/gateway/config policy and clan role registry 2026-02-19 00:14:06 -08:00
Apple
de8bb36462 docs+router: formalize runtime policy and remove temporary cloud-first code override 2026-02-18 10:40:40 -08:00
Apple
05435e7fad router: bypass local routing rules for cloud-first agents 2026-02-18 10:28:53 -08:00
Apple
ef59cb0950 router: enforce cloud-first direct path for top-level and monitor agents 2026-02-18 10:26:29 -08:00
Apple
5bca7fb79d router: unify top-level DeepSeek-first + on-demand CrewAI policy 2026-02-18 10:20:10 -08:00
Apple
a23cde217f clan: route simple requests to fast crew profile; keep zhos_mvp for complex 2026-02-18 09:59:53 -08:00
Apple
7c3bc68ac2 clan: restore zhos_mvp profile in crewai-service and re-enable clan zhos routing 2026-02-18 09:56:06 -08:00
Apple
b65ed7cdf2 clan: stop forcing missing zhos_mvp crew profile; use available default 2026-02-18 09:43:33 -08:00
Apple
13aa0c79f0 router: bundle CLAN runtime registry in router image path 2026-02-18 09:42:00 -08:00
Apple
63fec84734 clan: map runtime-guard manager alias so agent_id=clan is recognized 2026-02-18 09:40:54 -08:00
Apple
760022d7f5 helion: ignore keyword complexity hints; trigger CrewAI only by explicit detailed/complex flags 2026-02-18 09:25:52 -08:00
Apple
635f2d7e37 helion: deepseek-first, on-demand CrewAI, local subagent profiles, concise post-synthesis 2026-02-18 09:21:47 -08:00
Apple
77ab034744 Sync NODE1 crewai-service runtime files and monitor summary script 2026-02-18 06:00:19 -08:00
Apple
b9f83a5006 Sync NODE1 runtime config for Sofiia monitor + Clan canary fixes 2026-02-18 05:56:21 -08:00
Apple
b2be937fbb feat(file-tool): add djvu conversion and extraction actions 2026-02-15 03:11:55 -08:00