Apple
9a36020316
P3.5-P3.7: 2-layer inventory, capability routing, STT/TTS adapters, Dev Contract
...
NCS:
- _collect_worker_caps() fetches capability flags from node-worker /caps
- _derive_capabilities() merges served model types + worker provider flags
- installed_artifacts replaces inventory_only (disk scan with DISK_SCAN_PATHS env)
- New endpoints: /capabilities/caps, /capabilities/installed
Node Worker:
- STT_PROVIDER, TTS_PROVIDER, OCR_PROVIDER, IMAGE_PROVIDER env flags
- /caps endpoint returns capabilities + providers for NCS aggregation
- STT adapter (providers/stt_mlx_whisper.py) — remote + local mode
- TTS adapter (providers/tts_mlx_kokoro.py) — remote + local mode
- OCR handler via vision_prompted (ollama_vision with OCR prompt)
- NATS subjects: node.{id}.stt/tts/ocr/image.request
Router:
- POST /v1/capability/{stt,tts,ocr,image} — capability-based offload routing
- GET /v1/capabilities — global view with capabilities_by_node
- require_fresh_caps(ttl) preflight guard
- find_nodes_with_capability(cap) + load-based node selection
Ops:
- ops/fabric_snapshot.py — full runtime snapshot collector
- ops/fabric_preflight.sh — quick check + snapshot save + diff
- docs/fabric_contract.md — Dev Contract v0.1 (preflight-first)
- tests/test_fabric_contract.py — CI enforcement (6 tests)
Made-with: Cursor
2026-02-27 05:24:09 -08:00
Apple
a6531507df
merge: integrate remote codex/sync-node1-runtime with fabric layer changes
...
Resolve conflicts in docker-compose.node1.yml, services/router/main.py,
and gateway-bot/services/doc_service.py — keeping both fabric layer
(NCS, node-worker, Prometheus) and document ingest/query endpoints.
Made-with: Cursor
2026-02-27 03:09:12 -08:00
Apple
ed7ad49d3a
P3.2+P3.3+P3.4: NODA1 node-worker + NATS auth config + Prometheus counters
...
P3.2 — Multi-node deployment:
- Added node-worker service to docker-compose.node1.yml (NODE_ID=noda1)
- NCS NODA1 now has NODE_WORKER_URL for metrics collection
- Fixed NODE_ID consistency: router NODA1 uses 'noda1'
- NODA2 node-worker/NCS gets NCS_REPORT_URL for latency reporting
P3.3 — NATS accounts/auth (opt-in config):
- config/nats-server.conf with 3 accounts: SYS, FABRIC, APP
- Per-user topic permissions (router, ncs, node_worker)
- Leafnode listener :7422 with auth
- Not yet activated (requires credential provisioning)
P3.4 — Prometheus counters:
- Router /fabric_metrics: caps_refresh, caps_stale, model_select,
offload_total, breaker_state, score_ms histogram
- Node Worker /prom_metrics: jobs_total, inflight gauge, latency_ms histogram
- NCS /prom_metrics: runtime_health, runtime_p50/p95, node_wait_ms
- All bound to 127.0.0.1 (not externally exposed)
Made-with: Cursor
2026-02-27 03:03:18 -08:00
Apple
a605b8c43e
P3.1: GPU/Queue-aware routing — NCS metrics + scoring-based model selection
...
NCS (services/node-capabilities/metrics.py):
- NodeLoad: inflight_jobs, queue_depth, concurrency_limit, estimated_wait_ms,
cpu_load_1m, mem_pressure (macOS + Linux), rtt_ms_to_hub
- RuntimeLoad: per-runtime healthy, p50_ms, p95_ms from rolling 50-sample window
- POST /capabilities/report_latency for node-worker → NCS reporting
- NCS fetches worker metrics via NODE_WORKER_URL
Node Worker:
- GET /metrics endpoint (inflight, concurrency, latency buffers)
- Latency tracking per job type (llm/vision) with rolling buffer
- Fire-and-forget latency reporting to NCS after each successful job
Router (model_select v3):
- score_candidate(): wait + model_latency + cross_node_penalty + prefer_bonus
- LOCAL_THRESHOLD_MS=250: prefer local if within threshold of remote
- ModelSelection.score field for observability
- Structured [score] logs with chosen node, model, and score breakdown
Tests: 19 new (12 scoring + 7 NCS metrics), 36 total pass
Docs: ops/runbook_p3_1.md, ops/CHANGELOG_FABRIC.md
No breaking changes to JobRequest/JobResponse or capabilities schema.
Made-with: Cursor
2026-02-27 02:55:44 -08:00
Apple
c4b94a327d
P2.2+P2.3: NATS offload node-worker + router offload integration
...
Node Worker (services/node-worker/):
- NATS subscriber for node.{NODE_ID}.llm.request / vision.request
- Canonical JobRequest/JobResponse envelope (Pydantic)
- Idempotency cache (TTL 10min) with inflight dedup
- Deadline enforcement (DEADLINE_EXCEEDED on expired jobs)
- Concurrency limiter (semaphore, returns busy)
- Ollama + Swapper vision providers
Router offload (services/router/offload_client.py):
- NATS req/reply with configurable retries
- Circuit breaker per node+type (3 fails/60s → open 120s)
- Concurrency semaphore for remote requests
Model selection (services/router/model_select.py):
- exclude_nodes parameter for circuit-broken nodes
- force_local flag for fallback re-selection
- Integrated circuit breaker state awareness
Router /infer pipeline:
- Remote offload path when NCS selects remote node
- Automatic fallback: exclude failed node → force_local re-select
- Deadline propagation from router to node-worker
Tests: 17 unit tests (idempotency, deadline, circuit breaker)
Docs: ops/offload_routing.md (subjects, envelope, verification)
Made-with: Cursor
2026-02-27 02:44:05 -08:00
Apple
e2a3ae342a
node2: fix Sofiia routing determinism + Node Capabilities Service
...
Bug fixes:
- Bug A: GROK_API_KEY env mismatch — router expected GROK_API_KEY but only
XAI_API_KEY was present. Added GROK_API_KEY=${XAI_API_KEY} alias in compose.
- Bug B: 'grok' profile missing in router-config.node2.yml — added cloud_grok
profile (provider: grok, model: grok-2-1212). Sofiia now has
default_llm=cloud_grok with fallback_llm=local_default_coder.
- Bug C: Router silently defaulted to cloud DeepSeek when profile was unknown.
Now falls back to agent.fallback_llm or local_default_coder with WARNING log.
Hardcoded Ollama URL (172.18.0.1) replaced with config-driven base_url.
New service: Node Capabilities Service (NCS)
- services/node-capabilities/ — FastAPI microservice exposing live model
inventory from Ollama, Swapper, and llama-server.
- GET /capabilities — canonical JSON with served_models[] and inventory_only[]
- GET /capabilities/models — flat list of served models
- POST /capabilities/refresh — force cache refresh
- Cache TTL 15s, bound to 127.0.0.1:8099
- services/router/capabilities_client.py — async client with TTL cache
Artifacts:
- ops/node2_models_audit.md — 3-layer model view (served/disk/cloud)
- ops/node2_models_audit.yml — machine-readable audit
- ops/node2_capabilities_example.json — sample NCS output (14 served models)
Made-with: Cursor
2026-02-27 02:07:40 -08:00
Apple
3965f68fac
node2: full model inventory audit 2026-02-27
...
Read-only audit of all installed models on NODA2 (MacBook M4 Max):
- 12 Ollama models, 1 llama-server duplicate, 16 HF cache models
- ComfyUI stack (200+ GB): FLUX.2-dev, LTX-2 video, SDXL
- Whisper-large-v3-turbo (MLX, 1.5GB) + Kokoro TTS (MLX, 0.35GB) installed but unused
- MiniCPM-V-4_5 (16GB) installed but not in Swapper (better than llava:13b)
- Key finding: 149GB cleanup potential; llama-server duplicates Ollama (P1, 20GB)
Artifacts:
- ops/node2_models_inventory_20260227.json
- ops/node2_models_inventory_20260227.md
- ops/node2_model_capabilities.yml
- ops/node2_model_gaps.yml
Made-with: Cursor
2026-02-27 01:44:26 -08:00
Apple
7b8499dd8a
node2: P0 vision restore + P1 security hardening + node-specific router config
...
P0 — Vision:
- swapper_config_node2.yaml: add llava-13b as vision model (vision:true)
/vision/models now returns non-empty list; inference verified ~3.5s
- ollama.url fixed to host.docker.internal:11434 (was localhost, broken in Docker)
P1 — Security:
- Remove NODES_NODA1_SSH_PASSWORD from .env and docker-compose.node2-sofiia.yml
- SSH ED25519 key generated, authorized on NODA1, mounted as /run/secrets/noda1_ssh_key
- sofiia-console reads key via NODES_NODA1_SSH_PRIVATE_KEY env var
- secrets/noda1_id_ed25519 added to .gitignore
P1 — Router:
- services/router/router-config.node2.yml: new node2-specific config
replaces all 172.17.0.1:11434 → host.docker.internal:11434
- docker-compose.node2-sofiia.yml: mount router-config.node2.yml (not root config)
P1 — Ports:
- router (9102), swapper (8890), sofiia-console (8002): bind to 127.0.0.1
- gateway (9300): keep 0.0.0.0 (Telegram webhook requires public access)
Artifacts:
- ops/patch_node2_P0P1_20260227.md — change log
- ops/validation_node2_P0P1_20260227.md — all checks PASS
- ops/node2.env.example — safe env template (no secrets)
- ops/security_hardening_node2.md — SSH key migration guide + firewall
- ops/node2_models_pull.sh — model pull script for P0/P1
Made-with: Cursor
2026-02-27 01:27:38 -08:00
Apple
46d7dea88a
docs(audit): NODA2 full audit 2026-02-27
...
- ops/audit_node2_20260227.md: readable report (hardware, containers, models, Sofiia, findings)
- ops/audit_node2_20260227.json: structured machine-readable inventory
- ops/audit_node2_findings.yml: 10 PASS + 5 PARTIAL + 3 FAIL + 3 SECURITY gaps
- ops/node2_capabilities.yml: router-ready capabilities (vision/text/code/stt/tts models)
Key findings:
P0: vision pipeline broken (/vision/models=empty, qwen3-vl:8b not installed)
P1: node-ops-worker missing, SSH root password in sofiia-console env
P1: router-config.yml uses 172.17.0.1 (Linux bridge) not host.docker.internal
Made-with: Cursor
2026-02-27 01:14:38 -08:00
NODA1 System
987ece5bac
ops: add plant-vision node1 service and update monitor/prober scripts
2026-02-20 17:57:40 +01:00
Apple
d42bb09912
helion: stabilize doc context, remove legacy webhook path, add stack smoke canary
2026-02-18 09:36:16 -08:00
Apple
e5a6e310b7
ops: make DAARWIZZ awareness canary static by default with optional runtime mode
2026-02-18 08:29:02 -08:00
Apple
00b77066b0
ops: add DAARWIZZ awareness canary for all top-level agents
2026-02-18 08:22:50 -08:00
Apple
249b2e1e94
ops: restore canary_all and harden monitor summary script invocation
2026-02-18 06:13:15 -08:00
Apple
77ab034744
Sync NODE1 crewai-service runtime files and monitor summary script
2026-02-18 06:00:19 -08:00
Apple
b9f83a5006
Sync NODE1 runtime config for Sofiia monitor + Clan canary fixes
2026-02-18 05:56:21 -08:00
Apple
ef3473db21
snapshot: NODE1 production state 2026-02-09
...
Complete snapshot of /opt/microdao-daarion/ from NODE1 (144.76.224.179).
This represents the actual running production code that has diverged
significantly from the previous main branch.
Key changes from old main:
- Gateway (http_api.py): expanded from ~40KB to 164KB with full agent support
- Router: new /v1/agents/{id}/infer endpoint with vision + DeepSeek routing
- Behavior Policy: SOWA v2.2 (3-level: FULL/ACK/SILENT)
- Agent Registry: config/agent_registry.yml as single source of truth
- 13 agents configured (was 3)
- Memory service integration
- CrewAI teams and roles
Excluded from snapshot: venv/, .env, data/, backups, .tgz archives
Co-authored-by: Cursor <cursoragent@cursor.com >
2026-02-09 08:46:46 -08:00
Apple
0c8bef82f4
feat: Add Alateya, Clan, Eonarch agents + fix gateway-router connection
...
## Agents Added
- Alateya: R&D, biotech, innovations
- Clan (Spirit): Community spirit agent
- Eonarch: Consciousness evolution agent
## Changes
- docker-compose.node1.yml: Added tokens for all 3 new agents
- gateway-bot/http_api.py: Added configs and webhook endpoints
- gateway-bot/clan_prompt.txt: New prompt file
- gateway-bot/eonarch_prompt.txt: New prompt file
## Fixes
- Fixed ROUTER_URL from :9102 to :8000 (internal container port)
- All 9 Telegram agents now working
## Documentation
- Created PROJECT-MASTER-INDEX.md - single entry point
- Added various status documents and scripts
Tokens configured:
- Helion, NUTRA, Agromatrix (existing)
- Alateya, Clan, Eonarch (new)
- Druid, GreenFood, DAARWIZZ (configured)
2026-01-28 06:40:34 -08:00