microdao-daarion

Author	SHA1	Message	Date
Apple	f70a824f6a	fix(matrix-bridge): fix _SoakMatrixClient.mark_seen signature for inject endpoint Made-with: Cursor	2026-03-05 07:55:04 -08:00
Apple	84cb7e51bc	fix(matrix-bridge): remove shadowed 'import os' inside lifespan causing UnboundLocalError Made-with: Cursor	2026-03-05 07:53:26 -08:00
Apple	82d5ff2a4f	feat(matrix-bridge-dagi): M4–M11 + soak infrastructure (debug inject endpoint) Includes all milestones M4 through M11: - M4: agent discovery (!agents / !status) - M5: node-aware routing + per-node observability - M6: dynamic policy store (node/agent overrides, import/export) - M7: Prometheus alerts + Grafana dashboard + metrics contract - M8: node health tracker + soft failover + sticky cache + HA persistence - M9: two-step confirm + diff preview for dangerous commands - M10: auto-backup, restore, retention, policy history + change detail - M11: soak scenarios (CI tests) + live soak script Soak infrastructure (this commit): - POST /v1/debug/inject_event (guarded by DEBUG_INJECT_ENABLED=false) - _preflight_inject() and _check_wal() in soak script - --db-path arg for WAL delta reporting - Runbook sections 2a/2b/2c: Step 0 and Step 1 exact commands Made-with: Cursor	2026-03-05 07:51:37 -08:00
Apple	fe6e3d30ae	feat(matrix-bridge-dagi): add operator allowlist for control commands (M3.0) New: app/control.py - ControlConfig: operator_allowlist + control_rooms (frozensets) - parse_control_config(): validates @user:server + !room:server formats, fail-fast - parse_command(): parses !verb subcommand [args] [key=value] up to 512 chars - check_authorization(): AND(is_control_room, is_operator) → (bool, reason) - Reply helpers: not_implemented, unknown_command, unauthorized, help - KNOWN_VERBS: runbook, status, help (M3.1+ stubs) - MAX_CMD_LEN=512, MAX_CMD_TOKENS=20 ingress.py: - _try_control(): dispatch for control rooms (authorized → audit + reply, unauthorized → audit + optional ⛔) - join control rooms on startup - _enqueue_from_sync: control rooms processed first, never forwarded to agents - on_control_command(sender, verb, subcommand) metric callback - CONTROL_UNAUTHORIZED_BEHAVIOR: "ignore" \| "reply_error" Audit events: matrix.control.command — authorised command (verb, subcommand, args, kwargs) matrix.control.unauthorized — rejected by allowlist (reason: not_operator \| not_control_room) matrix.control.unknown_cmd — authorised but unrecognised verb Config + main: - bridge_operator_allowlist, bridge_control_rooms, control_unauthorized_behavior - matrix_bridge_control_commands_total{sender,verb,subcommand} counter - /health: control_channel section (enabled, rooms_count, operators_count, behavior) - /bridge/mappings: control_rooms + control_operators_count - docker-compose: BRIDGE_OPERATOR_ALLOWLIST, BRIDGE_CONTROL_ROOMS, CONTROL_UNAUTHORIZED_BEHAVIOR Tests: 40 new → 148 total pass Made-with: Cursor	2026-03-05 01:50:04 -08:00
Apple	d40b1e87c6	feat(matrix-bridge-dagi): harden mixed rooms with safe defaults and ops visibility (M2.2) Guard rails (mixed_routing.py): - MAX_AGENTS_PER_MIXED_ROOM (default 5): fail-fast at parse time - MAX_SLASH_LEN (default 32): reject garbage/injection slash tokens - Unified rejection reasons: unknown_agent, slash_too_long, no_mapping - REASON_REJECTED_* constants (separate from success REASON_*) Ingress (ingress.py): - per-room-agent concurrency semaphore (MIXED_CONCURRENCY_CAP, default 1) - active_lock_count property for /health + prometheus - UNKNOWN_AGENT_BEHAVIOR: "ignore" (silent) \| "reply_error" (inform user) - on_routed(agent_id, reason) callback for routing metrics - on_route_rejected(room_id, reason) callback for rejection metrics - matrix.route.rejected audit event on every rejection Config + main: - max_agents_per_mixed_room, max_slash_len, unknown_agent_behavior, mixed_concurrency_cap - matrix_bridge_routed_total{agent_id, reason} counter - matrix_bridge_route_rejected_total{room_id, reason} counter - matrix_bridge_active_room_agent_locks gauge - /health: mixed_guard_rails section + total_agents_in_mixed_rooms - docker-compose: all 4 new guard rail env vars Runbook: section 9 — mixed room debug guide (6 acceptance tests, routing metrics, session isolation, lock hang, config guard) Tests: 108 pass (94 → 108, +14 new tests for guard rails + callbacks + concurrency) Made-with: Cursor	2026-03-05 01:41:20 -08:00
Apple	a85a11984b	feat(matrix-bridge-dagi): add mixed-room routing by slash/mention (M2.1) - mixed_routing.py: parse BRIDGE_MIXED_ROOM_MAP, route by /slash > @mention > name: > default - ingress.py: _try_enqueue_mixed for mixed rooms, session isolation {room}:{agent}, reply tagging - config.py: bridge_mixed_room_map + bridge_mixed_defaults fields - main.py: parse mixed config, pass to MatrixIngressLoop, expose in /health + /bridge/mappings - docker-compose: BRIDGE_MIXED_ROOM_MAP / BRIDGE_MIXED_DEFAULTS env vars, BRIDGE_ALLOWED_AGENTS multi-value - tests: 25 routing unit tests + 10 ingress integration tests (94 total pass) Made-with: Cursor	2026-03-05 01:29:18 -08:00
Apple	79db053b38	feat(matrix-bridge-dagi): support N rooms in BRIDGE_ROOM_MAP, reject duplicate room_id (M2.0) Made-with: Cursor	2026-03-05 01:21:07 -08:00
Apple	a24dae8e18	feat(matrix-bridge-dagi): add backpressure queue with N workers (H2) Reader + N workers architecture: Reader: sync_poll → rate_check → dedupe → queue.put_nowait() Workers (WORKER_CONCURRENCY, default 2): queue.get() → invoke → send → audit Drop policy (queue full): - put_nowait() raises QueueFull → dropped immediately (reader never blocks) - audit matrix.queue_full + on_queue_dropped callback - metric: matrix_bridge_queue_dropped_total{room_id,agent_id} Graceful shutdown: 1. stop_event → reader exits loop 2. queue.join() with QUEUE_DRAIN_TIMEOUT_S (default 5s) → workers finish in-flight 3. worker tasks cancelled New config env vars: QUEUE_MAX_EVENTS (default 100) WORKER_CONCURRENCY (default 2) QUEUE_DRAIN_TIMEOUT_S (default 5) New metrics (H3 additions): matrix_bridge_queue_size (gauge) matrix_bridge_queue_dropped_total (counter) matrix_bridge_queue_wait_seconds histogram (buckets: 0.01…30s) /health: queue.size, queue.max, queue.workers MatrixIngressLoop: queue_size + worker_count properties 6 queue tests: enqueue/process, full-drop-audit, concurrency barrier, graceful drain, wait metric, rate-limit-before-enqueue Total: 71 passed Made-with: Cursor	2026-03-05 01:07:04 -08:00
Apple	a4e95482bc	feat(matrix-bridge-dagi): add rate limiting (H1) and metrics (H3) H1 — InMemoryRateLimiter (sliding window, no Redis): - Per-room: RATE_LIMIT_ROOM_RPM (default 20/min) - Per-sender: RATE_LIMIT_SENDER_RPM (default 10/min) - Room checked before sender — sender quota not charged on room block - Blocked messages: audit matrix.rate_limited + on_rate_limited callback - reset() for ops/test, stats() exposed in /health H3 — Extended Prometheus metrics: - matrix_bridge_rate_limited_total{room_id,agent_id,limit_type} - matrix_bridge_send_duration_seconds histogram (invoke was already there) - matrix_bridge_invoke_duration_seconds buckets tuned for LLM latency - matrix_bridge_rate_limiter_active_rooms/senders gauges - on_invoke_latency + on_send_latency callbacks wired in ingress loop 16 new tests: rate limiter unit (13) + ingress integration (3) Total: 65 passed Made-with: Cursor	2026-03-05 00:54:14 -08:00
Apple	cad3663508	feat(matrix-bridge-dagi): add egress, audit integration, fix router endpoint (PR-M1.4) Closes the full Matrix ↔ DAGI loop: Egress: - invoke Router POST /v1/agents/{agent_id}/infer (field: prompt, response: response) - send_text() reply to Matrix room with idempotent txn_id = make_txn_id(room_id, event_id) - empty reply → skip send (no spam) - reply truncated to 4000 chars if needed Audit (via sofiia-console POST /api/audit/internal): - matrix.message.received (on ingress) - matrix.agent.replied (on successful reply) - matrix.error (on router/send failure, with error_code) - fire-and-forget: audit failures never crash the loop Router URL fix: - DAGI_GATEWAY_URL now points to dagi-router-node1:8000 (not gateway:9300) - Session ID: stable per room — matrix:{room_localpart} (memory context) 9 tests: invoke endpoint, fallback fields, audit write, full cycle, dedupe, empty reply skip, metric callbacks Made-with: Cursor	2026-03-03 08:06:49 -08:00
Apple	8d564fbbe5	feat(sofiia-console): add internal audit ingest endpoint for trusted services Adds POST /api/audit/internal authenticated via X-Internal-Service-Token header (SOFIIA_INTERNAL_TOKEN env). Allows matrix-bridge-dagi and other internal services to write audit events without team keys. Reuses existing audit_log() + db layer. Made-with: Cursor	2026-03-03 08:03:49 -08:00
Apple	dbfab78f02	feat(matrix-bridge-dagi): add room mapping, ingress loop, synapse setup (PR-M1.2 + PR-M1.3) PR-M1.2 — room-to-agent mapping: - adds room_mapping.py: parse BRIDGE_ROOM_MAP (format: agent:!room_id:server) - RoomMappingConfig with O(1) room→agent lookup, agent allowlist check - /bridge/mappings endpoint (read-only ops summary, no secrets) - health endpoint now includes mappings_count - 21 tests for parsing, validation, allowlist, summary PR-M1.3 — Matrix ingress loop: - adds ingress.py: MatrixIngressLoop asyncio task - sync_poll → extract → dedupe → _invoke_gateway (POST /v1/invoke) - gateway payload: agent_id, node_id, message, metadata (transport, room_id, event_id, sender) - exponential backoff on errors (2s..60s) - joins all mapped rooms at startup - metric callbacks: on_message_received, on_gateway_error - graceful shutdown via asyncio.Event - 5 ingress tests (invoke, dedupe, callbacks, empty-map idle) Synapse setup (docker-compose.synapse-node1.yml): - fixed volume: bind mount ./synapse-data instead of named volume - added port mapping 127.0.0.1:8008:8008 Synapse running on NODA1 (localhost:8008), bot @dagi_bridge:daarion.space created, room !QwHczWXgefDHBEVkTH:daarion.space created, all 4 values in .env on NODA1. Made-with: Cursor	2026-03-03 07:51:13 -08:00
Apple	d8506da179	feat(matrix-bridge-dagi): add matrix client wrapper and synapse setup (PR-M1.1) - adds MatrixClient with send_text/sync_poll/join_room/whoami (idempotent via txn_id) - LRU dedupe for incoming event_ids (2048 capacity) - exponential backoff retry (max 3 attempts) for 429/5xx/network errors - extract_room_messages: filters own messages, non-text, duplicates - health endpoint now probes matrix_reachable + gateway_reachable at startup - adds docker-compose.synapse-node1.yml (Synapse + Postgres for NODA1) - adds ops/runbook-matrix-setup.md (10-step setup: DNS, config, bot, room, .env) - 19 tests passing, no real Synapse required Made-with: Cursor	2026-03-03 07:38:54 -08:00
Apple	1d8482f4c1	feat(matrix-bridge-dagi): scaffold service with health, metrics and config (PR-M1.0) New service: services/matrix-bridge-dagi/ - app/config.py: BridgeConfig dataclass, load_config() with full env validation (MATRIX_HOMESERVER_URL, MATRIX_ACCESS_TOKEN, MATRIX_USER_ID, SOFIIA_ROOM_ID, DAGI_GATEWAY_URL, SOFIIA_CONSOLE_URL, SOFIIA_INTERNAL_TOKEN, rate limits) - app/main.py: FastAPI app with lifespan, GET /health, GET /metrics (prometheus) health returns: ok, node_id, homeserver, bridge_user, sofiia_room_id, allowed_agents, gateway, uptime_s; graceful error state when config missing - requirements.txt: fastapi, uvicorn, httpx, prometheus-client, pyyaml - Dockerfile: python:3.11-slim, port 7030, BUILD_SHA/BUILD_TIME args docker-compose.matrix-bridge-node1.yml: - standalone override file (node1 network, port 127.0.0.1:7030) - all env vars wired: MATRIX_*, SOFIIA_ROOM_ID, DAGI_GATEWAY_URL, SOFIIA_CONSOLE_URL, SOFIIA_INTERNAL_TOKEN, rate limit policy - healthcheck, restart: unless-stopped DoD: config validates, health/metrics respond, imports clean Made-with: Cursor	2026-03-03 07:28:24 -08:00
Apple	5994a3a56f	feat(node-capabilities): add voice HA capability pass-through from node-worker Made-with: Cursor	2026-03-03 07:15:39 -08:00
Apple	129e4ea1fc	feat(platform): add new services, tools, tests and crews modules New router intelligence modules (26 files): alert_ingest/store, audit_store, architecture_pressure, backlog_generator/store, cost_analyzer, data_governance, dependency_scanner, drift_analyzer, incident_* (5 files), llm_enrichment, platform_priority_digest, provider_budget, release_check_runner, risk_* (6 files), signature_state_store, sofiia_auto_router, tool_governance New services: - sofiia-console: Dockerfile, adapters/, monitor/nodes/ops/voice modules, launchd, react static - memory-service: integration_endpoints, integrations, voice_endpoints, static UI - aurora-service: full app suite (analysis, job_store, orchestrator, reporting, schemas, subagents) - sofiia-supervisor: new supervisor service - aistalk-bridge-lite: Telegram bridge lite - calendar-service: CalDAV calendar service with reminders - mlx-stt-service / mlx-tts-service: Apple Silicon speech services - binance-bot-monitor: market monitor service - node-worker: STT/TTS memory providers New tools (9): agent_email, browser_tool, contract_tool, observability_tool, oncall_tool, pr_reviewer_tool, repo_tool, safe_code_executor, secure_vault New crews: agromatrix_crew (10 modules: depth_classifier, doc_facts, doc_focus, farm_state, light_reply, llm_factory, memory_manager, proactivity, reflection_engine, session_context, style_adapter, telemetry) Tests: 85+ test files for all new modules Made-with: Cursor	2026-03-03 07:14:14 -08:00
Apple	e9dedffa48	feat(production): sync all modified production files to git Includes updates across gateway, router, node-worker, memory-service, aurora-service, swapper, sofiia-console UI and node2 infrastructure: - gateway-bot: Dockerfile, http_api.py, druid/aistalk prompts, doc_service - services/router: main.py, router-config.yml, fabric_metrics, memory_retrieval, offload_client, prompt_builder - services/node-worker: worker.py, main.py, config.py, fabric_metrics - services/memory-service: Dockerfile, database.py, main.py, requirements - services/aurora-service: main.py (+399), kling.py, quality_report.py - services/swapper-service: main.py, swapper_config_node2.yaml - services/sofiia-console: static/index.html (console UI update) - config: agent_registry, crewai_agents/teams, router_agents - ops/fabric_preflight.sh: updated preflight checks - router-config.yml, docker-compose.node2.yml: infra updates - docs: NODA1-AGENT-ARCHITECTURE, fabric_contract updated Made-with: Cursor	2026-03-03 07:13:29 -08:00
Apple	2962d33a3b	feat(sofiia-console): add artifacts list endpoint + team onboarding doc - runbook_artifacts.py: adds list_run_artifacts() returning files with names, paths, sizes, mtime_utc from release_artifacts/<run_id>/ - runbook_runs_router.py: adds GET /api/runbooks/runs/{run_id}/artifacts - docs/runbook/team-onboarding-console.md: one-page team onboarding doc covering access, rehearsal run steps, audit auth model (strict, no localhost bypass), artifacts location, abort procedure Made-with: Cursor	2026-03-03 06:55:49 -08:00
Apple	e0bea910b9	feat(sofiia-console): add multi-user team key auth + fix aurora DNS env - auth.py: adds SOFIIA_CONSOLE_TEAM_KEYS="name:key,..." support; require_auth now returns identity ("operator"/"user:<name>") for audit; validate_any_key checks primary + team keys; login sets per-user cookie - main.py: auth/login+check endpoints return identity field; imports validate_any_key, _expected_team_cookie_tokens from auth - docker-compose.node1.yml: adds SOFIIA_CONSOLE_TEAM_KEYS env var; adds AURORA_SERVICE_URL=http://127.0.0.1:9401 to prevent DNS lookup failure for aurora-service (not deployed on NODA1) Made-with: Cursor	2026-03-03 06:38:26 -08:00
Apple	8879da1e7f	feat(sofiia-console): add auto-evidence and post-review generation from runbook runs - adds runbook_artifacts.py: server-side render of release_evidence.md and post_review.md from DB step results (no shell); saves to SOFIIA_DATA_DIR/release_artifacts/<run_id>/ - evidence: auto-fills preflight/smoke/script outcomes, step table, timestamps - post_review: auto-fills metadata, smoke results, incidents from step statuses; leaves [TODO] markers for manual observation sections - adds POST /api/runbooks/runs/{run_id}/evidence and /post_review endpoints - updates runbook_runs.evidence_path in DB after render - adds 11 tests covering file creation, key sections, TODO markers, 404s, API Made-with: Cursor	2026-03-03 05:07:52 -08:00
Apple	0603184524	feat(sofiia-console): add safe script executor for allowlisted runbook steps - adds safe_executor.py: REPO_ROOT confinement, strict script allowlist, env key allowlist (STRICT/SOFIIA_URL/BFF_A/BFF_B/NODE_ID/AGENT_ID), stdin=DEVNULL, 8KB output cap, timeout clamp (max 300s), non-root warn - integrates script action_type into runbook_runner: next_step handles http_check and script branches; running_as_root -> step_status=warn - extends runbook_parser: rehearsal-v1 now includes 3 built-in script steps (preflight, idempotency smoke, generate evidence) after http_checks - adds tests/test_sofiia_safe_executor.py: 12 tests covering path traversal, absolute path, non-allowlist, env drop, timeout, exit_code, mocked subprocess Made-with: Cursor	2026-03-03 04:57:22 -08:00
Apple	ad8bddf595	feat(sofiia-console): add guided runbook runner with http checks and audit integration adds runbook_runs/runbook_steps state machine parses markdown runbooks into guided steps supports allowlisted http_check (health/metrics/audit) integrates runbook execution with audit trail exposes authenticated runbook runs API Made-with: Cursor	2026-03-03 04:49:19 -08:00
Apple	4db1774a34	feat(sofiia-console): rank runbook search results with bm25 FTS path: score = bm25(docs_chunks_fts), ORDER BY score ASC; LIKE fallback: score null; test asserts score key present Made-with: Cursor	2026-03-03 04:36:52 -08:00
Apple	63fec4371a	feat(sofiia-console): add runbooks index status endpoint GET /api/runbooks/status returns docs_root, indexed_files, indexed_chunks, last_indexed_at, fts_available; docs_index_meta table and set on rebuild Made-with: Cursor	2026-03-03 04:35:18 -08:00
Apple	ef3ff80645	feat(sofiia-console): add docs index and runbook search API (FTS5) adds SQLite docs index (files/chunks + FTS5) and CLI rebuild exposes authenticated runbook search/preview/raw endpoints Made-with: Cursor	2026-03-03 04:26:34 -08:00
Apple	e2c2333b6f	feat(sofiia-console): protect audit endpoint with admin token Made-with: Cursor	2026-03-02 09:42:10 -08:00
Apple	11e0ba7264	feat(sofiia-console): add audit query endpoint with cursor pagination Made-with: Cursor	2026-03-02 09:36:11 -08:00
Apple	3246440ac8	feat(sofiia-console): add audit trail for operator actions Made-with: Cursor	2026-03-02 09:29:14 -08:00
Apple	9b89ace2fc	feat(sofiia-console): add rate limiting for chat send (per-chat and per-operator) Made-with: Cursor	2026-03-02 09:24:21 -08:00
Apple	3b16739671	feat(sofiia-console): add RedisIdempotencyStore backend Made-with: Cursor	2026-03-02 09:08:52 -08:00
Apple	0b30775ac1	feat(sofiia-console): add structured json logging for chat ops Made-with: Cursor	2026-03-02 08:24:54 -08:00
Apple	e504df7dfa	feat(sofiia-console): harden cursor pagination with tie-breaker Version cursor payloads and keep backward compatibility while adding dedicated tie-breaker regression coverage for equal timestamps to prevent pagination duplicates and gaps. Made-with: Cursor	2026-03-02 08:12:19 -08:00
Apple	0c626943d6	refactor(sofiia-console): extract idempotency store abstraction Move idempotency TTL/LRU logic into a dedicated store module with a swap-ready interface and wire chat send flow to use store get/set semantics without changing API behavior. Made-with: Cursor	2026-03-02 08:11:13 -08:00
Apple	93f94030f4	feat(sofiia-console): expose /metrics and add basic ops counters Expose Prometheus-style metrics endpoint and add counters for send requests, idempotency replays, and cursor pagination calls, including a safe in-process fallback exposition when prometheus_client is unavailable. Made-with: Cursor	2026-03-02 04:52:04 -08:00
Apple	d9ce366538	feat(sofiia-console): idempotency_key, cursor pagination, and noda2 router fallback Add BFF runtime support for chat idempotency (header priority over body) with bounded in-memory TTL/LRU replay cache, implement cursor-based pagination for chats and messages, and add a safe NODA2 local router fallback for legacy runs without NODE_ID. Made-with: Cursor	2026-03-02 04:14:58 -08:00
Apple	f16bab2cb9	chore(aurora): support keychain/env loading for kling credentials on launchd	2026-03-01 06:26:17 -08:00
Apple	1ea4464838	feat(aurora-smart): add dual-stack orchestration with policy, audit, and UI toggle	2026-03-01 06:21:17 -08:00
Apple	5b4c4f92ba	feat(aurora): add detection overlays with face/plate boxes in compare UI	2026-03-01 05:00:29 -08:00
Apple	79f26ab683	feat(aurora-ui): add interactive pre-analysis controls and quality report	2026-03-01 04:10:10 -08:00
Apple	fe0f2e23c2	feat(aurora): expose quality report API and proxy via sofiia console	2026-03-01 03:59:54 -08:00
Apple	c230abe9cf	fix(aurora): harden Kling integration and surface config diagnostics	2026-03-01 03:55:16 -08:00
Apple	ff97d3cf4a	fix(console): route Aurora Kling enhance via standard proxy base URL	2026-03-01 03:48:19 -08:00
Apple	57632699c0	chore(cleanup): remove obsolete compose version and trim router Dockerfile	2026-03-01 01:37:30 -08:00
Apple	de234112f3	feat(node2): wire calendar-service and core automation tools in router	2026-03-01 01:37:13 -08:00
Apple	9a36020316	P3.5-P3.7: 2-layer inventory, capability routing, STT/TTS adapters, Dev Contract NCS: - _collect_worker_caps() fetches capability flags from node-worker /caps - _derive_capabilities() merges served model types + worker provider flags - installed_artifacts replaces inventory_only (disk scan with DISK_SCAN_PATHS env) - New endpoints: /capabilities/caps, /capabilities/installed Node Worker: - STT_PROVIDER, TTS_PROVIDER, OCR_PROVIDER, IMAGE_PROVIDER env flags - /caps endpoint returns capabilities + providers for NCS aggregation - STT adapter (providers/stt_mlx_whisper.py) — remote + local mode - TTS adapter (providers/tts_mlx_kokoro.py) — remote + local mode - OCR handler via vision_prompted (ollama_vision with OCR prompt) - NATS subjects: node.{id}.stt/tts/ocr/image.request Router: - POST /v1/capability/{stt,tts,ocr,image} — capability-based offload routing - GET /v1/capabilities — global view with capabilities_by_node - require_fresh_caps(ttl) preflight guard - find_nodes_with_capability(cap) + load-based node selection Ops: - ops/fabric_snapshot.py — full runtime snapshot collector - ops/fabric_preflight.sh — quick check + snapshot save + diff - docs/fabric_contract.md — Dev Contract v0.1 (preflight-first) - tests/test_fabric_contract.py — CI enforcement (6 tests) Made-with: Cursor	2026-02-27 05:24:09 -08:00
Apple	194c87f53c	feat(fabric): decommission Swapper from critical path, NCS = source of truth - Node Worker: replace swapper_vision with ollama_vision (direct Ollama API) - Node Worker: add NATS subjects for stt/tts/image (stubs ready) - Node Worker: remove SWAPPER_URL dependency from config - Router: vision calls go directly to Ollama /api/generate with images - Router: local LLM calls go directly to Ollama /api/generate - Router: add OLLAMA_URL and PREFER_NODE_WORKER=true feature flag - Router: /v1/models now uses NCS global capabilities pool - NCS: SWAPPER_URL="" -> skip Swapper probing (status=disabled) - Swapper configs: remove all hardcoded model lists, keep only runtime URLs, timeouts, limits - docker-compose.node1.yml: add OLLAMA_URL, PREFER_NODE_WORKER for router; SWAPPER_URL= for NCS; remove swapper-service from node-worker depends_on - docker-compose.node2-sofiia.yml: same changes for NODA2 Swapper service still runs but is NOT in the critical inference path. Source of truth for models is now NCS -> Ollama /api/tags. Made-with: Cursor	2026-02-27 04:16:16 -08:00
Apple	90080c632a	fix(fabric): use broadcast subject for NATS capabilities discovery NATS wildcards (node.*.capabilities.get) only work for subscriptions, not for publish. Switch to a dedicated broadcast subject (fabric.capabilities.discover) that all NCS instances subscribe to, enabling proper scatter-gather discovery across nodes. Made-with: Cursor	2026-02-27 03:20:13 -08:00
Apple	a6531507df	merge: integrate remote codex/sync-node1-runtime with fabric layer changes Resolve conflicts in docker-compose.node1.yml, services/router/main.py, and gateway-bot/services/doc_service.py — keeping both fabric layer (NCS, node-worker, Prometheus) and document ingest/query endpoints. Made-with: Cursor	2026-02-27 03:09:12 -08:00
Apple	ed7ad49d3a	P3.2+P3.3+P3.4: NODA1 node-worker + NATS auth config + Prometheus counters P3.2 — Multi-node deployment: - Added node-worker service to docker-compose.node1.yml (NODE_ID=noda1) - NCS NODA1 now has NODE_WORKER_URL for metrics collection - Fixed NODE_ID consistency: router NODA1 uses 'noda1' - NODA2 node-worker/NCS gets NCS_REPORT_URL for latency reporting P3.3 — NATS accounts/auth (opt-in config): - config/nats-server.conf with 3 accounts: SYS, FABRIC, APP - Per-user topic permissions (router, ncs, node_worker) - Leafnode listener :7422 with auth - Not yet activated (requires credential provisioning) P3.4 — Prometheus counters: - Router /fabric_metrics: caps_refresh, caps_stale, model_select, offload_total, breaker_state, score_ms histogram - Node Worker /prom_metrics: jobs_total, inflight gauge, latency_ms histogram - NCS /prom_metrics: runtime_health, runtime_p50/p95, node_wait_ms - All bound to 127.0.0.1 (not externally exposed) Made-with: Cursor	2026-02-27 03:03:18 -08:00
Apple	a605b8c43e	P3.1: GPU/Queue-aware routing — NCS metrics + scoring-based model selection NCS (services/node-capabilities/metrics.py): - NodeLoad: inflight_jobs, queue_depth, concurrency_limit, estimated_wait_ms, cpu_load_1m, mem_pressure (macOS + Linux), rtt_ms_to_hub - RuntimeLoad: per-runtime healthy, p50_ms, p95_ms from rolling 50-sample window - POST /capabilities/report_latency for node-worker → NCS reporting - NCS fetches worker metrics via NODE_WORKER_URL Node Worker: - GET /metrics endpoint (inflight, concurrency, latency buffers) - Latency tracking per job type (llm/vision) with rolling buffer - Fire-and-forget latency reporting to NCS after each successful job Router (model_select v3): - score_candidate(): wait + model_latency + cross_node_penalty + prefer_bonus - LOCAL_THRESHOLD_MS=250: prefer local if within threshold of remote - ModelSelection.score field for observability - Structured [score] logs with chosen node, model, and score breakdown Tests: 19 new (12 scoring + 7 NCS metrics), 36 total pass Docs: ops/runbook_p3_1.md, ops/CHANGELOG_FABRIC.md No breaking changes to JobRequest/JobResponse or capabilities schema. Made-with: Cursor	2026-02-27 02:55:44 -08:00

1 2 3 4 5 ...

255 Commits