Phase6/7 runtime + Gitea smoke gate setup #1

daarion-admin · 2026-03-05T09:40:40-08:00

daarion-admin commented

2026-03-05 09:40:40 -08:00

Includes live-validated Phase-6/7 runtime changes, phase6 smoke workflow for Gitea, and SSH key hardening for runner execution.\n\nValidation done:\n- NODA1 make phase6-smoke PASS\n- /v1/agents/public count=14 without aistalk\n- Gitea Actions run #19 success on macbook-noda2-runner\n\nNext step after merge:\n- optionally wire deploy workflow hard gate to this smoke run.

daarion-admin added 214 commits 2026-03-05 09:40:40 -08:00

🔧 NATS: виправлено deployment.yaml з правильним initContainer 346dfdfb2d

- Додано initContainer для підстановки server_name
- Використано emptyDir для запису конфігу
- Оновлено volumeMounts

🔧 NATS: standalone режим + streams creation Job a001636c11

- NATS працює в standalone режимі (1 replica)
- Виправлено server_name через initContainer
- Створено K8s Job для створення streams (через Python)
- Створено create-streams.py скрипт

TODO: Streams створити через worker-daemon або після виправлення DNS в Job

🚀 Matrix Gateway: базова реалізація v1 a0c3c0cbb5

- Matrix Client (підключення та синхронізація)
- RBAC Checker (перевірка прав через Postgres)
- Job Creator (створення jobs з команд)
- NATS Publisher (публікація jobs у streams)
- K8s deployment
- README з документацією

Команди: !embed, !retrieve, !summarize

TODO: Реальна інтеграція з Matrix homeserver, статуси результатів

🔧 Worker Daemon: додано Stream Creator 0ebbb172f0

- Автоматичне створення streams при старті worker
- Перевірка наявності streams перед створенням
- Підтримка всіх 4 streams (MM_ONLINE, MM_OFFLINE, MM_WRITE, MM_EVENTS)

Це вирішує проблему з DNS в K8s Job

🔐 Auth: базова реалізація JWT для Memory Service 6c426bc274

- JWT middleware для FastAPI
- Генерація/перевірка JWT токенів
- Скрипти для генерації Qdrant API keys
- Скрипти для генерації NATS operator JWT
- План реалізації Auth

TODO: Додати JWT до endpoints, NATS nkeys config, Qdrant API key config

🔐 Auth: інтеграція JWT в Memory Service + конфігурації 38cb96dd68

- Опціональна JWT auth в Memory Service endpoints
- get_current_service_optional для backward compatibility
- NATS auth config (nkeys) - шаблони
- Qdrant auth config (API keys) - шаблони
- Тестовий скрипт для повного потоку

TODO: Генерація реальних JWT/ключів та застосування конфігів

📊 Статус реалізації: всі основні компоненти готові 2bb19343f5

- NATS JetStream: працює, streams створюються автоматично
- Worker Daemon: повна реалізація з Stream Creator
- Matrix Gateway: базова реалізація готова
- Auth: базова реалізація (JWT, nkeys, API keys)

TODO: Генерація реальних секретів та тестування

🚀 Production-ready: Auth enforcement + Observability + Policy 70fd268a0d

- Atomic генерація всіх секретів (generate-all-secrets.sh)
- Auth enforcement перевірка (enforce-auth.sh)
- Оновлений full flow test (must-pass)
- Prometheus alerting rules для Memory Module
- Matrix alerts bridge (алерти в ops room)
- Policy engine документація для пам'яті

Готово до production deployment!

📚 Production Deployment Guide: повна інструкція 90a2156bf6

- Atomic генерація секретів
- Auth enforcement checklist
- Smoke-test та Full flow test
- Observability setup
- Policy layer документація
- SLO/SLA рекомендації
- Scale-out інструкції
- Incident response

Система готова до production deployment!

📋 Deployment Plan: DAGI Router, Swapper Service, Агенти 13ae216be7

- Відповіді на питання про підключення агентів
- План встановлення DAGI Router на NODE1/NODE3
- План встановлення Swapper Service на NODE1/NODE3
- Перевірка логування (GitLab, Gitea, GitHub)
- Перевірка NODE1 на інциденти (чистий)

Статус:
- DAGI Router: працює на NODE2, потрібно на NODE1/NODE3
- Swapper Service: працює на NODE2, потрібно на NODE1/NODE3
- Агенти: підключати після налаштування інфраструктури

🔧 Deployment configs: DAGI Router + Swapper Service для NODE1/NODE3 0761aa2771

- K8s deployment для DAGI Router (NODE1)
- K8s deployment для Swapper Service (NODE1)
- ConfigMaps для конфігурацій
- Services (ClusterIP + NodePort)
- Інтеграція з NATS JetStream
- Оновлено DEPLOYMENT-PLAN.md з конкретними інструкціями

TODO: Створити аналоги для NODE3

📊 Deployment Status Summary: відповіді на всі питання a9fcadc6e2

- Коли підключати агентів: після налаштування інфраструктури
- DAGI Router: готово до deployment на NODE1/NODE3
- Swapper Service: готово до deployment на NODE1/NODE3
- Логування: все записується (GitHub, Gitea, GitLab)
- NODE1 перевірка: чистий, інцидентів не виявлено

Рекомендований порядок дій включено.

feat: implement TTS, Document processing, and Memory Service /facts API 5290287058

- TTS: xtts-v2 integration with voice cloning support
- Document: docling integration for PDF/DOCX/PPTX processing
- Memory Service: added /facts/upsert, /facts/{key}, /facts endpoints
- Added required dependencies (TTS, docling)

docs: Add NODA1 v2.0 deployment report 4aeb69e7ae

Comprehensive report after health check and fixes on NODA1:
- Qdrant healthcheck fixed (wget → true)
- render-pdf-worker disabled (NATS connection issues)
- Git repository initialized on NODA1
- All critical services healthy (13/26 with healthcheck)
- System resources: Load 0.57, RAM 16%, Disk 25%
- Security check passed (no suspicious activity)

Status: Production Ready ✅

Co-Authored-By: Warp Agent <agent@warp.dev>

feat: Add Alateya, Clan, Eonarch agents + fix gateway-router connection 0c8bef82f4

## Agents Added
- Alateya: R&D, biotech, innovations
- Clan (Spirit): Community spirit agent
- Eonarch: Consciousness evolution agent

## Changes
- docker-compose.node1.yml: Added tokens for all 3 new agents
- gateway-bot/http_api.py: Added configs and webhook endpoints
- gateway-bot/clan_prompt.txt: New prompt file
- gateway-bot/eonarch_prompt.txt: New prompt file

## Fixes
- Fixed ROUTER_URL from :9102 to :8000 (internal container port)
- All 9 Telegram agents now working

## Documentation
- Created PROJECT-MASTER-INDEX.md - single entry point
- Added various status documents and scripts

Tokens configured:
- Helion, NUTRA, Agromatrix (existing)
- Alateya, Clan, Eonarch (new)
- Druid, GreenFood, DAARWIZZ (configured)

docs: Add critical TODO summary for NODA1 bc4ad30878

docs: Update TODO with security audit results 656115ef87

docs: Update TODO with implementation results - container limits, NATS update, Qdrant fix 9dcc3563f6

feat(P0/P1/P2): Add E2E agent prober, version pinning, prometheus fixes a3923cd96f

feat(P0): Add JetStream streams, DLQ, timeout policy 3ecb43dafc

feat(P1): Add /metrics endpoint to gateway b9b7660930

docs: Update PROJECT-MASTER-INDEX with Agent Registry changes 8f046e7226

- Added Agent Registry section (Single Source of Truth)
- Updated agent list (11 top-level + 2 internal)
- Added CLI tools documentation
- Fixed agent roles (DRUID = Ayurveda/Cosmetics R&D)
- Added YAROMIR and SOUL agents
- Updated architecture diagram reference
- Marked old issues as resolved

Co-authored-by: Cursor <cursoragent@cursor.com>

fix(ops): Add network aliases and stabilize DNS for NODA1 a46a70c014

- docker-compose.node1.yml: Add network aliases (router, gateway,
  memory-service, qdrant, nats, neo4j) to eliminate manual
  `docker network connect --alias` commands
- docker-compose.node1.yml: ROUTER_URL now uses env variable with
  fallback: ${ROUTER_URL:-http://router:8000}
- docker-compose.node1.yml: Increase router healthcheck start_period
  to 30s and retries to 5
- .gitignore: Add noda1-credentials.local.mdc (local-only SSH creds)
- scripts/node1/verify_agents.sh: Improved output with agent list
- docs: Add NODA1-AGENT-VERIFICATION.md, NODA1-AGENT-ARCHITECTURE.md,
  NODA1-VERIFICATION-REPORT-2026-02-03.md
- config/README.md: How to add new agents
- .cursor/rules/, .cursor/skills/: NODA1 operations skill for Cursor

Root cause fixed: Gateway could not resolve 'router' DNS name when
Router container was named 'dagi-staging-router' without alias.

Co-authored-by: Cursor <cursoragent@cursor.com>

fix(router): Replace requests with urllib in healthcheck 6b54e0da6d

- Use stdlib urllib.request instead of requests library
- requests was not installed in the router image, causing healthcheck
  to always fail with "ModuleNotFoundError: No module named 'requests'"
- Increase start_period to 30s and retries to 5 for stability

Co-authored-by: Cursor <cursoragent@cursor.com>

fix: add missing Telegram tokens for DAARWIZZ, DRUID, GREENFOOD a0a89b577d

Synced from NODA1 after 2026-02-03 incident fix.
All 9 agents now have tokens configured.

Co-authored-by: Cursor <cursoragent@cursor.com>

fix: add group silence rules for Helion 0d30ea0009

Helion now only responds in groups when:
- Mentioned by name/username
- Direct question about Energy Union
- Previously was responding to all messages in groups

Co-authored-by: Cursor <cursoragent@cursor.com>

feat: add training mode for Agent Preschool group 8907fb110c

All agents now respond to all messages in the training group
"Agent Preschool Daarion.city" without requiring mentions.

Updated prompts: helion, daarwizz, greenfood, nutra, agromatrix, druid

Co-authored-by: Cursor <cursoragent@cursor.com>

feat: add training group support in Gateway c8698f6a1d

- Added TRAINING_GROUP_IDS constant for Agent Preschool group
- Gateway now adds "[РЕЖИМ НАВЧАННЯ]" prefix for training groups
- Agents will respond to all messages in training groups

Co-authored-by: Cursor <cursoragent@cursor.com>

feat: Behavior Policy v1 - Silent-by-default + Short-first + Media-no-comment 134c044c21

NODA1 agents now:
- Don't respond to broadcasts/posters/announcements without direct mention
- Don't respond to media (photo/link) without explicit question
- Keep responses short (1-2 sentences by default)
- No emoji, no "ready to help", no self-promotion

Added:
- behavior_policy.py: detect_directed_to_agent(), detect_broadcast_intent(), should_respond()
- behavior_policy_v1.txt: unified policy block for all prompts
- Pre-LLM check in http_api.py: skip Router call if should_respond=False
- NO_OUTPUT handling: don't send to Telegram if LLM returns empty
- Updated all 9 agent prompts with Behavior Policy v1
- Unit and E2E tests for 5 acceptance cases

snapshot: NODE1 production state 2026-02-09 ef3473db21

Complete snapshot of /opt/microdao-daarion/ from NODE1 (144.76.224.179).
This represents the actual running production code that has diverged
significantly from the previous main branch.

Key changes from old main:
- Gateway (http_api.py): expanded from ~40KB to 164KB with full agent support
- Router: new /v1/agents/{id}/infer endpoint with vision + DeepSeek routing
- Behavior Policy: SOWA v2.2 (3-level: FULL/ACK/SILENT)
- Agent Registry: config/agent_registry.yml as single source of truth
- 13 agents configured (was 3)
- Memory service integration
- CrewAI teams and roles

Excluded from snapshot: venv/, .env, data/, backups, .tgz archives

Co-authored-by: Cursor <cursoragent@cursor.com>

fix: SOWA agent name variants + vision denial prevention a1599df053

SOWA fixes:
- Add Russian variants for all agents (сэнпай, хелион, друид, etc.)
- Add missing sofiia agent to AGENT_NAME_VARIANTS
- Add /senpai, /sofiia command prefixes

Vision denial fix (all 13 agents):
- Add explicit rule: "Never say you can't see/analyze images"
- Agents have Vision API via Swapper (qwen3-vl-8b)
- When vision model describes a photo, the follow-up text model (DeepSeek)
  must not deny having seen it

Root cause: NUTRA correctly analyzed a photo via vision model, but when
asked a follow-up question, DeepSeek (text model) responded "I cannot
see images" because the system prompt lacked the denial prevention rule.

Co-authored-by: Cursor <cursoragent@cursor.com>

fix: CI branch filter + Cursor auto-context rules aee2a55a26

CI:
- python-services-ci now only runs on main branch (not feature branches)
- Install deps with lock fallback (if lock file is stale, install without it)

Cursor rules:
- New project-context.mdc (alwaysApply: true) — gives AI full project
  context immediately in every new chat
- Updated noda1-operations.mdc: alwaysApply: true, fixed container names
  (dagi-router-node1, not dagi-staging-router)

This ensures that when opening a new Cursor chat in this workspace,
the AI already knows: project structure, NODE1 server details, all 13
agents, SSH credentials location, and key documentation paths.

Co-authored-by: Cursor <cursoragent@cursor.com>

feat: reply-to-agent detection in Gateway → SOWA Priority 3 1f4472ec18

When a user replies to an agent's message in Telegram groups,
it is now treated as a direct mention (SOWA FULL response).

Implementation:
- Detect reply_to_message.from.is_bot in Gateway webhook handler
- Verify bot_id matches this agent's token (multi-agent safe)
- Pass is_reply_to_agent=True to detect_explicit_request() and
  analyze_message() (SOWA v2.2)
- Add is_reply_to_agent to Router metadata for analytics

SOWA already had Priority 3 logic for reply_to_agent → FULL,
it was just never wired up (had TODO placeholders with False).

Edge cases handled:
- Only triggers when reply is to THIS agent's bot (not other bots)
- Reply to forwarded messages: won't trigger (from.is_bot would be
  the original sender, not the bot)
- Works alongside existing DM, mention, and training group rules

Co-authored-by: Cursor <cursoragent@cursor.com>

feat: thread_has_agent_participation + ACK reply linkage 27e66b90bf

1. thread_has_agent_participation (SOWA Priority 11):
   - New function has_agent_chat_participation() in behavior_policy.py
   - Checks if agent responded to ANY user in this chat within 30min
   - When active + user asks question/imperative → agent responds
   - Different from per-user conversation_context (Priority 12)
   - Wired into both detect_explicit_request() and analyze_message()

2. ACK reply_to_message_id:
   - When SOWA sends ACK ("NUTRA тут"), it now replies to the user's
     message instead of sending a standalone message
   - Better UX: visually linked to what the user wrote
   - Uses allow_sending_without_reply=True for safety

Known issue (not fixed - too risky):
- Lines 1368-1639 in http_api.py are dead code (brand commands /бренд)
  at incorrect indentation level (8 spaces, inside unreachable block)
- These commands never worked on NODE1, fixing 260 lines of indentation
  carries regression risk — deferred to separate cleanup PR

Co-authored-by: Cursor <cursoragent@cursor.com>

fix: quarantine dead brand commands + implement Memory LLM summary 3b924118be

Brand commands (~290 lines):
- Code was trapped inside `if reply_to_message:` block (unreachable)
- Moved to feature flag: ENABLE_BRAND_COMMANDS=true to activate
- Zero re-indentation: 8sp code naturally fits as feature flag body
- Helper functions (_brand_*, _artifact_*) unchanged

Memory LLM Summary:
- Replace placeholder with real DeepSeek API integration
- Structured output: summary, goals, decisions, open_questions, next_steps, key_facts
- Graceful fallback if API key not set or call fails
- Added MEMORY_DEEPSEEK_API_KEY config
- Ukrainian output language

Deployed and verified on NODE1.

Co-authored-by: Cursor <cursoragent@cursor.com>

fix(critical): Senpai using Helion's memory — 3 root causes fixed b9f7ca8ecf

1. YAML structure bug: Senpai was in `policies:` instead of `agents:`
   in router-config.yml. Router couldn't find Senpai config → no routing
   rule → fallback to local model.

2. tool_manager agent_id not passed: memory_search and graph_query
   tools were called without agent_id → defaulted to "helion" →
   ALL agents' tool calls searched Helion's Qdrant collections.
   Fixed: agent_id now flows from main.py → execute_tool → _memory_search.

3. Config not mounted: router-config.yml was baked into Docker image,
   host changes had no effect. Added volume mount in docker-compose.

Also added:
- Sofiia agent config + routing rule (was completely missing)
- Senpai routing rule: cloud_deepseek (was falling to local qwen3:8b)
- Anti-echo instruction for memory brief injection

Deployed and verified on NODE1: Senpai now searches senpai_* collections.

Co-authored-by: Cursor <cursoragent@cursor.com>

fix: helion string literal + memory brief anti-echo in Router acceac6929

- Fixed unquoted `helion` variable reference to string literal `"helion"`
  in tool_manager.py search_memories fallback
- Replaced `[Контекст пам'яті]` with `[INTERNAL MEMORY - do NOT repeat
  to user]` in all 3 injection points in main.py
- Verified: Senpai now responds without Helion contamination or memory
  brief leaking

Tested and deployed on NODE1.

Co-authored-by: Cursor <cursoragent@cursor.com>

feat: auto-summarize trigger for agent memory 0cfd3619ea

- Memory Service: POST /agents/{agent_id}/summarize endpoint
  - Fetches recent events by agent_id (new db.list_facts_by_agent)
  - Generates structured summary via DeepSeek LLM
  - Saves summary to PostgreSQL facts + Qdrant vector store
  - Returns structured JSON (summary, goals, decisions, key_facts)

- Gateway memory_client: auto-trigger after 30 turns
  - Turn counter per chat (agent_id:channel_id)
  - 5-minute debounce between summarize calls
  - Fire-and-forget via asyncio.ensure_future (non-blocking)
  - Configurable via SUMMARIZE_TURN_THRESHOLD / SUMMARIZE_DEBOUNCE_SECONDS

- Database: list_facts_by_agent() for agent-level queries without user_id

Tested on NODE1: Helion summarize returns valid Ukrainian summary with 20 events.

Co-authored-by: Cursor <cursoragent@cursor.com>

feat: harden memory summary — fingerprint dedup, versioning, prompt injection defense 990e594a1d

Summary hardening:
- SHA256 fingerprint of events content for deduplication
  (skips LLM call when events unchanged since last summary)
- Versioned summary storage: summary:agent:channel:vN keys
- Latest pointer: summary_latest:agent:channel for fast retrieval
- Prompt injection defense: sanitize event content before LLM,
  strip [SYSTEM]/[INTERNAL] markers, block "ignore instructions" patterns
- Anti-injection clause in SUMMARY_SYSTEM_PROMPT

Database fix:
- list_facts_by_agent: SQL filter by fact_prefix to only return chat_events
  (prevents summary/version facts from consuming LIMIT quota)
- Fixed NULL team_id issue in UNIQUE constraint (PostgreSQL NULL != NULL)
  using "__system__" sentinel for team_id in summary operations

Tested on NODE1: dedup works (same events → skipped), force=true bypasses.

Co-authored-by: Cursor <cursoragent@cursor.com>

fix: DSML fallback — 3rd LLM call for clean synthesis + think tag stripping 7887f7cbe9

Router (main.py):
- When DSML detected in 2nd LLM response after tool execution,
  make a 3rd LLM call with explicit synthesis prompt instead of
  returning raw tool results to the user
- Falls back to format_tool_calls_for_response only if 3rd call fails

Router (tool_manager.py):
- Added _strip_think_tags() helper for <think>...</think> removal
  from DeepSeek reasoning artifacts

Gateway (http_api.py):
- Strip <think>...</think> tags before sending to Telegram
- Strip DSML/XML-like markup (function_calls, invoke, parameter tags)
- Ensure empty text after stripping gets "..." fallback

Deployed to NODE1 and verified services running.

Co-authored-by: Cursor <cursoragent@cursor.com>

feat: enable brand commands MVP — ENABLE_BRAND_COMMANDS=true ad6b6d2662

Brand commands are now active in Gateway:
- /бренд — help menu
- /бренд_інтейк <url|текст> — save brand source
- /бренд_тема <brand_id> [версія] — publish theme
- /бренд_останнє <brand_id> — show latest theme
- /презентація — render presentation
- /job_статус — check job status

All 4 brand services verified healthy:
- brand-intake:9211, brand-registry:9210
- presentation-renderer:9212, artifact-registry:9220

Feature flag ENABLE_BRAND_COMMANDS=true added to gateway env
in docker-compose.node1.yml.

Co-authored-by: Cursor <cursoragent@cursor.com>

feat: market-data-service for SenpAI trading agent c50843933f

New service: real-time market data collection with unified event model.

Architecture:
- Domain events: TradeEvent, QuoteEvent, BookL2Event, HeartbeatEvent
- Provider interface: MarketDataProvider ABC with connect/subscribe/stream/close
- Async EventBus with fan-out to multiple consumers

Providers:
- BinanceProvider: public WebSocket (trades + bookTicker), no API key needed,
  auto-reconnect with exponential backoff, heartbeat timeout detection
- AlpacaProvider: IEX real-time data + paper trading auth,
  dry-run mode when no keys configured (heartbeats only)

Consumers:
- StorageConsumer: SQLite (via SQLAlchemy async) + JSONL append-only log
- MetricsConsumer: Prometheus counters, latency histograms, events/sec gauge
- PrintConsumer: sampled structured logging (1/100 events)

CLI: python -m app run --provider binance --symbols BTCUSDT,ETHUSDT
HTTP: /health, /metrics (Prometheus), /latest?symbol=XXX

Tests: 19/19 passed (Binance parse, Alpaca parse, bus smoke tests)

Config: pydantic-settings + .env, all secrets via environment variables.
Co-authored-by: Cursor <cursoragent@cursor.com>

feat: MD pipeline — market-data-service hardening + SenpAI NATS consumer 09dee24342

Producer (market-data-service):
- Backpressure: smart drop policy (heartbeats→quotes→trades preserved)
- Heartbeat monitor: synthetic HeartbeatEvent on provider silence
- Graceful shutdown: WS→bus→storage→DB engine cleanup sequence
- Bybit V5 public WS provider (backup for Binance, no API key needed)
- FailoverManager: health-based provider switching with recovery
- NATS output adapter: md.events.{type}.{symbol} for SenpAI
- /bus-stats endpoint for backpressure monitoring
- Dockerfile + docker-compose.node1.yml integration
- 36 tests (parsing + bus + failover), requirements.lock

Consumer (senpai-md-consumer):
- NATSConsumer: subscribe md.events.>, queue group senpai-md, backpressure
- State store: LatestState + RollingWindow (deque, 60s)
- Feature engine: 11 features (mid, spread, VWAP, return, vol, latency)
- Rule-based signals: long/short on return+volume+spread conditions
- Publisher: rate-limited features + signals + alerts to NATS
- HTTP API: /health, /metrics, /state/latest, /features/latest, /stats
- 10 Prometheus metrics
- Dockerfile + docker-compose.senpai.yml
- 41 tests (parsing + state + features + rate-limit), requirements.lock

CI: ruff + pytest + smoke import for both services
Tests: 77 total passed, lint clean
Co-authored-by: Cursor <cursoragent@cursor.com>

docs: update NODE1 docs for MD pipeline deploy (ports 8893/8892) 0555ee9fa6

- Fix market-data-service host port 8891→8893 (conflict with Swapper)
- Increase healthcheck start_period/retries for market-data-service
- Add Market Data Service + SenpAI MD Consumer to PROJECT-MASTER-INDEX.md
- Update noda1-operations rule and skill with new ports/containers

Co-authored-by: Cursor <cursoragent@cursor.com>

docs: SenpAI integration log + healthcheck fix 6e0887abcd

- PROJECT-MASTER-INDEX: add "Зміни 2026-02-09" section (market data + Senpai tool integration)
- docker-compose: senpai-md-consumer healthcheck timeout 5s→10s, retries 3→5

Co-authored-by: Cursor <cursoragent@cursor.com>

feat: Add Comfy Agent service for NODE3 image/video generation c41c68dc08

- Create comfy-agent service with FastAPI + NATS integration
- ComfyUI client with HTTP/WebSocket support
- REST API: /generate/image, /generate/video, /status, /result
- NATS subjects: agent.invoke.comfy, comfy.request.*
- Async job queue with progress tracking
- Docker compose configuration for NODE3
- Update PROJECT-MASTER-INDEX.md with NODE2/NODE3 docs

Co-Authored-By: Warp <agent@warp.dev>

fix(router): guard DSML tool-call flows 7f3ee700a4

Prevent DeepSeek DSML from leaking to users and avoid returning raw memory_search/web results when DSML is detected.

Co-authored-by: Cursor <cursoragent@cursor.com>

chore(helion): respond to direct mentions in groups 42599787a6

Clarify Helion group behavior: stay silent unless energy topic or direct mention, but answer operational questions when directly addressed.

Co-authored-by: Cursor <cursoragent@cursor.com>

feat: Add valid ComfyUI SD1.5 workflow to comfy-agent 25e57d8221

- Replace placeholder workflow with complete SD1.5 pipeline
- Support dynamic prompt, negative_prompt, steps, seed, width, height
- Nodes: CheckpointLoader -> CLIP -> KSampler -> VAE -> SaveImage

Co-Authored-By: Warp <agent@warp.dev>

feat: Register Comfy agent in agent registry dd4b466d79

- Add Comfy as node_local internal agent on NODE3
- Scope: node-3-threadripper-rtx3090
- API endpoint: http://212.8.58.133:8880
- NATS subject: agent.invoke.comfy
- Capabilities: text-to-image, text-to-video, image-to-video
- Specialized tools: comfy_generate_image, comfy_generate_video

Co-Authored-By: Warp <agent@warp.dev>

node1: add universal file tool, gateway document delivery, and sync runbook 21576f0ca3

feat(router): implement file_tool excel actions on NODE1 stack e91584246d

feat(file-tool): add text markdown xml html actions cf6ac778bb

feat(file-tool): add pptx ods parquet and image actions 36314a871f

feat(file-tool): add image_bundle and svg actions aad5870e81

feat(file-tool): harden svg rendering and add rich pptx/pdf updates 3a565fd910

feat(file-tool): add djvu conversion and extraction actions b2be937fbb

docs: sync consolidation and session starter de3bd8c13f

docs: sync consolidation and session starter b962d4a288

docs: sync consolidation and session starter 798c6f88c7

docs: sync consolidation and session starter 7df8cd5882

Sync NODE1 runtime config for Sofiia monitor + Clan canary fixes b9f83a5006

Docs sync: align OPENAPI contracts with NODE1 runtime 963813607b

Sync NODE1 crewai-service runtime files and monitor summary script 77ab034744

ops: restore canary_all and harden monitor summary script invocation 249b2e1e94

gitignore: ignore runtime canary status artifacts 71b248de23

senpai: enforce DAARWIZZ network awareness; sync daarwizz delegation roster 2c03632f67

ops: add DAARWIZZ awareness canary for all top-level agents 00b77066b0

ops: make DAARWIZZ awareness canary static by default with optional runtime mode e5a6e310b7

prompts: enforce DAARWIZZ awareness across top-level agents 6b5e462c85

prompts: add DAARWIZZ awareness to legacy nutra prompt 343bdc2d11

helion: deepseek-first, on-demand CrewAI, local subagent profiles, concise post-synthesis 635f2d7e37

helion: ignore keyword complexity hints; trigger CrewAI only by explicit detailed/complex flags 760022d7f5

helion: stabilize doc context, remove legacy webhook path, add stack smoke canary d42bb09912

doc-service: persist doc_context by stable session key 30ea12e0f8

doc-service: parse fact_value_json string in doc context lookup bfd0e05bc9

clan: map runtime-guard manager alias so agent_id=clan is recognized 63fec84734

router: bundle CLAN runtime registry in router image path 13aa0c79f0

clan: stop forcing missing zhos_mvp crew profile; use available default b65ed7cdf2

clan: restore zhos_mvp profile in crewai-service and re-enable clan zhos routing 7c3bc68ac2

clan: route simple requests to fast crew profile; keep zhos_mvp for complex a23cde217f

router: unify top-level DeepSeek-first + on-demand CrewAI policy 5bca7fb79d

router: enforce cloud-first direct path for top-level and monitor agents ef59cb0950

router: bypass local routing rules for cloud-first agents 05435e7fad

docs+router: formalize runtime policy and remove temporary cloud-first code override de8bb36462

chore: ignore backup/temp artifacts and local worktree scratch 675b25953b

runtime: sync router/gateway/config policy and clan role registry dfc0ef1ceb

services: add clan consent/visibility and oneok adapter stack c201d105f6

services: update comfy agent, senpai md consumer, and swapper deps c57e6ed96b

docs: add node1 runbooks, consolidation artifacts, and maintenance scripts 544874d952

chore: ignore local rollback backup snapshots e82d70553d

router: remove qwen2.5 profile and pin monitor to local qwen3 e01ed7be75

gateway: add redis-backed city metrics poller and /v1/metrics/dashboard 7e82a427e3

gateway: add public invoke/jobs facade with redis queue worker and SSE 2e76ef9ccb

registry: assign district_id for agents and add district registry catalog 9ecce79810

router: add tool manager runtime and memory retrieval updates a8a153a87a

crewai: add agromatrix and plant-intel role packs with updated team config 90eff85662

ops: add plant-vision node1 service and update monitor/prober scripts 987ece5bac

gateway: add privacy guard plus reminders and mentor relay commands c2f0b64604

gateway: add natural-language action mapping for reminders and mentor relay ce6c9ec60a

agents: add planned AISTALK orchestrator and crew profile 195eb9b7ac

gateway: enforce source-lock, pii guard, style profile, and intent retry e6c083a000

doc-service: add shared deterministic excel answer contract 7b5357228f

gateway: auto-handle unresolved user questions in chat context 0a87eadb8d

gateway: fix greeting UX and reduce false photo-intent fallbacks 2b0b142f95

Gateway/Doc: source-lock, PII guard, intent retry, shared Excel contract (#4 ) 815a287474

* gateway: enforce source-lock, pii guard, style profile, and intent retry

* doc-service: add shared deterministic excel answer contract

* gateway: auto-handle unresolved user questions in chat context

* gateway: fix greeting UX and reduce false photo-intent fallbacks

---------

Co-authored-by: Apple <apple@MacBook-Pro.local>

agromatrix: harden correction learning and invalidate wrong labels e00e7af1e7

agromatrix: deploy context/photo learning + deterministic excel policy a91309de11

vendor: replace third_party/nature-id gitlink with tracked files 69486a92be

agromatrix: harden correction parser + cap context + persist last photo ref 3d04cd4c88

agromatrix: invalidate wrong photo labels and tighten correction parsing f3d2aa6499

router: enforce direct image inputs for plant tools and inject runtime image_data 50dfcd7390

agromatrix: deterministic plant-id flow + confidence guard + plantnet env a87a1fe52c

agromatrix: add pending-question memory, anti-repeat guard, and numeric contract d963c52fe5

agromatrix: tighten numeric source contract guard 01bfa97783

agromatrix: add shared-memory review api and crawl4ai robustness 68ac8fa355

agromatrix: enforce mentor auth and expose shared-memory review via gateway f44e920486

security: remove default agromatrix review token fallback 3e3546ea89

feat(docs): add standard file processing and router document ingest/query 5d52cf81c4

feat(docs): add versioned document update and versions APIs f53e71a0f4

feat(docs): add document write-back publish pipeline cca16254e5

feat(gateway): proxy artifact downloads via public doc endpoints 088ca07137

feat(noda2): enable NATS leafnode remote to NODA1:7422 974522f12b

- nats-server.conf: added leafnodes.remotes to nats://144.76.224.179:7422
- NODA2 now a spoke leaf node; NODA1 is hub
- Cross-node pub/sub verified: NODA1 pub → NODA2 sub (node.test.>)
- Leafnode connection confirmed: 144.76.224.179:7422 lid:5

Made-with: Cursor

docs(audit): NODA2 full audit 2026-02-27 46d7dea88a

- ops/audit_node2_20260227.md: readable report (hardware, containers, models, Sofiia, findings)
- ops/audit_node2_20260227.json: structured machine-readable inventory
- ops/audit_node2_findings.yml: 10 PASS + 5 PARTIAL + 3 FAIL + 3 SECURITY gaps
- ops/node2_capabilities.yml: router-ready capabilities (vision/text/code/stt/tts models)

Key findings:
  P0: vision pipeline broken (/vision/models=empty, qwen3-vl:8b not installed)
  P1: node-ops-worker missing, SSH root password in sofiia-console env
  P1: router-config.yml uses 172.17.0.1 (Linux bridge) not host.docker.internal

Made-with: Cursor

node2: P0 vision restore + P1 security hardening + node-specific router config 7b8499dd8a

P0 — Vision:
- swapper_config_node2.yaml: add llava-13b as vision model (vision:true)
  /vision/models now returns non-empty list; inference verified ~3.5s
- ollama.url fixed to host.docker.internal:11434 (was localhost, broken in Docker)

P1 — Security:
- Remove NODES_NODA1_SSH_PASSWORD from .env and docker-compose.node2-sofiia.yml
- SSH ED25519 key generated, authorized on NODA1, mounted as /run/secrets/noda1_ssh_key
- sofiia-console reads key via NODES_NODA1_SSH_PRIVATE_KEY env var
- secrets/noda1_id_ed25519 added to .gitignore

P1 — Router:
- services/router/router-config.node2.yml: new node2-specific config
  replaces all 172.17.0.1:11434 → host.docker.internal:11434
- docker-compose.node2-sofiia.yml: mount router-config.node2.yml (not root config)

P1 — Ports:
- router (9102), swapper (8890), sofiia-console (8002): bind to 127.0.0.1
- gateway (9300): keep 0.0.0.0 (Telegram webhook requires public access)

Artifacts:
- ops/patch_node2_P0P1_20260227.md — change log
- ops/validation_node2_P0P1_20260227.md — all checks PASS
- ops/node2.env.example — safe env template (no secrets)
- ops/security_hardening_node2.md — SSH key migration guide + firewall
- ops/node2_models_pull.sh — model pull script for P0/P1

Made-with: Cursor

node2: full model inventory audit 2026-02-27 3965f68fac

Read-only audit of all installed models on NODA2 (MacBook M4 Max):
- 12 Ollama models, 1 llama-server duplicate, 16 HF cache models
- ComfyUI stack (200+ GB): FLUX.2-dev, LTX-2 video, SDXL
- Whisper-large-v3-turbo (MLX, 1.5GB) + Kokoro TTS (MLX, 0.35GB) installed but unused
- MiniCPM-V-4_5 (16GB) installed but not in Swapper (better than llava:13b)
- Key finding: 149GB cleanup potential; llama-server duplicates Ollama (P1, 20GB)

Artifacts:
- ops/node2_models_inventory_20260227.json
- ops/node2_models_inventory_20260227.md
- ops/node2_model_capabilities.yml
- ops/node2_model_gaps.yml

Made-with: Cursor

node2: fix Sofiia routing determinism + Node Capabilities Service e2a3ae342a

Bug fixes:
- Bug A: GROK_API_KEY env mismatch — router expected GROK_API_KEY but only
  XAI_API_KEY was present. Added GROK_API_KEY=${XAI_API_KEY} alias in compose.
- Bug B: 'grok' profile missing in router-config.node2.yml — added cloud_grok
  profile (provider: grok, model: grok-2-1212). Sofiia now has
  default_llm=cloud_grok with fallback_llm=local_default_coder.
- Bug C: Router silently defaulted to cloud DeepSeek when profile was unknown.
  Now falls back to agent.fallback_llm or local_default_coder with WARNING log.
  Hardcoded Ollama URL (172.18.0.1) replaced with config-driven base_url.

New service: Node Capabilities Service (NCS)
- services/node-capabilities/ — FastAPI microservice exposing live model
  inventory from Ollama, Swapper, and llama-server.
- GET /capabilities — canonical JSON with served_models[] and inventory_only[]
- GET /capabilities/models — flat list of served models
- POST /capabilities/refresh — force cache refresh
- Cache TTL 15s, bound to 127.0.0.1:8099
- services/router/capabilities_client.py — async client with TTL cache

Artifacts:
- ops/node2_models_audit.md — 3-layer model view (served/disk/cloud)
- ops/node2_models_audit.yml — machine-readable audit
- ops/node2_capabilities_example.json — sample NCS output (14 served models)

Made-with: Cursor

P1: NCS-first model selection + NATS capabilities + Grok 4.1 89c3f2ac66

Router model selection:
- New model_select.py: resolve_effective_profile → profile_requirements →
  select_best_model pipeline. NCS-first with graceful static fallback.
- selection_policies in router-config.node2.yml define prefer order per
  profile without hardcoding models (e.g. local_default_coder prefers
  qwen3:14b then qwen3.5:35b-a3b).
- Cloud profiles (cloud_grok, cloud_deepseek) skip NCS; on cloud failure
  use fallback_profile via NCS for local selection.
- Structured logs: selected_profile, required_type, runtime, model,
  caps_age_s, fallback_reason on every infer request.

Grok model fix:
- grok-2-1212 no longer exists on xAI API → updated to
  grok-4-1-fast-reasoning across all 3 hardcoded locations in main.py
  and router-config.node2.yml.

NCS NATS request/reply:
- node-capabilities subscribes to node.noda2.capabilities.get (NATS
  request/reply). Enabled via ENABLE_NATS_CAPS=true in compose.
- NODA1 router can query NODA2 capabilities over NATS leafnode without
  HTTP connectivity.

Verified:
- NCS: 14 served models from Ollama+Swapper+llama-server
- NATS: request/reply returns full capabilities JSON
- Sofiia: cloud_grok → grok-4-1-fast-reasoning (tested, 200 OK)
- Helion: NCS → qwen3:14b via Ollama (caps_age=23.7s cache hit)
- Router health: ok

Made-with: Cursor

P2: Global multi-node model selection + NCS on NODA1 a92c424845

Architecture for 150+ nodes:
- global_capabilities_client.py: NATS scatter-gather discovery using
  wildcard subject node.*.capabilities.get — zero static node lists.
  New nodes auto-register by deploying NCS and subscribing to NATS.
  Dead nodes expire from cache after 3x TTL automatically.

Multi-node model_select.py:
- ModelSelection now includes node, local, via_nats fields
- select_best_model prefers local candidates, then remote
- Prefer list resolution: local first, remote second
- All logged per request: node, runtime, model, local/remote

NODA1 compose:
- Added node-capabilities service (NCS) to docker-compose.node1.yml
- NATS subscription: node.noda1.capabilities.get
- Router env: NODE_CAPABILITIES_URL + ENABLE_GLOBAL_CAPS_NATS=true

NODA2 compose:
- Router env: ENABLE_GLOBAL_CAPS_NATS=true

Router main.py:
- Startup: initializes global_capabilities_client (NATS connect + first
  discovery). Falls back to local-only capabilities_client if unavailable.
- /infer: uses get_global_capabilities() for cross-node model pool
- Offload support: send_offload_request(node_id, type, payload) via NATS

Verified on NODA2:
- Global caps: 1 node, 14 models (NODA1 not yet deployed)
- Sofiia: cloud_grok → grok-4-1-fast-reasoning (OK)
- Helion: NCS → qwen3:14b local (OK)
- When NODA1 deploys NCS, its models appear automatically via NATS discovery

Made-with: Cursor

P2.2+P2.3: NATS offload node-worker + router offload integration c4b94a327d

Node Worker (services/node-worker/):
- NATS subscriber for node.{NODE_ID}.llm.request / vision.request
- Canonical JobRequest/JobResponse envelope (Pydantic)
- Idempotency cache (TTL 10min) with inflight dedup
- Deadline enforcement (DEADLINE_EXCEEDED on expired jobs)
- Concurrency limiter (semaphore, returns busy)
- Ollama + Swapper vision providers

Router offload (services/router/offload_client.py):
- NATS req/reply with configurable retries
- Circuit breaker per node+type (3 fails/60s → open 120s)
- Concurrency semaphore for remote requests

Model selection (services/router/model_select.py):
- exclude_nodes parameter for circuit-broken nodes
- force_local flag for fallback re-selection
- Integrated circuit breaker state awareness

Router /infer pipeline:
- Remote offload path when NCS selects remote node
- Automatic fallback: exclude failed node → force_local re-select
- Deadline propagation from router to node-worker

Tests: 17 unit tests (idempotency, deadline, circuit breaker)
Docs: ops/offload_routing.md (subjects, envelope, verification)
Made-with: Cursor

P3.1: GPU/Queue-aware routing — NCS metrics + scoring-based model selection a605b8c43e

NCS (services/node-capabilities/metrics.py):
- NodeLoad: inflight_jobs, queue_depth, concurrency_limit, estimated_wait_ms,
  cpu_load_1m, mem_pressure (macOS + Linux), rtt_ms_to_hub
- RuntimeLoad: per-runtime healthy, p50_ms, p95_ms from rolling 50-sample window
- POST /capabilities/report_latency for node-worker → NCS reporting
- NCS fetches worker metrics via NODE_WORKER_URL

Node Worker:
- GET /metrics endpoint (inflight, concurrency, latency buffers)
- Latency tracking per job type (llm/vision) with rolling buffer
- Fire-and-forget latency reporting to NCS after each successful job

Router (model_select v3):
- score_candidate(): wait + model_latency + cross_node_penalty + prefer_bonus
- LOCAL_THRESHOLD_MS=250: prefer local if within threshold of remote
- ModelSelection.score field for observability
- Structured [score] logs with chosen node, model, and score breakdown

Tests: 19 new (12 scoring + 7 NCS metrics), 36 total pass
Docs: ops/runbook_p3_1.md, ops/CHANGELOG_FABRIC.md

No breaking changes to JobRequest/JobResponse or capabilities schema.

Made-with: Cursor

P3.2+P3.3+P3.4: NODA1 node-worker + NATS auth config + Prometheus counters ed7ad49d3a

P3.2 — Multi-node deployment:
- Added node-worker service to docker-compose.node1.yml (NODE_ID=noda1)
- NCS NODA1 now has NODE_WORKER_URL for metrics collection
- Fixed NODE_ID consistency: router NODA1 uses 'noda1'
- NODA2 node-worker/NCS gets NCS_REPORT_URL for latency reporting

P3.3 — NATS accounts/auth (opt-in config):
- config/nats-server.conf with 3 accounts: SYS, FABRIC, APP
- Per-user topic permissions (router, ncs, node_worker)
- Leafnode listener :7422 with auth
- Not yet activated (requires credential provisioning)

P3.4 — Prometheus counters:
- Router /fabric_metrics: caps_refresh, caps_stale, model_select,
  offload_total, breaker_state, score_ms histogram
- Node Worker /prom_metrics: jobs_total, inflight gauge, latency_ms histogram
- NCS /prom_metrics: runtime_health, runtime_p50/p95, node_wait_ms
- All bound to 127.0.0.1 (not externally exposed)

Made-with: Cursor

merge: integrate remote codex/sync-node1-runtime with fabric layer changes a6531507df

Resolve conflicts in docker-compose.node1.yml, services/router/main.py,
and gateway-bot/services/doc_service.py — keeping both fabric layer
(NCS, node-worker, Prometheus) and document ingest/query endpoints.

Made-with: Cursor

fix(fabric): use broadcast subject for NATS capabilities discovery 90080c632a

NATS wildcards (node.*.capabilities.get) only work for subscriptions,
not for publish. Switch to a dedicated broadcast subject
(fabric.capabilities.discover) that all NCS instances subscribe to,
enabling proper scatter-gather discovery across nodes.

Made-with: Cursor

feat(fabric): decommission Swapper from critical path, NCS = source of truth 194c87f53c

- Node Worker: replace swapper_vision with ollama_vision (direct Ollama API)
- Node Worker: add NATS subjects for stt/tts/image (stubs ready)
- Node Worker: remove SWAPPER_URL dependency from config
- Router: vision calls go directly to Ollama /api/generate with images
- Router: local LLM calls go directly to Ollama /api/generate
- Router: add OLLAMA_URL and PREFER_NODE_WORKER=true feature flag
- Router: /v1/models now uses NCS global capabilities pool
- NCS: SWAPPER_URL="" -> skip Swapper probing (status=disabled)
- Swapper configs: remove all hardcoded model lists, keep only runtime
  URLs, timeouts, limits
- docker-compose.node1.yml: add OLLAMA_URL, PREFER_NODE_WORKER for router;
  SWAPPER_URL= for NCS; remove swapper-service from node-worker depends_on
- docker-compose.node2-sofiia.yml: same changes for NODA2

Swapper service still runs but is NOT in the critical inference path.
Source of truth for models is now NCS -> Ollama /api/tags.

Made-with: Cursor

P3.5-P3.7: 2-layer inventory, capability routing, STT/TTS adapters, Dev Contract 9a36020316

NCS:
- _collect_worker_caps() fetches capability flags from node-worker /caps
- _derive_capabilities() merges served model types + worker provider flags
- installed_artifacts replaces inventory_only (disk scan with DISK_SCAN_PATHS env)
- New endpoints: /capabilities/caps, /capabilities/installed

Node Worker:
- STT_PROVIDER, TTS_PROVIDER, OCR_PROVIDER, IMAGE_PROVIDER env flags
- /caps endpoint returns capabilities + providers for NCS aggregation
- STT adapter (providers/stt_mlx_whisper.py) — remote + local mode
- TTS adapter (providers/tts_mlx_kokoro.py) — remote + local mode
- OCR handler via vision_prompted (ollama_vision with OCR prompt)
- NATS subjects: node.{id}.stt/tts/ocr/image.request

Router:
- POST /v1/capability/{stt,tts,ocr,image} — capability-based offload routing
- GET /v1/capabilities — global view with capabilities_by_node
- require_fresh_caps(ttl) preflight guard
- find_nodes_with_capability(cap) + load-based node selection

Ops:
- ops/fabric_snapshot.py — full runtime snapshot collector
- ops/fabric_preflight.sh — quick check + snapshot save + diff
- docs/fabric_contract.md — Dev Contract v0.1 (preflight-first)
- tests/test_fabric_contract.py — CI enforcement (6 tests)

Made-with: Cursor

feat(node2): wire calendar-service and core automation tools in router de234112f3

chore(cleanup): remove obsolete compose version and trim router Dockerfile 57632699c0

docs(audit): add NODA2 Sofiia tools audit and full matrix 49afb1df99

fix(node2): mount config into router for tool governance policies 91559a720b

fix(aurora): avoid port clash with native launchd instance on NODA2 4e9091b96c

fix(console): route Aurora Kling enhance via standard proxy base URL ff97d3cf4a

fix(aurora): harden Kling integration and surface config diagnostics c230abe9cf

feat(aurora): expose quality report API and proxy via sofiia console fe0f2e23c2

feat(aurora-ui): add interactive pre-analysis controls and quality report 79f26ab683

feat(aurora): add detection overlays with face/plate boxes in compare UI 5b4c4f92ba

feat(aurora-smart): add dual-stack orchestration with policy, audit, and UI toggle 1ea4464838

chore(aurora): support keychain/env loading for kling credentials on launchd f16bab2cb9

test(sofiia-console): cover idempotency and cursor pagination contracts 5a886a56ca

Add focused API contract tests for chat idempotency, cursor pagination, and node routing behavior using isolated local fixtures and mocked upstream inference.

Made-with: Cursor

feat(sofiia-console): idempotency_key, cursor pagination, and noda2 router fallback d9ce366538

Add BFF runtime support for chat idempotency (header priority over body) with bounded in-memory TTL/LRU replay cache, implement cursor-based pagination for chats and messages, and add a safe NODA2 local router fallback for legacy runs without NODE_ID.

Made-with: Cursor

feat(sofiia-console): expose /metrics and add basic ops counters 93f94030f4

Expose Prometheus-style metrics endpoint and add counters for send requests, idempotency replays, and cursor pagination calls, including a safe in-process fallback exposition when prometheus_client is unavailable.

Made-with: Cursor

test(sofiia-console): cover noda2 router_url fallback in legacy local run b9c548f1a6

Add regression coverage for router URL resolution when NODE_ID is unset and ROUTER_URL is present, and verify explicit NODES_NODA2_ROUTER_URL keeps higher priority.

Made-with: Cursor

refactor(sofiia-console): extract idempotency store abstraction 0c626943d6

Move idempotency TTL/LRU logic into a dedicated store module with a swap-ready interface and wire chat send flow to use store get/set semantics without changing API behavior.

Made-with: Cursor

feat(sofiia-console): harden cursor pagination with tie-breaker e504df7dfa

Version cursor payloads and keep backward compatibility while adding dedicated tie-breaker regression coverage for equal timestamps to prevent pagination duplicates and gaps.

Made-with: Cursor

test(sofiia-console): add multi-node e2e routing test 98555aa483

Made-with: Cursor

feat(sofiia-console): add structured json logging for chat ops 0b30775ac1

Made-with: Cursor

feat(sofiia-console): add RedisIdempotencyStore backend 3b16739671

Made-with: Cursor

test(sofiia-console): cover redis idempotency backend 9f085509dd

Made-with: Cursor

docs(dev): add redis docker-compose smoke snippet for sofiia-console d85aa507a2

Made-with: Cursor

ops(dev): add redis idempotency A/B smoke script de8002eacd

Made-with: Cursor

feat(sofiia-console): add rate limiting for chat send (per-chat and per-operator) 9b89ace2fc

Made-with: Cursor

feat(sofiia-console): add audit trail for operator actions 3246440ac8

Made-with: Cursor

ops(dev): add secrets rotation runbook and sofiia-console preflight checks 9e70fc83d2

Made-with: Cursor

feat(sofiia-console): add audit query endpoint with cursor pagination 11e0ba7264

Made-with: Cursor

feat(sofiia-console): protect audit endpoint with admin token e2c2333b6f

Made-with: Cursor

ops(dev): add audit retention pruning script 1d18634c01

Made-with: Cursor

ops(dev): extend preflight with audit retention checks 6a0d2ff103

Made-with: Cursor

docs(dev): add release runbook for sofiia-console 47073ba761

Made-with: Cursor

ops(dev): add release evidence auto-generator script e75fd334bf

Made-with: Cursor

docs(dev): add sofiia-console v1 technical release announcement 3df414d35a

Made-with: Cursor

docs(dev): add sofiia-console post-release review template ad74e4c0ba

Made-with: Cursor

docs(dev): add v1 30-min rehearsal execution checklist 55a5e541df

includes preflight, restart, smoke, observation, evidence steps

defines success criteria and metrics to collect for next-step decision

Made-with: Cursor

docs(dev): index release and rehearsal runbooks in docs/runbook 3c199be6d3

Made-with: Cursor

docs(dev): index release evidence template in runbook README bddb6cd75a

Made-with: Cursor

feat(sofiia-console): add docs index and runbook search API (FTS5) ef3ff80645

adds SQLite docs index (files/chunks + FTS5) and CLI rebuild

exposes authenticated runbook search/preview/raw endpoints

Made-with: Cursor

feat(sofiia-console): add runbooks index status endpoint 63fec4371a

GET /api/runbooks/status returns docs_root, indexed_files, indexed_chunks, last_indexed_at, fts_available; docs_index_meta table and set on rebuild

Made-with: Cursor

feat(sofiia-console): rank runbook search results with bm25 4db1774a34

FTS path: score = bm25(docs_chunks_fts), ORDER BY score ASC; LIKE fallback: score null; test asserts score key present

Made-with: Cursor

feat(sofiia-console): add guided runbook runner with http checks and audit integration ad8bddf595

adds runbook_runs/runbook_steps state machine

parses markdown runbooks into guided steps

supports allowlisted http_check (health/metrics/audit)

integrates runbook execution with audit trail

exposes authenticated runbook runs API

Made-with: Cursor

feat(sofiia-console): add safe script executor for allowlisted runbook steps 0603184524

- adds safe_executor.py: REPO_ROOT confinement, strict script allowlist,
  env key allowlist (STRICT/SOFIIA_URL/BFF_A/BFF_B/NODE_ID/AGENT_ID),
  stdin=DEVNULL, 8KB output cap, timeout clamp (max 300s), non-root warn
- integrates script action_type into runbook_runner: next_step handles
  http_check and script branches; running_as_root -> step_status=warn
- extends runbook_parser: rehearsal-v1 now includes 3 built-in script steps
  (preflight, idempotency smoke, generate evidence) after http_checks
- adds tests/test_sofiia_safe_executor.py: 12 tests covering path traversal,
  absolute path, non-allowlist, env drop, timeout, exit_code, mocked subprocess

Made-with: Cursor

feat(sofiia-console): add auto-evidence and post-review generation from runbook runs 8879da1e7f

- adds runbook_artifacts.py: server-side render of release_evidence.md and
  post_review.md from DB step results (no shell); saves to
  SOFIIA_DATA_DIR/release_artifacts/<run_id>/
- evidence: auto-fills preflight/smoke/script outcomes, step table, timestamps
- post_review: auto-fills metadata, smoke results, incidents from step statuses;
  leaves [TODO] markers for manual observation sections
- adds POST /api/runbooks/runs/{run_id}/evidence and /post_review endpoints
- updates runbook_runs.evidence_path in DB after render
- adds 11 tests covering file creation, key sections, TODO markers, 404s, API

Made-with: Cursor

feat(sofiia-prompt): add hardware dev Sergiy Plis (@vetr369) to development team 32989525fb

- adds Development Team section with Сергій Миколайович Пліс (@vetr369)
  as Hardware Engineer & Infrastructure Specialist for DAGI nodes
- grants developer-level access to technical node/infra information

Made-with: Cursor

feat(sofiia-console): add multi-user team key auth + fix aurora DNS env e0bea910b9

- auth.py: adds SOFIIA_CONSOLE_TEAM_KEYS="name:key,..." support;
  require_auth now returns identity ("operator"/"user:<name>") for audit;
  validate_any_key checks primary + team keys; login sets per-user cookie
- main.py: auth/login+check endpoints return identity field;
  imports validate_any_key, _expected_team_cookie_tokens from auth
- docker-compose.node1.yml: adds SOFIIA_CONSOLE_TEAM_KEYS env var;
  adds AURORA_SERVICE_URL=http://127.0.0.1:9401 to prevent DNS lookup
  failure for aurora-service (not deployed on NODA1)

Made-with: Cursor

feat(sofiia-console): add artifacts list endpoint + team onboarding doc 2962d33a3b

- runbook_artifacts.py: adds list_run_artifacts() returning files with
  names, paths, sizes, mtime_utc from release_artifacts/<run_id>/
- runbook_runs_router.py: adds GET /api/runbooks/runs/{run_id}/artifacts
- docs/runbook/team-onboarding-console.md: one-page team onboarding doc
  covering access, rehearsal run steps, audit auth model (strict, no
  localhost bypass), artifacts location, abort procedure

Made-with: Cursor

chore(git): fix .gitignore — remove duplicate node_modules, add .venv-macos and runtime artifacts 9aac835882

- remove 13 duplicate 'node_modules' lines (cursor auto-added)
- add .venv-macos/ (aurora-service Python venv, 24k files)
- add ops/preflight_snapshots/, ops/voice_audit_results/, ops/voice_latency_report.json
- add *.bak and router-config.yml.bak backup files
- add services/sofiia-console/data/ (runbook runner artifacts dir)

Made-with: Cursor

feat(production): sync all modified production files to git e9dedffa48

Includes updates across gateway, router, node-worker, memory-service,
aurora-service, swapper, sofiia-console UI and node2 infrastructure:

- gateway-bot: Dockerfile, http_api.py, druid/aistalk prompts, doc_service
- services/router: main.py, router-config.yml, fabric_metrics, memory_retrieval,
  offload_client, prompt_builder
- services/node-worker: worker.py, main.py, config.py, fabric_metrics
- services/memory-service: Dockerfile, database.py, main.py, requirements
- services/aurora-service: main.py (+399), kling.py, quality_report.py
- services/swapper-service: main.py, swapper_config_node2.yaml
- services/sofiia-console: static/index.html (console UI update)
- config: agent_registry, crewai_agents/teams, router_agents
- ops/fabric_preflight.sh: updated preflight checks
- router-config.yml, docker-compose.node2.yml: infra updates
- docs: NODA1-AGENT-ARCHITECTURE, fabric_contract updated

Made-with: Cursor

feat(platform): add new services, tools, tests and crews modules 129e4ea1fc

New router intelligence modules (26 files): alert_ingest/store, audit_store,
architecture_pressure, backlog_generator/store, cost_analyzer, data_governance,
dependency_scanner, drift_analyzer, incident_* (5 files), llm_enrichment,
platform_priority_digest, provider_budget, release_check_runner, risk_* (6 files),
signature_state_store, sofiia_auto_router, tool_governance

New services:
- sofiia-console: Dockerfile, adapters/, monitor/nodes/ops/voice modules, launchd, react static
- memory-service: integration_endpoints, integrations, voice_endpoints, static UI
- aurora-service: full app suite (analysis, job_store, orchestrator, reporting, schemas, subagents)
- sofiia-supervisor: new supervisor service
- aistalk-bridge-lite: Telegram bridge lite
- calendar-service: CalDAV calendar service with reminders
- mlx-stt-service / mlx-tts-service: Apple Silicon speech services
- binance-bot-monitor: market monitor service
- node-worker: STT/TTS memory providers

New tools (9): agent_email, browser_tool, contract_tool, observability_tool,
oncall_tool, pr_reviewer_tool, repo_tool, safe_code_executor, secure_vault

New crews: agromatrix_crew (10 modules: depth_classifier, doc_facts, doc_focus,
farm_state, light_reply, llm_factory, memory_manager, proactivity, reflection_engine,
session_context, style_adapter, telemetry)

Tests: 85+ test files for all new modules
Made-with: Cursor

docs(platform): add policy configs, runbooks, ops scripts and platform documentation 67225a39fa

Config policies (16 files): alert_routing, architecture_pressure, backlog,
cost_weights, data_governance, incident_escalation, incident_intelligence,
network_allowlist, nodes_registry, observability_sources, rbac_tools_matrix,
release_gate, risk_attribution, risk_policy, slo_policy, tool_limits, tools_rollout

Ops (22 files): Caddyfile, calendar compose, grafana voice dashboard,
deployments/incidents logs, runbooks for alerts/audit/backlog/incidents/sofiia/voice,
cron jobs, scripts (alert_triage, audit_cleanup, migrate_*, governance, schedule),
task_registry, voice alerts/ha/latency/policy

Docs (30+ files): HUMANIZED_STEPAN v2.7-v3 changelogs and runbooks,
NODA1/NODA2 status and setup, audit index and traces, backlog, incident,
supervisor, tools, voice, opencode, release, risk, aistalk, spacebot

Made-with: Cursor

chore(infra): add NODA2 setup files, docker-compose configs and root config fa749fa56c

- AGENTS.md: Sofiia Chief AI Architect role definition
- SOFIIA_IN_OPENCODE.md, SOFIIA_NODA2_SETUP.md: NODA2 setup documentation
- agromatrix_stepan_noda1_APPLY.md, agromatrix_stepan_noda1_prod.patch: AgroMatrix production patch
- docker-compose.memory-node2.yml: memory service for NODA2
- docker-compose.node2-sofiia-supervisor.yml: sofiia supervisor for NODA2
- gateway-bot/gateway_boot.py, monitor_prompt.txt, vision_guard.py: gateway extras
- models/Modelfile.qwen3.5-35b-a3b: Qwen model definition for NODA3
- opencode.json: OpenCode providers and agents config
- scripts/init-sofiia-memory.py, scripts/node2/*, start-memory-node2.sh: NODA2 init scripts
- setup_sofiia_node2.sh: NODA2 full setup script

Made-with: Cursor

feat(node-capabilities): add voice HA capability pass-through from node-worker 5994a3a56f

Made-with: Cursor

feat(matrix-bridge-dagi): scaffold service with health, metrics and config (PR-M1.0) 1d8482f4c1

New service: services/matrix-bridge-dagi/
- app/config.py: BridgeConfig dataclass, load_config() with full env validation
  (MATRIX_HOMESERVER_URL, MATRIX_ACCESS_TOKEN, MATRIX_USER_ID, SOFIIA_ROOM_ID,
   DAGI_GATEWAY_URL, SOFIIA_CONSOLE_URL, SOFIIA_INTERNAL_TOKEN, rate limits)
- app/main.py: FastAPI app with lifespan, GET /health, GET /metrics (prometheus)
  health returns: ok, node_id, homeserver, bridge_user, sofiia_room_id,
  allowed_agents, gateway, uptime_s; graceful error state when config missing
- requirements.txt: fastapi, uvicorn, httpx, prometheus-client, pyyaml
- Dockerfile: python:3.11-slim, port 7030, BUILD_SHA/BUILD_TIME args

docker-compose.matrix-bridge-node1.yml:
- standalone override file (node1 network, port 127.0.0.1:7030)
- all env vars wired: MATRIX_*, SOFIIA_ROOM_ID, DAGI_GATEWAY_URL,
  SOFIIA_CONSOLE_URL, SOFIIA_INTERNAL_TOKEN, rate limit policy
- healthcheck, restart: unless-stopped

DoD: config validates, health/metrics respond, imports clean
Made-with: Cursor

feat(matrix-bridge-dagi): add matrix client wrapper and synapse setup (PR-M1.1) d8506da179

- adds MatrixClient with send_text/sync_poll/join_room/whoami (idempotent via txn_id)
- LRU dedupe for incoming event_ids (2048 capacity)
- exponential backoff retry (max 3 attempts) for 429/5xx/network errors
- extract_room_messages: filters own messages, non-text, duplicates
- health endpoint now probes matrix_reachable + gateway_reachable at startup
- adds docker-compose.synapse-node1.yml (Synapse + Postgres for NODA1)
- adds ops/runbook-matrix-setup.md (10-step setup: DNS, config, bot, room, .env)
- 19 tests passing, no real Synapse required

Made-with: Cursor

feat(matrix-bridge-dagi): add room mapping, ingress loop, synapse setup (PR-M1.2 + PR-M1.3) dbfab78f02

PR-M1.2 — room-to-agent mapping:
- adds room_mapping.py: parse BRIDGE_ROOM_MAP (format: agent:!room_id:server)
- RoomMappingConfig with O(1) room→agent lookup, agent allowlist check
- /bridge/mappings endpoint (read-only ops summary, no secrets)
- health endpoint now includes mappings_count
- 21 tests for parsing, validation, allowlist, summary

PR-M1.3 — Matrix ingress loop:
- adds ingress.py: MatrixIngressLoop asyncio task
- sync_poll → extract → dedupe → _invoke_gateway (POST /v1/invoke)
- gateway payload: agent_id, node_id, message, metadata (transport, room_id, event_id, sender)
- exponential backoff on errors (2s..60s)
- joins all mapped rooms at startup
- metric callbacks: on_message_received, on_gateway_error
- graceful shutdown via asyncio.Event
- 5 ingress tests (invoke, dedupe, callbacks, empty-map idle)

Synapse setup (docker-compose.synapse-node1.yml):
- fixed volume: bind mount ./synapse-data instead of named volume
- added port mapping 127.0.0.1:8008:8008

Synapse running on NODA1 (localhost:8008), bot @dagi_bridge:daarion.space created,
room !QwHczWXgefDHBEVkTH:daarion.space created, all 4 values in .env on NODA1.

Made-with: Cursor

fix(matrix-bridge-dagi): add BRIDGE_ROOM_MAP to docker-compose env 88bdaf214b

Made-with: Cursor

feat(sofiia-console): add internal audit ingest endpoint for trusted services 8d564fbbe5

Adds POST /api/audit/internal authenticated via X-Internal-Service-Token header
(SOFIIA_INTERNAL_TOKEN env). Allows matrix-bridge-dagi and other internal services
to write audit events without team keys. Reuses existing audit_log() + db layer.

Made-with: Cursor

feat(matrix-bridge-dagi): add egress, audit integration, fix router endpoint (PR-M1.4) cad3663508

Closes the full Matrix ↔ DAGI loop:

Egress:
- invoke Router POST /v1/agents/{agent_id}/infer (field: prompt, response: response)
- send_text() reply to Matrix room with idempotent txn_id = make_txn_id(room_id, event_id)
- empty reply → skip send (no spam)
- reply truncated to 4000 chars if needed

Audit (via sofiia-console POST /api/audit/internal):
- matrix.message.received (on ingress)
- matrix.agent.replied (on successful reply)
- matrix.error (on router/send failure, with error_code)
- fire-and-forget: audit failures never crash the loop

Router URL fix:
- DAGI_GATEWAY_URL now points to dagi-router-node1:8000 (not gateway:9300)
- Session ID: stable per room — matrix:{room_localpart} (memory context)

9 tests: invoke endpoint, fallback fields, audit write, full cycle,
dedupe, empty reply skip, metric callbacks

Made-with: Cursor

fix(sofiia-console): pass SOFIIA_INTERNAL_TOKEN env var to container b27dd79ece

Made-with: Cursor

ops(nginx): add matrix.daarion.space vhost config (HTTP + HTTPS template) e5480e92db

Made-with: Cursor

ops(nginx): finalize matrix.daarion.space HTTPS config with Synapse proxy 313d777c84

Made-with: Cursor

feat(matrix-bridge-dagi): add rate limiting (H1) and metrics (H3) a4e95482bc

H1 — InMemoryRateLimiter (sliding window, no Redis):
  - Per-room: RATE_LIMIT_ROOM_RPM (default 20/min)
  - Per-sender: RATE_LIMIT_SENDER_RPM (default 10/min)
  - Room checked before sender — sender quota not charged on room block
  - Blocked messages: audit matrix.rate_limited + on_rate_limited callback
  - reset() for ops/test, stats() exposed in /health

H3 — Extended Prometheus metrics:
  - matrix_bridge_rate_limited_total{room_id,agent_id,limit_type}
  - matrix_bridge_send_duration_seconds histogram (invoke was already there)
  - matrix_bridge_invoke_duration_seconds buckets tuned for LLM latency
  - matrix_bridge_rate_limiter_active_rooms/senders gauges
  - on_invoke_latency + on_send_latency callbacks wired in ingress loop

16 new tests: rate limiter unit (13) + ingress integration (3)
Total: 65 passed

Made-with: Cursor

feat(matrix-bridge-dagi): add backpressure queue with N workers (H2) a24dae8e18

Reader + N workers architecture:
  Reader: sync_poll → rate_check → dedupe → queue.put_nowait()
  Workers (WORKER_CONCURRENCY, default 2): queue.get() → invoke → send → audit

Drop policy (queue full):
  - put_nowait() raises QueueFull → dropped immediately (reader never blocks)
  - audit matrix.queue_full + on_queue_dropped callback
  - metric: matrix_bridge_queue_dropped_total{room_id,agent_id}

Graceful shutdown:
  1. stop_event → reader exits loop
  2. queue.join() with QUEUE_DRAIN_TIMEOUT_S (default 5s) → workers finish in-flight
  3. worker tasks cancelled

New config env vars:
  QUEUE_MAX_EVENTS (default 100)
  WORKER_CONCURRENCY (default 2)
  QUEUE_DRAIN_TIMEOUT_S (default 5)

New metrics (H3 additions):
  matrix_bridge_queue_size (gauge)
  matrix_bridge_queue_dropped_total (counter)
  matrix_bridge_queue_wait_seconds histogram (buckets: 0.01…30s)

/health: queue.size, queue.max, queue.workers
MatrixIngressLoop: queue_size + worker_count properties

6 queue tests: enqueue/process, full-drop-audit, concurrency barrier,
graceful drain, wait metric, rate-limit-before-enqueue
Total: 71 passed

Made-with: Cursor

docs(dev): add ops runbook for matrix-bridge-dagi (H4) 70dd2a97dc

Made-with: Cursor

feat(matrix-bridge-dagi): support N rooms in BRIDGE_ROOM_MAP, reject duplicate room_id (M2.0) 79db053b38

Made-with: Cursor

feat(matrix-bridge-dagi): add mixed-room routing by slash/mention (M2.1) a85a11984b

- mixed_routing.py: parse BRIDGE_MIXED_ROOM_MAP, route by /slash > @mention > name: > default
- ingress.py: _try_enqueue_mixed for mixed rooms, session isolation {room}:{agent}, reply tagging
- config.py: bridge_mixed_room_map + bridge_mixed_defaults fields
- main.py: parse mixed config, pass to MatrixIngressLoop, expose in /health + /bridge/mappings
- docker-compose: BRIDGE_MIXED_ROOM_MAP / BRIDGE_MIXED_DEFAULTS env vars, BRIDGE_ALLOWED_AGENTS multi-value
- tests: 25 routing unit tests + 10 ingress integration tests (94 total pass)

Made-with: Cursor

feat(matrix-bridge-dagi): harden mixed rooms with safe defaults and ops visibility (M2.2) d40b1e87c6

Guard rails (mixed_routing.py):
  - MAX_AGENTS_PER_MIXED_ROOM (default 5): fail-fast at parse time
  - MAX_SLASH_LEN (default 32): reject garbage/injection slash tokens
  - Unified rejection reasons: unknown_agent, slash_too_long, no_mapping
  - REASON_REJECTED_* constants (separate from success REASON_*)

Ingress (ingress.py):
  - per-room-agent concurrency semaphore (MIXED_CONCURRENCY_CAP, default 1)
  - active_lock_count property for /health + prometheus
  - UNKNOWN_AGENT_BEHAVIOR: "ignore" (silent) | "reply_error" (inform user)
  - on_routed(agent_id, reason) callback for routing metrics
  - on_route_rejected(room_id, reason) callback for rejection metrics
  - matrix.route.rejected audit event on every rejection

Config + main:
  - max_agents_per_mixed_room, max_slash_len, unknown_agent_behavior, mixed_concurrency_cap
  - matrix_bridge_routed_total{agent_id, reason} counter
  - matrix_bridge_route_rejected_total{room_id, reason} counter
  - matrix_bridge_active_room_agent_locks gauge
  - /health: mixed_guard_rails section + total_agents_in_mixed_rooms
  - docker-compose: all 4 new guard rail env vars

Runbook: section 9 — mixed room debug guide (6 acceptance tests, routing metrics, session isolation, lock hang, config guard)

Tests: 108 pass (94 → 108, +14 new tests for guard rails + callbacks + concurrency)
Made-with: Cursor

feat(matrix-bridge-dagi): add operator allowlist for control commands (M3.0) fe6e3d30ae

New: app/control.py
  - ControlConfig: operator_allowlist + control_rooms (frozensets)
  - parse_control_config(): validates @user:server + !room:server formats, fail-fast
  - parse_command(): parses !verb subcommand [args] [key=value] up to 512 chars
  - check_authorization(): AND(is_control_room, is_operator) → (bool, reason)
  - Reply helpers: not_implemented, unknown_command, unauthorized, help
  - KNOWN_VERBS: runbook, status, help (M3.1+ stubs)
  - MAX_CMD_LEN=512, MAX_CMD_TOKENS=20

ingress.py:
  - _try_control(): dispatch for control rooms (authorized → audit + reply, unauthorized → audit + optional ⛔)
  - join control rooms on startup
  - _enqueue_from_sync: control rooms processed first, never forwarded to agents
  - on_control_command(sender, verb, subcommand) metric callback
  - CONTROL_UNAUTHORIZED_BEHAVIOR: "ignore" | "reply_error"

Audit events:
  matrix.control.command       — authorised command (verb, subcommand, args, kwargs)
  matrix.control.unauthorized  — rejected by allowlist (reason: not_operator | not_control_room)
  matrix.control.unknown_cmd   — authorised but unrecognised verb

Config + main:
  - bridge_operator_allowlist, bridge_control_rooms, control_unauthorized_behavior
  - matrix_bridge_control_commands_total{sender,verb,subcommand} counter
  - /health: control_channel section (enabled, rooms_count, operators_count, behavior)
  - /bridge/mappings: control_rooms + control_operators_count
  - docker-compose: BRIDGE_OPERATOR_ALLOWLIST, BRIDGE_CONTROL_ROOMS, CONTROL_UNAUTHORIZED_BEHAVIOR

Tests: 40 new → 148 total pass
Made-with: Cursor

feat(matrix-bridge-dagi): M4–M11 + soak infrastructure (debug inject endpoint) 82d5ff2a4f

Includes all milestones M4 through M11:
- M4: agent discovery (!agents / !status)
- M5: node-aware routing + per-node observability
- M6: dynamic policy store (node/agent overrides, import/export)
- M7: Prometheus alerts + Grafana dashboard + metrics contract
- M8: node health tracker + soft failover + sticky cache + HA persistence
- M9: two-step confirm + diff preview for dangerous commands
- M10: auto-backup, restore, retention, policy history + change detail
- M11: soak scenarios (CI tests) + live soak script

Soak infrastructure (this commit):
- POST /v1/debug/inject_event (guarded by DEBUG_INJECT_ENABLED=false)
- _preflight_inject() and _check_wal() in soak script
- --db-path arg for WAL delta reporting
- Runbook sections 2a/2b/2c: Step 0 and Step 1 exact commands

Made-with: Cursor

fix(matrix-bridge): remove shadowed 'import os' inside lifespan causing UnboundLocalError 84cb7e51bc