Phase6/7 runtime + Gitea smoke gate setup #1
Reference in New Issue
Block a user
Delete Branch "codex/sync-node1-runtime"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Includes live-validated Phase-6/7 runtime changes, phase6 smoke workflow for Gitea, and SSH key hardening for runner execution.\n\nValidation done:\n- NODA1 make phase6-smoke PASS\n- /v1/agents/public count=14 without aistalk\n- Gitea Actions run #19 success on macbook-noda2-runner\n\nNext step after merge:\n- optionally wire deploy workflow hard gate to this smoke run.
- TTS: xtts-v2 integration with voice cloning support - Document: docling integration for PDF/DOCX/PPTX processing - Memory Service: added /facts/upsert, /facts/{key}, /facts endpoints - Added required dependencies (TTS, docling)- docker-compose.node1.yml: Add network aliases (router, gateway, memory-service, qdrant, nats, neo4j) to eliminate manual `docker network connect --alias` commands - docker-compose.node1.yml: ROUTER_URL now uses env variable with fallback: ${ROUTER_URL:-http://router:8000} - docker-compose.node1.yml: Increase router healthcheck start_period to 30s and retries to 5 - .gitignore: Add noda1-credentials.local.mdc (local-only SSH creds) - scripts/node1/verify_agents.sh: Improved output with agent list - docs: Add NODA1-AGENT-VERIFICATION.md, NODA1-AGENT-ARCHITECTURE.md, NODA1-VERIFICATION-REPORT-2026-02-03.md - config/README.md: How to add new agents - .cursor/rules/, .cursor/skills/: NODA1 operations skill for Cursor Root cause fixed: Gateway could not resolve 'router' DNS name when Router container was named 'dagi-staging-router' without alias. Co-authored-by: Cursor <cursoragent@cursor.com>Complete snapshot of /opt/microdao-daarion/ from NODE1 (144.76.224.179). This represents the actual running production code that has diverged significantly from the previous main branch. Key changes from old main: - Gateway (http_api.py): expanded from ~40KB to 164KB with full agent support - Router: new /v1/agents/{id}/infer endpoint with vision + DeepSeek routing - Behavior Policy: SOWA v2.2 (3-level: FULL/ACK/SILENT) - Agent Registry: config/agent_registry.yml as single source of truth - 13 agents configured (was 3) - Memory service integration - CrewAI teams and roles Excluded from snapshot: venv/, .env, data/, backups, .tgz archives Co-authored-by: Cursor <cursoragent@cursor.com>1. thread_has_agent_participation (SOWA Priority 11): - New function has_agent_chat_participation() in behavior_policy.py - Checks if agent responded to ANY user in this chat within 30min - When active + user asks question/imperative → agent responds - Different from per-user conversation_context (Priority 12) - Wired into both detect_explicit_request() and analyze_message() 2. ACK reply_to_message_id: - When SOWA sends ACK ("NUTRA тут"), it now replies to the user's message instead of sending a standalone message - Better UX: visually linked to what the user wrote - Uses allow_sending_without_reply=True for safety Known issue (not fixed - too risky): - Lines 1368-1639 in http_api.py are dead code (brand commands /бренд) at incorrect indentation level (8 spaces, inside unreachable block) - These commands never worked on NODE1, fixing 260 lines of indentation carries regression risk — deferred to separate cleanup PR Co-authored-by: Cursor <cursoragent@cursor.com>- Memory Service: POST /agents/{agent_id}/summarize endpoint - Fetches recent events by agent_id (new db.list_facts_by_agent) - Generates structured summary via DeepSeek LLM - Saves summary to PostgreSQL facts + Qdrant vector store - Returns structured JSON (summary, goals, decisions, key_facts) - Gateway memory_client: auto-trigger after 30 turns - Turn counter per chat (agent_id:channel_id) - 5-minute debounce between summarize calls - Fire-and-forget via asyncio.ensure_future (non-blocking) - Configurable via SUMMARIZE_TURN_THRESHOLD / SUMMARIZE_DEBOUNCE_SECONDS - Database: list_facts_by_agent() for agent-level queries without user_id Tested on NODE1: Helion summarize returns valid Ukrainian summary with 20 events. Co-authored-by: Cursor <cursoragent@cursor.com>Producer (market-data-service): - Backpressure: smart drop policy (heartbeats→quotes→trades preserved) - Heartbeat monitor: synthetic HeartbeatEvent on provider silence - Graceful shutdown: WS→bus→storage→DB engine cleanup sequence - Bybit V5 public WS provider (backup for Binance, no API key needed) - FailoverManager: health-based provider switching with recovery - NATS output adapter: md.events.{type}.{symbol} for SenpAI - /bus-stats endpoint for backpressure monitoring - Dockerfile + docker-compose.node1.yml integration - 36 tests (parsing + bus + failover), requirements.lock Consumer (senpai-md-consumer): - NATSConsumer: subscribe md.events.>, queue group senpai-md, backpressure - State store: LatestState + RollingWindow (deque, 60s) - Feature engine: 11 features (mid, spread, VWAP, return, vol, latency) - Rule-based signals: long/short on return+volume+spread conditions - Publisher: rate-limited features + signals + alerts to NATS - HTTP API: /health, /metrics, /state/latest, /features/latest, /stats - 10 Prometheus metrics - Dockerfile + docker-compose.senpai.yml - 41 tests (parsing + state + features + rate-limit), requirements.lock CI: ruff + pytest + smoke import for both services Tests: 77 total passed, lint clean Co-authored-by: Cursor <cursoragent@cursor.com>Bug fixes: - Bug A: GROK_API_KEY env mismatch — router expected GROK_API_KEY but only XAI_API_KEY was present. Added GROK_API_KEY=${XAI_API_KEY} alias in compose. - Bug B: 'grok' profile missing in router-config.node2.yml — added cloud_grok profile (provider: grok, model: grok-2-1212). Sofiia now has default_llm=cloud_grok with fallback_llm=local_default_coder. - Bug C: Router silently defaulted to cloud DeepSeek when profile was unknown. Now falls back to agent.fallback_llm or local_default_coder with WARNING log. Hardcoded Ollama URL (172.18.0.1) replaced with config-driven base_url. New service: Node Capabilities Service (NCS) - services/node-capabilities/ — FastAPI microservice exposing live model inventory from Ollama, Swapper, and llama-server. - GET /capabilities — canonical JSON with served_models[] and inventory_only[] - GET /capabilities/models — flat list of served models - POST /capabilities/refresh — force cache refresh - Cache TTL 15s, bound to 127.0.0.1:8099 - services/router/capabilities_client.py — async client with TTL cache Artifacts: - ops/node2_models_audit.md — 3-layer model view (served/disk/cloud) - ops/node2_models_audit.yml — machine-readable audit - ops/node2_capabilities_example.json — sample NCS output (14 served models) Made-with: CursorNode Worker (services/node-worker/): - NATS subscriber for node.{NODE_ID}.llm.request / vision.request - Canonical JobRequest/JobResponse envelope (Pydantic) - Idempotency cache (TTL 10min) with inflight dedup - Deadline enforcement (DEADLINE_EXCEEDED on expired jobs) - Concurrency limiter (semaphore, returns busy) - Ollama + Swapper vision providers Router offload (services/router/offload_client.py): - NATS req/reply with configurable retries - Circuit breaker per node+type (3 fails/60s → open 120s) - Concurrency semaphore for remote requests Model selection (services/router/model_select.py): - exclude_nodes parameter for circuit-broken nodes - force_local flag for fallback re-selection - Integrated circuit breaker state awareness Router /infer pipeline: - Remote offload path when NCS selects remote node - Automatic fallback: exclude failed node → force_local re-select - Deadline propagation from router to node-worker Tests: 17 unit tests (idempotency, deadline, circuit breaker) Docs: ops/offload_routing.md (subjects, envelope, verification) Made-with: CursorNCS: - _collect_worker_caps() fetches capability flags from node-worker /caps - _derive_capabilities() merges served model types + worker provider flags - installed_artifacts replaces inventory_only (disk scan with DISK_SCAN_PATHS env) - New endpoints: /capabilities/caps, /capabilities/installed Node Worker: - STT_PROVIDER, TTS_PROVIDER, OCR_PROVIDER, IMAGE_PROVIDER env flags - /caps endpoint returns capabilities + providers for NCS aggregation - STT adapter (providers/stt_mlx_whisper.py) — remote + local mode - TTS adapter (providers/tts_mlx_kokoro.py) — remote + local mode - OCR handler via vision_prompted (ollama_vision with OCR prompt) - NATS subjects: node.{id}.stt/tts/ocr/image.request Router: - POST /v1/capability/{stt,tts,ocr,image} — capability-based offload routing - GET /v1/capabilities — global view with capabilities_by_node - require_fresh_caps(ttl) preflight guard - find_nodes_with_capability(cap) + load-based node selection Ops: - ops/fabric_snapshot.py — full runtime snapshot collector - ops/fabric_preflight.sh — quick check + snapshot save + diff - docs/fabric_contract.md — Dev Contract v0.1 (preflight-first) - tests/test_fabric_contract.py — CI enforcement (6 tests) Made-with: Cursor- adds runbook_artifacts.py: server-side render of release_evidence.md and post_review.md from DB step results (no shell); saves to SOFIIA_DATA_DIR/release_artifacts/<run_id>/ - evidence: auto-fills preflight/smoke/script outcomes, step table, timestamps - post_review: auto-fills metadata, smoke results, incidents from step statuses; leaves [TODO] markers for manual observation sections - adds POST /api/runbooks/runs/{run_id}/evidence and /post_review endpoints - updates runbook_runs.evidence_path in DB after render - adds 11 tests covering file creation, key sections, TODO markers, 404s, API Made-with: Cursor- auth.py: adds SOFIIA_CONSOLE_TEAM_KEYS="name:key,..." support; require_auth now returns identity ("operator"/"user:<name>") for audit; validate_any_key checks primary + team keys; login sets per-user cookie - main.py: auth/login+check endpoints return identity field; imports validate_any_key, _expected_team_cookie_tokens from auth - docker-compose.node1.yml: adds SOFIIA_CONSOLE_TEAM_KEYS env var; adds AURORA_SERVICE_URL=http://127.0.0.1:9401 to prevent DNS lookup failure for aurora-service (not deployed on NODA1) Made-with: Cursor- runbook_artifacts.py: adds list_run_artifacts() returning files with names, paths, sizes, mtime_utc from release_artifacts/<run_id>/ - runbook_runs_router.py: adds GET /api/runbooks/runs/{run_id}/artifacts - docs/runbook/team-onboarding-console.md: one-page team onboarding doc covering access, rehearsal run steps, audit auth model (strict, no localhost bypass), artifacts location, abort procedure Made-with: CursorCloses the full Matrix ↔ DAGI loop: Egress: - invoke Router POST /v1/agents/{agent_id}/infer (field: prompt, response: response) - send_text() reply to Matrix room with idempotent txn_id = make_txn_id(room_id, event_id) - empty reply → skip send (no spam) - reply truncated to 4000 chars if needed Audit (via sofiia-console POST /api/audit/internal): - matrix.message.received (on ingress) - matrix.agent.replied (on successful reply) - matrix.error (on router/send failure, with error_code) - fire-and-forget: audit failures never crash the loop Router URL fix: - DAGI_GATEWAY_URL now points to dagi-router-node1:8000 (not gateway:9300) - Session ID: stable per room — matrix:{room_localpart} (memory context) 9 tests: invoke endpoint, fallback fields, audit write, full cycle, dedupe, empty reply skip, metric callbacks Made-with: CursorH1 — InMemoryRateLimiter (sliding window, no Redis): - Per-room: RATE_LIMIT_ROOM_RPM (default 20/min) - Per-sender: RATE_LIMIT_SENDER_RPM (default 10/min) - Room checked before sender — sender quota not charged on room block - Blocked messages: audit matrix.rate_limited + on_rate_limited callback - reset() for ops/test, stats() exposed in /health H3 — Extended Prometheus metrics: - matrix_bridge_rate_limited_total{room_id,agent_id,limit_type} - matrix_bridge_send_duration_seconds histogram (invoke was already there) - matrix_bridge_invoke_duration_seconds buckets tuned for LLM latency - matrix_bridge_rate_limiter_active_rooms/senders gauges - on_invoke_latency + on_send_latency callbacks wired in ingress loop 16 new tests: rate limiter unit (13) + ingress integration (3) Total: 65 passed Made-with: CursorReader + N workers architecture: Reader: sync_poll → rate_check → dedupe → queue.put_nowait() Workers (WORKER_CONCURRENCY, default 2): queue.get() → invoke → send → audit Drop policy (queue full): - put_nowait() raises QueueFull → dropped immediately (reader never blocks) - audit matrix.queue_full + on_queue_dropped callback - metric: matrix_bridge_queue_dropped_total{room_id,agent_id} Graceful shutdown: 1. stop_event → reader exits loop 2. queue.join() with QUEUE_DRAIN_TIMEOUT_S (default 5s) → workers finish in-flight 3. worker tasks cancelled New config env vars: QUEUE_MAX_EVENTS (default 100) WORKER_CONCURRENCY (default 2) QUEUE_DRAIN_TIMEOUT_S (default 5) New metrics (H3 additions): matrix_bridge_queue_size (gauge) matrix_bridge_queue_dropped_total (counter) matrix_bridge_queue_wait_seconds histogram (buckets: 0.01…30s) /health: queue.size, queue.max, queue.workers MatrixIngressLoop: queue_size + worker_count properties 6 queue tests: enqueue/process, full-drop-audit, concurrency barrier, graceful drain, wait metric, rate-limit-before-enqueue Total: 71 passed Made-with: Cursor- mixed_routing.py: parse BRIDGE_MIXED_ROOM_MAP, route by /slash > @mention > name: > default - ingress.py: _try_enqueue_mixed for mixed rooms, session isolation {room}:{agent}, reply tagging - config.py: bridge_mixed_room_map + bridge_mixed_defaults fields - main.py: parse mixed config, pass to MatrixIngressLoop, expose in /health + /bridge/mappings - docker-compose: BRIDGE_MIXED_ROOM_MAP / BRIDGE_MIXED_DEFAULTS env vars, BRIDGE_ALLOWED_AGENTS multi-value - tests: 25 routing unit tests + 10 ingress integration tests (94 total pass) Made-with: CursorGuard rails (mixed_routing.py): - MAX_AGENTS_PER_MIXED_ROOM (default 5): fail-fast at parse time - MAX_SLASH_LEN (default 32): reject garbage/injection slash tokens - Unified rejection reasons: unknown_agent, slash_too_long, no_mapping - REASON_REJECTED_* constants (separate from success REASON_*) Ingress (ingress.py): - per-room-agent concurrency semaphore (MIXED_CONCURRENCY_CAP, default 1) - active_lock_count property for /health + prometheus - UNKNOWN_AGENT_BEHAVIOR: "ignore" (silent) | "reply_error" (inform user) - on_routed(agent_id, reason) callback for routing metrics - on_route_rejected(room_id, reason) callback for rejection metrics - matrix.route.rejected audit event on every rejection Config + main: - max_agents_per_mixed_room, max_slash_len, unknown_agent_behavior, mixed_concurrency_cap - matrix_bridge_routed_total{agent_id, reason} counter - matrix_bridge_route_rejected_total{room_id, reason} counter - matrix_bridge_active_room_agent_locks gauge - /health: mixed_guard_rails section + total_agents_in_mixed_rooms - docker-compose: all 4 new guard rail env vars Runbook: section 9 — mixed room debug guide (6 acceptance tests, routing metrics, session isolation, lock hang, config guard) Tests: 108 pass (94 → 108, +14 new tests for guard rails + callbacks + concurrency) Made-with: CursorNew: app/control.py - ControlConfig: operator_allowlist + control_rooms (frozensets) - parse_control_config(): validates @user:server + !room:server formats, fail-fast - parse_command(): parses !verb subcommand [args] [key=value] up to 512 chars - check_authorization(): AND(is_control_room, is_operator) → (bool, reason) - Reply helpers: not_implemented, unknown_command, unauthorized, help - KNOWN_VERBS: runbook, status, help (M3.1+ stubs) - MAX_CMD_LEN=512, MAX_CMD_TOKENS=20 ingress.py: - _try_control(): dispatch for control rooms (authorized → audit + reply, unauthorized → audit + optional ⛔) - join control rooms on startup - _enqueue_from_sync: control rooms processed first, never forwarded to agents - on_control_command(sender, verb, subcommand) metric callback - CONTROL_UNAUTHORIZED_BEHAVIOR: "ignore" | "reply_error" Audit events: matrix.control.command — authorised command (verb, subcommand, args, kwargs) matrix.control.unauthorized — rejected by allowlist (reason: not_operator | not_control_room) matrix.control.unknown_cmd — authorised but unrecognised verb Config + main: - bridge_operator_allowlist, bridge_control_rooms, control_unauthorized_behavior - matrix_bridge_control_commands_total{sender,verb,subcommand} counter - /health: control_channel section (enabled, rooms_count, operators_count, behavior) - /bridge/mappings: control_rooms + control_operators_count - docker-compose: BRIDGE_OPERATOR_ALLOWLIST, BRIDGE_CONTROL_ROOMS, CONTROL_UNAUTHORIZED_BEHAVIOR Tests: 40 new → 148 total pass Made-with: Cursor