# Sofiia Supervisor — LangGraph Orchestration Service **Location**: NODA2 | **Port**: 8084 (external) → 8080 (container) **State backend**: Redis (`sofiia-redis:6379`) **Gateway**: `http://router:8000/v1/tools/execute` --- ## Architecture ``` Caller (Telegram/UI/API) │ ▼ sofiia-supervisor:8084 ──── POST /v1/graphs/{name}/runs │ GET /v1/runs/{run_id} │ POST /v1/runs/{run_id}/cancel │ ▼ (LangGraph nodes) GatewayClient ──────────────→ router:8000/v1/tools/execute │ │ │ ▼ (ToolGovernance) │ RBAC check → limits → redact → audit │ │ │ ToolManager.execute_tool(...) │ ▼ sofiia-redis ←── RunRecord + RunEvents (no payload) ``` **Key invariants:** - LangGraph nodes have **no direct access** to internal services - All tool calls go through `router → ToolGovernance → ToolManager` - `graph_run_id` is propagated in every gateway request metadata - Logs contain **hash + sizes only** (no payload content) --- ## Graphs ### `release_check` Runs the DAARION release_check pipeline via `job_orchestrator_tool`. **Nodes**: `start_job` → `poll_job` (loop) → `finalize` → END **Input** (`input` field of StartRunRequest): | Field | Type | Default | Description | |---|---|---|---| | `service_name` | string | `"unknown"` | Service being released | | `diff_text` | string | `""` | Git diff text | | `fail_fast` | bool | `true` | Stop on first gate failure | | `run_deps` | bool | `true` | Run dependency scan gate | | `run_drift` | bool | `true` | Run drift analysis gate | | `run_smoke` | bool | `false` | Run smoke tests | | `deps_targets` | array | `["python","node"]` | Ecosystems for dep scan | | `deps_vuln_mode` | string | `"offline_cache"` | OSV mode | | `deps_fail_on` | array | `["CRITICAL","HIGH"]` | Blocking severity | | `drift_categories` | array | all | Drift analysis categories | | `risk_profile` | string | `"default"` | Risk profile | | `timeouts.overall_sec` | number | `180` | Total timeout | **Output** (in `result`): Same as `release_check_runner.py`: ```json { "pass": true, "gates": [{"name": "pr_review", "status": "pass"}, ...], "recommendations": [], "summary": "All 5 gates passed.", "elapsed_ms": 4200 } ``` --- ### `incident_triage` Collects observability data, logs, health, and runbooks to build a triage report. **Nodes**: `validate_input` → `service_overview` → `top_errors_logs` → `health_and_runbooks` → `trace_lookup` → `build_triage_report` → END **Input**: | Field | Type | Default | Description | |---|---|---|---| | `service` | string | — | Service name (required) | | `symptom` | string | — | Brief incident description (required) | | `time_range.from` | ISO | -1h | Start of analysis window | | `time_range.to` | ISO | now | End of analysis window | | `env` | string | `"prod"` | Environment | | `include_traces` | bool | `false` | Look up traces from log IDs | | `max_log_lines` | int | `120` | Log lines to analyse (max 200) | | `log_query_hint` | string | auto | Custom log query filter | **Time window**: Clamped to 24h max (`INCIDENT_MAX_TIME_WINDOW_H`). **Output** (in `result`): ```json { "summary": "...", "suspected_root_causes": [{"rank": 1, "cause": "...", "evidence": [...]}], "impact_assessment": "SLO impact: error_rate=2.1%", "mitigations_now": ["Increase DB pool size", "..."], "next_checks": ["Verify healthz", "..."], "references": { "metrics": {"slo": {...}, "alerts_count": 1}, "log_samples": ["..."], "runbook_snippets": [{"path": "...", "text": "..."}], "traces": {"traces": [...]} } } ``` --- ## Deployment on NODA2 ### Quick start ```bash # On NODA2 host cd /path/to/microdao-daarion # Start supervisor + redis (attaches to existing dagi-network-node2) docker compose \ -f docker-compose.node2.yml \ -f docker-compose.node2-sofiia-supervisor.yml \ up -d sofiia-supervisor sofiia-redis # Verify curl http://localhost:8084/healthz ``` ### Environment variables Copy `.env.example` and set: ```bash cp services/sofiia-supervisor/.env.example .env # Edit: # GATEWAY_BASE_URL=http://router:8000 (must be accessible from container) # SUPERVISOR_API_KEY= (matches SUPERVISOR_API_KEY in router) # SUPERVISOR_INTERNAL_KEY= ``` --- ## HTTP API All endpoints require `Authorization: Bearer ` if `SUPERVISOR_INTERNAL_KEY` is set. ### Start a run ```bash curl -X POST http://localhost:8084/v1/graphs/release_check/runs \ -H "Content-Type: application/json" \ -d '{ "workspace_id": "daarion", "user_id": "sofiia", "agent_id": "sofiia", "input": { "service_name": "router", "run_deps": true, "run_drift": true } }' ``` Response: ```json {"run_id": "gr_3a1b2c...", "status": "queued", "result": null} ``` ### Poll for result ```bash curl http://localhost:8084/v1/runs/gr_3a1b2c... ``` Response (when complete): ```json { "run_id": "gr_3a1b2c...", "graph": "release_check", "status": "succeeded", "started_at": "2026-02-23T10:00:00+00:00", "finished_at": "2026-02-23T10:00:45+00:00", "result": {"pass": true, "gates": [...], "summary": "..."}, "events": [ {"ts": "...", "type": "node_start", "node": "graph_start", "details": {...}}, ... ] } ``` ### Start incident triage ```bash curl -X POST http://localhost:8084/v1/graphs/incident_triage/runs \ -H "Content-Type: application/json" \ -d '{ "workspace_id": "daarion", "user_id": "helion", "agent_id": "sofiia", "input": { "service": "router", "symptom": "High error rate after deploy", "env": "prod", "include_traces": true, "time_range": {"from": "2026-02-23T09:00:00Z", "to": "2026-02-23T10:00:00Z"} } }' ``` ### Cancel a run ```bash curl -X POST http://localhost:8084/v1/runs/gr_3a1b2c.../cancel ``` --- ## Connecting to Sofiia (Telegram / internal UI) The supervisor exposes a REST API. To invoke from Sofiia's tool loop: 1. The gateway `job_orchestrator_tool` can be extended with a `start_supervisor_run` action that calls `POST http://sofiia-supervisor:8080/v1/graphs/{name}/runs`. 2. Alternatively, call the supervisor directly from the Telegram bot's backend (if on the same network). Example flow for Telegram → Sofiia → Supervisor → Release Check: ``` User: "Run release check for router" → Sofiia LLM → job_orchestrator_tool(start_task, release_check) → Router: job_orchestrator_tool dispatches to release_check_runner → Returns report (existing flow, unchanged) ``` For **async long-running** workflows (>30s), use the supervisor directly: ``` User: "Triage production incident for router" → Sofiia LLM → [http call] POST /v1/graphs/incident_triage/runs → Returns run_id → Sofiia polls GET /v1/runs/{run_id} (or user asks again) → Returns structured triage report ``` --- ## Security - `SUPERVISOR_INTERNAL_KEY`: Protects supervisor HTTP API (recommend: network-level isolation instead) - `SUPERVISOR_API_KEY` → sent to router's `/v1/tools/execute` as `Authorization: Bearer` - Router's `SUPERVISOR_API_KEY` guards direct tool execution endpoint - All RBAC/limits/audit enforced by router's `ToolGovernance` — supervisor cannot bypass them - LangGraph nodes have **no credentials or secrets** — only `workspace_id/user_id/agent_id` --- ## State TTL and cleanup Runs are stored in Redis with TTL = `RUN_TTL_SEC` (default 24h). After TTL expires, the run metadata is automatically removed. To extend TTL for important runs, call `backend.save_run(run)` with a new timestamp (planned: admin endpoint).