Config policies (16 files): alert_routing, architecture_pressure, backlog, cost_weights, data_governance, incident_escalation, incident_intelligence, network_allowlist, nodes_registry, observability_sources, rbac_tools_matrix, release_gate, risk_attribution, risk_policy, slo_policy, tool_limits, tools_rollout Ops (22 files): Caddyfile, calendar compose, grafana voice dashboard, deployments/incidents logs, runbooks for alerts/audit/backlog/incidents/sofiia/voice, cron jobs, scripts (alert_triage, audit_cleanup, migrate_*, governance, schedule), task_registry, voice alerts/ha/latency/policy Docs (30+ files): HUMANIZED_STEPAN v2.7-v3 changelogs and runbooks, NODA1/NODA2 status and setup, audit index and traces, backlog, incident, supervisor, tools, voice, opencode, release, risk, aistalk, spacebot Made-with: Cursor
7.7 KiB
Sofiia Supervisor — LangGraph Orchestration Service
Location: NODA2 | Port: 8084 (external) → 8080 (container)
State backend: Redis (sofiia-redis:6379)
Gateway: http://router:8000/v1/tools/execute
Architecture
Caller (Telegram/UI/API)
│
▼
sofiia-supervisor:8084 ──── POST /v1/graphs/{name}/runs
│ GET /v1/runs/{run_id}
│ POST /v1/runs/{run_id}/cancel
│
▼ (LangGraph nodes)
GatewayClient ──────────────→ router:8000/v1/tools/execute
│ │
│ ▼ (ToolGovernance)
│ RBAC check → limits → redact → audit
│ │
│ ToolManager.execute_tool(...)
│
▼
sofiia-redis ←── RunRecord + RunEvents (no payload)
Key invariants:
- LangGraph nodes have no direct access to internal services
- All tool calls go through
router → ToolGovernance → ToolManager graph_run_idis propagated in every gateway request metadata- Logs contain hash + sizes only (no payload content)
Graphs
release_check
Runs the DAARION release_check pipeline via job_orchestrator_tool.
Nodes: start_job → poll_job (loop) → finalize → END
Input (input field of StartRunRequest):
| Field | Type | Default | Description |
|---|---|---|---|
service_name |
string | "unknown" |
Service being released |
diff_text |
string | "" |
Git diff text |
fail_fast |
bool | true |
Stop on first gate failure |
run_deps |
bool | true |
Run dependency scan gate |
run_drift |
bool | true |
Run drift analysis gate |
run_smoke |
bool | false |
Run smoke tests |
deps_targets |
array | ["python","node"] |
Ecosystems for dep scan |
deps_vuln_mode |
string | "offline_cache" |
OSV mode |
deps_fail_on |
array | ["CRITICAL","HIGH"] |
Blocking severity |
drift_categories |
array | all | Drift analysis categories |
risk_profile |
string | "default" |
Risk profile |
timeouts.overall_sec |
number | 180 |
Total timeout |
Output (in result): Same as release_check_runner.py:
{
"pass": true,
"gates": [{"name": "pr_review", "status": "pass"}, ...],
"recommendations": [],
"summary": "All 5 gates passed.",
"elapsed_ms": 4200
}
incident_triage
Collects observability data, logs, health, and runbooks to build a triage report.
Nodes: validate_input → service_overview → top_errors_logs → health_and_runbooks → trace_lookup → build_triage_report → END
Input:
| Field | Type | Default | Description |
|---|---|---|---|
service |
string | — | Service name (required) |
symptom |
string | — | Brief incident description (required) |
time_range.from |
ISO | -1h | Start of analysis window |
time_range.to |
ISO | now | End of analysis window |
env |
string | "prod" |
Environment |
include_traces |
bool | false |
Look up traces from log IDs |
max_log_lines |
int | 120 |
Log lines to analyse (max 200) |
log_query_hint |
string | auto | Custom log query filter |
Time window: Clamped to 24h max (INCIDENT_MAX_TIME_WINDOW_H).
Output (in result):
{
"summary": "...",
"suspected_root_causes": [{"rank": 1, "cause": "...", "evidence": [...]}],
"impact_assessment": "SLO impact: error_rate=2.1%",
"mitigations_now": ["Increase DB pool size", "..."],
"next_checks": ["Verify healthz", "..."],
"references": {
"metrics": {"slo": {...}, "alerts_count": 1},
"log_samples": ["..."],
"runbook_snippets": [{"path": "...", "text": "..."}],
"traces": {"traces": [...]}
}
}
Deployment on NODA2
Quick start
# On NODA2 host
cd /path/to/microdao-daarion
# Start supervisor + redis (attaches to existing dagi-network-node2)
docker compose \
-f docker-compose.node2.yml \
-f docker-compose.node2-sofiia-supervisor.yml \
up -d sofiia-supervisor sofiia-redis
# Verify
curl http://localhost:8084/healthz
Environment variables
Copy .env.example and set:
cp services/sofiia-supervisor/.env.example .env
# Edit:
# GATEWAY_BASE_URL=http://router:8000 (must be accessible from container)
# SUPERVISOR_API_KEY=<key-for-router> (matches SUPERVISOR_API_KEY in router)
# SUPERVISOR_INTERNAL_KEY=<key-to-protect-supervisor-api>
HTTP API
All endpoints require Authorization: Bearer <SUPERVISOR_INTERNAL_KEY> if SUPERVISOR_INTERNAL_KEY is set.
Start a run
curl -X POST http://localhost:8084/v1/graphs/release_check/runs \
-H "Content-Type: application/json" \
-d '{
"workspace_id": "daarion",
"user_id": "sofiia",
"agent_id": "sofiia",
"input": {
"service_name": "router",
"run_deps": true,
"run_drift": true
}
}'
Response:
{"run_id": "gr_3a1b2c...", "status": "queued", "result": null}
Poll for result
curl http://localhost:8084/v1/runs/gr_3a1b2c...
Response (when complete):
{
"run_id": "gr_3a1b2c...",
"graph": "release_check",
"status": "succeeded",
"started_at": "2026-02-23T10:00:00+00:00",
"finished_at": "2026-02-23T10:00:45+00:00",
"result": {"pass": true, "gates": [...], "summary": "..."},
"events": [
{"ts": "...", "type": "node_start", "node": "graph_start", "details": {...}},
...
]
}
Start incident triage
curl -X POST http://localhost:8084/v1/graphs/incident_triage/runs \
-H "Content-Type: application/json" \
-d '{
"workspace_id": "daarion",
"user_id": "helion",
"agent_id": "sofiia",
"input": {
"service": "router",
"symptom": "High error rate after deploy",
"env": "prod",
"include_traces": true,
"time_range": {"from": "2026-02-23T09:00:00Z", "to": "2026-02-23T10:00:00Z"}
}
}'
Cancel a run
curl -X POST http://localhost:8084/v1/runs/gr_3a1b2c.../cancel
Connecting to Sofiia (Telegram / internal UI)
The supervisor exposes a REST API. To invoke from Sofiia's tool loop:
- The gateway
job_orchestrator_toolcan be extended with astart_supervisor_runaction that callsPOST http://sofiia-supervisor:8080/v1/graphs/{name}/runs. - Alternatively, call the supervisor directly from the Telegram bot's backend (if on the same network).
Example flow for Telegram → Sofiia → Supervisor → Release Check:
User: "Run release check for router"
→ Sofiia LLM → job_orchestrator_tool(start_task, release_check)
→ Router: job_orchestrator_tool dispatches to release_check_runner
→ Returns report (existing flow, unchanged)
For async long-running workflows (>30s), use the supervisor directly:
User: "Triage production incident for router"
→ Sofiia LLM → [http call] POST /v1/graphs/incident_triage/runs
→ Returns run_id
→ Sofiia polls GET /v1/runs/{run_id} (or user asks again)
→ Returns structured triage report
Security
SUPERVISOR_INTERNAL_KEY: Protects supervisor HTTP API (recommend: network-level isolation instead)SUPERVISOR_API_KEY→ sent to router's/v1/tools/executeasAuthorization: Bearer- Router's
SUPERVISOR_API_KEYguards direct tool execution endpoint - All RBAC/limits/audit enforced by router's
ToolGovernance— supervisor cannot bypass them - LangGraph nodes have no credentials or secrets — only
workspace_id/user_id/agent_id
State TTL and cleanup
Runs are stored in Redis with TTL = RUN_TTL_SEC (default 24h). After TTL expires, the run metadata is automatically removed.
To extend TTL for important runs, call backend.save_run(run) with a new timestamp (planned: admin endpoint).