Files
microdao-daarion/docs/supervisor/langgraph_supervisor.md
Apple 67225a39fa docs(platform): add policy configs, runbooks, ops scripts and platform documentation
Config policies (16 files): alert_routing, architecture_pressure, backlog,
cost_weights, data_governance, incident_escalation, incident_intelligence,
network_allowlist, nodes_registry, observability_sources, rbac_tools_matrix,
release_gate, risk_attribution, risk_policy, slo_policy, tool_limits, tools_rollout

Ops (22 files): Caddyfile, calendar compose, grafana voice dashboard,
deployments/incidents logs, runbooks for alerts/audit/backlog/incidents/sofiia/voice,
cron jobs, scripts (alert_triage, audit_cleanup, migrate_*, governance, schedule),
task_registry, voice alerts/ha/latency/policy

Docs (30+ files): HUMANIZED_STEPAN v2.7-v3 changelogs and runbooks,
NODA1/NODA2 status and setup, audit index and traces, backlog, incident,
supervisor, tools, voice, opencode, release, risk, aistalk, spacebot

Made-with: Cursor
2026-03-03 07:14:53 -08:00

7.7 KiB

Sofiia Supervisor — LangGraph Orchestration Service

Location: NODA2 | Port: 8084 (external) → 8080 (container)
State backend: Redis (sofiia-redis:6379)
Gateway: http://router:8000/v1/tools/execute


Architecture

Caller (Telegram/UI/API)
        │
        ▼
sofiia-supervisor:8084  ──── POST /v1/graphs/{name}/runs
        │                     GET  /v1/runs/{run_id}
        │                     POST /v1/runs/{run_id}/cancel
        │
        ▼ (LangGraph nodes)
GatewayClient ──────────────→ router:8000/v1/tools/execute
        │                         │
        │                         ▼ (ToolGovernance)
        │                     RBAC check → limits → redact → audit
        │                         │
        │                     ToolManager.execute_tool(...)
        │
        ▼
sofiia-redis  ←── RunRecord + RunEvents (no payload)

Key invariants:

  • LangGraph nodes have no direct access to internal services
  • All tool calls go through router → ToolGovernance → ToolManager
  • graph_run_id is propagated in every gateway request metadata
  • Logs contain hash + sizes only (no payload content)

Graphs

release_check

Runs the DAARION release_check pipeline via job_orchestrator_tool.

Nodes: start_jobpoll_job (loop) → finalize → END

Input (input field of StartRunRequest):

Field Type Default Description
service_name string "unknown" Service being released
diff_text string "" Git diff text
fail_fast bool true Stop on first gate failure
run_deps bool true Run dependency scan gate
run_drift bool true Run drift analysis gate
run_smoke bool false Run smoke tests
deps_targets array ["python","node"] Ecosystems for dep scan
deps_vuln_mode string "offline_cache" OSV mode
deps_fail_on array ["CRITICAL","HIGH"] Blocking severity
drift_categories array all Drift analysis categories
risk_profile string "default" Risk profile
timeouts.overall_sec number 180 Total timeout

Output (in result): Same as release_check_runner.py:

{
  "pass": true,
  "gates": [{"name": "pr_review", "status": "pass"}, ...],
  "recommendations": [],
  "summary": "All 5 gates passed.",
  "elapsed_ms": 4200
}

incident_triage

Collects observability data, logs, health, and runbooks to build a triage report.

Nodes: validate_inputservice_overviewtop_errors_logshealth_and_runbookstrace_lookupbuild_triage_report → END

Input:

Field Type Default Description
service string Service name (required)
symptom string Brief incident description (required)
time_range.from ISO -1h Start of analysis window
time_range.to ISO now End of analysis window
env string "prod" Environment
include_traces bool false Look up traces from log IDs
max_log_lines int 120 Log lines to analyse (max 200)
log_query_hint string auto Custom log query filter

Time window: Clamped to 24h max (INCIDENT_MAX_TIME_WINDOW_H).

Output (in result):

{
  "summary": "...",
  "suspected_root_causes": [{"rank": 1, "cause": "...", "evidence": [...]}],
  "impact_assessment": "SLO impact: error_rate=2.1%",
  "mitigations_now": ["Increase DB pool size", "..."],
  "next_checks": ["Verify healthz", "..."],
  "references": {
    "metrics": {"slo": {...}, "alerts_count": 1},
    "log_samples": ["..."],
    "runbook_snippets": [{"path": "...", "text": "..."}],
    "traces": {"traces": [...]}
  }
}

Deployment on NODA2

Quick start

# On NODA2 host
cd /path/to/microdao-daarion

# Start supervisor + redis (attaches to existing dagi-network-node2)
docker compose \
  -f docker-compose.node2.yml \
  -f docker-compose.node2-sofiia-supervisor.yml \
  up -d sofiia-supervisor sofiia-redis

# Verify
curl http://localhost:8084/healthz

Environment variables

Copy .env.example and set:

cp services/sofiia-supervisor/.env.example .env
# Edit:
#   GATEWAY_BASE_URL=http://router:8000   (must be accessible from container)
#   SUPERVISOR_API_KEY=<key-for-router>   (matches SUPERVISOR_API_KEY in router)
#   SUPERVISOR_INTERNAL_KEY=<key-to-protect-supervisor-api>

HTTP API

All endpoints require Authorization: Bearer <SUPERVISOR_INTERNAL_KEY> if SUPERVISOR_INTERNAL_KEY is set.

Start a run

curl -X POST http://localhost:8084/v1/graphs/release_check/runs \
  -H "Content-Type: application/json" \
  -d '{
    "workspace_id": "daarion",
    "user_id": "sofiia",
    "agent_id": "sofiia",
    "input": {
      "service_name": "router",
      "run_deps": true,
      "run_drift": true
    }
  }'

Response:

{"run_id": "gr_3a1b2c...", "status": "queued", "result": null}

Poll for result

curl http://localhost:8084/v1/runs/gr_3a1b2c...

Response (when complete):

{
  "run_id": "gr_3a1b2c...",
  "graph": "release_check",
  "status": "succeeded",
  "started_at": "2026-02-23T10:00:00+00:00",
  "finished_at": "2026-02-23T10:00:45+00:00",
  "result": {"pass": true, "gates": [...], "summary": "..."},
  "events": [
    {"ts": "...", "type": "node_start", "node": "graph_start", "details": {...}},
    ...
  ]
}

Start incident triage

curl -X POST http://localhost:8084/v1/graphs/incident_triage/runs \
  -H "Content-Type: application/json" \
  -d '{
    "workspace_id": "daarion",
    "user_id": "helion",
    "agent_id": "sofiia",
    "input": {
      "service": "router",
      "symptom": "High error rate after deploy",
      "env": "prod",
      "include_traces": true,
      "time_range": {"from": "2026-02-23T09:00:00Z", "to": "2026-02-23T10:00:00Z"}
    }
  }'

Cancel a run

curl -X POST http://localhost:8084/v1/runs/gr_3a1b2c.../cancel

Connecting to Sofiia (Telegram / internal UI)

The supervisor exposes a REST API. To invoke from Sofiia's tool loop:

  1. The gateway job_orchestrator_tool can be extended with a start_supervisor_run action that calls POST http://sofiia-supervisor:8080/v1/graphs/{name}/runs.
  2. Alternatively, call the supervisor directly from the Telegram bot's backend (if on the same network).

Example flow for Telegram → Sofiia → Supervisor → Release Check:

User: "Run release check for router"
  → Sofiia LLM → job_orchestrator_tool(start_task, release_check)
  → Router: job_orchestrator_tool dispatches to release_check_runner
  → Returns report (existing flow, unchanged)

For async long-running workflows (>30s), use the supervisor directly:

User: "Triage production incident for router"
  → Sofiia LLM → [http call] POST /v1/graphs/incident_triage/runs
  → Returns run_id
  → Sofiia polls GET /v1/runs/{run_id} (or user asks again)
  → Returns structured triage report

Security

  • SUPERVISOR_INTERNAL_KEY: Protects supervisor HTTP API (recommend: network-level isolation instead)
  • SUPERVISOR_API_KEY → sent to router's /v1/tools/execute as Authorization: Bearer
  • Router's SUPERVISOR_API_KEY guards direct tool execution endpoint
  • All RBAC/limits/audit enforced by router's ToolGovernance — supervisor cannot bypass them
  • LangGraph nodes have no credentials or secrets — only workspace_id/user_id/agent_id

State TTL and cleanup

Runs are stored in Redis with TTL = RUN_TTL_SEC (default 24h). After TTL expires, the run metadata is automatically removed.

To extend TTL for important runs, call backend.save_run(run) with a new timestamp (planned: admin endpoint).