Files
microdao-daarion/docs/supervisor/langgraph_supervisor.md
Apple 67225a39fa docs(platform): add policy configs, runbooks, ops scripts and platform documentation
Config policies (16 files): alert_routing, architecture_pressure, backlog,
cost_weights, data_governance, incident_escalation, incident_intelligence,
network_allowlist, nodes_registry, observability_sources, rbac_tools_matrix,
release_gate, risk_attribution, risk_policy, slo_policy, tool_limits, tools_rollout

Ops (22 files): Caddyfile, calendar compose, grafana voice dashboard,
deployments/incidents logs, runbooks for alerts/audit/backlog/incidents/sofiia/voice,
cron jobs, scripts (alert_triage, audit_cleanup, migrate_*, governance, schedule),
task_registry, voice alerts/ha/latency/policy

Docs (30+ files): HUMANIZED_STEPAN v2.7-v3 changelogs and runbooks,
NODA1/NODA2 status and setup, audit index and traces, backlog, incident,
supervisor, tools, voice, opencode, release, risk, aistalk, spacebot

Made-with: Cursor
2026-03-03 07:14:53 -08:00

265 lines
7.7 KiB
Markdown

# Sofiia Supervisor — LangGraph Orchestration Service
**Location**: NODA2 | **Port**: 8084 (external) → 8080 (container)
**State backend**: Redis (`sofiia-redis:6379`)
**Gateway**: `http://router:8000/v1/tools/execute`
---
## Architecture
```
Caller (Telegram/UI/API)
sofiia-supervisor:8084 ──── POST /v1/graphs/{name}/runs
│ GET /v1/runs/{run_id}
│ POST /v1/runs/{run_id}/cancel
▼ (LangGraph nodes)
GatewayClient ──────────────→ router:8000/v1/tools/execute
│ │
│ ▼ (ToolGovernance)
│ RBAC check → limits → redact → audit
│ │
│ ToolManager.execute_tool(...)
sofiia-redis ←── RunRecord + RunEvents (no payload)
```
**Key invariants:**
- LangGraph nodes have **no direct access** to internal services
- All tool calls go through `router → ToolGovernance → ToolManager`
- `graph_run_id` is propagated in every gateway request metadata
- Logs contain **hash + sizes only** (no payload content)
---
## Graphs
### `release_check`
Runs the DAARION release_check pipeline via `job_orchestrator_tool`.
**Nodes**: `start_job``poll_job` (loop) → `finalize` → END
**Input** (`input` field of StartRunRequest):
| Field | Type | Default | Description |
|---|---|---|---|
| `service_name` | string | `"unknown"` | Service being released |
| `diff_text` | string | `""` | Git diff text |
| `fail_fast` | bool | `true` | Stop on first gate failure |
| `run_deps` | bool | `true` | Run dependency scan gate |
| `run_drift` | bool | `true` | Run drift analysis gate |
| `run_smoke` | bool | `false` | Run smoke tests |
| `deps_targets` | array | `["python","node"]` | Ecosystems for dep scan |
| `deps_vuln_mode` | string | `"offline_cache"` | OSV mode |
| `deps_fail_on` | array | `["CRITICAL","HIGH"]` | Blocking severity |
| `drift_categories` | array | all | Drift analysis categories |
| `risk_profile` | string | `"default"` | Risk profile |
| `timeouts.overall_sec` | number | `180` | Total timeout |
**Output** (in `result`): Same as `release_check_runner.py`:
```json
{
"pass": true,
"gates": [{"name": "pr_review", "status": "pass"}, ...],
"recommendations": [],
"summary": "All 5 gates passed.",
"elapsed_ms": 4200
}
```
---
### `incident_triage`
Collects observability data, logs, health, and runbooks to build a triage report.
**Nodes**: `validate_input``service_overview``top_errors_logs``health_and_runbooks``trace_lookup``build_triage_report` → END
**Input**:
| Field | Type | Default | Description |
|---|---|---|---|
| `service` | string | — | Service name (required) |
| `symptom` | string | — | Brief incident description (required) |
| `time_range.from` | ISO | -1h | Start of analysis window |
| `time_range.to` | ISO | now | End of analysis window |
| `env` | string | `"prod"` | Environment |
| `include_traces` | bool | `false` | Look up traces from log IDs |
| `max_log_lines` | int | `120` | Log lines to analyse (max 200) |
| `log_query_hint` | string | auto | Custom log query filter |
**Time window**: Clamped to 24h max (`INCIDENT_MAX_TIME_WINDOW_H`).
**Output** (in `result`):
```json
{
"summary": "...",
"suspected_root_causes": [{"rank": 1, "cause": "...", "evidence": [...]}],
"impact_assessment": "SLO impact: error_rate=2.1%",
"mitigations_now": ["Increase DB pool size", "..."],
"next_checks": ["Verify healthz", "..."],
"references": {
"metrics": {"slo": {...}, "alerts_count": 1},
"log_samples": ["..."],
"runbook_snippets": [{"path": "...", "text": "..."}],
"traces": {"traces": [...]}
}
}
```
---
## Deployment on NODA2
### Quick start
```bash
# On NODA2 host
cd /path/to/microdao-daarion
# Start supervisor + redis (attaches to existing dagi-network-node2)
docker compose \
-f docker-compose.node2.yml \
-f docker-compose.node2-sofiia-supervisor.yml \
up -d sofiia-supervisor sofiia-redis
# Verify
curl http://localhost:8084/healthz
```
### Environment variables
Copy `.env.example` and set:
```bash
cp services/sofiia-supervisor/.env.example .env
# Edit:
# GATEWAY_BASE_URL=http://router:8000 (must be accessible from container)
# SUPERVISOR_API_KEY=<key-for-router> (matches SUPERVISOR_API_KEY in router)
# SUPERVISOR_INTERNAL_KEY=<key-to-protect-supervisor-api>
```
---
## HTTP API
All endpoints require `Authorization: Bearer <SUPERVISOR_INTERNAL_KEY>` if `SUPERVISOR_INTERNAL_KEY` is set.
### Start a run
```bash
curl -X POST http://localhost:8084/v1/graphs/release_check/runs \
-H "Content-Type: application/json" \
-d '{
"workspace_id": "daarion",
"user_id": "sofiia",
"agent_id": "sofiia",
"input": {
"service_name": "router",
"run_deps": true,
"run_drift": true
}
}'
```
Response:
```json
{"run_id": "gr_3a1b2c...", "status": "queued", "result": null}
```
### Poll for result
```bash
curl http://localhost:8084/v1/runs/gr_3a1b2c...
```
Response (when complete):
```json
{
"run_id": "gr_3a1b2c...",
"graph": "release_check",
"status": "succeeded",
"started_at": "2026-02-23T10:00:00+00:00",
"finished_at": "2026-02-23T10:00:45+00:00",
"result": {"pass": true, "gates": [...], "summary": "..."},
"events": [
{"ts": "...", "type": "node_start", "node": "graph_start", "details": {...}},
...
]
}
```
### Start incident triage
```bash
curl -X POST http://localhost:8084/v1/graphs/incident_triage/runs \
-H "Content-Type: application/json" \
-d '{
"workspace_id": "daarion",
"user_id": "helion",
"agent_id": "sofiia",
"input": {
"service": "router",
"symptom": "High error rate after deploy",
"env": "prod",
"include_traces": true,
"time_range": {"from": "2026-02-23T09:00:00Z", "to": "2026-02-23T10:00:00Z"}
}
}'
```
### Cancel a run
```bash
curl -X POST http://localhost:8084/v1/runs/gr_3a1b2c.../cancel
```
---
## Connecting to Sofiia (Telegram / internal UI)
The supervisor exposes a REST API. To invoke from Sofiia's tool loop:
1. The gateway `job_orchestrator_tool` can be extended with a `start_supervisor_run` action that calls `POST http://sofiia-supervisor:8080/v1/graphs/{name}/runs`.
2. Alternatively, call the supervisor directly from the Telegram bot's backend (if on the same network).
Example flow for Telegram → Sofiia → Supervisor → Release Check:
```
User: "Run release check for router"
→ Sofiia LLM → job_orchestrator_tool(start_task, release_check)
→ Router: job_orchestrator_tool dispatches to release_check_runner
→ Returns report (existing flow, unchanged)
```
For **async long-running** workflows (>30s), use the supervisor directly:
```
User: "Triage production incident for router"
→ Sofiia LLM → [http call] POST /v1/graphs/incident_triage/runs
→ Returns run_id
→ Sofiia polls GET /v1/runs/{run_id} (or user asks again)
→ Returns structured triage report
```
---
## Security
- `SUPERVISOR_INTERNAL_KEY`: Protects supervisor HTTP API (recommend: network-level isolation instead)
- `SUPERVISOR_API_KEY` → sent to router's `/v1/tools/execute` as `Authorization: Bearer`
- Router's `SUPERVISOR_API_KEY` guards direct tool execution endpoint
- All RBAC/limits/audit enforced by router's `ToolGovernance` — supervisor cannot bypass them
- LangGraph nodes have **no credentials or secrets** — only `workspace_id/user_id/agent_id`
---
## State TTL and cleanup
Runs are stored in Redis with TTL = `RUN_TTL_SEC` (default 24h). After TTL expires, the run metadata is automatically removed.
To extend TTL for important runs, call `backend.save_run(run)` with a new timestamp (planned: admin endpoint).