microdao-daarion/docs/supervisor/langgraph_supervisor.md

# Sofiia Supervisor — LangGraph Orchestration Service

**Location**: NODA2 | **Port**: 8084 (external) → 8080 (container)
**State backend**: Redis (`sofiia-redis:6379`)
**Gateway**: `http://router:8000/v1/tools/execute`

---

## Architecture

```
Caller (Telegram/UI/API)
        │
        ▼
sofiia-supervisor:8084  ──── POST /v1/graphs/{name}/runs
        │                     GET  /v1/runs/{run_id}
        │                     POST /v1/runs/{run_id}/cancel
        │
        ▼ (LangGraph nodes)
GatewayClient ──────────────→ router:8000/v1/tools/execute
        │                         │
        │                         ▼ (ToolGovernance)
        │                     RBAC check → limits → redact → audit
        │                         │
        │                     ToolManager.execute_tool(...)
        │
        ▼
sofiia-redis  ←── RunRecord + RunEvents (no payload)
```

**Key invariants:**
- LangGraph nodes have **no direct access** to internal services
- All tool calls go through `router → ToolGovernance → ToolManager`
- `graph_run_id` is propagated in every gateway request metadata
- Logs contain **hash + sizes only** (no payload content)

---

## Graphs

### `release_check`

Runs the DAARION release_check pipeline via `job_orchestrator_tool`.

**Nodes**: `start_job` → `poll_job` (loop) → `finalize` → END

**Input** (`input` field of StartRunRequest):

| Field | Type | Default | Description |
|---|---|---|---|
| `service_name` | string | `"unknown"` | Service being released |
| `diff_text` | string | `""` | Git diff text |
| `fail_fast` | bool | `true` | Stop on first gate failure |
| `run_deps` | bool | `true` | Run dependency scan gate |
| `run_drift` | bool | `true` | Run drift analysis gate |
| `run_smoke` | bool | `false` | Run smoke tests |
| `deps_targets` | array | `["python","node"]` | Ecosystems for dep scan |
| `deps_vuln_mode` | string | `"offline_cache"` | OSV mode |
| `deps_fail_on` | array | `["CRITICAL","HIGH"]` | Blocking severity |
| `drift_categories` | array | all | Drift analysis categories |
| `risk_profile` | string | `"default"` | Risk profile |
| `timeouts.overall_sec` | number | `180` | Total timeout |

**Output** (in `result`): Same as `release_check_runner.py`:
```json
{
  "pass": true,
  "gates": [{"name": "pr_review", "status": "pass"}, ...],
  "recommendations": [],
  "summary": "All 5 gates passed.",
  "elapsed_ms": 4200
}
```

---

### `incident_triage`

Collects observability data, logs, health, and runbooks to build a triage report.

**Nodes**: `validate_input` → `service_overview` → `top_errors_logs` → `health_and_runbooks` → `trace_lookup` → `build_triage_report` → END

**Input**:

| Field | Type | Default | Description |
|---|---|---|---|
| `service` | string | — | Service name (required) |
| `symptom` | string | — | Brief incident description (required) |
| `time_range.from` | ISO | -1h | Start of analysis window |
| `time_range.to` | ISO | now | End of analysis window |
| `env` | string | `"prod"` | Environment |
| `include_traces` | bool | `false` | Look up traces from log IDs |
| `max_log_lines` | int | `120` | Log lines to analyse (max 200) |
| `log_query_hint` | string | auto | Custom log query filter |

**Time window**: Clamped to 24h max (`INCIDENT_MAX_TIME_WINDOW_H`).

**Output** (in `result`):
```json
{
  "summary": "...",
  "suspected_root_causes": [{"rank": 1, "cause": "...", "evidence": [...]}],
  "impact_assessment": "SLO impact: error_rate=2.1%",
  "mitigations_now": ["Increase DB pool size", "..."],
  "next_checks": ["Verify healthz", "..."],
  "references": {
    "metrics": {"slo": {...}, "alerts_count": 1},
    "log_samples": ["..."],
    "runbook_snippets": [{"path": "...", "text": "..."}],
    "traces": {"traces": [...]}
  }
}
```

---

## Deployment on NODA2

### Quick start

```bash
# On NODA2 host
cd /path/to/microdao-daarion

# Start supervisor + redis (attaches to existing dagi-network-node2)
docker compose \
  -f docker-compose.node2.yml \
  -f docker-compose.node2-sofiia-supervisor.yml \
  up -d sofiia-supervisor sofiia-redis

# Verify
curl http://localhost:8084/healthz
```

### Environment variables

Copy `.env.example` and set:

```bash
cp services/sofiia-supervisor/.env.example .env
# Edit:
#   GATEWAY_BASE_URL=http://router:8000   (must be accessible from container)
#   SUPERVISOR_API_KEY=<key-for-router>   (matches SUPERVISOR_API_KEY in router)
#   SUPERVISOR_INTERNAL_KEY=<key-to-protect-supervisor-api>
```

---

## HTTP API

All endpoints require `Authorization: Bearer <SUPERVISOR_INTERNAL_KEY>` if `SUPERVISOR_INTERNAL_KEY` is set.

### Start a run

```bash
curl -X POST http://localhost:8084/v1/graphs/release_check/runs \
  -H "Content-Type: application/json" \
  -d '{
    "workspace_id": "daarion",
    "user_id": "sofiia",
    "agent_id": "sofiia",
    "input": {
      "service_name": "router",
      "run_deps": true,
      "run_drift": true
    }
  }'
```

Response:
```json
{"run_id": "gr_3a1b2c...", "status": "queued", "result": null}
```

### Poll for result

```bash
curl http://localhost:8084/v1/runs/gr_3a1b2c...
```

Response (when complete):
```json
{
  "run_id": "gr_3a1b2c...",
  "graph": "release_check",
  "status": "succeeded",
  "started_at": "2026-02-23T10:00:00+00:00",
  "finished_at": "2026-02-23T10:00:45+00:00",
  "result": {"pass": true, "gates": [...], "summary": "..."},
  "events": [
    {"ts": "...", "type": "node_start", "node": "graph_start", "details": {...}},
    ...
  ]
}
```

### Start incident triage

```bash
curl -X POST http://localhost:8084/v1/graphs/incident_triage/runs \
  -H "Content-Type: application/json" \
  -d '{
    "workspace_id": "daarion",
    "user_id": "helion",
    "agent_id": "sofiia",
    "input": {
      "service": "router",
      "symptom": "High error rate after deploy",
      "env": "prod",
      "include_traces": true,
      "time_range": {"from": "2026-02-23T09:00:00Z", "to": "2026-02-23T10:00:00Z"}
    }
  }'
```

### Cancel a run

```bash
curl -X POST http://localhost:8084/v1/runs/gr_3a1b2c.../cancel
```

---

## Connecting to Sofiia (Telegram / internal UI)

The supervisor exposes a REST API. To invoke from Sofiia's tool loop:

1. The gateway `job_orchestrator_tool` can be extended with a `start_supervisor_run` action that calls `POST http://sofiia-supervisor:8080/v1/graphs/{name}/runs`.
2. Alternatively, call the supervisor directly from the Telegram bot's backend (if on the same network).

Example flow for Telegram → Sofiia → Supervisor → Release Check:
```
User: "Run release check for router"
  → Sofiia LLM → job_orchestrator_tool(start_task, release_check)
  → Router: job_orchestrator_tool dispatches to release_check_runner
  → Returns report (existing flow, unchanged)
```

For **async long-running** workflows (>30s), use the supervisor directly:
```
User: "Triage production incident for router"
  → Sofiia LLM → [http call] POST /v1/graphs/incident_triage/runs
  → Returns run_id
  → Sofiia polls GET /v1/runs/{run_id} (or user asks again)
  → Returns structured triage report
```

---

## Security

- `SUPERVISOR_INTERNAL_KEY`: Protects supervisor HTTP API (recommend: network-level isolation instead)
- `SUPERVISOR_API_KEY` → sent to router's `/v1/tools/execute` as `Authorization: Bearer`
- Router's `SUPERVISOR_API_KEY` guards direct tool execution endpoint
- All RBAC/limits/audit enforced by router's `ToolGovernance` — supervisor cannot bypass them
- LangGraph nodes have **no credentials or secrets** — only `workspace_id/user_id/agent_id`

---

## State TTL and cleanup

Runs are stored in Redis with TTL = `RUN_TTL_SEC` (default 24h). After TTL expires, the run metadata is automatically removed.

To extend TTL for important runs, call `backend.save_run(run)` with a new timestamp (planned: admin endpoint).