Config policies (16 files): alert_routing, architecture_pressure, backlog, cost_weights, data_governance, incident_escalation, incident_intelligence, network_allowlist, nodes_registry, observability_sources, rbac_tools_matrix, release_gate, risk_attribution, risk_policy, slo_policy, tool_limits, tools_rollout Ops (22 files): Caddyfile, calendar compose, grafana voice dashboard, deployments/incidents logs, runbooks for alerts/audit/backlog/incidents/sofiia/voice, cron jobs, scripts (alert_triage, audit_cleanup, migrate_*, governance, schedule), task_registry, voice alerts/ha/latency/policy Docs (30+ files): HUMANIZED_STEPAN v2.7-v3 changelogs and runbooks, NODA1/NODA2 status and setup, audit index and traces, backlog, incident, supervisor, tools, voice, opencode, release, risk, aistalk, spacebot Made-with: Cursor
265 lines
7.7 KiB
Markdown
265 lines
7.7 KiB
Markdown
# Sofiia Supervisor — LangGraph Orchestration Service
|
|
|
|
**Location**: NODA2 | **Port**: 8084 (external) → 8080 (container)
|
|
**State backend**: Redis (`sofiia-redis:6379`)
|
|
**Gateway**: `http://router:8000/v1/tools/execute`
|
|
|
|
---
|
|
|
|
## Architecture
|
|
|
|
```
|
|
Caller (Telegram/UI/API)
|
|
│
|
|
▼
|
|
sofiia-supervisor:8084 ──── POST /v1/graphs/{name}/runs
|
|
│ GET /v1/runs/{run_id}
|
|
│ POST /v1/runs/{run_id}/cancel
|
|
│
|
|
▼ (LangGraph nodes)
|
|
GatewayClient ──────────────→ router:8000/v1/tools/execute
|
|
│ │
|
|
│ ▼ (ToolGovernance)
|
|
│ RBAC check → limits → redact → audit
|
|
│ │
|
|
│ ToolManager.execute_tool(...)
|
|
│
|
|
▼
|
|
sofiia-redis ←── RunRecord + RunEvents (no payload)
|
|
```
|
|
|
|
**Key invariants:**
|
|
- LangGraph nodes have **no direct access** to internal services
|
|
- All tool calls go through `router → ToolGovernance → ToolManager`
|
|
- `graph_run_id` is propagated in every gateway request metadata
|
|
- Logs contain **hash + sizes only** (no payload content)
|
|
|
|
---
|
|
|
|
## Graphs
|
|
|
|
### `release_check`
|
|
|
|
Runs the DAARION release_check pipeline via `job_orchestrator_tool`.
|
|
|
|
**Nodes**: `start_job` → `poll_job` (loop) → `finalize` → END
|
|
|
|
**Input** (`input` field of StartRunRequest):
|
|
|
|
| Field | Type | Default | Description |
|
|
|---|---|---|---|
|
|
| `service_name` | string | `"unknown"` | Service being released |
|
|
| `diff_text` | string | `""` | Git diff text |
|
|
| `fail_fast` | bool | `true` | Stop on first gate failure |
|
|
| `run_deps` | bool | `true` | Run dependency scan gate |
|
|
| `run_drift` | bool | `true` | Run drift analysis gate |
|
|
| `run_smoke` | bool | `false` | Run smoke tests |
|
|
| `deps_targets` | array | `["python","node"]` | Ecosystems for dep scan |
|
|
| `deps_vuln_mode` | string | `"offline_cache"` | OSV mode |
|
|
| `deps_fail_on` | array | `["CRITICAL","HIGH"]` | Blocking severity |
|
|
| `drift_categories` | array | all | Drift analysis categories |
|
|
| `risk_profile` | string | `"default"` | Risk profile |
|
|
| `timeouts.overall_sec` | number | `180` | Total timeout |
|
|
|
|
**Output** (in `result`): Same as `release_check_runner.py`:
|
|
```json
|
|
{
|
|
"pass": true,
|
|
"gates": [{"name": "pr_review", "status": "pass"}, ...],
|
|
"recommendations": [],
|
|
"summary": "All 5 gates passed.",
|
|
"elapsed_ms": 4200
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
### `incident_triage`
|
|
|
|
Collects observability data, logs, health, and runbooks to build a triage report.
|
|
|
|
**Nodes**: `validate_input` → `service_overview` → `top_errors_logs` → `health_and_runbooks` → `trace_lookup` → `build_triage_report` → END
|
|
|
|
**Input**:
|
|
|
|
| Field | Type | Default | Description |
|
|
|---|---|---|---|
|
|
| `service` | string | — | Service name (required) |
|
|
| `symptom` | string | — | Brief incident description (required) |
|
|
| `time_range.from` | ISO | -1h | Start of analysis window |
|
|
| `time_range.to` | ISO | now | End of analysis window |
|
|
| `env` | string | `"prod"` | Environment |
|
|
| `include_traces` | bool | `false` | Look up traces from log IDs |
|
|
| `max_log_lines` | int | `120` | Log lines to analyse (max 200) |
|
|
| `log_query_hint` | string | auto | Custom log query filter |
|
|
|
|
**Time window**: Clamped to 24h max (`INCIDENT_MAX_TIME_WINDOW_H`).
|
|
|
|
**Output** (in `result`):
|
|
```json
|
|
{
|
|
"summary": "...",
|
|
"suspected_root_causes": [{"rank": 1, "cause": "...", "evidence": [...]}],
|
|
"impact_assessment": "SLO impact: error_rate=2.1%",
|
|
"mitigations_now": ["Increase DB pool size", "..."],
|
|
"next_checks": ["Verify healthz", "..."],
|
|
"references": {
|
|
"metrics": {"slo": {...}, "alerts_count": 1},
|
|
"log_samples": ["..."],
|
|
"runbook_snippets": [{"path": "...", "text": "..."}],
|
|
"traces": {"traces": [...]}
|
|
}
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Deployment on NODA2
|
|
|
|
### Quick start
|
|
|
|
```bash
|
|
# On NODA2 host
|
|
cd /path/to/microdao-daarion
|
|
|
|
# Start supervisor + redis (attaches to existing dagi-network-node2)
|
|
docker compose \
|
|
-f docker-compose.node2.yml \
|
|
-f docker-compose.node2-sofiia-supervisor.yml \
|
|
up -d sofiia-supervisor sofiia-redis
|
|
|
|
# Verify
|
|
curl http://localhost:8084/healthz
|
|
```
|
|
|
|
### Environment variables
|
|
|
|
Copy `.env.example` and set:
|
|
|
|
```bash
|
|
cp services/sofiia-supervisor/.env.example .env
|
|
# Edit:
|
|
# GATEWAY_BASE_URL=http://router:8000 (must be accessible from container)
|
|
# SUPERVISOR_API_KEY=<key-for-router> (matches SUPERVISOR_API_KEY in router)
|
|
# SUPERVISOR_INTERNAL_KEY=<key-to-protect-supervisor-api>
|
|
```
|
|
|
|
---
|
|
|
|
## HTTP API
|
|
|
|
All endpoints require `Authorization: Bearer <SUPERVISOR_INTERNAL_KEY>` if `SUPERVISOR_INTERNAL_KEY` is set.
|
|
|
|
### Start a run
|
|
|
|
```bash
|
|
curl -X POST http://localhost:8084/v1/graphs/release_check/runs \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"workspace_id": "daarion",
|
|
"user_id": "sofiia",
|
|
"agent_id": "sofiia",
|
|
"input": {
|
|
"service_name": "router",
|
|
"run_deps": true,
|
|
"run_drift": true
|
|
}
|
|
}'
|
|
```
|
|
|
|
Response:
|
|
```json
|
|
{"run_id": "gr_3a1b2c...", "status": "queued", "result": null}
|
|
```
|
|
|
|
### Poll for result
|
|
|
|
```bash
|
|
curl http://localhost:8084/v1/runs/gr_3a1b2c...
|
|
```
|
|
|
|
Response (when complete):
|
|
```json
|
|
{
|
|
"run_id": "gr_3a1b2c...",
|
|
"graph": "release_check",
|
|
"status": "succeeded",
|
|
"started_at": "2026-02-23T10:00:00+00:00",
|
|
"finished_at": "2026-02-23T10:00:45+00:00",
|
|
"result": {"pass": true, "gates": [...], "summary": "..."},
|
|
"events": [
|
|
{"ts": "...", "type": "node_start", "node": "graph_start", "details": {...}},
|
|
...
|
|
]
|
|
}
|
|
```
|
|
|
|
### Start incident triage
|
|
|
|
```bash
|
|
curl -X POST http://localhost:8084/v1/graphs/incident_triage/runs \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"workspace_id": "daarion",
|
|
"user_id": "helion",
|
|
"agent_id": "sofiia",
|
|
"input": {
|
|
"service": "router",
|
|
"symptom": "High error rate after deploy",
|
|
"env": "prod",
|
|
"include_traces": true,
|
|
"time_range": {"from": "2026-02-23T09:00:00Z", "to": "2026-02-23T10:00:00Z"}
|
|
}
|
|
}'
|
|
```
|
|
|
|
### Cancel a run
|
|
|
|
```bash
|
|
curl -X POST http://localhost:8084/v1/runs/gr_3a1b2c.../cancel
|
|
```
|
|
|
|
---
|
|
|
|
## Connecting to Sofiia (Telegram / internal UI)
|
|
|
|
The supervisor exposes a REST API. To invoke from Sofiia's tool loop:
|
|
|
|
1. The gateway `job_orchestrator_tool` can be extended with a `start_supervisor_run` action that calls `POST http://sofiia-supervisor:8080/v1/graphs/{name}/runs`.
|
|
2. Alternatively, call the supervisor directly from the Telegram bot's backend (if on the same network).
|
|
|
|
Example flow for Telegram → Sofiia → Supervisor → Release Check:
|
|
```
|
|
User: "Run release check for router"
|
|
→ Sofiia LLM → job_orchestrator_tool(start_task, release_check)
|
|
→ Router: job_orchestrator_tool dispatches to release_check_runner
|
|
→ Returns report (existing flow, unchanged)
|
|
```
|
|
|
|
For **async long-running** workflows (>30s), use the supervisor directly:
|
|
```
|
|
User: "Triage production incident for router"
|
|
→ Sofiia LLM → [http call] POST /v1/graphs/incident_triage/runs
|
|
→ Returns run_id
|
|
→ Sofiia polls GET /v1/runs/{run_id} (or user asks again)
|
|
→ Returns structured triage report
|
|
```
|
|
|
|
---
|
|
|
|
## Security
|
|
|
|
- `SUPERVISOR_INTERNAL_KEY`: Protects supervisor HTTP API (recommend: network-level isolation instead)
|
|
- `SUPERVISOR_API_KEY` → sent to router's `/v1/tools/execute` as `Authorization: Bearer`
|
|
- Router's `SUPERVISOR_API_KEY` guards direct tool execution endpoint
|
|
- All RBAC/limits/audit enforced by router's `ToolGovernance` — supervisor cannot bypass them
|
|
- LangGraph nodes have **no credentials or secrets** — only `workspace_id/user_id/agent_id`
|
|
|
|
---
|
|
|
|
## State TTL and cleanup
|
|
|
|
Runs are stored in Redis with TTL = `RUN_TTL_SEC` (default 24h). After TTL expires, the run metadata is automatically removed.
|
|
|
|
To extend TTL for important runs, call `backend.save_run(run)` with a new timestamp (planned: admin endpoint).
|