docs(platform): add policy configs, runbooks, ops scripts and platform documentation
Config policies (16 files): alert_routing, architecture_pressure, backlog, cost_weights, data_governance, incident_escalation, incident_intelligence, network_allowlist, nodes_registry, observability_sources, rbac_tools_matrix, release_gate, risk_attribution, risk_policy, slo_policy, tool_limits, tools_rollout Ops (22 files): Caddyfile, calendar compose, grafana voice dashboard, deployments/incidents logs, runbooks for alerts/audit/backlog/incidents/sofiia/voice, cron jobs, scripts (alert_triage, audit_cleanup, migrate_*, governance, schedule), task_registry, voice alerts/ha/latency/policy Docs (30+ files): HUMANIZED_STEPAN v2.7-v3 changelogs and runbooks, NODA1/NODA2 status and setup, audit index and traces, backlog, incident, supervisor, tools, voice, opencode, release, risk, aistalk, spacebot Made-with: Cursor
This commit is contained in:
264
docs/supervisor/langgraph_supervisor.md
Normal file
264
docs/supervisor/langgraph_supervisor.md
Normal file
@@ -0,0 +1,264 @@
|
||||
# Sofiia Supervisor — LangGraph Orchestration Service
|
||||
|
||||
**Location**: NODA2 | **Port**: 8084 (external) → 8080 (container)
|
||||
**State backend**: Redis (`sofiia-redis:6379`)
|
||||
**Gateway**: `http://router:8000/v1/tools/execute`
|
||||
|
||||
---
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
Caller (Telegram/UI/API)
|
||||
│
|
||||
▼
|
||||
sofiia-supervisor:8084 ──── POST /v1/graphs/{name}/runs
|
||||
│ GET /v1/runs/{run_id}
|
||||
│ POST /v1/runs/{run_id}/cancel
|
||||
│
|
||||
▼ (LangGraph nodes)
|
||||
GatewayClient ──────────────→ router:8000/v1/tools/execute
|
||||
│ │
|
||||
│ ▼ (ToolGovernance)
|
||||
│ RBAC check → limits → redact → audit
|
||||
│ │
|
||||
│ ToolManager.execute_tool(...)
|
||||
│
|
||||
▼
|
||||
sofiia-redis ←── RunRecord + RunEvents (no payload)
|
||||
```
|
||||
|
||||
**Key invariants:**
|
||||
- LangGraph nodes have **no direct access** to internal services
|
||||
- All tool calls go through `router → ToolGovernance → ToolManager`
|
||||
- `graph_run_id` is propagated in every gateway request metadata
|
||||
- Logs contain **hash + sizes only** (no payload content)
|
||||
|
||||
---
|
||||
|
||||
## Graphs
|
||||
|
||||
### `release_check`
|
||||
|
||||
Runs the DAARION release_check pipeline via `job_orchestrator_tool`.
|
||||
|
||||
**Nodes**: `start_job` → `poll_job` (loop) → `finalize` → END
|
||||
|
||||
**Input** (`input` field of StartRunRequest):
|
||||
|
||||
| Field | Type | Default | Description |
|
||||
|---|---|---|---|
|
||||
| `service_name` | string | `"unknown"` | Service being released |
|
||||
| `diff_text` | string | `""` | Git diff text |
|
||||
| `fail_fast` | bool | `true` | Stop on first gate failure |
|
||||
| `run_deps` | bool | `true` | Run dependency scan gate |
|
||||
| `run_drift` | bool | `true` | Run drift analysis gate |
|
||||
| `run_smoke` | bool | `false` | Run smoke tests |
|
||||
| `deps_targets` | array | `["python","node"]` | Ecosystems for dep scan |
|
||||
| `deps_vuln_mode` | string | `"offline_cache"` | OSV mode |
|
||||
| `deps_fail_on` | array | `["CRITICAL","HIGH"]` | Blocking severity |
|
||||
| `drift_categories` | array | all | Drift analysis categories |
|
||||
| `risk_profile` | string | `"default"` | Risk profile |
|
||||
| `timeouts.overall_sec` | number | `180` | Total timeout |
|
||||
|
||||
**Output** (in `result`): Same as `release_check_runner.py`:
|
||||
```json
|
||||
{
|
||||
"pass": true,
|
||||
"gates": [{"name": "pr_review", "status": "pass"}, ...],
|
||||
"recommendations": [],
|
||||
"summary": "All 5 gates passed.",
|
||||
"elapsed_ms": 4200
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### `incident_triage`
|
||||
|
||||
Collects observability data, logs, health, and runbooks to build a triage report.
|
||||
|
||||
**Nodes**: `validate_input` → `service_overview` → `top_errors_logs` → `health_and_runbooks` → `trace_lookup` → `build_triage_report` → END
|
||||
|
||||
**Input**:
|
||||
|
||||
| Field | Type | Default | Description |
|
||||
|---|---|---|---|
|
||||
| `service` | string | — | Service name (required) |
|
||||
| `symptom` | string | — | Brief incident description (required) |
|
||||
| `time_range.from` | ISO | -1h | Start of analysis window |
|
||||
| `time_range.to` | ISO | now | End of analysis window |
|
||||
| `env` | string | `"prod"` | Environment |
|
||||
| `include_traces` | bool | `false` | Look up traces from log IDs |
|
||||
| `max_log_lines` | int | `120` | Log lines to analyse (max 200) |
|
||||
| `log_query_hint` | string | auto | Custom log query filter |
|
||||
|
||||
**Time window**: Clamped to 24h max (`INCIDENT_MAX_TIME_WINDOW_H`).
|
||||
|
||||
**Output** (in `result`):
|
||||
```json
|
||||
{
|
||||
"summary": "...",
|
||||
"suspected_root_causes": [{"rank": 1, "cause": "...", "evidence": [...]}],
|
||||
"impact_assessment": "SLO impact: error_rate=2.1%",
|
||||
"mitigations_now": ["Increase DB pool size", "..."],
|
||||
"next_checks": ["Verify healthz", "..."],
|
||||
"references": {
|
||||
"metrics": {"slo": {...}, "alerts_count": 1},
|
||||
"log_samples": ["..."],
|
||||
"runbook_snippets": [{"path": "...", "text": "..."}],
|
||||
"traces": {"traces": [...]}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Deployment on NODA2
|
||||
|
||||
### Quick start
|
||||
|
||||
```bash
|
||||
# On NODA2 host
|
||||
cd /path/to/microdao-daarion
|
||||
|
||||
# Start supervisor + redis (attaches to existing dagi-network-node2)
|
||||
docker compose \
|
||||
-f docker-compose.node2.yml \
|
||||
-f docker-compose.node2-sofiia-supervisor.yml \
|
||||
up -d sofiia-supervisor sofiia-redis
|
||||
|
||||
# Verify
|
||||
curl http://localhost:8084/healthz
|
||||
```
|
||||
|
||||
### Environment variables
|
||||
|
||||
Copy `.env.example` and set:
|
||||
|
||||
```bash
|
||||
cp services/sofiia-supervisor/.env.example .env
|
||||
# Edit:
|
||||
# GATEWAY_BASE_URL=http://router:8000 (must be accessible from container)
|
||||
# SUPERVISOR_API_KEY=<key-for-router> (matches SUPERVISOR_API_KEY in router)
|
||||
# SUPERVISOR_INTERNAL_KEY=<key-to-protect-supervisor-api>
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## HTTP API
|
||||
|
||||
All endpoints require `Authorization: Bearer <SUPERVISOR_INTERNAL_KEY>` if `SUPERVISOR_INTERNAL_KEY` is set.
|
||||
|
||||
### Start a run
|
||||
|
||||
```bash
|
||||
curl -X POST http://localhost:8084/v1/graphs/release_check/runs \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"workspace_id": "daarion",
|
||||
"user_id": "sofiia",
|
||||
"agent_id": "sofiia",
|
||||
"input": {
|
||||
"service_name": "router",
|
||||
"run_deps": true,
|
||||
"run_drift": true
|
||||
}
|
||||
}'
|
||||
```
|
||||
|
||||
Response:
|
||||
```json
|
||||
{"run_id": "gr_3a1b2c...", "status": "queued", "result": null}
|
||||
```
|
||||
|
||||
### Poll for result
|
||||
|
||||
```bash
|
||||
curl http://localhost:8084/v1/runs/gr_3a1b2c...
|
||||
```
|
||||
|
||||
Response (when complete):
|
||||
```json
|
||||
{
|
||||
"run_id": "gr_3a1b2c...",
|
||||
"graph": "release_check",
|
||||
"status": "succeeded",
|
||||
"started_at": "2026-02-23T10:00:00+00:00",
|
||||
"finished_at": "2026-02-23T10:00:45+00:00",
|
||||
"result": {"pass": true, "gates": [...], "summary": "..."},
|
||||
"events": [
|
||||
{"ts": "...", "type": "node_start", "node": "graph_start", "details": {...}},
|
||||
...
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Start incident triage
|
||||
|
||||
```bash
|
||||
curl -X POST http://localhost:8084/v1/graphs/incident_triage/runs \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"workspace_id": "daarion",
|
||||
"user_id": "helion",
|
||||
"agent_id": "sofiia",
|
||||
"input": {
|
||||
"service": "router",
|
||||
"symptom": "High error rate after deploy",
|
||||
"env": "prod",
|
||||
"include_traces": true,
|
||||
"time_range": {"from": "2026-02-23T09:00:00Z", "to": "2026-02-23T10:00:00Z"}
|
||||
}
|
||||
}'
|
||||
```
|
||||
|
||||
### Cancel a run
|
||||
|
||||
```bash
|
||||
curl -X POST http://localhost:8084/v1/runs/gr_3a1b2c.../cancel
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Connecting to Sofiia (Telegram / internal UI)
|
||||
|
||||
The supervisor exposes a REST API. To invoke from Sofiia's tool loop:
|
||||
|
||||
1. The gateway `job_orchestrator_tool` can be extended with a `start_supervisor_run` action that calls `POST http://sofiia-supervisor:8080/v1/graphs/{name}/runs`.
|
||||
2. Alternatively, call the supervisor directly from the Telegram bot's backend (if on the same network).
|
||||
|
||||
Example flow for Telegram → Sofiia → Supervisor → Release Check:
|
||||
```
|
||||
User: "Run release check for router"
|
||||
→ Sofiia LLM → job_orchestrator_tool(start_task, release_check)
|
||||
→ Router: job_orchestrator_tool dispatches to release_check_runner
|
||||
→ Returns report (existing flow, unchanged)
|
||||
```
|
||||
|
||||
For **async long-running** workflows (>30s), use the supervisor directly:
|
||||
```
|
||||
User: "Triage production incident for router"
|
||||
→ Sofiia LLM → [http call] POST /v1/graphs/incident_triage/runs
|
||||
→ Returns run_id
|
||||
→ Sofiia polls GET /v1/runs/{run_id} (or user asks again)
|
||||
→ Returns structured triage report
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Security
|
||||
|
||||
- `SUPERVISOR_INTERNAL_KEY`: Protects supervisor HTTP API (recommend: network-level isolation instead)
|
||||
- `SUPERVISOR_API_KEY` → sent to router's `/v1/tools/execute` as `Authorization: Bearer`
|
||||
- Router's `SUPERVISOR_API_KEY` guards direct tool execution endpoint
|
||||
- All RBAC/limits/audit enforced by router's `ToolGovernance` — supervisor cannot bypass them
|
||||
- LangGraph nodes have **no credentials or secrets** — only `workspace_id/user_id/agent_id`
|
||||
|
||||
---
|
||||
|
||||
## State TTL and cleanup
|
||||
|
||||
Runs are stored in Redis with TTL = `RUN_TTL_SEC` (default 24h). After TTL expires, the run metadata is automatically removed.
|
||||
|
||||
To extend TTL for important runs, call `backend.save_run(run)` with a new timestamp (planned: admin endpoint).
|
||||
Reference in New Issue
Block a user