docs(platform): add policy configs, runbooks, ops scripts and platform documentation

Config policies (16 files): alert_routing, architecture_pressure, backlog, cost_weights, data_governance, incident_escalation, incident_intelligence, network_allowlist, nodes_registry, observability_sources, rbac_tools_matrix, release_gate, risk_attribution, risk_policy, slo_policy, tool_limits, tools_rollout Ops (22 files): Caddyfile, calendar compose, grafana voice dashboard, deployments/incidents logs, runbooks for alerts/audit/backlog/incidents/sofiia/voice, cron jobs, scripts (alert_triage, audit_cleanup, migrate_*, governance, schedule), task_registry, voice alerts/ha/latency/policy Docs (30+ files): HUMANIZED_STEPAN v2.7-v3 changelogs and runbooks, NODA1/NODA2 status and setup, audit index and traces, backlog, incident, supervisor, tools, voice, opencode, release, risk, aistalk, spacebot Made-with: Cursor
2026-03-03 07:14:53 -08:00
parent 129e4ea1fc
commit 67225a39fa
102 changed files with 20060 additions and 0 deletions
--- a/docs/supervisor/langgraph_supervisor.md
+++ b/docs/supervisor/langgraph_supervisor.md
@@ -0,0 +1,264 @@
+# Sofiia Supervisor — LangGraph Orchestration Service
+
+**Location**: NODA2 | **Port**: 8084 (external) → 8080 (container)  
+**State backend**: Redis (`sofiia-redis:6379`)  
+**Gateway**: `http://router:8000/v1/tools/execute`
+
+---
+
+## Architecture
+
+```
+Caller (Telegram/UI/API)
+        │
+        ▼
+sofiia-supervisor:8084  ──── POST /v1/graphs/{name}/runs
+        │                     GET  /v1/runs/{run_id}
+        │                     POST /v1/runs/{run_id}/cancel
+        │
+        ▼ (LangGraph nodes)
+GatewayClient ──────────────→ router:8000/v1/tools/execute
+        │                         │
+        │                         ▼ (ToolGovernance)
+        │                     RBAC check → limits → redact → audit
+        │                         │
+        │                     ToolManager.execute_tool(...)
+        │
+        ▼
+sofiia-redis  ←── RunRecord + RunEvents (no payload)
+```
+
+**Key invariants:**
+- LangGraph nodes have **no direct access** to internal services
+- All tool calls go through `router → ToolGovernance → ToolManager`
+- `graph_run_id` is propagated in every gateway request metadata
+- Logs contain **hash + sizes only** (no payload content)
+
+---
+
+## Graphs
+
+### `release_check`
+
+Runs the DAARION release_check pipeline via `job_orchestrator_tool`.
+
+**Nodes**: `start_job` → `poll_job` (loop) → `finalize` → END
+
+**Input** (`input` field of StartRunRequest):
+
+| Field | Type | Default | Description |
+|---|---|---|---|
+| `service_name` | string | `"unknown"` | Service being released |
+| `diff_text` | string | `""` | Git diff text |
+| `fail_fast` | bool | `true` | Stop on first gate failure |
+| `run_deps` | bool | `true` | Run dependency scan gate |
+| `run_drift` | bool | `true` | Run drift analysis gate |
+| `run_smoke` | bool | `false` | Run smoke tests |
+| `deps_targets` | array | `["python","node"]` | Ecosystems for dep scan |
+| `deps_vuln_mode` | string | `"offline_cache"` | OSV mode |
+| `deps_fail_on` | array | `["CRITICAL","HIGH"]` | Blocking severity |
+| `drift_categories` | array | all | Drift analysis categories |
+| `risk_profile` | string | `"default"` | Risk profile |
+| `timeouts.overall_sec` | number | `180` | Total timeout |
+
+**Output** (in `result`): Same as `release_check_runner.py`:
+```json
+{
+  "pass": true,
+  "gates": [{"name": "pr_review", "status": "pass"}, ...],
+  "recommendations": [],
+  "summary": "All 5 gates passed.",
+  "elapsed_ms": 4200
+}
+```
+
+---
+
+### `incident_triage`
+
+Collects observability data, logs, health, and runbooks to build a triage report.
+
+**Nodes**: `validate_input` → `service_overview` → `top_errors_logs` → `health_and_runbooks` → `trace_lookup` → `build_triage_report` → END
+
+**Input**:
+
+| Field | Type | Default | Description |
+|---|---|---|---|
+| `service` | string | — | Service name (required) |
+| `symptom` | string | — | Brief incident description (required) |
+| `time_range.from` | ISO | -1h | Start of analysis window |
+| `time_range.to` | ISO | now | End of analysis window |
+| `env` | string | `"prod"` | Environment |
+| `include_traces` | bool | `false` | Look up traces from log IDs |
+| `max_log_lines` | int | `120` | Log lines to analyse (max 200) |
+| `log_query_hint` | string | auto | Custom log query filter |
+
+**Time window**: Clamped to 24h max (`INCIDENT_MAX_TIME_WINDOW_H`).
+
+**Output** (in `result`):
+```json
+{
+  "summary": "...",
+  "suspected_root_causes": [{"rank": 1, "cause": "...", "evidence": [...]}],
+  "impact_assessment": "SLO impact: error_rate=2.1%",
+  "mitigations_now": ["Increase DB pool size", "..."],
+  "next_checks": ["Verify healthz", "..."],
+  "references": {
+    "metrics": {"slo": {...}, "alerts_count": 1},
+    "log_samples": ["..."],
+    "runbook_snippets": [{"path": "...", "text": "..."}],
+    "traces": {"traces": [...]}
+  }
+}
+```
+
+---
+
+## Deployment on NODA2
+
+### Quick start
+
+```bash
+# On NODA2 host
+cd /path/to/microdao-daarion
+
+# Start supervisor + redis (attaches to existing dagi-network-node2)
+docker compose \
+  -f docker-compose.node2.yml \
+  -f docker-compose.node2-sofiia-supervisor.yml \
+  up -d sofiia-supervisor sofiia-redis
+
+# Verify
+curl http://localhost:8084/healthz
+```
+
+### Environment variables
+
+Copy `.env.example` and set:
+
+```bash
+cp services/sofiia-supervisor/.env.example .env
+# Edit:
+#   GATEWAY_BASE_URL=http://router:8000   (must be accessible from container)
+#   SUPERVISOR_API_KEY=<key-for-router>   (matches SUPERVISOR_API_KEY in router)
+#   SUPERVISOR_INTERNAL_KEY=<key-to-protect-supervisor-api>
+```
+
+---
+
+## HTTP API
+
+All endpoints require `Authorization: Bearer <SUPERVISOR_INTERNAL_KEY>` if `SUPERVISOR_INTERNAL_KEY` is set.
+
+### Start a run
+
+```bash
+curl -X POST http://localhost:8084/v1/graphs/release_check/runs \
+  -H "Content-Type: application/json" \
+  -d '{
+    "workspace_id": "daarion",
+    "user_id": "sofiia",
+    "agent_id": "sofiia",
+    "input": {
+      "service_name": "router",
+      "run_deps": true,
+      "run_drift": true
+    }
+  }'
+```
+
+Response:
+```json
+{"run_id": "gr_3a1b2c...", "status": "queued", "result": null}
+```
+
+### Poll for result
+
+```bash
+curl http://localhost:8084/v1/runs/gr_3a1b2c...
+```
+
+Response (when complete):
+```json
+{
+  "run_id": "gr_3a1b2c...",
+  "graph": "release_check",
+  "status": "succeeded",
+  "started_at": "2026-02-23T10:00:00+00:00",
+  "finished_at": "2026-02-23T10:00:45+00:00",
+  "result": {"pass": true, "gates": [...], "summary": "..."},
+  "events": [
+    {"ts": "...", "type": "node_start", "node": "graph_start", "details": {...}},
+    ...
+  ]
+}
+```
+
+### Start incident triage
+
+```bash
+curl -X POST http://localhost:8084/v1/graphs/incident_triage/runs \
+  -H "Content-Type: application/json" \
+  -d '{
+    "workspace_id": "daarion",
+    "user_id": "helion",
+    "agent_id": "sofiia",
+    "input": {
+      "service": "router",
+      "symptom": "High error rate after deploy",
+      "env": "prod",
+      "include_traces": true,
+      "time_range": {"from": "2026-02-23T09:00:00Z", "to": "2026-02-23T10:00:00Z"}
+    }
+  }'
+```
+
+### Cancel a run
+
+```bash
+curl -X POST http://localhost:8084/v1/runs/gr_3a1b2c.../cancel
+```
+
+---
+
+## Connecting to Sofiia (Telegram / internal UI)
+
+The supervisor exposes a REST API. To invoke from Sofiia's tool loop:
+
+1. The gateway `job_orchestrator_tool` can be extended with a `start_supervisor_run` action that calls `POST http://sofiia-supervisor:8080/v1/graphs/{name}/runs`.
+2. Alternatively, call the supervisor directly from the Telegram bot's backend (if on the same network).
+
+Example flow for Telegram → Sofiia → Supervisor → Release Check:
+```
+User: "Run release check for router"
+  → Sofiia LLM → job_orchestrator_tool(start_task, release_check)
+  → Router: job_orchestrator_tool dispatches to release_check_runner
+  → Returns report (existing flow, unchanged)
+```
+
+For **async long-running** workflows (>30s), use the supervisor directly:
+```
+User: "Triage production incident for router"
+  → Sofiia LLM → [http call] POST /v1/graphs/incident_triage/runs
+  → Returns run_id
+  → Sofiia polls GET /v1/runs/{run_id} (or user asks again)
+  → Returns structured triage report
+```
+
+---
+
+## Security
+
+- `SUPERVISOR_INTERNAL_KEY`: Protects supervisor HTTP API (recommend: network-level isolation instead)
+- `SUPERVISOR_API_KEY` → sent to router's `/v1/tools/execute` as `Authorization: Bearer`
+- Router's `SUPERVISOR_API_KEY` guards direct tool execution endpoint
+- All RBAC/limits/audit enforced by router's `ToolGovernance` — supervisor cannot bypass them
+- LangGraph nodes have **no credentials or secrets** — only `workspace_id/user_id/agent_id`
+
+---
+
+## State TTL and cleanup
+
+Runs are stored in Redis with TTL = `RUN_TTL_SEC` (default 24h). After TTL expires, the run metadata is automatically removed.
+
+To extend TTL for important runs, call `backend.save_run(run)` with a new timestamp (planned: admin endpoint).