Files

Apple 67225a39fa docs(platform): add policy configs, runbooks, ops scripts and platform documentation

Config policies (16 files): alert_routing, architecture_pressure, backlog,
cost_weights, data_governance, incident_escalation, incident_intelligence,
network_allowlist, nodes_registry, observability_sources, rbac_tools_matrix,
release_gate, risk_attribution, risk_policy, slo_policy, tool_limits, tools_rollout

Ops (22 files): Caddyfile, calendar compose, grafana voice dashboard,
deployments/incidents logs, runbooks for alerts/audit/backlog/incidents/sofiia/voice,
cron jobs, scripts (alert_triage, audit_cleanup, migrate_*, governance, schedule),
task_registry, voice alerts/ha/latency/policy

Docs (30+ files): HUMANIZED_STEPAN v2.7-v3 changelogs and runbooks,
NODA1/NODA2 status and setup, audit index and traces, backlog, incident,
supervisor, tools, voice, opencode, release, risk, aistalk, spacebot

Made-with: Cursor

2026-03-03 07:14:53 -08:00

7.7 KiB

Raw Blame History

Sofiia Supervisor — LangGraph Orchestration Service

Location: NODA2 | Port: 8084 (external) → 8080 (container)
State backend: Redis (sofiia-redis:6379)
Gateway: http://router:8000/v1/tools/execute

Architecture

Caller (Telegram/UI/API)
        │
        ▼
sofiia-supervisor:8084  ──── POST /v1/graphs/{name}/runs
        │                     GET  /v1/runs/{run_id}
        │                     POST /v1/runs/{run_id}/cancel
        │
        ▼ (LangGraph nodes)
GatewayClient ──────────────→ router:8000/v1/tools/execute
        │                         │
        │                         ▼ (ToolGovernance)
        │                     RBAC check → limits → redact → audit
        │                         │
        │                     ToolManager.execute_tool(...)
        │
        ▼
sofiia-redis  ←── RunRecord + RunEvents (no payload)

Key invariants:

LangGraph nodes have no direct access to internal services
All tool calls go through router → ToolGovernance → ToolManager
graph_run_id is propagated in every gateway request metadata
Logs contain hash + sizes only (no payload content)

Graphs

`release_check`

Runs the DAARION release_check pipeline via job_orchestrator_tool.

Nodes: start_job → poll_job (loop) → finalize → END

Input (input field of StartRunRequest):

Field	Type	Default	Description
`service_name`	string	`"unknown"`	Service being released
`diff_text`	string	`""`	Git diff text
`fail_fast`	bool	`true`	Stop on first gate failure
`run_deps`	bool	`true`	Run dependency scan gate
`run_drift`	bool	`true`	Run drift analysis gate
`run_smoke`	bool	`false`	Run smoke tests
`deps_targets`	array	`["python","node"]`	Ecosystems for dep scan
`deps_vuln_mode`	string	`"offline_cache"`	OSV mode
`deps_fail_on`	array	`["CRITICAL","HIGH"]`	Blocking severity
`drift_categories`	array	all	Drift analysis categories
`risk_profile`	string	`"default"`	Risk profile
`timeouts.overall_sec`	number	`180`	Total timeout

Output (in result): Same as release_check_runner.py:

{
  "pass": true,
  "gates": [{"name": "pr_review", "status": "pass"}, ...],
  "recommendations": [],
  "summary": "All 5 gates passed.",
  "elapsed_ms": 4200
}

`incident_triage`

Collects observability data, logs, health, and runbooks to build a triage report.

Nodes: validate_input → service_overview → top_errors_logs → health_and_runbooks → trace_lookup → build_triage_report → END

Input:

Field	Type	Default	Description
`service`	string	—	Service name (required)
`symptom`	string	—	Brief incident description (required)
`time_range.from`	ISO	-1h	Start of analysis window
`time_range.to`	ISO	now	End of analysis window
`env`	string	`"prod"`	Environment
`include_traces`	bool	`false`	Look up traces from log IDs
`max_log_lines`	int	`120`	Log lines to analyse (max 200)
`log_query_hint`	string	auto	Custom log query filter

Time window: Clamped to 24h max (INCIDENT_MAX_TIME_WINDOW_H).

Output (in result):

{
  "summary": "...",
  "suspected_root_causes": [{"rank": 1, "cause": "...", "evidence": [...]}],
  "impact_assessment": "SLO impact: error_rate=2.1%",
  "mitigations_now": ["Increase DB pool size", "..."],
  "next_checks": ["Verify healthz", "..."],
  "references": {
    "metrics": {"slo": {...}, "alerts_count": 1},
    "log_samples": ["..."],
    "runbook_snippets": [{"path": "...", "text": "..."}],
    "traces": {"traces": [...]}
  }
}

Deployment on NODA2

Quick start

# On NODA2 host
cd /path/to/microdao-daarion

# Start supervisor + redis (attaches to existing dagi-network-node2)
docker compose \
  -f docker-compose.node2.yml \
  -f docker-compose.node2-sofiia-supervisor.yml \
  up -d sofiia-supervisor sofiia-redis

# Verify
curl http://localhost:8084/healthz

Environment variables

Copy .env.example and set:

cp services/sofiia-supervisor/.env.example .env
# Edit:
#   GATEWAY_BASE_URL=http://router:8000   (must be accessible from container)
#   SUPERVISOR_API_KEY=<key-for-router>   (matches SUPERVISOR_API_KEY in router)
#   SUPERVISOR_INTERNAL_KEY=<key-to-protect-supervisor-api>

HTTP API

All endpoints require Authorization: Bearer <SUPERVISOR_INTERNAL_KEY> if SUPERVISOR_INTERNAL_KEY is set.

Start a run

curl -X POST http://localhost:8084/v1/graphs/release_check/runs \
  -H "Content-Type: application/json" \
  -d '{
    "workspace_id": "daarion",
    "user_id": "sofiia",
    "agent_id": "sofiia",
    "input": {
      "service_name": "router",
      "run_deps": true,
      "run_drift": true
    }
  }'

Response:

{"run_id": "gr_3a1b2c...", "status": "queued", "result": null}

Poll for result

curl http://localhost:8084/v1/runs/gr_3a1b2c...

Response (when complete):

{
  "run_id": "gr_3a1b2c...",
  "graph": "release_check",
  "status": "succeeded",
  "started_at": "2026-02-23T10:00:00+00:00",
  "finished_at": "2026-02-23T10:00:45+00:00",
  "result": {"pass": true, "gates": [...], "summary": "..."},
  "events": [
    {"ts": "...", "type": "node_start", "node": "graph_start", "details": {...}},
    ...
  ]
}

Start incident triage

curl -X POST http://localhost:8084/v1/graphs/incident_triage/runs \
  -H "Content-Type: application/json" \
  -d '{
    "workspace_id": "daarion",
    "user_id": "helion",
    "agent_id": "sofiia",
    "input": {
      "service": "router",
      "symptom": "High error rate after deploy",
      "env": "prod",
      "include_traces": true,
      "time_range": {"from": "2026-02-23T09:00:00Z", "to": "2026-02-23T10:00:00Z"}
    }
  }'

Cancel a run

curl -X POST http://localhost:8084/v1/runs/gr_3a1b2c.../cancel

Connecting to Sofiia (Telegram / internal UI)

The supervisor exposes a REST API. To invoke from Sofiia's tool loop:

The gateway job_orchestrator_tool can be extended with a start_supervisor_run action that calls POST http://sofiia-supervisor:8080/v1/graphs/{name}/runs.
Alternatively, call the supervisor directly from the Telegram bot's backend (if on the same network).

Example flow for Telegram → Sofiia → Supervisor → Release Check:

User: "Run release check for router"
  → Sofiia LLM → job_orchestrator_tool(start_task, release_check)
  → Router: job_orchestrator_tool dispatches to release_check_runner
  → Returns report (existing flow, unchanged)

For async long-running workflows (>30s), use the supervisor directly:

User: "Triage production incident for router"
  → Sofiia LLM → [http call] POST /v1/graphs/incident_triage/runs
  → Returns run_id
  → Sofiia polls GET /v1/runs/{run_id} (or user asks again)
  → Returns structured triage report

Security

SUPERVISOR_INTERNAL_KEY: Protects supervisor HTTP API (recommend: network-level isolation instead)
SUPERVISOR_API_KEY → sent to router's /v1/tools/execute as Authorization: Bearer
Router's SUPERVISOR_API_KEY guards direct tool execution endpoint
All RBAC/limits/audit enforced by router's ToolGovernance — supervisor cannot bypass them
LangGraph nodes have no credentials or secrets — only workspace_id/user_id/agent_id

State TTL and cleanup

Runs are stored in Redis with TTL = RUN_TTL_SEC (default 24h). After TTL expires, the run metadata is automatically removed.

To extend TTL for important runs, call backend.save_run(run) with a new timestamp (planned: admin endpoint).

7.7 KiB Raw Blame History