Files

Apple 67225a39fa docs(platform): add policy configs, runbooks, ops scripts and platform documentation

Config policies (16 files): alert_routing, architecture_pressure, backlog,
cost_weights, data_governance, incident_escalation, incident_intelligence,
network_allowlist, nodes_registry, observability_sources, rbac_tools_matrix,
release_gate, risk_attribution, risk_policy, slo_policy, tool_limits, tools_rollout

Ops (22 files): Caddyfile, calendar compose, grafana voice dashboard,
deployments/incidents logs, runbooks for alerts/audit/backlog/incidents/sofiia/voice,
cron jobs, scripts (alert_triage, audit_cleanup, migrate_*, governance, schedule),
task_registry, voice alerts/ha/latency/policy

Docs (30+ files): HUMANIZED_STEPAN v2.7-v3 changelogs and runbooks,
NODA1/NODA2 status and setup, audit index and traces, backlog, incident,
supervisor, tools, voice, opencode, release, risk, aistalk, spacebot

Made-with: Cursor

2026-03-03 07:14:53 -08:00

2.8 KiB

Raw Blame History

Postmortem Draft Graph

Overview

The postmortem_draft_graph is a LangGraph workflow on the Sofiia Supervisor (NODA2) that generates structured postmortem drafts from incident data.

Flow

validate → load_incident → ensure_triage → draft_postmortem
  → attach_artifacts → append_followups → build_result → END

validate — checks incident_id is provided.
load_incident — calls oncall_tool.incident_get via gateway.
ensure_triage — if no triage_report artifact exists, generates one by calling observability/health/KB tools.
draft_postmortem — builds a deterministic markdown + JSON postmortem using a structured template.
attach_artifacts — uploads postmortem_draft.md, postmortem_draft.json (and optionally triage_report.json) via oncall_tool.incident_attach_artifact.
append_followups — creates followup timeline events from the postmortem.
build_result — returns the final output.

API

Start run

curl -X POST http://supervisor:8000/v1/graphs/postmortem_draft/runs \
  -H "Content-Type: application/json" \
  -d '{
    "workspace_id": "default",
    "user_id": "admin",
    "agent_id": "sofiia",
    "input": {
      "incident_id": "inc_20260223_1000_abc123",
      "service": "router",
      "env": "prod",
      "include_traces": false
    }
  }'

Input

Field	Type	Required	Description
incident_id	string	Yes	Existing incident ID
service	string	No	Override service (defaults to incident's service)
env	string	No	Environment (default: prod)
time_range	object	No	`{"from": "ISO", "to": "ISO"}` (defaults to incident timestamps)
include_traces	bool	No	Include trace lookup in triage (default: false)

Output

{
  "incident_id": "inc_...",
  "artifacts_count": 3,
  "artifacts": [...],
  "followups_count": 4,
  "triage_was_generated": true,
  "markdown_preview": "# Postmortem: Router OOM\n..."
}

Postmortem Template

The generated markdown includes:

Summary — from triage report
Impact — SLO/health assessment
Detection — when/how the incident was reported
Timeline — from incident events
Root Cause Analysis — from triage suspected causes
Mitigations Applied — from triage/runbooks
Follow-ups — action items extracted from triage
Prevention — standard recommendations

Error Handling

Incident not found → graph_status: "failed"
Gateway errors during triage generation → non-fatal (uses partial data)
Follow-up append errors → non-fatal (graph still succeeds)
All tool calls go through gateway (RBAC/audit enforced)

Correlation

Every tool call includes graph_run_id in metadata for full traceability.

2.8 KiB Raw Blame History