Config policies (16 files): alert_routing, architecture_pressure, backlog, cost_weights, data_governance, incident_escalation, incident_intelligence, network_allowlist, nodes_registry, observability_sources, rbac_tools_matrix, release_gate, risk_attribution, risk_policy, slo_policy, tool_limits, tools_rollout Ops (22 files): Caddyfile, calendar compose, grafana voice dashboard, deployments/incidents logs, runbooks for alerts/audit/backlog/incidents/sofiia/voice, cron jobs, scripts (alert_triage, audit_cleanup, migrate_*, governance, schedule), task_registry, voice alerts/ha/latency/policy Docs (30+ files): HUMANIZED_STEPAN v2.7-v3 changelogs and runbooks, NODA1/NODA2 status and setup, audit index and traces, backlog, incident, supervisor, tools, voice, opencode, release, risk, aistalk, spacebot Made-with: Cursor
88 lines
2.8 KiB
Markdown
88 lines
2.8 KiB
Markdown
# Postmortem Draft Graph
|
|
|
|
## Overview
|
|
|
|
The `postmortem_draft_graph` is a LangGraph workflow on the Sofiia Supervisor (NODA2) that generates structured postmortem drafts from incident data.
|
|
|
|
## Flow
|
|
|
|
```
|
|
validate → load_incident → ensure_triage → draft_postmortem
|
|
→ attach_artifacts → append_followups → build_result → END
|
|
```
|
|
|
|
1. **validate** — checks `incident_id` is provided.
|
|
2. **load_incident** — calls `oncall_tool.incident_get` via gateway.
|
|
3. **ensure_triage** — if no `triage_report` artifact exists, generates one by calling observability/health/KB tools.
|
|
4. **draft_postmortem** — builds a deterministic markdown + JSON postmortem using a structured template.
|
|
5. **attach_artifacts** — uploads `postmortem_draft.md`, `postmortem_draft.json` (and optionally `triage_report.json`) via `oncall_tool.incident_attach_artifact`.
|
|
6. **append_followups** — creates `followup` timeline events from the postmortem.
|
|
7. **build_result** — returns the final output.
|
|
|
|
## API
|
|
|
|
### Start run
|
|
|
|
```bash
|
|
curl -X POST http://supervisor:8000/v1/graphs/postmortem_draft/runs \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"workspace_id": "default",
|
|
"user_id": "admin",
|
|
"agent_id": "sofiia",
|
|
"input": {
|
|
"incident_id": "inc_20260223_1000_abc123",
|
|
"service": "router",
|
|
"env": "prod",
|
|
"include_traces": false
|
|
}
|
|
}'
|
|
```
|
|
|
|
### Input
|
|
|
|
| Field | Type | Required | Description |
|
|
|-------|------|----------|-------------|
|
|
| incident_id | string | Yes | Existing incident ID |
|
|
| service | string | No | Override service (defaults to incident's service) |
|
|
| env | string | No | Environment (default: prod) |
|
|
| time_range | object | No | `{"from": "ISO", "to": "ISO"}` (defaults to incident timestamps) |
|
|
| include_traces | bool | No | Include trace lookup in triage (default: false) |
|
|
|
|
### Output
|
|
|
|
```json
|
|
{
|
|
"incident_id": "inc_...",
|
|
"artifacts_count": 3,
|
|
"artifacts": [...],
|
|
"followups_count": 4,
|
|
"triage_was_generated": true,
|
|
"markdown_preview": "# Postmortem: Router OOM\n..."
|
|
}
|
|
```
|
|
|
|
## Postmortem Template
|
|
|
|
The generated markdown includes:
|
|
|
|
- **Summary** — from triage report
|
|
- **Impact** — SLO/health assessment
|
|
- **Detection** — when/how the incident was reported
|
|
- **Timeline** — from incident events
|
|
- **Root Cause Analysis** — from triage suspected causes
|
|
- **Mitigations Applied** — from triage/runbooks
|
|
- **Follow-ups** — action items extracted from triage
|
|
- **Prevention** — standard recommendations
|
|
|
|
## Error Handling
|
|
|
|
- Incident not found → `graph_status: "failed"`
|
|
- Gateway errors during triage generation → non-fatal (uses partial data)
|
|
- Follow-up append errors → non-fatal (graph still succeeds)
|
|
- All tool calls go through gateway (RBAC/audit enforced)
|
|
|
|
## Correlation
|
|
|
|
Every tool call includes `graph_run_id` in metadata for full traceability.
|