docs(platform): add policy configs, runbooks, ops scripts and platform documentation

Config policies (16 files): alert_routing, architecture_pressure, backlog, cost_weights, data_governance, incident_escalation, incident_intelligence, network_allowlist, nodes_registry, observability_sources, rbac_tools_matrix, release_gate, risk_attribution, risk_policy, slo_policy, tool_limits, tools_rollout Ops (22 files): Caddyfile, calendar compose, grafana voice dashboard, deployments/incidents logs, runbooks for alerts/audit/backlog/incidents/sofiia/voice, cron jobs, scripts (alert_triage, audit_cleanup, migrate_*, governance, schedule), task_registry, voice alerts/ha/latency/policy Docs (30+ files): HUMANIZED_STEPAN v2.7-v3 changelogs and runbooks, NODA1/NODA2 status and setup, audit index and traces, backlog, incident, supervisor, tools, voice, opencode, release, risk, aistalk, spacebot Made-with: Cursor
2026-03-03 07:14:53 -08:00
parent 129e4ea1fc
commit 67225a39fa
102 changed files with 20060 additions and 0 deletions
--- a/docs/tools/cost_analyzer_tool.md
+++ b/docs/tools/cost_analyzer_tool.md
@@ -0,0 +1,266 @@
+# cost_analyzer_tool — FinOps & Resource Analyzer
+
+**Категорія:** FinOps / Observability  
+**RBAC:** `tools.cost.read` (report, top, anomalies, weights), `tools.cost.gate` (gate)  
+**Ролі:** `agent_cto` (read + gate), `agent_oncall` (read)  
+**Timeout:** 20 s  
+**Rate limit:** 10 rpm  
+
+---
+
+## Призначення
+
+`cost_analyzer_tool` дає CTO/oncall команді відповіді на питання:
+
+- **Хто спалює ресурси?** (по агентам, tools, workspace)
+- **Чи є аномальні сплески?** (порівняння вікна з базовим рівнем)
+- **Які налаштування ваг?** (для FinOps калібрування)
+
+Всі розрахунки базуються на **відносних cost_units** без реальних грошових значень.  
+Payload ніколи не зберігається і не логується.
+
+---
+
+## Actions
+
+### `report` — агрегований звіт за період
+
+```json
+{
+  "action": "report",
+  "time_range": { "from": "2026-02-16T00:00:00Z", "to": "2026-02-23T00:00:00Z" },
+  "group_by": ["tool", "agent_id"],
+  "top_n": 10,
+  "include_failed": true,
+  "include_hourly": false
+}
+```
+
+**Відповідь:**
+```json
+{
+  "time_range": { "from": "...", "to": "..." },
+  "totals": {
+    "calls": 1240,
+    "cost_units": 4821.5,
+    "failed": 12,
+    "denied": 3,
+    "error_rate": 0.0097
+  },
+  "breakdowns": {
+    "tool": [
+      { "tool": "comfy_generate_video", "count": 42, "cost_units": 5200.0, "avg_duration_ms": 8200 },
+      { "tool": "pr_reviewer_tool", "count": 87, "cost_units": 960.0, ... }
+    ],
+    "agent_id": [...]
+  }
+}
+```
+
+---
+
+### `top` — швидкий топ-N за вікно (24h/7d)
+
+```json
+{
+  "action": "top",
+  "window_hours": 24,
+  "top_n": 10
+}
+```
+
+**Відповідь:** `top_tools`, `top_agents`, `top_users`, `top_workspaces`.
+
+---
+
+### `anomalies` — виявлення сплесків
+
+```json
+{
+  "action": "anomalies",
+  "window_minutes": 60,
+  "baseline_hours": 24,
+  "ratio_threshold": 3.0,
+  "min_calls": 50
+}
+```
+
+**Алгоритм:**
+1. Вікно = `[now - window_minutes, now]`
+2. Базовий рівень = `[now - baseline_hours, now - window_minutes]`
+3. Spike = `window_rate / baseline_rate >= ratio_threshold` AND `calls >= min_calls`
+4. Error spike = `error_rate > 10%` AND `calls >= min_calls`
+
+**Відповідь:**
+```json
+{
+  "anomalies": [
+    {
+      "type": "cost_spike",
+      "key": "tool:comfy_generate_image",
+      "tool": "comfy_generate_image",
+      "window": "last_60m",
+      "baseline": "prev_24h",
+      "window_calls": 120,
+      "baseline_calls": 8,
+      "ratio": 6.3,
+      "recommendation": "'comfy_generate_image' cost spike..."
+    }
+  ],
+  "anomaly_count": 1,
+  "stats": { "window_calls": 120, "baseline_calls": 8 }
+}
+```
+
+---
+
+### `weights` — поточні ваги cost model
+
+```json
+{ "action": "weights" }
+```
+
+Повертає конфіг з `config/cost_weights.yml`: defaults, per-tool weights, anomaly thresholds.
+
+---
+
+## Cost Model
+
+```
+cost_units = cost_per_call(tool) + duration_ms × cost_per_ms(tool)
+```
+
+Це **відносні одиниці**, не реальні $. Калібруйте через `config/cost_weights.yml`.
+
+| Tool | cost_per_call | cost_per_ms |
+|------|--------------|-------------|
+| `comfy_generate_video` | 120.0 | 0.005 |
+| `comfy_generate_image` | 50.0 | 0.003 |
+| `pr_reviewer_tool` | 10.0 | 0.002 |
+| `observability_tool` | 2.0 | 0.001 |
+| _(default)_ | 1.0 | 0.001 |
+
+---
+
+## Audit persistence (AuditStore)
+
+Кожен tool call через `ToolGovernance.post_call()` автоматично зберігається.
+
+**Backend (env var `AUDIT_BACKEND`):**
+
+| Backend | Config | Опис |
+|---------|--------|------|
+| `jsonl` (default) | `AUDIT_JSONL_DIR` | Append-only файли по датах: `ops/audit/tool_audit_YYYY-MM-DD.jsonl` |
+| `postgres` | `DATABASE_URL` | async asyncpg → таблиця `tool_audit_events` |
+| `memory` | — | In-process (тести, dev) |
+| `null` | — | Вимкнено |
+
+**Поля в store** (без payload):
+```
+ts, req_id, workspace_id, user_id, agent_id, tool, action,
+status, duration_ms, in_size, out_size, input_hash,
+graph_run_id?, graph_node?, job_id?
+```
+
+**Non-fatal:** якщо store недоступний — логується warning, tool call не падає.
+
+---
+
+## Інтеграція в release_check (cost_watch gate)
+
+`cost_watch` — **warning-only gate**: завжди `pass=true`, додає рекомендації.
+
+```yaml
+# ops/task_registry.yml (release_check inputs)
+run_cost_watch: true           # вмикає gate
+cost_watch_window_hours: 24    # вікно аналізу
+cost_spike_ratio_threshold: 3.0
+cost_min_calls_threshold: 50
+```
+
+**Gate output:**
+```json
+{
+  "name": "cost_watch",
+  "status": "pass",
+  "anomalies_count": 2,
+  "anomalies_preview": [...],
+  "note": "2 anomaly(ies) detected",
+  "recommendations": ["Cost spike: comfy_generate_image — apply rate limit."]
+}
+```
+
+Якщо `cost_analyzer_tool` недоступний → `skipped: true`, реліз не блокується.
+
+---
+
+## RBAC
+
+```yaml
+cost_analyzer_tool:
+  actions:
+    report:     { entitlements: ["tools.cost.read"] }
+    top:        { entitlements: ["tools.cost.read"] }
+    anomalies:  { entitlements: ["tools.cost.read"] }
+    weights:    { entitlements: ["tools.cost.read"] }
+    gate:       { entitlements: ["tools.cost.gate"] }
+
+role_entitlements:
+  agent_cto:    [..., tools.cost.read, tools.cost.gate]
+  agent_oncall: [..., tools.cost.read]
+```
+
+---
+
+## Limits
+
+```yaml
+cost_analyzer_tool:
+  timeout_ms: 20000       # 20s
+  max_chars_in: 2000
+  max_bytes_out: 1048576  # 1MB
+  rate_limit_rpm: 10
+  concurrency: 2
+```
+
+---
+
+## Security
+
+- Payload НІКОЛИ не зберігається і не логується.
+- AuditStore writes: тільки hash + sizes + metadata.
+- Всі aggregation queries фільтруються тільки по метаданим (ts, tool, agent_id, workspace_id).
+- `anomalies` endpoint не розкриває вміст tool calls.
+
+---
+
+## Тести
+
+`tests/test_cost_analyzer.py` (18 тестів):
+
+| Тест | Перевірка |
+|------|-----------|
+| `test_audit_persist_nonfatal` | Broken store не ламає tool call |
+| `test_cost_report_aggregation` | 20 events → правильні totals і top |
+| `test_cost_event_cost_units` | `pr_reviewer` 500ms = 11.0 units |
+| `test_anomalies_spike_detection` | 80 calls у вікні vs 2 в baseline → spike |
+| `test_anomalies_no_spike` | Стабільний трафік → 0 anomalies |
+| `test_top_report` | comfy_generate_video як #1 spender |
+| `test_release_check_cost_watch_always_passes` | gate pass=True з аномаліями |
+| `test_cost_watch_gate_in_full_release_check` | full run_release_check зберігає pass |
+| `test_rbac_cost_tool_deny` | alateya (agent_media) → denied |
+| `test_rbac_cost_tool_allow` | sofiia (agent_cto) → allowed |
+| `test_weights_loaded` | cost_weights.yml читається коректно |
+| `test_jsonl_store_roundtrip` | write + read JSONL |
+| `test_cost_watch_skipped_on_tool_error` | tool error → gate skipped, не error |
+| `test_anomalies_error_rate_spike` | 80% failure rate → error_spike |
+
+---
+
+## Наступні кроки (після MVP)
+
+1. **Postgres backend** — для довгострокового зберігання (>7d) і SQL-запитів.
+2. **Token-level cost** — якщо є метрика LLM tokens → точний $ cost.
+3. **Budget alerts** — notify oncall при перевищенні щоденного бюджету.
+4. **Cost dashboard** — Grafana panel на базі `tool_audit_events` table.
+5. **Per-graph cost** — tracking через `graph_run_id` (вже є в schema).