docs(platform): add policy configs, runbooks, ops scripts and platform documentation

Config policies (16 files): alert_routing, architecture_pressure, backlog, cost_weights, data_governance, incident_escalation, incident_intelligence, network_allowlist, nodes_registry, observability_sources, rbac_tools_matrix, release_gate, risk_attribution, risk_policy, slo_policy, tool_limits, tools_rollout Ops (22 files): Caddyfile, calendar compose, grafana voice dashboard, deployments/incidents logs, runbooks for alerts/audit/backlog/incidents/sofiia/voice, cron jobs, scripts (alert_triage, audit_cleanup, migrate_*, governance, schedule), task_registry, voice alerts/ha/latency/policy Docs (30+ files): HUMANIZED_STEPAN v2.7-v3 changelogs and runbooks, NODA1/NODA2 status and setup, audit index and traces, backlog, incident, supervisor, tools, voice, opencode, release, risk, aistalk, spacebot Made-with: Cursor
2026-03-03 07:14:53 -08:00
parent 129e4ea1fc
commit 67225a39fa
102 changed files with 20060 additions and 0 deletions
--- a/docs/release/release_check.md
+++ b/docs/release/release_check.md
@@ -0,0 +1,248 @@
+# release_check — Release Gate
+
+**Єдиний оркестрований job для перевірки готовності до релізу**  
+Нода: NODE2 (dev) + NODA1 (production)
+
+---
+
+## Що це?
+
+`release_check` — internal task у Job Orchestrator, який послідовно запускає всі release gates і повертає єдиний структурований verdict `pass/fail`.
+
+Замінює ручне запускання кожного gate окремо.
+
+---
+
+## Gates (послідовно)
+
+| # | Gate | Tool | Умова блокування |
+|---|------|------|-----------------|
+| 1 | **PR Review** | `pr_reviewer_tool` (mode=`blocking_only`) | blocking_count > 0 |
+| 2 | **Config Lint** | `config_linter_tool` (strict=true) | blocking_count > 0 |
+| 3 | **Contract Diff** | `contract_tool` (fail_on_breaking=true) | breaking_count > 0 |
+| 4 | **Threat Model** | `threatmodel_tool` (risk_profile) | unmitigated_high > 0 |
+| 5 | **Smoke** *(optional)* | `job_orchestrator_tool` → `smoke_gateway` | job fail |
+| 6 | **Drift** *(optional)* | `job_orchestrator_tool` → `drift_check_node1` | job fail |
+
+Gates 1–4 завжди виконуються (якщо є вхідні дані).  
+Gates 5–6 виконуються тільки при `run_smoke=true` / `run_drift=true`.
+
+---
+
+## Як запустити
+
+### Через job_orchestrator_tool (рекомендовано)
+
+```json
+{
+  "action": "start_task",
+  "agent_id": "sofiia",
+  "params": {
+    "task_id": "release_check",
+    "inputs": {
+      "service_name": "router",
+      "diff_text": "<unified diff>",
+      "openapi_base": "<base OpenAPI spec>",
+      "openapi_head": "<head OpenAPI spec>",
+      "risk_profile": "agentic_tools",
+      "fail_fast": false,
+      "run_smoke": true,
+      "run_drift": false
+    }
+  }
+}
+```
+
+### Через Sofiia (OpenCode/Telegram)
+
+```
+"Запусти release_check для сервісу router з цим diff: ..."
+"Зроби release gate перевірку"
+```
+
+### Dry run (тільки валідація)
+
+```json
+{
+  "action": "start_task",
+  "params": {
+    "task_id": "release_check",
+    "dry_run": true,
+    "inputs": {"service_name": "router"}
+  }
+}
+```
+
+---
+
+## Вхідні параметри (inputs_schema)
+
+| Параметр | Тип | Обов'язковий | Опис |
+|----------|-----|:---:|------|
+| `service_name` | string | ✅ | Назва сервісу |
+| `diff_text` | string | — | Unified diff (git diff) |
+| `openapi_base` | string | — | OpenAPI base spec (text) |
+| `openapi_head` | string | — | OpenAPI head spec (text) |
+| `risk_profile` | enum | — | `default` / `agentic_tools` / `public_api` (default: `default`) |
+| `fail_fast` | boolean | — | Зупинитись на першому fail (default: `false`) |
+| `run_smoke` | boolean | — | Запустити smoke tests (default: `false`) |
+| `run_drift` | boolean | — | Запустити drift check (default: `false`) |
+
+---
+
+## Вихідний формат
+
+```json
+{
+  "pass": true,
+  "gates": [
+    {
+      "name": "pr_review",
+      "status": "pass",
+      "blocking_count": 0,
+      "summary": "No blocking issues found",
+      "score": 95
+    },
+    {
+      "name": "config_lint",
+      "status": "pass",
+      "blocking_count": 0,
+      "total_findings": 2
+    },
+    {
+      "name": "contract_diff",
+      "status": "skipped",
+      "reason": "openapi_base or openapi_head not provided"
+    },
+    {
+      "name": "threat_model",
+      "status": "pass",
+      "unmitigated_high": 0,
+      "risk_profile": "default"
+    }
+  ],
+  "recommendations": [],
+  "summary": "✅ RELEASE CHECK PASSED in 1234ms. Gates: ['pr_review', 'config_lint', 'threat_model'].",
+  "elapsed_ms": 1234.5
+}
+```
+
+### Gate statuses
+
+| Status | Значення |
+|--------|----------|
+| `pass` | Gate пройшов |
+| `fail` | Gate не пройшов (блокує реліз) |
+| `skipped` | Вхідних даних не було (не блокує) |
+| `error` | Внутрішня помилка gate |
+
+---
+
+## Інтерпретація результату
+
+### `pass: true`
+Всі mandatory gates пройшли → **можна випускати реліз**.
+
+### `pass: false`
+Хоча б один gate має `status: fail` → **реліз заблоковано**.  
+Дивись `gates[].status == "fail"` та `recommendations` для деталей.
+
+### `status: error`
+Gate не зміг виконатись (internal error). Не є `fail`, але потребує уваги.
+
+---
+
+## Risk Profiles для Threat Model
+
+| Профіль | Коли використовувати |
+|---------|---------------------|
+| `default` | Звичайний внутрішній сервіс |
+| `agentic_tools` | Сервіс з tool-викликами, prompt injection ризики |
+| `public_api` | Публічний API (rate limiting, WAF, auth hardening) |
+
+---
+
+## Необхідні Entitlements
+
+Для запуску `release_check` агент повинен мати:
+- `tools.pr_review.gate`
+- `tools.contract.gate`
+- `tools.config_lint.gate`
+- `tools.threatmodel.gate`
+
+Тільки агенти з роллю `agent_cto` (sofiia, yaromir) мають ці entitlements.
+
+---
+
+## Приклади сценаріїв
+
+### Швидка перевірка PR (без openapi, без smoke)
+
+```json
+{
+  "service_name": "gateway-bot",
+  "diff_text": "...",
+  "fail_fast": true
+}
+```
+
+### Повний release pipeline для публічного API
+
+```json
+{
+  "service_name": "router",
+  "diff_text": "...",
+  "openapi_base": "...",
+  "openapi_head": "...",
+  "risk_profile": "public_api",
+  "run_smoke": true,
+  "run_drift": true
+}
+```
+
+### Тільки threat model (без diff)
+
+```json
+{
+  "service_name": "auth-service",
+  "risk_profile": "agentic_tools"
+}
+```
+
+---
+
+## Внутрішня архітектура
+
+```
+job_orchestrator_tool.start_task("release_check")
+  → _job_orchestrator_tool() виявляє runner="internal"
+  → release_check_runner.run_release_check(tool_manager, inputs, agent_id)
+    → Gate 1: _run_pr_review()
+    → Gate 2: _run_config_lint()
+    → Gate 3: _run_dependency_scan()
+    → Gate 4: _run_contract_diff()
+    → Gate 5: _run_threat_model()
+    → [Gate 6: _run_smoke()]
+    → [Gate 7: _run_drift()]
+    → Gate 8: _run_followup_watch()  (policy: off/warn/strict)
+    → Gate 9: _run_privacy_watch()   (policy: off/warn/strict)
+    → Gate 10: _run_cost_watch()     (always warn)
+    → _build_report()
+  → ToolResult(success=True, result=report)
+```
+
+Кожен gate викликає відповідний tool через `tool_manager.execute_tool()`.  
+Governance middleware (RBAC, limits, audit) застосовується до кожного gate-виклику.
+
+---
+
+## Файли
+
+| Файл | Призначення |
+|------|-------------|
+| `ops/task_registry.yml` | Реєстрація `release_check` task |
+| `services/router/release_check_runner.py` | Internal runner (gates logic) |
+| `config/release_gate_policy.yml` | Gate strictness profiles (dev/staging/prod) |
+| `config/slo_policy.yml` | SLO thresholds per service |
+| `tests/test_tool_governance.py` | Тести (включно з release_check fixtures) |
+| `tests/test_release_check_followup_watch.py` | Follow-up watch gate tests |
--- a/docs/release/release_gate_policy.md
+++ b/docs/release/release_gate_policy.md
@@ -0,0 +1,68 @@
+# Release Gate Policy
+
+`config/release_gate_policy.yml` — централізований конфіг строгості gate-ів для різних профілів деплойменту.
+
+## Профілі
+
+| Профіль | Призначення | privacy_watch | cost_watch |
+|---------|-------------|---------------|------------|
+| `dev` | Розробка | warn | warn |
+| `staging` | Стейджинг | **strict** (fail_on error) | warn |
+| `prod` | Продакшн | **strict** (fail_on error) | warn |
+
+## Режими gate-ів
+
+| Режим | Поведінка |
+|-------|-----------|
+| `off` | Gate повністю пропускається (не викликається, не виводиться) |
+| `warn` | Gate завжди `pass=True`; findings → `recommendations` |
+| `strict` | Gate може заблокувати реліз за умовами `fail_on` |
+
+## Використання
+
+Передати `gate_profile` у inputs release_check:
+
+```json
+{
+  "gate_profile": "staging",
+  "run_privacy_watch": true,
+  "diff_text": "..."
+}
+```
+
+## strict mode: privacy_watch
+
+Блокує реліз якщо є findings із severity у `fail_on`:
+
+```yaml
+privacy_watch:
+  mode: "strict"
+  fail_on: ["error"]   # тільки error-severity блокує; warning = recommendation
+```
+
+Наприклад, `DG-SEC-001` (private key) = error → `release_check.pass = false`.  
+`DG-LOG-001` (sensitive logger) = warning → не блокує у staging/prod.
+
+## cost_watch
+
+**Завжди `warn`** у всіх профілях — cost spikes ніколи не блокують реліз (тільки recommendations).
+
+## Backward compatibility
+
+Якщо `gate_profile` не переданий → використовується `dev` (warn для privacy і cost).  
+Якщо `release_gate_policy.yml` відсутній → всі gates використовують `warn` (graceful fallback).
+
+## Приклад виводу для staging з error finding
+
+```json
+{
+  "pass": false,
+  "gates": [
+    { "name": "privacy_watch", "status": "pass", "errors": 1,
+      "top_findings": [{"id": "DG-SEC-001", "severity": "error", ...}],
+      "recommendations": ["Remove private key from code..."] }
+  ],
+  "summary": "❌ RELEASE CHECK FAILED. Failed: []. Errors: [].",
+  "recommendations": ["Remove private key from code..."]
+}
+```
--- a/docs/release/sofiia-console-v1-readiness.md
+++ b/docs/release/sofiia-console-v1-readiness.md
@@ -0,0 +1,109 @@
+# Sofiia Console v1.0 Release Readiness Summary
+
+One-page go/no-go артефакт для релізного рішення по `sofiia-console`.
+
+## 1) Scope & Version
+
+- Service: `sofiia-console`
+- Target version / tag: `v1.0` (to be assigned at release cut)
+- Git SHAs:
+  - sofiia-console: `e75fd33`
+  - router: `<set at release window>`
+  - gateway: `<set at release window>`
+- Deployment target:
+  - NODA1: production runtime/data plane
+  - NODA2: control plane / sofiia-console
+- Date prepared: `<set at release window>`
+- Prepared by: `<operator>`
+
+## 2) Production Guarantees
+
+### Reliability
+
+- Idempotent `POST /api/chats/{chat_id}/send` with selectable backend (`inmemory|redis`).
+- Multi-node routing covered by E2E tests (NODA1/NODA2 via `infer` monkeypatch path).
+- Cursor pagination hardened with tie-breakers (`(ts,id)` / stable ordering semantics).
+- Release process formalized via preflight + release runbook + smoke scripts.
+
+### Security
+
+- Rate limiting on send path:
+  - per-chat scope
+  - per-operator scope
+- Strict `/api/audit` protection:
+  - key required
+  - no localhost bypass
+- Structured audit trail:
+  - write events for operator actions
+  - cursor-based read endpoint
+- Secrets rotation runbook documented and operational.
+
+### Operational Controls
+
+- `/metrics` exposed (including rate-limit and idempotency counters).
+- Structured JSON logs for send/replay/pagination/error flows.
+- Audit retention policy in place (default 90 days).
+- Pruning script available (`ops/prune_audit_db.py`: dry-run + batch delete + optional vacuum).
+- Release evidence auto-generator available (`ops/generate_release_evidence.sh`).
+
+## 3) Known Limitations / Residual Risks
+
+- Chat index is still local DB-backed; full multi-instance HA for global chat index needs Phase 6 (Redis ChatIndexStore).
+- Rate-limit defaults to `inmemory`; multi-instance consistency needs `SOFIIA_RATE_LIMIT_BACKEND=redis`.
+- Audit storage is SQLite (single-node storage, non-clustered by default).
+- Automatic alerting/paging is not yet enabled; metric observation is primarily manual/runbook-driven.
+
+## 4) Required Release-Day Checks
+
+### Preflight
+
+- `STRICT=1 bash ops/preflight_sofiia_console.sh`
+
+### Deploy order
+
+- NODA2 precheck
+- NODA1 rollout
+- NODA2 finalize
+
+### Smoke
+
+- `GET /api/health` -> `200`
+- `/metrics` reachable
+- `bash ops/redis_idempotency_smoke.sh` -> `PASS` (when redis backend is enabled)
+- `/api/audit` auth:
+  - without key -> `401`
+  - with key -> `200`
+
+### Post-release
+
+- Verify rate-limit metrics increment under controlled load.
+- Verify audit write/read quick check.
+- Run retention dry-run:
+  - `python3 ops/prune_audit_db.py --dry-run`
+
+## 5) Explicit Go / No-Go Criteria
+
+**GO if all conditions hold:**
+
+- Preflight is `PASS` (or only non-critical `WARN` accepted by operator).
+- Smoke checks pass.
+- No unexpected 5xx spike during first 5–10 minutes.
+- Rate-limit counters and idempotency behavior are within expected range.
+
+**NO-GO if any condition holds:**
+
+- Strict audit auth fails (401/200 behavior broken).
+- Redis idempotency A/B smoke fails.
+- Audit write/read fails.
+- Unexpected 500s on send path.
+
+## 6) Rollback Readiness Statement
+
+- Rollback method:
+  - revert to previous known-good SHA/tag
+  - restart affected services via docker compose/systemd as per runbook
+- Estimated rollback time: `<set by operator, typically 5-15 min>`
+- Mandatory post-rollback smoke:
+  - `/api/health`
+  - idempotency smoke
+  - audit auth/read checks