docs(platform): add policy configs, runbooks, ops scripts and platform documentation

Config policies (16 files): alert_routing, architecture_pressure, backlog, cost_weights, data_governance, incident_escalation, incident_intelligence, network_allowlist, nodes_registry, observability_sources, rbac_tools_matrix, release_gate, risk_attribution, risk_policy, slo_policy, tool_limits, tools_rollout Ops (22 files): Caddyfile, calendar compose, grafana voice dashboard, deployments/incidents logs, runbooks for alerts/audit/backlog/incidents/sofiia/voice, cron jobs, scripts (alert_triage, audit_cleanup, migrate_*, governance, schedule), task_registry, voice alerts/ha/latency/policy Docs (30+ files): HUMANIZED_STEPAN v2.7-v3 changelogs and runbooks, NODA1/NODA2 status and setup, audit index and traces, backlog, incident, supervisor, tools, voice, opencode, release, risk, aistalk, spacebot Made-with: Cursor
2026-03-03 07:14:53 -08:00
parent 129e4ea1fc
commit 67225a39fa
102 changed files with 20060 additions and 0 deletions
--- a/docs/release/sofiia-console-v1-readiness.md
+++ b/docs/release/sofiia-console-v1-readiness.md
@@ -0,0 +1,109 @@
+# Sofiia Console v1.0 Release Readiness Summary
+
+One-page go/no-go артефакт для релізного рішення по `sofiia-console`.
+
+## 1) Scope & Version
+
+- Service: `sofiia-console`
+- Target version / tag: `v1.0` (to be assigned at release cut)
+- Git SHAs:
+  - sofiia-console: `e75fd33`
+  - router: `<set at release window>`
+  - gateway: `<set at release window>`
+- Deployment target:
+  - NODA1: production runtime/data plane
+  - NODA2: control plane / sofiia-console
+- Date prepared: `<set at release window>`
+- Prepared by: `<operator>`
+
+## 2) Production Guarantees
+
+### Reliability
+
+- Idempotent `POST /api/chats/{chat_id}/send` with selectable backend (`inmemory|redis`).
+- Multi-node routing covered by E2E tests (NODA1/NODA2 via `infer` monkeypatch path).
+- Cursor pagination hardened with tie-breakers (`(ts,id)` / stable ordering semantics).
+- Release process formalized via preflight + release runbook + smoke scripts.
+
+### Security
+
+- Rate limiting on send path:
+  - per-chat scope
+  - per-operator scope
+- Strict `/api/audit` protection:
+  - key required
+  - no localhost bypass
+- Structured audit trail:
+  - write events for operator actions
+  - cursor-based read endpoint
+- Secrets rotation runbook documented and operational.
+
+### Operational Controls
+
+- `/metrics` exposed (including rate-limit and idempotency counters).
+- Structured JSON logs for send/replay/pagination/error flows.
+- Audit retention policy in place (default 90 days).
+- Pruning script available (`ops/prune_audit_db.py`: dry-run + batch delete + optional vacuum).
+- Release evidence auto-generator available (`ops/generate_release_evidence.sh`).
+
+## 3) Known Limitations / Residual Risks
+
+- Chat index is still local DB-backed; full multi-instance HA for global chat index needs Phase 6 (Redis ChatIndexStore).
+- Rate-limit defaults to `inmemory`; multi-instance consistency needs `SOFIIA_RATE_LIMIT_BACKEND=redis`.
+- Audit storage is SQLite (single-node storage, non-clustered by default).
+- Automatic alerting/paging is not yet enabled; metric observation is primarily manual/runbook-driven.
+
+## 4) Required Release-Day Checks
+
+### Preflight
+
+- `STRICT=1 bash ops/preflight_sofiia_console.sh`
+
+### Deploy order
+
+- NODA2 precheck
+- NODA1 rollout
+- NODA2 finalize
+
+### Smoke
+
+- `GET /api/health` -> `200`
+- `/metrics` reachable
+- `bash ops/redis_idempotency_smoke.sh` -> `PASS` (when redis backend is enabled)
+- `/api/audit` auth:
+  - without key -> `401`
+  - with key -> `200`
+
+### Post-release
+
+- Verify rate-limit metrics increment under controlled load.
+- Verify audit write/read quick check.
+- Run retention dry-run:
+  - `python3 ops/prune_audit_db.py --dry-run`
+
+## 5) Explicit Go / No-Go Criteria
+
+**GO if all conditions hold:**
+
+- Preflight is `PASS` (or only non-critical `WARN` accepted by operator).
+- Smoke checks pass.
+- No unexpected 5xx spike during first 5–10 minutes.
+- Rate-limit counters and idempotency behavior are within expected range.
+
+**NO-GO if any condition holds:**
+
+- Strict audit auth fails (401/200 behavior broken).
+- Redis idempotency A/B smoke fails.
+- Audit write/read fails.
+- Unexpected 500s on send path.
+
+## 6) Rollback Readiness Statement
+
+- Rollback method:
+  - revert to previous known-good SHA/tag
+  - restart affected services via docker compose/systemd as per runbook
+- Estimated rollback time: `<set by operator, typically 5-15 min>`
+- Mandatory post-rollback smoke:
+  - `/api/health`
+  - idempotency smoke
+  - audit auth/read checks