docs(platform): add policy configs, runbooks, ops scripts and platform documentation

Config policies (16 files): alert_routing, architecture_pressure, backlog, cost_weights, data_governance, incident_escalation, incident_intelligence, network_allowlist, nodes_registry, observability_sources, rbac_tools_matrix, release_gate, risk_attribution, risk_policy, slo_policy, tool_limits, tools_rollout Ops (22 files): Caddyfile, calendar compose, grafana voice dashboard, deployments/incidents logs, runbooks for alerts/audit/backlog/incidents/sofiia/voice, cron jobs, scripts (alert_triage, audit_cleanup, migrate_*, governance, schedule), task_registry, voice alerts/ha/latency/policy Docs (30+ files): HUMANIZED_STEPAN v2.7-v3 changelogs and runbooks, NODA1/NODA2 status and setup, audit index and traces, backlog, incident, supervisor, tools, voice, opencode, release, risk, aistalk, spacebot Made-with: Cursor
2026-03-03 07:14:53 -08:00
parent 129e4ea1fc
commit 67225a39fa
102 changed files with 20060 additions and 0 deletions
--- a/docs/voice_streaming_phase2.md
+++ b/docs/voice_streaming_phase2.md
@@ -0,0 +1,129 @@
+# Voice Streaming — Phase 2 Architecture
+
+## Проблема
+
+Поточний pipeline (Phase 1):
+
+```
+User stops → STT → [full LLM text] → TTS request → audio plays
+                        ↑
+                  Bottleneck: 8–12s
+```
+
+TTS запускається лише після **повного** тексту від LLM.
+Результат: E2E latency = `llm_total + tts_compute` (~10–14s).
+
+## Ціль Phase 2
+
+```
+User stops → STT → [LLM first chunk] → TTS(chunk1) → audio starts
+                          ↓
+                   [LLM continues] → TTS(chunk2) → audio continues
+```
+
+**E2E TTFA** (time-to-first-audio): ~`llm_first_sentence + tts_compute` = ~3–5s.
+
+---
+
+## Архітектура
+
+### Варіант A (рекомендований): "Sentence chunking" без streaming
+
+Не потребує streaming від LLM. Кроки:
+
+1. BFF робить `POST /api/generate` з `stream=true` до Ollama.
+2. BFF накопичує токени до першого `[.!?]` або 100 символів.
+3. Одразу `POST /voice/tts` для першого речення.
+4. Паралельно продовжує читати LLM stream для наступних речень.
+5. Браузер отримує перший аудіо chunk → починає відтворення.
+6. Наступні chunks додаються через MediaSource API або sequential `<audio>`.
+
+**Переваги**: не потребує WebSocket/SSE між BFF і браузером для відео; тільки аудіо.
+
+### Варіант B: Full streaming pipeline
+
+```
+BFF → SSE → Browser
+     ↓
+  chunk1_text → TTS → audio_b64_1
+  chunk2_text → TTS → audio_b64_2
+  ...
+```
+
+Складніший, але найкращий UX.
+
+---
+
+## Мінімальний патч (Варіант A)
+
+### 1. BFF: новий endpoint `POST /api/voice/chat/stream`
+
+```python
+@app.post("/api/voice/chat/stream")
+async def api_voice_chat_stream(body: VoiceChatBody):
+    # 1. GET full LLM text (streaming or not)
+    # 2. Split into sentences: re.split(r'(?<=[.!?])\s+', text)
+    # 3. For first sentence: POST /voice/tts immediately
+    # 4. Return: {first_audio_b64, first_text, remaining_text}
+    # 5. Client plays first_audio, requests TTS for remaining in background
+```
+
+### 2. Browser: play first sentence, background-fetch rest
+
+```javascript
+async function voiceChatStreamTurn(text) {
+  const r = await fetch('/api/voice/chat/stream', {...});
+  const d = await r.json();
+
+  // Play first sentence immediately
+  playAudioB64(d.first_audio_b64);
+
+  // Fetch remaining in background while first plays
+  if (d.remaining_text) {
+    fetchAndQueueAudio(d.remaining_text);
+  }
+}
+```
+
+### 3. Audio queue on browser
+
+```javascript
+const audioQueue = [];
+function playAudioB64(b64) { /* ... */ }
+function fetchAndQueueAudio(text) {
+  // split to sentences, fetch TTS per sentence, add to queue
+  // play each when previous finishes (currentAudio.onended)
+}
+```
+
+---
+
+## SLO Impact (estimated)
+
+| Metric | Phase 1 | Phase 2 (est.) |
+|---|---|---|
+| TTFA (first audio) | ~10–14s | ~3–5s |
+| Full response end | ~12–15s | ~10–13s (same) |
+| UX perceived latency | high | natural conversation |
+
+---
+
+## Prerequisites
+
+- `stream=true` support in Ollama (already available)
+- BFF needs async generator / streaming response
+- Browser needs MediaSource or sequential audio queue
+- TTS chunk size: 1 sentence or 80–120 chars (edge-tts handles well)
+
+---
+
+## Status
+
+- Phase 1: ✅ deployed (delegates to memory-service)
+- Phase 2: 📋 planned — implement after voice quality stabilizes
+
+### When to implement Phase 2
+
+1. When `gemma3` p95 latency is consistently < 4s (currently ~2.6s — ready).
+2. When voice usage > 20 turns/day (worth the complexity).
+3. When edge-tts 403 rate < 0.1% (confirmed stable with 7.2.7).