Config policies (16 files): alert_routing, architecture_pressure, backlog, cost_weights, data_governance, incident_escalation, incident_intelligence, network_allowlist, nodes_registry, observability_sources, rbac_tools_matrix, release_gate, risk_attribution, risk_policy, slo_policy, tool_limits, tools_rollout Ops (22 files): Caddyfile, calendar compose, grafana voice dashboard, deployments/incidents logs, runbooks for alerts/audit/backlog/incidents/sofiia/voice, cron jobs, scripts (alert_triage, audit_cleanup, migrate_*, governance, schedule), task_registry, voice alerts/ha/latency/policy Docs (30+ files): HUMANIZED_STEPAN v2.7-v3 changelogs and runbooks, NODA1/NODA2 status and setup, audit index and traces, backlog, incident, supervisor, tools, voice, opencode, release, risk, aistalk, spacebot Made-with: Cursor
3.8 KiB
3.8 KiB
Voice Streaming — Phase 2 Architecture
Проблема
Поточний pipeline (Phase 1):
User stops → STT → [full LLM text] → TTS request → audio plays
↑
Bottleneck: 8–12s
TTS запускається лише після повного тексту від LLM.
Результат: E2E latency = llm_total + tts_compute (~10–14s).
Ціль Phase 2
User stops → STT → [LLM first chunk] → TTS(chunk1) → audio starts
↓
[LLM continues] → TTS(chunk2) → audio continues
E2E TTFA (time-to-first-audio): ~llm_first_sentence + tts_compute = ~3–5s.
Архітектура
Варіант A (рекомендований): "Sentence chunking" без streaming
Не потребує streaming від LLM. Кроки:
- BFF робить
POST /api/generateзstream=trueдо Ollama. - BFF накопичує токени до першого
[.!?]або 100 символів. - Одразу
POST /voice/ttsдля першого речення. - Паралельно продовжує читати LLM stream для наступних речень.
- Браузер отримує перший аудіо chunk → починає відтворення.
- Наступні chunks додаються через MediaSource API або sequential
<audio>.
Переваги: не потребує WebSocket/SSE між BFF і браузером для відео; тільки аудіо.
Варіант B: Full streaming pipeline
BFF → SSE → Browser
↓
chunk1_text → TTS → audio_b64_1
chunk2_text → TTS → audio_b64_2
...
Складніший, але найкращий UX.
Мінімальний патч (Варіант A)
1. BFF: новий endpoint POST /api/voice/chat/stream
@app.post("/api/voice/chat/stream")
async def api_voice_chat_stream(body: VoiceChatBody):
# 1. GET full LLM text (streaming or not)
# 2. Split into sentences: re.split(r'(?<=[.!?])\s+', text)
# 3. For first sentence: POST /voice/tts immediately
# 4. Return: {first_audio_b64, first_text, remaining_text}
# 5. Client plays first_audio, requests TTS for remaining in background
2. Browser: play first sentence, background-fetch rest
async function voiceChatStreamTurn(text) {
const r = await fetch('/api/voice/chat/stream', {...});
const d = await r.json();
// Play first sentence immediately
playAudioB64(d.first_audio_b64);
// Fetch remaining in background while first plays
if (d.remaining_text) {
fetchAndQueueAudio(d.remaining_text);
}
}
3. Audio queue on browser
const audioQueue = [];
function playAudioB64(b64) { /* ... */ }
function fetchAndQueueAudio(text) {
// split to sentences, fetch TTS per sentence, add to queue
// play each when previous finishes (currentAudio.onended)
}
SLO Impact (estimated)
| Metric | Phase 1 | Phase 2 (est.) |
|---|---|---|
| TTFA (first audio) | ~10–14s | ~3–5s |
| Full response end | ~12–15s | ~10–13s (same) |
| UX perceived latency | high | natural conversation |
Prerequisites
stream=truesupport in Ollama (already available)- BFF needs async generator / streaming response
- Browser needs MediaSource or sequential audio queue
- TTS chunk size: 1 sentence or 80–120 chars (edge-tts handles well)
Status
- Phase 1: ✅ deployed (delegates to memory-service)
- Phase 2: 📋 planned — implement after voice quality stabilizes
When to implement Phase 2
- When
gemma3p95 latency is consistently < 4s (currently ~2.6s — ready). - When voice usage > 20 turns/day (worth the complexity).
- When edge-tts 403 rate < 0.1% (confirmed stable with 7.2.7).