Files
microdao-daarion/docs/voice_streaming_phase2.md
Apple 67225a39fa docs(platform): add policy configs, runbooks, ops scripts and platform documentation
Config policies (16 files): alert_routing, architecture_pressure, backlog,
cost_weights, data_governance, incident_escalation, incident_intelligence,
network_allowlist, nodes_registry, observability_sources, rbac_tools_matrix,
release_gate, risk_attribution, risk_policy, slo_policy, tool_limits, tools_rollout

Ops (22 files): Caddyfile, calendar compose, grafana voice dashboard,
deployments/incidents logs, runbooks for alerts/audit/backlog/incidents/sofiia/voice,
cron jobs, scripts (alert_triage, audit_cleanup, migrate_*, governance, schedule),
task_registry, voice alerts/ha/latency/policy

Docs (30+ files): HUMANIZED_STEPAN v2.7-v3 changelogs and runbooks,
NODA1/NODA2 status and setup, audit index and traces, backlog, incident,
supervisor, tools, voice, opencode, release, risk, aistalk, spacebot

Made-with: Cursor
2026-03-03 07:14:53 -08:00

130 lines
3.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Voice Streaming — Phase 2 Architecture
## Проблема
Поточний pipeline (Phase 1):
```
User stops → STT → [full LLM text] → TTS request → audio plays
Bottleneck: 812s
```
TTS запускається лише після **повного** тексту від LLM.
Результат: E2E latency = `llm_total + tts_compute` (~1014s).
## Ціль Phase 2
```
User stops → STT → [LLM first chunk] → TTS(chunk1) → audio starts
[LLM continues] → TTS(chunk2) → audio continues
```
**E2E TTFA** (time-to-first-audio): ~`llm_first_sentence + tts_compute` = ~35s.
---
## Архітектура
### Варіант A (рекомендований): "Sentence chunking" без streaming
Не потребує streaming від LLM. Кроки:
1. BFF робить `POST /api/generate` з `stream=true` до Ollama.
2. BFF накопичує токени до першого `[.!?]` або 100 символів.
3. Одразу `POST /voice/tts` для першого речення.
4. Паралельно продовжує читати LLM stream для наступних речень.
5. Браузер отримує перший аудіо chunk → починає відтворення.
6. Наступні chunks додаються через MediaSource API або sequential `<audio>`.
**Переваги**: не потребує WebSocket/SSE між BFF і браузером для відео; тільки аудіо.
### Варіант B: Full streaming pipeline
```
BFF → SSE → Browser
chunk1_text → TTS → audio_b64_1
chunk2_text → TTS → audio_b64_2
...
```
Складніший, але найкращий UX.
---
## Мінімальний патч (Варіант A)
### 1. BFF: новий endpoint `POST /api/voice/chat/stream`
```python
@app.post("/api/voice/chat/stream")
async def api_voice_chat_stream(body: VoiceChatBody):
# 1. GET full LLM text (streaming or not)
# 2. Split into sentences: re.split(r'(?<=[.!?])\s+', text)
# 3. For first sentence: POST /voice/tts immediately
# 4. Return: {first_audio_b64, first_text, remaining_text}
# 5. Client plays first_audio, requests TTS for remaining in background
```
### 2. Browser: play first sentence, background-fetch rest
```javascript
async function voiceChatStreamTurn(text) {
const r = await fetch('/api/voice/chat/stream', {...});
const d = await r.json();
// Play first sentence immediately
playAudioB64(d.first_audio_b64);
// Fetch remaining in background while first plays
if (d.remaining_text) {
fetchAndQueueAudio(d.remaining_text);
}
}
```
### 3. Audio queue on browser
```javascript
const audioQueue = [];
function playAudioB64(b64) { /* ... */ }
function fetchAndQueueAudio(text) {
// split to sentences, fetch TTS per sentence, add to queue
// play each when previous finishes (currentAudio.onended)
}
```
---
## SLO Impact (estimated)
| Metric | Phase 1 | Phase 2 (est.) |
|---|---|---|
| TTFA (first audio) | ~1014s | ~35s |
| Full response end | ~1215s | ~1013s (same) |
| UX perceived latency | high | natural conversation |
---
## Prerequisites
- `stream=true` support in Ollama (already available)
- BFF needs async generator / streaming response
- Browser needs MediaSource or sequential audio queue
- TTS chunk size: 1 sentence or 80120 chars (edge-tts handles well)
---
## Status
- Phase 1: ✅ deployed (delegates to memory-service)
- Phase 2: 📋 planned — implement after voice quality stabilizes
### When to implement Phase 2
1. When `gemma3` p95 latency is consistently < 4s (currently ~2.6s — ready).
2. When voice usage > 20 turns/day (worth the complexity).
3. When edge-tts 403 rate < 0.1% (confirmed stable with 7.2.7).