Files
microdao-daarion/docs/voice_streaming_phase2.md
Apple 67225a39fa docs(platform): add policy configs, runbooks, ops scripts and platform documentation
Config policies (16 files): alert_routing, architecture_pressure, backlog,
cost_weights, data_governance, incident_escalation, incident_intelligence,
network_allowlist, nodes_registry, observability_sources, rbac_tools_matrix,
release_gate, risk_attribution, risk_policy, slo_policy, tool_limits, tools_rollout

Ops (22 files): Caddyfile, calendar compose, grafana voice dashboard,
deployments/incidents logs, runbooks for alerts/audit/backlog/incidents/sofiia/voice,
cron jobs, scripts (alert_triage, audit_cleanup, migrate_*, governance, schedule),
task_registry, voice alerts/ha/latency/policy

Docs (30+ files): HUMANIZED_STEPAN v2.7-v3 changelogs and runbooks,
NODA1/NODA2 status and setup, audit index and traces, backlog, incident,
supervisor, tools, voice, opencode, release, risk, aistalk, spacebot

Made-with: Cursor
2026-03-03 07:14:53 -08:00

3.8 KiB
Raw Permalink Blame History

Voice Streaming — Phase 2 Architecture

Проблема

Поточний pipeline (Phase 1):

User stops → STT → [full LLM text] → TTS request → audio plays
                        ↑
                  Bottleneck: 812s

TTS запускається лише після повного тексту від LLM. Результат: E2E latency = llm_total + tts_compute (~1014s).

Ціль Phase 2

User stops → STT → [LLM first chunk] → TTS(chunk1) → audio starts
                          ↓
                   [LLM continues] → TTS(chunk2) → audio continues

E2E TTFA (time-to-first-audio): ~llm_first_sentence + tts_compute = ~35s.


Архітектура

Варіант A (рекомендований): "Sentence chunking" без streaming

Не потребує streaming від LLM. Кроки:

  1. BFF робить POST /api/generate з stream=true до Ollama.
  2. BFF накопичує токени до першого [.!?] або 100 символів.
  3. Одразу POST /voice/tts для першого речення.
  4. Паралельно продовжує читати LLM stream для наступних речень.
  5. Браузер отримує перший аудіо chunk → починає відтворення.
  6. Наступні chunks додаються через MediaSource API або sequential <audio>.

Переваги: не потребує WebSocket/SSE між BFF і браузером для відео; тільки аудіо.

Варіант B: Full streaming pipeline

BFF → SSE → Browser
     ↓
  chunk1_text → TTS → audio_b64_1
  chunk2_text → TTS → audio_b64_2
  ...

Складніший, але найкращий UX.


Мінімальний патч (Варіант A)

1. BFF: новий endpoint POST /api/voice/chat/stream

@app.post("/api/voice/chat/stream")
async def api_voice_chat_stream(body: VoiceChatBody):
    # 1. GET full LLM text (streaming or not)
    # 2. Split into sentences: re.split(r'(?<=[.!?])\s+', text)
    # 3. For first sentence: POST /voice/tts immediately
    # 4. Return: {first_audio_b64, first_text, remaining_text}
    # 5. Client plays first_audio, requests TTS for remaining in background

2. Browser: play first sentence, background-fetch rest

async function voiceChatStreamTurn(text) {
  const r = await fetch('/api/voice/chat/stream', {...});
  const d = await r.json();

  // Play first sentence immediately
  playAudioB64(d.first_audio_b64);

  // Fetch remaining in background while first plays
  if (d.remaining_text) {
    fetchAndQueueAudio(d.remaining_text);
  }
}

3. Audio queue on browser

const audioQueue = [];
function playAudioB64(b64) { /* ... */ }
function fetchAndQueueAudio(text) {
  // split to sentences, fetch TTS per sentence, add to queue
  // play each when previous finishes (currentAudio.onended)
}

SLO Impact (estimated)

Metric Phase 1 Phase 2 (est.)
TTFA (first audio) ~1014s ~35s
Full response end ~1215s ~1013s (same)
UX perceived latency high natural conversation

Prerequisites

  • stream=true support in Ollama (already available)
  • BFF needs async generator / streaming response
  • Browser needs MediaSource or sequential audio queue
  • TTS chunk size: 1 sentence or 80120 chars (edge-tts handles well)

Status

  • Phase 1: deployed (delegates to memory-service)
  • Phase 2: 📋 planned — implement after voice quality stabilizes

When to implement Phase 2

  1. When gemma3 p95 latency is consistently < 4s (currently ~2.6s — ready).
  2. When voice usage > 20 turns/day (worth the complexity).
  3. When edge-tts 403 rate < 0.1% (confirmed stable with 7.2.7).