# matrix-bridge-dagi — Soak & Failure Rehearsal Runbook (M11) **Phase:** M11 **Applies to:** `matrix-bridge-dagi` service on NODA1 **When to run:** Before any production traffic increase, after major code changes, or on a recurring monthly basis. --- ## 1. Goals | Goal | Measurable pass criterion | |------|--------------------------| | Latency under load | p95 invoke < 5 000 ms | | Queue stability | drop rate < 1% | | Failover correctness | failover fires on NODA1 outage; NODA2 serves all remaining messages | | Sticky anti-flap | sticky set after first failover; no re-tries to degraded node | | Restart recovery | sticky + health snapshot reloads within 10 s of restart | | Policy operations safe under load | `!policy history` / `!policy change` work while messages in-flight | --- ## 2. Prerequisites ```bash # On NODA1 or local machine with network access to bridge pip install httpx # Verify bridge is up curl -s http://localhost:9400/health | jq '.ok' # Expected: true # Verify /metrics endpoint curl -s http://localhost:9400/metrics | grep matrix_bridge_up # Expected: matrix_bridge_up{...} 1 ``` --- ## 2a. Enabling the Soak Inject Endpoint The soak script uses `POST /v1/debug/inject_event` which is **disabled by default**. Enable it only on staging/NODA1 soak runs: ```bash # On NODA1 — edit docker-compose override or pass env inline: # Option 1: temporary inline restart DEBUG_INJECT_ENABLED=true docker-compose \ -f docker-compose.matrix-bridge-node1.yml \ up -d --no-deps matrix-bridge-dagi # Option 2: .env file override echo "DEBUG_INJECT_ENABLED=true" >> .env.soak docker-compose --env-file .env.soak \ -f docker-compose.matrix-bridge-node1.yml \ up -d --no-deps matrix-bridge-dagi # Verify it's enabled (should return 200, not 403) curl -s -X POST http://localhost:9400/v1/debug/inject_event \ -H 'Content-Type: application/json' \ -d '{"room_id":"!test:test","event":{}}' | jq . # Expected: {"ok":false,"error":"no mapping for room_id=..."} ← 200, not 403 # IMPORTANT: disable after soak docker-compose -f docker-compose.matrix-bridge-node1.yml up -d --no-deps matrix-bridge-dagi # (DEBUG_INJECT_ENABLED defaults to false) ``` --- ## 2b. Step 0 (WORKERS=2 / QUEUE=100) — Record True Baseline **Goal:** snapshot the "before any tuning" numbers to have a comparison point. ```bash # 0. Confirm current config (should be defaults) curl -s http://localhost:9400/health | jq '{workers: .workers, queue_max: .queue.max}' # Expected: {"workers": 2, "queue_max": 100} # 1. DB path for WAL check (adjust to your BRIDGE_DATA_DIR) DB=/opt/microdao-daarion/data/matrix_bridge.db # 2. WAL size before (manual check) ls -lh ${DB}-wal 2>/dev/null || echo "(no WAL file yet — first run)" sqlite3 $DB "PRAGMA wal_checkpoint(PASSIVE);" 2>/dev/null || echo "(no sqlite3)" # 3. Run Step 0 soak python3 ops/scripts/matrix_bridge_soak.py \ --url http://localhost:9400 \ --messages 100 \ --concurrency 4 \ --agent sofiia \ --room-id "!your-room-id:your-server" \ --max-p95-ms 5000 \ --max-drop-rate 0.001 \ --db-path $DB \ --report-file /tmp/soak_step0_baseline.json # 4. Record result in "Baseline numbers" table (section 10) below. jq '.summary, .latency, .metrics_delta, .wal' /tmp/soak_step0_baseline.json ``` **v1 Go/No-Go thresholds for Step 0:** | Metric | Green ✅ | Yellow ⚠️ | Red ❌ | |--------|---------|-----------|-------| | `p95_invoke_ms` | < 3000 | 3000–5000 | > 5000 | | `drop_rate` | 0.00% (mandatory) | — | > 0.1% | | `error_rate` | < 1% | 1–3% | > 3% | | `failovers` | 0 | — | ≥ 1 without cause | | WAL delta | < 2 MB | 2–10 MB | > 10 MB | **If Step 0 is Green → proceed to Step 1 tuning.** **If Step 0 is Yellow/Red → investigate before touching WORKER_CONCURRENCY.** --- ## 2c. Step 1 (WORKERS=4 / QUEUE=200) — Tune-1 **Goal:** verify that doubling workers gives headroom without Router saturation. ```bash # 1. Apply tuning WORKER_CONCURRENCY=4 QUEUE_MAX_EVENTS=200 docker-compose \ -f docker-compose.matrix-bridge-node1.yml \ --env-file .env.soak \ up -d --no-deps matrix-bridge-dagi sleep 3 curl -s http://localhost:9400/health | jq '{workers: .workers, queue_max: .queue.max}' # Expected: {"workers": 4, "queue_max": 200} # 2. Run Step 1 soak (higher concurrency to stress the new headroom) python3 ops/scripts/matrix_bridge_soak.py \ --url http://localhost:9400 \ --messages 100 \ --concurrency 8 \ --agent sofiia \ --room-id "!your-room-id:your-server" \ --max-p95-ms 3000 \ --max-drop-rate 0.001 \ --db-path $DB \ --report-file /tmp/soak_step1_tune1.json # 3. Compare Step 0 vs Step 1 python3 - <<'EOF' import json s0 = json.load(open('/tmp/soak_step0_baseline.json')) s1 = json.load(open('/tmp/soak_step1_tune1.json')) for k in ('p50', 'p95', 'p99'): print(f"{k}: {s0['latency'][k]}ms → {s1['latency'][k]}ms") print(f"drops: {s0['metrics_delta']['queue_drops']} → {s1['metrics_delta']['queue_drops']}") print(f"WAL: {s0['wal'].get('delta_mb')} → {s1['wal'].get('delta_mb')} MB delta") EOF ``` **Decision:** - Step 1 Green → **freeze, tag v1.0, ship to production.** - p95 within 5% of Step 0 → Router is bottleneck (not workers); don't go to Step 2. - Queue drops > 0 at WORKERS=4 → try Step 2 (WORKERS=8, QUEUE=300). --- ## 3. Scenario A — Baseline load (100 messages, concurrency 4) **Goal:** establish latency baseline, verify no drops under normal load. ```bash python3 ops/scripts/matrix_bridge_soak.py \ --url http://localhost:9400 \ --messages 100 \ --concurrency 4 \ --max-p95-ms 3000 \ --report-file /tmp/soak_baseline.json ``` **Expected output:** ``` matrix-bridge-dagi Soak Report ✅ PASSED Messages: 100 concurrency=4 Latency: p50=<500ms p95=<3000ms Queue drops: 0 (rate 0.000%) Failovers: 0 ``` **If FAILED:** - `p95 too high` → check router `/health`, DeepSeek API latency, `docker stats` - `drop_rate > 0` → check `QUEUE_MAX_EVENTS` env var (increase if needed), inspect bridge logs --- ## 4. Scenario B — Queue saturation test **Goal:** confirm drop metric fires cleanly and bridge doesn't crash. ```bash # Reduce queue via env override, then flood: QUEUE_MAX_EVENTS=5 docker-compose -f docker-compose.matrix-bridge-node1.yml \ up -d matrix-bridge-dagi # Wait for restart sleep 5 python3 ops/scripts/matrix_bridge_soak.py \ --url http://localhost:9400 \ --messages 30 \ --concurrency 10 \ --max-drop-rate 0.99 \ --report-file /tmp/soak_queue_sat.json # Restore normal queue size docker-compose -f docker-compose.matrix-bridge-node1.yml up -d matrix-bridge-dagi ``` **Expected:** `queue_drops > 0`, bridge still running after the test. **Verify in Prometheus/Grafana:** ```promql rate(matrix_bridge_queue_dropped_total[1m]) ``` Should spike and then return to 0. --- ## 5. Scenario C — Node failover rehearsal **Goal:** simulate NODA1 router becoming unavailable, verify NODA2 takes over. ```bash # Step 1: stop the router on NODA1 temporarily docker pause dagi-router-node1 # Step 2: run soak against bridge (bridge will failover to NODA2) python3 ops/scripts/matrix_bridge_soak.py \ --url http://localhost:9400 \ --messages 20 \ --concurrency 2 \ --max-p95-ms 10000 \ --report-file /tmp/soak_failover.json # Step 3: restore router docker unpause dagi-router-node1 ``` **Expected:** ``` Failovers: 1..20 (at least 1) Sticky sets: 1+ Errors: 0 (fallback to NODA2 serves all messages) ``` **Check sticky in control room:** ``` !nodes ``` Should show `NODA2` sticky with remaining TTL. **Check health tracker:** ``` !status ``` Should show `NODA1 state=degraded|down`. --- ## 6. Scenario D — Restart recovery **Goal:** after restart, sticky and health state reload within one polling cycle. ```bash # After Scenario C: sticky is set to NODA2 # Restart the bridge docker restart dagi-matrix-bridge-node1 # Wait for startup (up to 30s) sleep 15 # Verify sticky reloaded curl -s http://localhost:9400/health | jq '.ha_state' # Expected: {"sticky_loaded": N, ...} # Verify routing still uses NODA2 sticky python3 ops/scripts/matrix_bridge_soak.py \ --url http://localhost:9400 \ --messages 10 \ --concurrency 2 \ --report-file /tmp/soak_restart.json ``` **Expected:** p95 similar to post-failover run, `Failovers: 0` (sticky already applied). --- ## 7. Scenario E — Rate limit burst **Goal:** verify rate limiting fires and bridge doesn't silently drop below-limit messages. ```bash # Set RPM very low for test, then flood from same sender # This is best done in control room by observing !status rate_limited count # rather than the soak script (which uses different senders per message). # In Matrix control room: # Send 30+ messages from the same user account in quick succession in a mixed room. # Then: !status # Check: rate_limited_total increased, no queue drops. ``` --- ## 8. Scenario F — Policy operations under load **Goal:** `!policy history`, `!policy change`, and `!policy export` work while messages are in-flight. ```bash # Run a background soak python3 ops/scripts/matrix_bridge_soak.py \ --url http://localhost:9400 \ --messages 200 \ --concurrency 2 \ --report-file /tmp/soak_concurrent_policy.json & # While soak is running, in Matrix control room: !policy history limit=5 !policy export !status ``` **Expected:** all three commands respond immediately (< 2s), soak completes without extra drops. --- ## 9. Prometheus / Grafana during soak Key queries for the Grafana dashboard: ```promql # Throughput (messages/s) rate(matrix_bridge_routed_total[30s]) # Error rate rate(matrix_bridge_errors_total[30s]) # p95 invoke latency per node histogram_quantile(0.95, rate(matrix_bridge_invoke_duration_seconds_bucket[1m])) # Queue drops rate rate(matrix_bridge_queue_dropped_total[1m]) # Failovers rate(matrix_bridge_failover_total[5m]) ``` Use the `matrix-bridge-dagi` Grafana dashboard at: `ops/grafana/dashboards/matrix-bridge-dagi.json` --- ## 10. Baseline numbers (reference) | Metric | Cold start | Warm (sticky set) | |--------|-----------|-------------------| | p50 latency | ~200ms | ~150ms | | p95 latency | ~2 000ms | ~1 500ms | | Queue drops | 0 (queue=100) | 0 | | Failover fires | 1 per degradation | 0 after sticky | | Policy ops response | < 500ms | < 500ms | *Update this table after each soak run with actual measured values.* --- ## 11. CI soak (mocked, no network) For CI pipelines, use the mocked soak scenarios: ```bash python3 -m pytest tests/test_matrix_bridge_m11_soak_scenarios.py -v ``` Covers (all deterministic, no network): - **S1** Queue saturation → drop counter - **S2** Failover under load → on_failover callback, health tracker - **S3** Sticky routing under burst → sticky set, burst routed to NODA2 - **S4** Multi-room isolation → separate rooms don't interfere - **S5** Rate-limit burst → RL callback wired, no panic - **S6** HA restart recovery → sticky + health snapshot persisted and reloaded - **Perf baseline** 100-msg + 50-msg failover burst < 5s wall clock --- ## 12. Known failure modes & mitigations | Symptom | Likely cause | Mitigation | |---------|-------------|------------| | `p95 > 5000ms` | Router/LLM slow | Increase `ROUTER_TIMEOUT_S`, check DeepSeek API | | `drop_rate > 1%` | Queue too small | Increase `QUEUE_MAX_EVENTS` | | `failovers > 0` but errors > 0 | Both nodes degraded | Check NODA1 + NODA2 health; scale router | | Bridge crash during soak | Memory leak / bug | `docker logs` → file GitHub issue | | Sticky not set after failover | `FAILOVER_STICKY_TTL_S=0` | Set to 300+ | | Restart doesn't load sticky | `HA_HEALTH_MAX_AGE_S` too small | Increase or set to 3600 |