Files
microdao-daarion/ops/runbook-matrix-bridge-soak.md
Apple 82d5ff2a4f feat(matrix-bridge-dagi): M4–M11 + soak infrastructure (debug inject endpoint)
Includes all milestones M4 through M11:
- M4: agent discovery (!agents / !status)
- M5: node-aware routing + per-node observability
- M6: dynamic policy store (node/agent overrides, import/export)
- M7: Prometheus alerts + Grafana dashboard + metrics contract
- M8: node health tracker + soft failover + sticky cache + HA persistence
- M9: two-step confirm + diff preview for dangerous commands
- M10: auto-backup, restore, retention, policy history + change detail
- M11: soak scenarios (CI tests) + live soak script

Soak infrastructure (this commit):
- POST /v1/debug/inject_event (guarded by DEBUG_INJECT_ENABLED=false)
- _preflight_inject() and _check_wal() in soak script
- --db-path arg for WAL delta reporting
- Runbook sections 2a/2b/2c: Step 0 and Step 1 exact commands

Made-with: Cursor
2026-03-05 07:51:37 -08:00

402 lines
12 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# matrix-bridge-dagi — Soak & Failure Rehearsal Runbook (M11)
**Phase:** M11
**Applies to:** `matrix-bridge-dagi` service on NODA1
**When to run:** Before any production traffic increase, after major code changes, or on a recurring monthly basis.
---
## 1. Goals
| Goal | Measurable pass criterion |
|------|--------------------------|
| Latency under load | p95 invoke < 5 000 ms |
| Queue stability | drop rate < 1% |
| Failover correctness | failover fires on NODA1 outage; NODA2 serves all remaining messages |
| Sticky anti-flap | sticky set after first failover; no re-tries to degraded node |
| Restart recovery | sticky + health snapshot reloads within 10 s of restart |
| Policy operations safe under load | `!policy history` / `!policy change` work while messages in-flight |
---
## 2. Prerequisites
```bash
# On NODA1 or local machine with network access to bridge
pip install httpx
# Verify bridge is up
curl -s http://localhost:9400/health | jq '.ok'
# Expected: true
# Verify /metrics endpoint
curl -s http://localhost:9400/metrics | grep matrix_bridge_up
# Expected: matrix_bridge_up{...} 1
```
---
## 2a. Enabling the Soak Inject Endpoint
The soak script uses `POST /v1/debug/inject_event` which is **disabled by default**.
Enable it only on staging/NODA1 soak runs:
```bash
# On NODA1 — edit docker-compose override or pass env inline:
# Option 1: temporary inline restart
DEBUG_INJECT_ENABLED=true docker-compose \
-f docker-compose.matrix-bridge-node1.yml \
up -d --no-deps matrix-bridge-dagi
# Option 2: .env file override
echo "DEBUG_INJECT_ENABLED=true" >> .env.soak
docker-compose --env-file .env.soak \
-f docker-compose.matrix-bridge-node1.yml \
up -d --no-deps matrix-bridge-dagi
# Verify it's enabled (should return 200, not 403)
curl -s -X POST http://localhost:9400/v1/debug/inject_event \
-H 'Content-Type: application/json' \
-d '{"room_id":"!test:test","event":{}}' | jq .
# Expected: {"ok":false,"error":"no mapping for room_id=..."} ← 200, not 403
# IMPORTANT: disable after soak
docker-compose -f docker-compose.matrix-bridge-node1.yml up -d --no-deps matrix-bridge-dagi
# (DEBUG_INJECT_ENABLED defaults to false)
```
---
## 2b. Step 0 (WORKERS=2 / QUEUE=100) — Record True Baseline
**Goal:** snapshot the "before any tuning" numbers to have a comparison point.
```bash
# 0. Confirm current config (should be defaults)
curl -s http://localhost:9400/health | jq '{workers: .workers, queue_max: .queue.max}'
# Expected: {"workers": 2, "queue_max": 100}
# 1. DB path for WAL check (adjust to your BRIDGE_DATA_DIR)
DB=/opt/microdao-daarion/data/matrix_bridge.db
# 2. WAL size before (manual check)
ls -lh ${DB}-wal 2>/dev/null || echo "(no WAL file yet — first run)"
sqlite3 $DB "PRAGMA wal_checkpoint(PASSIVE);" 2>/dev/null || echo "(no sqlite3)"
# 3. Run Step 0 soak
python3 ops/scripts/matrix_bridge_soak.py \
--url http://localhost:9400 \
--messages 100 \
--concurrency 4 \
--agent sofiia \
--room-id "!your-room-id:your-server" \
--max-p95-ms 5000 \
--max-drop-rate 0.001 \
--db-path $DB \
--report-file /tmp/soak_step0_baseline.json
# 4. Record result in "Baseline numbers" table (section 10) below.
jq '.summary, .latency, .metrics_delta, .wal' /tmp/soak_step0_baseline.json
```
**v1 Go/No-Go thresholds for Step 0:**
| Metric | Green ✅ | Yellow ⚠️ | Red ❌ |
|--------|---------|-----------|-------|
| `p95_invoke_ms` | < 3000 | 30005000 | > 5000 |
| `drop_rate` | 0.00% (mandatory) | — | > 0.1% |
| `error_rate` | < 1% | 13% | > 3% |
| `failovers` | 0 | — | ≥ 1 without cause |
| WAL delta | < 2 MB | 210 MB | > 10 MB |
**If Step 0 is Green → proceed to Step 1 tuning.**
**If Step 0 is Yellow/Red → investigate before touching WORKER_CONCURRENCY.**
---
## 2c. Step 1 (WORKERS=4 / QUEUE=200) — Tune-1
**Goal:** verify that doubling workers gives headroom without Router saturation.
```bash
# 1. Apply tuning
WORKER_CONCURRENCY=4 QUEUE_MAX_EVENTS=200 docker-compose \
-f docker-compose.matrix-bridge-node1.yml \
--env-file .env.soak \
up -d --no-deps matrix-bridge-dagi
sleep 3
curl -s http://localhost:9400/health | jq '{workers: .workers, queue_max: .queue.max}'
# Expected: {"workers": 4, "queue_max": 200}
# 2. Run Step 1 soak (higher concurrency to stress the new headroom)
python3 ops/scripts/matrix_bridge_soak.py \
--url http://localhost:9400 \
--messages 100 \
--concurrency 8 \
--agent sofiia \
--room-id "!your-room-id:your-server" \
--max-p95-ms 3000 \
--max-drop-rate 0.001 \
--db-path $DB \
--report-file /tmp/soak_step1_tune1.json
# 3. Compare Step 0 vs Step 1
python3 - <<'EOF'
import json
s0 = json.load(open('/tmp/soak_step0_baseline.json'))
s1 = json.load(open('/tmp/soak_step1_tune1.json'))
for k in ('p50', 'p95', 'p99'):
print(f"{k}: {s0['latency'][k]}ms → {s1['latency'][k]}ms")
print(f"drops: {s0['metrics_delta']['queue_drops']} → {s1['metrics_delta']['queue_drops']}")
print(f"WAL: {s0['wal'].get('delta_mb')} → {s1['wal'].get('delta_mb')} MB delta")
EOF
```
**Decision:**
- Step 1 Green → **freeze, tag v1.0, ship to production.**
- p95 within 5% of Step 0 → Router is bottleneck (not workers); don't go to Step 2.
- Queue drops > 0 at WORKERS=4 → try Step 2 (WORKERS=8, QUEUE=300).
---
## 3. Scenario A — Baseline load (100 messages, concurrency 4)
**Goal:** establish latency baseline, verify no drops under normal load.
```bash
python3 ops/scripts/matrix_bridge_soak.py \
--url http://localhost:9400 \
--messages 100 \
--concurrency 4 \
--max-p95-ms 3000 \
--report-file /tmp/soak_baseline.json
```
**Expected output:**
```
matrix-bridge-dagi Soak Report ✅ PASSED
Messages: 100 concurrency=4
Latency: p50=<500ms p95=<3000ms
Queue drops: 0 (rate 0.000%)
Failovers: 0
```
**If FAILED:**
- `p95 too high` → check router `/health`, DeepSeek API latency, `docker stats`
- `drop_rate > 0` → check `QUEUE_MAX_EVENTS` env var (increase if needed), inspect bridge logs
---
## 4. Scenario B — Queue saturation test
**Goal:** confirm drop metric fires cleanly and bridge doesn't crash.
```bash
# Reduce queue via env override, then flood:
QUEUE_MAX_EVENTS=5 docker-compose -f docker-compose.matrix-bridge-node1.yml \
up -d matrix-bridge-dagi
# Wait for restart
sleep 5
python3 ops/scripts/matrix_bridge_soak.py \
--url http://localhost:9400 \
--messages 30 \
--concurrency 10 \
--max-drop-rate 0.99 \
--report-file /tmp/soak_queue_sat.json
# Restore normal queue size
docker-compose -f docker-compose.matrix-bridge-node1.yml up -d matrix-bridge-dagi
```
**Expected:** `queue_drops > 0`, bridge still running after the test.
**Verify in Prometheus/Grafana:**
```promql
rate(matrix_bridge_queue_dropped_total[1m])
```
Should spike and then return to 0.
---
## 5. Scenario C — Node failover rehearsal
**Goal:** simulate NODA1 router becoming unavailable, verify NODA2 takes over.
```bash
# Step 1: stop the router on NODA1 temporarily
docker pause dagi-router-node1
# Step 2: run soak against bridge (bridge will failover to NODA2)
python3 ops/scripts/matrix_bridge_soak.py \
--url http://localhost:9400 \
--messages 20 \
--concurrency 2 \
--max-p95-ms 10000 \
--report-file /tmp/soak_failover.json
# Step 3: restore router
docker unpause dagi-router-node1
```
**Expected:**
```
Failovers: 1..20 (at least 1)
Sticky sets: 1+
Errors: 0 (fallback to NODA2 serves all messages)
```
**Check sticky in control room:**
```
!nodes
```
Should show `NODA2` sticky with remaining TTL.
**Check health tracker:**
```
!status
```
Should show `NODA1 state=degraded|down`.
---
## 6. Scenario D — Restart recovery
**Goal:** after restart, sticky and health state reload within one polling cycle.
```bash
# After Scenario C: sticky is set to NODA2
# Restart the bridge
docker restart dagi-matrix-bridge-node1
# Wait for startup (up to 30s)
sleep 15
# Verify sticky reloaded
curl -s http://localhost:9400/health | jq '.ha_state'
# Expected: {"sticky_loaded": N, ...}
# Verify routing still uses NODA2 sticky
python3 ops/scripts/matrix_bridge_soak.py \
--url http://localhost:9400 \
--messages 10 \
--concurrency 2 \
--report-file /tmp/soak_restart.json
```
**Expected:** p95 similar to post-failover run, `Failovers: 0` (sticky already applied).
---
## 7. Scenario E — Rate limit burst
**Goal:** verify rate limiting fires and bridge doesn't silently drop below-limit messages.
```bash
# Set RPM very low for test, then flood from same sender
# This is best done in control room by observing !status rate_limited count
# rather than the soak script (which uses different senders per message).
# In Matrix control room:
# Send 30+ messages from the same user account in quick succession in a mixed room.
# Then:
!status
# Check: rate_limited_total increased, no queue drops.
```
---
## 8. Scenario F — Policy operations under load
**Goal:** `!policy history`, `!policy change`, and `!policy export` work while messages are in-flight.
```bash
# Run a background soak
python3 ops/scripts/matrix_bridge_soak.py \
--url http://localhost:9400 \
--messages 200 \
--concurrency 2 \
--report-file /tmp/soak_concurrent_policy.json &
# While soak is running, in Matrix control room:
!policy history limit=5
!policy export
!status
```
**Expected:** all three commands respond immediately (< 2s), soak completes without extra drops.
---
## 9. Prometheus / Grafana during soak
Key queries for the Grafana dashboard:
```promql
# Throughput (messages/s)
rate(matrix_bridge_routed_total[30s])
# Error rate
rate(matrix_bridge_errors_total[30s])
# p95 invoke latency per node
histogram_quantile(0.95, rate(matrix_bridge_invoke_duration_seconds_bucket[1m]))
# Queue drops rate
rate(matrix_bridge_queue_dropped_total[1m])
# Failovers
rate(matrix_bridge_failover_total[5m])
```
Use the `matrix-bridge-dagi` Grafana dashboard at:
`ops/grafana/dashboards/matrix-bridge-dagi.json`
---
## 10. Baseline numbers (reference)
| Metric | Cold start | Warm (sticky set) |
|--------|-----------|-------------------|
| p50 latency | ~200ms | ~150ms |
| p95 latency | ~2 000ms | ~1 500ms |
| Queue drops | 0 (queue=100) | 0 |
| Failover fires | 1 per degradation | 0 after sticky |
| Policy ops response | < 500ms | < 500ms |
*Update this table after each soak run with actual measured values.*
---
## 11. CI soak (mocked, no network)
For CI pipelines, use the mocked soak scenarios:
```bash
python3 -m pytest tests/test_matrix_bridge_m11_soak_scenarios.py -v
```
Covers (all deterministic, no network):
- **S1** Queue saturation → drop counter
- **S2** Failover under load → on_failover callback, health tracker
- **S3** Sticky routing under burst → sticky set, burst routed to NODA2
- **S4** Multi-room isolation → separate rooms don't interfere
- **S5** Rate-limit burst → RL callback wired, no panic
- **S6** HA restart recovery → sticky + health snapshot persisted and reloaded
- **Perf baseline** 100-msg + 50-msg failover burst < 5s wall clock
---
## 12. Known failure modes & mitigations
| Symptom | Likely cause | Mitigation |
|---------|-------------|------------|
| `p95 > 5000ms` | Router/LLM slow | Increase `ROUTER_TIMEOUT_S`, check DeepSeek API |
| `drop_rate > 1%` | Queue too small | Increase `QUEUE_MAX_EVENTS` |
| `failovers > 0` but errors > 0 | Both nodes degraded | Check NODA1 + NODA2 health; scale router |
| Bridge crash during soak | Memory leak / bug | `docker logs` → file GitHub issue |
| Sticky not set after failover | `FAILOVER_STICKY_TTL_S=0` | Set to 300+ |
| Restart doesn't load sticky | `HA_HEALTH_MAX_AGE_S` too small | Increase or set to 3600 |