feat(matrix-bridge-dagi): M4–M11 + soak infrastructure (debug inject endpoint)
Includes all milestones M4 through M11: - M4: agent discovery (!agents / !status) - M5: node-aware routing + per-node observability - M6: dynamic policy store (node/agent overrides, import/export) - M7: Prometheus alerts + Grafana dashboard + metrics contract - M8: node health tracker + soft failover + sticky cache + HA persistence - M9: two-step confirm + diff preview for dangerous commands - M10: auto-backup, restore, retention, policy history + change detail - M11: soak scenarios (CI tests) + live soak script Soak infrastructure (this commit): - POST /v1/debug/inject_event (guarded by DEBUG_INJECT_ENABLED=false) - _preflight_inject() and _check_wal() in soak script - --db-path arg for WAL delta reporting - Runbook sections 2a/2b/2c: Step 0 and Step 1 exact commands Made-with: Cursor
This commit is contained in:
401
ops/runbook-matrix-bridge-soak.md
Normal file
401
ops/runbook-matrix-bridge-soak.md
Normal file
@@ -0,0 +1,401 @@
|
||||
# matrix-bridge-dagi — Soak & Failure Rehearsal Runbook (M11)
|
||||
|
||||
**Phase:** M11
|
||||
**Applies to:** `matrix-bridge-dagi` service on NODA1
|
||||
**When to run:** Before any production traffic increase, after major code changes, or on a recurring monthly basis.
|
||||
|
||||
---
|
||||
|
||||
## 1. Goals
|
||||
|
||||
| Goal | Measurable pass criterion |
|
||||
|------|--------------------------|
|
||||
| Latency under load | p95 invoke < 5 000 ms |
|
||||
| Queue stability | drop rate < 1% |
|
||||
| Failover correctness | failover fires on NODA1 outage; NODA2 serves all remaining messages |
|
||||
| Sticky anti-flap | sticky set after first failover; no re-tries to degraded node |
|
||||
| Restart recovery | sticky + health snapshot reloads within 10 s of restart |
|
||||
| Policy operations safe under load | `!policy history` / `!policy change` work while messages in-flight |
|
||||
|
||||
---
|
||||
|
||||
## 2. Prerequisites
|
||||
|
||||
```bash
|
||||
# On NODA1 or local machine with network access to bridge
|
||||
pip install httpx
|
||||
|
||||
# Verify bridge is up
|
||||
curl -s http://localhost:9400/health | jq '.ok'
|
||||
# Expected: true
|
||||
|
||||
# Verify /metrics endpoint
|
||||
curl -s http://localhost:9400/metrics | grep matrix_bridge_up
|
||||
# Expected: matrix_bridge_up{...} 1
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 2a. Enabling the Soak Inject Endpoint
|
||||
|
||||
The soak script uses `POST /v1/debug/inject_event` which is **disabled by default**.
|
||||
Enable it only on staging/NODA1 soak runs:
|
||||
|
||||
```bash
|
||||
# On NODA1 — edit docker-compose override or pass env inline:
|
||||
# Option 1: temporary inline restart
|
||||
DEBUG_INJECT_ENABLED=true docker-compose \
|
||||
-f docker-compose.matrix-bridge-node1.yml \
|
||||
up -d --no-deps matrix-bridge-dagi
|
||||
|
||||
# Option 2: .env file override
|
||||
echo "DEBUG_INJECT_ENABLED=true" >> .env.soak
|
||||
docker-compose --env-file .env.soak \
|
||||
-f docker-compose.matrix-bridge-node1.yml \
|
||||
up -d --no-deps matrix-bridge-dagi
|
||||
|
||||
# Verify it's enabled (should return 200, not 403)
|
||||
curl -s -X POST http://localhost:9400/v1/debug/inject_event \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d '{"room_id":"!test:test","event":{}}' | jq .
|
||||
# Expected: {"ok":false,"error":"no mapping for room_id=..."} ← 200, not 403
|
||||
|
||||
# IMPORTANT: disable after soak
|
||||
docker-compose -f docker-compose.matrix-bridge-node1.yml up -d --no-deps matrix-bridge-dagi
|
||||
# (DEBUG_INJECT_ENABLED defaults to false)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 2b. Step 0 (WORKERS=2 / QUEUE=100) — Record True Baseline
|
||||
|
||||
**Goal:** snapshot the "before any tuning" numbers to have a comparison point.
|
||||
|
||||
```bash
|
||||
# 0. Confirm current config (should be defaults)
|
||||
curl -s http://localhost:9400/health | jq '{workers: .workers, queue_max: .queue.max}'
|
||||
# Expected: {"workers": 2, "queue_max": 100}
|
||||
|
||||
# 1. DB path for WAL check (adjust to your BRIDGE_DATA_DIR)
|
||||
DB=/opt/microdao-daarion/data/matrix_bridge.db
|
||||
|
||||
# 2. WAL size before (manual check)
|
||||
ls -lh ${DB}-wal 2>/dev/null || echo "(no WAL file yet — first run)"
|
||||
sqlite3 $DB "PRAGMA wal_checkpoint(PASSIVE);" 2>/dev/null || echo "(no sqlite3)"
|
||||
|
||||
# 3. Run Step 0 soak
|
||||
python3 ops/scripts/matrix_bridge_soak.py \
|
||||
--url http://localhost:9400 \
|
||||
--messages 100 \
|
||||
--concurrency 4 \
|
||||
--agent sofiia \
|
||||
--room-id "!your-room-id:your-server" \
|
||||
--max-p95-ms 5000 \
|
||||
--max-drop-rate 0.001 \
|
||||
--db-path $DB \
|
||||
--report-file /tmp/soak_step0_baseline.json
|
||||
|
||||
# 4. Record result in "Baseline numbers" table (section 10) below.
|
||||
jq '.summary, .latency, .metrics_delta, .wal' /tmp/soak_step0_baseline.json
|
||||
```
|
||||
|
||||
**v1 Go/No-Go thresholds for Step 0:**
|
||||
|
||||
| Metric | Green ✅ | Yellow ⚠️ | Red ❌ |
|
||||
|--------|---------|-----------|-------|
|
||||
| `p95_invoke_ms` | < 3000 | 3000–5000 | > 5000 |
|
||||
| `drop_rate` | 0.00% (mandatory) | — | > 0.1% |
|
||||
| `error_rate` | < 1% | 1–3% | > 3% |
|
||||
| `failovers` | 0 | — | ≥ 1 without cause |
|
||||
| WAL delta | < 2 MB | 2–10 MB | > 10 MB |
|
||||
|
||||
**If Step 0 is Green → proceed to Step 1 tuning.**
|
||||
**If Step 0 is Yellow/Red → investigate before touching WORKER_CONCURRENCY.**
|
||||
|
||||
---
|
||||
|
||||
## 2c. Step 1 (WORKERS=4 / QUEUE=200) — Tune-1
|
||||
|
||||
**Goal:** verify that doubling workers gives headroom without Router saturation.
|
||||
|
||||
```bash
|
||||
# 1. Apply tuning
|
||||
WORKER_CONCURRENCY=4 QUEUE_MAX_EVENTS=200 docker-compose \
|
||||
-f docker-compose.matrix-bridge-node1.yml \
|
||||
--env-file .env.soak \
|
||||
up -d --no-deps matrix-bridge-dagi
|
||||
|
||||
sleep 3
|
||||
curl -s http://localhost:9400/health | jq '{workers: .workers, queue_max: .queue.max}'
|
||||
# Expected: {"workers": 4, "queue_max": 200}
|
||||
|
||||
# 2. Run Step 1 soak (higher concurrency to stress the new headroom)
|
||||
python3 ops/scripts/matrix_bridge_soak.py \
|
||||
--url http://localhost:9400 \
|
||||
--messages 100 \
|
||||
--concurrency 8 \
|
||||
--agent sofiia \
|
||||
--room-id "!your-room-id:your-server" \
|
||||
--max-p95-ms 3000 \
|
||||
--max-drop-rate 0.001 \
|
||||
--db-path $DB \
|
||||
--report-file /tmp/soak_step1_tune1.json
|
||||
|
||||
# 3. Compare Step 0 vs Step 1
|
||||
python3 - <<'EOF'
|
||||
import json
|
||||
s0 = json.load(open('/tmp/soak_step0_baseline.json'))
|
||||
s1 = json.load(open('/tmp/soak_step1_tune1.json'))
|
||||
for k in ('p50', 'p95', 'p99'):
|
||||
print(f"{k}: {s0['latency'][k]}ms → {s1['latency'][k]}ms")
|
||||
print(f"drops: {s0['metrics_delta']['queue_drops']} → {s1['metrics_delta']['queue_drops']}")
|
||||
print(f"WAL: {s0['wal'].get('delta_mb')} → {s1['wal'].get('delta_mb')} MB delta")
|
||||
EOF
|
||||
```
|
||||
|
||||
**Decision:**
|
||||
- Step 1 Green → **freeze, tag v1.0, ship to production.**
|
||||
- p95 within 5% of Step 0 → Router is bottleneck (not workers); don't go to Step 2.
|
||||
- Queue drops > 0 at WORKERS=4 → try Step 2 (WORKERS=8, QUEUE=300).
|
||||
|
||||
---
|
||||
|
||||
## 3. Scenario A — Baseline load (100 messages, concurrency 4)
|
||||
|
||||
**Goal:** establish latency baseline, verify no drops under normal load.
|
||||
|
||||
```bash
|
||||
python3 ops/scripts/matrix_bridge_soak.py \
|
||||
--url http://localhost:9400 \
|
||||
--messages 100 \
|
||||
--concurrency 4 \
|
||||
--max-p95-ms 3000 \
|
||||
--report-file /tmp/soak_baseline.json
|
||||
```
|
||||
|
||||
**Expected output:**
|
||||
```
|
||||
matrix-bridge-dagi Soak Report ✅ PASSED
|
||||
Messages: 100 concurrency=4
|
||||
Latency: p50=<500ms p95=<3000ms
|
||||
Queue drops: 0 (rate 0.000%)
|
||||
Failovers: 0
|
||||
```
|
||||
|
||||
**If FAILED:**
|
||||
- `p95 too high` → check router `/health`, DeepSeek API latency, `docker stats`
|
||||
- `drop_rate > 0` → check `QUEUE_MAX_EVENTS` env var (increase if needed), inspect bridge logs
|
||||
|
||||
---
|
||||
|
||||
## 4. Scenario B — Queue saturation test
|
||||
|
||||
**Goal:** confirm drop metric fires cleanly and bridge doesn't crash.
|
||||
|
||||
```bash
|
||||
# Reduce queue via env override, then flood:
|
||||
QUEUE_MAX_EVENTS=5 docker-compose -f docker-compose.matrix-bridge-node1.yml \
|
||||
up -d matrix-bridge-dagi
|
||||
|
||||
# Wait for restart
|
||||
sleep 5
|
||||
|
||||
python3 ops/scripts/matrix_bridge_soak.py \
|
||||
--url http://localhost:9400 \
|
||||
--messages 30 \
|
||||
--concurrency 10 \
|
||||
--max-drop-rate 0.99 \
|
||||
--report-file /tmp/soak_queue_sat.json
|
||||
|
||||
# Restore normal queue size
|
||||
docker-compose -f docker-compose.matrix-bridge-node1.yml up -d matrix-bridge-dagi
|
||||
```
|
||||
|
||||
**Expected:** `queue_drops > 0`, bridge still running after the test.
|
||||
|
||||
**Verify in Prometheus/Grafana:**
|
||||
```promql
|
||||
rate(matrix_bridge_queue_dropped_total[1m])
|
||||
```
|
||||
Should spike and then return to 0.
|
||||
|
||||
---
|
||||
|
||||
## 5. Scenario C — Node failover rehearsal
|
||||
|
||||
**Goal:** simulate NODA1 router becoming unavailable, verify NODA2 takes over.
|
||||
|
||||
```bash
|
||||
# Step 1: stop the router on NODA1 temporarily
|
||||
docker pause dagi-router-node1
|
||||
|
||||
# Step 2: run soak against bridge (bridge will failover to NODA2)
|
||||
python3 ops/scripts/matrix_bridge_soak.py \
|
||||
--url http://localhost:9400 \
|
||||
--messages 20 \
|
||||
--concurrency 2 \
|
||||
--max-p95-ms 10000 \
|
||||
--report-file /tmp/soak_failover.json
|
||||
|
||||
# Step 3: restore router
|
||||
docker unpause dagi-router-node1
|
||||
```
|
||||
|
||||
**Expected:**
|
||||
```
|
||||
Failovers: 1..20 (at least 1)
|
||||
Sticky sets: 1+
|
||||
Errors: 0 (fallback to NODA2 serves all messages)
|
||||
```
|
||||
|
||||
**Check sticky in control room:**
|
||||
```
|
||||
!nodes
|
||||
```
|
||||
Should show `NODA2` sticky with remaining TTL.
|
||||
|
||||
**Check health tracker:**
|
||||
```
|
||||
!status
|
||||
```
|
||||
Should show `NODA1 state=degraded|down`.
|
||||
|
||||
---
|
||||
|
||||
## 6. Scenario D — Restart recovery
|
||||
|
||||
**Goal:** after restart, sticky and health state reload within one polling cycle.
|
||||
|
||||
```bash
|
||||
# After Scenario C: sticky is set to NODA2
|
||||
# Restart the bridge
|
||||
docker restart dagi-matrix-bridge-node1
|
||||
|
||||
# Wait for startup (up to 30s)
|
||||
sleep 15
|
||||
|
||||
# Verify sticky reloaded
|
||||
curl -s http://localhost:9400/health | jq '.ha_state'
|
||||
# Expected: {"sticky_loaded": N, ...}
|
||||
|
||||
# Verify routing still uses NODA2 sticky
|
||||
python3 ops/scripts/matrix_bridge_soak.py \
|
||||
--url http://localhost:9400 \
|
||||
--messages 10 \
|
||||
--concurrency 2 \
|
||||
--report-file /tmp/soak_restart.json
|
||||
```
|
||||
|
||||
**Expected:** p95 similar to post-failover run, `Failovers: 0` (sticky already applied).
|
||||
|
||||
---
|
||||
|
||||
## 7. Scenario E — Rate limit burst
|
||||
|
||||
**Goal:** verify rate limiting fires and bridge doesn't silently drop below-limit messages.
|
||||
|
||||
```bash
|
||||
# Set RPM very low for test, then flood from same sender
|
||||
# This is best done in control room by observing !status rate_limited count
|
||||
# rather than the soak script (which uses different senders per message).
|
||||
|
||||
# In Matrix control room:
|
||||
# Send 30+ messages from the same user account in quick succession in a mixed room.
|
||||
# Then:
|
||||
!status
|
||||
# Check: rate_limited_total increased, no queue drops.
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 8. Scenario F — Policy operations under load
|
||||
|
||||
**Goal:** `!policy history`, `!policy change`, and `!policy export` work while messages are in-flight.
|
||||
|
||||
```bash
|
||||
# Run a background soak
|
||||
python3 ops/scripts/matrix_bridge_soak.py \
|
||||
--url http://localhost:9400 \
|
||||
--messages 200 \
|
||||
--concurrency 2 \
|
||||
--report-file /tmp/soak_concurrent_policy.json &
|
||||
|
||||
# While soak is running, in Matrix control room:
|
||||
!policy history limit=5
|
||||
!policy export
|
||||
!status
|
||||
```
|
||||
|
||||
**Expected:** all three commands respond immediately (< 2s), soak completes without extra drops.
|
||||
|
||||
---
|
||||
|
||||
## 9. Prometheus / Grafana during soak
|
||||
|
||||
Key queries for the Grafana dashboard:
|
||||
|
||||
```promql
|
||||
# Throughput (messages/s)
|
||||
rate(matrix_bridge_routed_total[30s])
|
||||
|
||||
# Error rate
|
||||
rate(matrix_bridge_errors_total[30s])
|
||||
|
||||
# p95 invoke latency per node
|
||||
histogram_quantile(0.95, rate(matrix_bridge_invoke_duration_seconds_bucket[1m]))
|
||||
|
||||
# Queue drops rate
|
||||
rate(matrix_bridge_queue_dropped_total[1m])
|
||||
|
||||
# Failovers
|
||||
rate(matrix_bridge_failover_total[5m])
|
||||
```
|
||||
|
||||
Use the `matrix-bridge-dagi` Grafana dashboard at:
|
||||
`ops/grafana/dashboards/matrix-bridge-dagi.json`
|
||||
|
||||
---
|
||||
|
||||
## 10. Baseline numbers (reference)
|
||||
|
||||
| Metric | Cold start | Warm (sticky set) |
|
||||
|--------|-----------|-------------------|
|
||||
| p50 latency | ~200ms | ~150ms |
|
||||
| p95 latency | ~2 000ms | ~1 500ms |
|
||||
| Queue drops | 0 (queue=100) | 0 |
|
||||
| Failover fires | 1 per degradation | 0 after sticky |
|
||||
| Policy ops response | < 500ms | < 500ms |
|
||||
|
||||
*Update this table after each soak run with actual measured values.*
|
||||
|
||||
---
|
||||
|
||||
## 11. CI soak (mocked, no network)
|
||||
|
||||
For CI pipelines, use the mocked soak scenarios:
|
||||
|
||||
```bash
|
||||
python3 -m pytest tests/test_matrix_bridge_m11_soak_scenarios.py -v
|
||||
```
|
||||
|
||||
Covers (all deterministic, no network):
|
||||
- **S1** Queue saturation → drop counter
|
||||
- **S2** Failover under load → on_failover callback, health tracker
|
||||
- **S3** Sticky routing under burst → sticky set, burst routed to NODA2
|
||||
- **S4** Multi-room isolation → separate rooms don't interfere
|
||||
- **S5** Rate-limit burst → RL callback wired, no panic
|
||||
- **S6** HA restart recovery → sticky + health snapshot persisted and reloaded
|
||||
- **Perf baseline** 100-msg + 50-msg failover burst < 5s wall clock
|
||||
|
||||
---
|
||||
|
||||
## 12. Known failure modes & mitigations
|
||||
|
||||
| Symptom | Likely cause | Mitigation |
|
||||
|---------|-------------|------------|
|
||||
| `p95 > 5000ms` | Router/LLM slow | Increase `ROUTER_TIMEOUT_S`, check DeepSeek API |
|
||||
| `drop_rate > 1%` | Queue too small | Increase `QUEUE_MAX_EVENTS` |
|
||||
| `failovers > 0` but errors > 0 | Both nodes degraded | Check NODA1 + NODA2 health; scale router |
|
||||
| Bridge crash during soak | Memory leak / bug | `docker logs` → file GitHub issue |
|
||||
| Sticky not set after failover | `FAILOVER_STICKY_TTL_S=0` | Set to 300+ |
|
||||
| Restart doesn't load sticky | `HA_HEALTH_MAX_AGE_S` too small | Increase or set to 3600 |
|
||||
Reference in New Issue
Block a user