Files

Apple 82d5ff2a4f feat(matrix-bridge-dagi): M4–M11 + soak infrastructure (debug inject endpoint)

Includes all milestones M4 through M11:
- M4: agent discovery (!agents / !status)
- M5: node-aware routing + per-node observability
- M6: dynamic policy store (node/agent overrides, import/export)
- M7: Prometheus alerts + Grafana dashboard + metrics contract
- M8: node health tracker + soft failover + sticky cache + HA persistence
- M9: two-step confirm + diff preview for dangerous commands
- M10: auto-backup, restore, retention, policy history + change detail
- M11: soak scenarios (CI tests) + live soak script

Soak infrastructure (this commit):
- POST /v1/debug/inject_event (guarded by DEBUG_INJECT_ENABLED=false)
- _preflight_inject() and _check_wal() in soak script
- --db-path arg for WAL delta reporting
- Runbook sections 2a/2b/2c: Step 0 and Step 1 exact commands

Made-with: Cursor

2026-03-05 07:51:37 -08:00

12 KiB

Raw Permalink Blame History

matrix-bridge-dagi — Soak & Failure Rehearsal Runbook (M11)

Phase: M11
Applies to: matrix-bridge-dagi service on NODA1
When to run: Before any production traffic increase, after major code changes, or on a recurring monthly basis.

1. Goals

Goal	Measurable pass criterion
Latency under load	p95 invoke < 5 000 ms
Queue stability	drop rate < 1%
Failover correctness	failover fires on NODA1 outage; NODA2 serves all remaining messages
Sticky anti-flap	sticky set after first failover; no re-tries to degraded node
Restart recovery	sticky + health snapshot reloads within 10 s of restart
Policy operations safe under load	`!policy history` / `!policy change` work while messages in-flight

2. Prerequisites

# On NODA1 or local machine with network access to bridge
pip install httpx

# Verify bridge is up
curl -s http://localhost:9400/health | jq '.ok'
# Expected: true

# Verify /metrics endpoint
curl -s http://localhost:9400/metrics | grep matrix_bridge_up
# Expected: matrix_bridge_up{...} 1

2a. Enabling the Soak Inject Endpoint

The soak script uses POST /v1/debug/inject_event which is disabled by default. Enable it only on staging/NODA1 soak runs:

# On NODA1 — edit docker-compose override or pass env inline:
# Option 1: temporary inline restart
DEBUG_INJECT_ENABLED=true docker-compose \
  -f docker-compose.matrix-bridge-node1.yml \
  up -d --no-deps matrix-bridge-dagi

# Option 2: .env file override
echo "DEBUG_INJECT_ENABLED=true" >> .env.soak
docker-compose --env-file .env.soak \
  -f docker-compose.matrix-bridge-node1.yml \
  up -d --no-deps matrix-bridge-dagi

# Verify it's enabled (should return 200, not 403)
curl -s -X POST http://localhost:9400/v1/debug/inject_event \
  -H 'Content-Type: application/json' \
  -d '{"room_id":"!test:test","event":{}}' | jq .
# Expected: {"ok":false,"error":"no mapping for room_id=..."}  ← 200, not 403

# IMPORTANT: disable after soak
docker-compose -f docker-compose.matrix-bridge-node1.yml up -d --no-deps matrix-bridge-dagi
# (DEBUG_INJECT_ENABLED defaults to false)

2b. Step 0 (WORKERS=2 / QUEUE=100) — Record True Baseline

Goal: snapshot the "before any tuning" numbers to have a comparison point.

# 0. Confirm current config (should be defaults)
curl -s http://localhost:9400/health | jq '{workers: .workers, queue_max: .queue.max}'
# Expected: {"workers": 2, "queue_max": 100}

# 1. DB path for WAL check (adjust to your BRIDGE_DATA_DIR)
DB=/opt/microdao-daarion/data/matrix_bridge.db

# 2. WAL size before (manual check)
ls -lh ${DB}-wal 2>/dev/null || echo "(no WAL file yet — first run)"
sqlite3 $DB "PRAGMA wal_checkpoint(PASSIVE);" 2>/dev/null || echo "(no sqlite3)"

# 3. Run Step 0 soak
python3 ops/scripts/matrix_bridge_soak.py \
  --url   http://localhost:9400 \
  --messages   100 \
  --concurrency  4 \
  --agent  sofiia \
  --room-id "!your-room-id:your-server" \
  --max-p95-ms  5000 \
  --max-drop-rate 0.001 \
  --db-path $DB \
  --report-file /tmp/soak_step0_baseline.json

# 4. Record result in "Baseline numbers" table (section 10) below.
jq '.summary, .latency, .metrics_delta, .wal' /tmp/soak_step0_baseline.json

v1 Go/No-Go thresholds for Step 0:

Metric	Green ✅	Yellow ⚠️	Red ❌
`p95_invoke_ms`	< 3000	3000–5000	> 5000
`drop_rate`	0.00% (mandatory)	—	> 0.1%
`error_rate`	< 1%	1–3%	> 3%
`failovers`	0	—	≥ 1 without cause
WAL delta	< 2 MB	2–10 MB	> 10 MB

If Step 0 is Green → proceed to Step 1 tuning. If Step 0 is Yellow/Red → investigate before touching WORKER_CONCURRENCY.

2c. Step 1 (WORKERS=4 / QUEUE=200) — Tune-1

Goal: verify that doubling workers gives headroom without Router saturation.

# 1. Apply tuning
WORKER_CONCURRENCY=4 QUEUE_MAX_EVENTS=200 docker-compose \
  -f docker-compose.matrix-bridge-node1.yml \
  --env-file .env.soak \
  up -d --no-deps matrix-bridge-dagi

sleep 3
curl -s http://localhost:9400/health | jq '{workers: .workers, queue_max: .queue.max}'
# Expected: {"workers": 4, "queue_max": 200}

# 2. Run Step 1 soak (higher concurrency to stress the new headroom)
python3 ops/scripts/matrix_bridge_soak.py \
  --url   http://localhost:9400 \
  --messages   100 \
  --concurrency  8 \
  --agent  sofiia \
  --room-id "!your-room-id:your-server" \
  --max-p95-ms  3000 \
  --max-drop-rate 0.001 \
  --db-path $DB \
  --report-file /tmp/soak_step1_tune1.json

# 3. Compare Step 0 vs Step 1
python3 - <<'EOF'
import json
s0 = json.load(open('/tmp/soak_step0_baseline.json'))
s1 = json.load(open('/tmp/soak_step1_tune1.json'))
for k in ('p50', 'p95', 'p99'):
    print(f"{k}: {s0['latency'][k]}ms → {s1['latency'][k]}ms")
print(f"drops: {s0['metrics_delta']['queue_drops']} → {s1['metrics_delta']['queue_drops']}")
print(f"WAL: {s0['wal'].get('delta_mb')} → {s1['wal'].get('delta_mb')} MB delta")
EOF

Decision:

Step 1 Green → freeze, tag v1.0, ship to production.
p95 within 5% of Step 0 → Router is bottleneck (not workers); don't go to Step 2.
Queue drops > 0 at WORKERS=4 → try Step 2 (WORKERS=8, QUEUE=300).

3. Scenario A — Baseline load (100 messages, concurrency 4)

Goal: establish latency baseline, verify no drops under normal load.

python3 ops/scripts/matrix_bridge_soak.py \
  --url http://localhost:9400 \
  --messages 100 \
  --concurrency 4 \
  --max-p95-ms 3000 \
  --report-file /tmp/soak_baseline.json

Expected output:

matrix-bridge-dagi Soak Report  ✅ PASSED
  Messages:    100  concurrency=4
  Latency: p50=<500ms  p95=<3000ms
  Queue drops:  0  (rate 0.000%)
  Failovers:    0

If FAILED:

p95 too high → check router /health, DeepSeek API latency, docker stats
drop_rate > 0 → check QUEUE_MAX_EVENTS env var (increase if needed), inspect bridge logs

4. Scenario B — Queue saturation test

Goal: confirm drop metric fires cleanly and bridge doesn't crash.

# Reduce queue via env override, then flood:
QUEUE_MAX_EVENTS=5 docker-compose -f docker-compose.matrix-bridge-node1.yml \
  up -d matrix-bridge-dagi

# Wait for restart
sleep 5

python3 ops/scripts/matrix_bridge_soak.py \
  --url http://localhost:9400 \
  --messages 30 \
  --concurrency 10 \
  --max-drop-rate 0.99 \
  --report-file /tmp/soak_queue_sat.json

# Restore normal queue size
docker-compose -f docker-compose.matrix-bridge-node1.yml up -d matrix-bridge-dagi

Expected: queue_drops > 0, bridge still running after the test.

Verify in Prometheus/Grafana:

rate(matrix_bridge_queue_dropped_total[1m])

Should spike and then return to 0.

5. Scenario C — Node failover rehearsal

Goal: simulate NODA1 router becoming unavailable, verify NODA2 takes over.

# Step 1: stop the router on NODA1 temporarily
docker pause dagi-router-node1

# Step 2: run soak against bridge (bridge will failover to NODA2)
python3 ops/scripts/matrix_bridge_soak.py \
  --url http://localhost:9400 \
  --messages 20 \
  --concurrency 2 \
  --max-p95-ms 10000 \
  --report-file /tmp/soak_failover.json

# Step 3: restore router
docker unpause dagi-router-node1

Expected:

  Failovers:   1..20  (at least 1)
  Sticky sets: 1+
  Errors:      0  (fallback to NODA2 serves all messages)

Check sticky in control room:

!nodes

Should show NODA2 sticky with remaining TTL.

Check health tracker:

!status

Should show NODA1 state=degraded|down.

6. Scenario D — Restart recovery

Goal: after restart, sticky and health state reload within one polling cycle.

# After Scenario C: sticky is set to NODA2
# Restart the bridge
docker restart dagi-matrix-bridge-node1

# Wait for startup (up to 30s)
sleep 15

# Verify sticky reloaded
curl -s http://localhost:9400/health | jq '.ha_state'
# Expected: {"sticky_loaded": N, ...}

# Verify routing still uses NODA2 sticky
python3 ops/scripts/matrix_bridge_soak.py \
  --url http://localhost:9400 \
  --messages 10 \
  --concurrency 2 \
  --report-file /tmp/soak_restart.json

Expected: p95 similar to post-failover run, Failovers: 0 (sticky already applied).

7. Scenario E — Rate limit burst

Goal: verify rate limiting fires and bridge doesn't silently drop below-limit messages.

# Set RPM very low for test, then flood from same sender
# This is best done in control room by observing !status rate_limited count
# rather than the soak script (which uses different senders per message).

# In Matrix control room:
# Send 30+ messages from the same user account in quick succession in a mixed room.
# Then:
!status
# Check: rate_limited_total increased, no queue drops.

8. Scenario F — Policy operations under load

Goal: !policy history, !policy change, and !policy export work while messages are in-flight.

# Run a background soak
python3 ops/scripts/matrix_bridge_soak.py \
  --url http://localhost:9400 \
  --messages 200 \
  --concurrency 2 \
  --report-file /tmp/soak_concurrent_policy.json &

# While soak is running, in Matrix control room:
!policy history limit=5
!policy export
!status

Expected: all three commands respond immediately (< 2s), soak completes without extra drops.

9. Prometheus / Grafana during soak

Key queries for the Grafana dashboard:

# Throughput (messages/s)
rate(matrix_bridge_routed_total[30s])

# Error rate
rate(matrix_bridge_errors_total[30s])

# p95 invoke latency per node
histogram_quantile(0.95, rate(matrix_bridge_invoke_duration_seconds_bucket[1m]))

# Queue drops rate
rate(matrix_bridge_queue_dropped_total[1m])

# Failovers
rate(matrix_bridge_failover_total[5m])

Use the matrix-bridge-dagi Grafana dashboard at:
ops/grafana/dashboards/matrix-bridge-dagi.json

10. Baseline numbers (reference)

Metric	Cold start	Warm (sticky set)
p50 latency	~200ms	~150ms
p95 latency	~2 000ms	~1 500ms
Queue drops	0 (queue=100)	0
Failover fires	1 per degradation	0 after sticky
Policy ops response	< 500ms	< 500ms

Update this table after each soak run with actual measured values.

11. CI soak (mocked, no network)

For CI pipelines, use the mocked soak scenarios:

python3 -m pytest tests/test_matrix_bridge_m11_soak_scenarios.py -v

Covers (all deterministic, no network):

S1 Queue saturation → drop counter
S2 Failover under load → on_failover callback, health tracker
S3 Sticky routing under burst → sticky set, burst routed to NODA2
S4 Multi-room isolation → separate rooms don't interfere
S5 Rate-limit burst → RL callback wired, no panic
S6 HA restart recovery → sticky + health snapshot persisted and reloaded
Perf baseline 100-msg + 50-msg failover burst < 5s wall clock

12. Known failure modes & mitigations

Symptom	Likely cause	Mitigation
`p95 > 5000ms`	Router/LLM slow	Increase `ROUTER_TIMEOUT_S`, check DeepSeek API
`drop_rate > 1%`	Queue too small	Increase `QUEUE_MAX_EVENTS`
`failovers > 0` but errors > 0	Both nodes degraded	Check NODA1 + NODA2 health; scale router
Bridge crash during soak	Memory leak / bug	`docker logs` → file GitHub issue
Sticky not set after failover	`FAILOVER_STICKY_TTL_S=0`	Set to 300+
Restart doesn't load sticky	`HA_HEALTH_MAX_AGE_S` too small	Increase or set to 3600

12 KiB Raw Permalink Blame History Unescape Escape