feat(matrix-bridge-dagi): M4–M11 + soak infrastructure (debug inject endpoint)

Includes all milestones M4 through M11:
- M4: agent discovery (!agents / !status)
- M5: node-aware routing + per-node observability
- M6: dynamic policy store (node/agent overrides, import/export)
- M7: Prometheus alerts + Grafana dashboard + metrics contract
- M8: node health tracker + soft failover + sticky cache + HA persistence
- M9: two-step confirm + diff preview for dangerous commands
- M10: auto-backup, restore, retention, policy history + change detail
- M11: soak scenarios (CI tests) + live soak script

Soak infrastructure (this commit):
- POST /v1/debug/inject_event (guarded by DEBUG_INJECT_ENABLED=false)
- _preflight_inject() and _check_wal() in soak script
- --db-path arg for WAL delta reporting
- Runbook sections 2a/2b/2c: Step 0 and Step 1 exact commands

Made-with: Cursor
This commit is contained in:
Apple
2026-03-05 07:51:37 -08:00
parent fe6e3d30ae
commit 82d5ff2a4f
21 changed files with 9123 additions and 93 deletions

View File

@@ -67,6 +67,41 @@ services:
- BRIDGE_CONTROL_ROOMS=${BRIDGE_CONTROL_ROOMS:-}
# "ignore" (silent) | "reply_error" (⛔ reply to unauthorised attempts)
- CONTROL_UNAUTHORIZED_BEHAVIOR=${CONTROL_UNAUTHORIZED_BEHAVIOR:-ignore}
# ── M3.1: Runbook runner token ───────────────────────────────────────
# X-Control-Token for POST /api/runbooks/internal/runs (sofiia-console)
- SOFIIA_CONTROL_TOKEN=${SOFIIA_CONTROL_TOKEN:-}
# M3.4: Control channel safety — rate limiting + cooldown
- CONTROL_ROOM_RPM=${CONTROL_ROOM_RPM:-60}
- CONTROL_OPERATOR_RPM=${CONTROL_OPERATOR_RPM:-30}
- CONTROL_RUN_NEXT_RPM=${CONTROL_RUN_NEXT_RPM:-20}
- CONTROL_COOLDOWN_S=${CONTROL_COOLDOWN_S:-2.0}
# M2.3: Persistent event deduplication
- PERSISTENT_DEDUPE=${PERSISTENT_DEDUPE:-1}
- BRIDGE_DATA_DIR=${BRIDGE_DATA_DIR:-/app/data}
- PROCESSED_EVENTS_TTL_H=${PROCESSED_EVENTS_TTL_H:-48}
- PROCESSED_EVENTS_PRUNE_BATCH=${PROCESSED_EVENTS_PRUNE_BATCH:-5000}
- PROCESSED_EVENTS_PRUNE_INTERVAL_S=${PROCESSED_EVENTS_PRUNE_INTERVAL_S:-3600}
# M4.0: agent discovery
- DISCOVERY_RPM=${DISCOVERY_RPM:-20}
# M5.0: node-aware routing
- BRIDGE_ALLOWED_NODES=${BRIDGE_ALLOWED_NODES:-NODA1}
- BRIDGE_DEFAULT_NODE=${BRIDGE_DEFAULT_NODE:-NODA1}
- BRIDGE_ROOM_NODE_MAP=${BRIDGE_ROOM_NODE_MAP:-}
# M8.0: Node health + soft-failover thresholds
- NODE_FAIL_CONSEC=${NODE_FAIL_CONSEC:-3}
- NODE_LAT_EWMA_S=${NODE_LAT_EWMA_S:-12.0}
- NODE_EWMA_ALPHA=${NODE_EWMA_ALPHA:-0.3}
# M8.1: Sticky failover TTL (0 = disabled)
- FAILOVER_STICKY_TTL_S=${FAILOVER_STICKY_TTL_S:-300}
# M8.2: HA state persistence
- HA_HEALTH_SNAPSHOT_INTERVAL_S=${HA_HEALTH_SNAPSHOT_INTERVAL_S:-60}
- HA_HEALTH_MAX_AGE_S=${HA_HEALTH_MAX_AGE_S:-600}
# M9.0: Two-step confirmation TTL for dangerous commands (0 = disabled)
- CONFIRM_TTL_S=${CONFIRM_TTL_S:-120}
- POLICY_EXPORT_RETENTION_DAYS=${POLICY_EXPORT_RETENTION_DAYS:-30}
- POLICY_HISTORY_LIMIT=${POLICY_HISTORY_LIMIT:-100}
# M11 soak: NEVER set to true in production
- DEBUG_INJECT_ENABLED=${DEBUG_INJECT_ENABLED:-false}
# ── M2.2: Mixed room guard rails ────────────────────────────────────
# Fail-fast if any room defines more agents than this