feat(matrix-bridge-dagi): M4–M11 + soak infrastructure (debug inject endpoint)
Includes all milestones M4 through M11: - M4: agent discovery (!agents / !status) - M5: node-aware routing + per-node observability - M6: dynamic policy store (node/agent overrides, import/export) - M7: Prometheus alerts + Grafana dashboard + metrics contract - M8: node health tracker + soft failover + sticky cache + HA persistence - M9: two-step confirm + diff preview for dangerous commands - M10: auto-backup, restore, retention, policy history + change detail - M11: soak scenarios (CI tests) + live soak script Soak infrastructure (this commit): - POST /v1/debug/inject_event (guarded by DEBUG_INJECT_ENABLED=false) - _preflight_inject() and _check_wal() in soak script - --db-path arg for WAL delta reporting - Runbook sections 2a/2b/2c: Step 0 and Step 1 exact commands Made-with: Cursor
This commit is contained in:
@@ -67,6 +67,41 @@ services:
|
||||
- BRIDGE_CONTROL_ROOMS=${BRIDGE_CONTROL_ROOMS:-}
|
||||
# "ignore" (silent) | "reply_error" (⛔ reply to unauthorised attempts)
|
||||
- CONTROL_UNAUTHORIZED_BEHAVIOR=${CONTROL_UNAUTHORIZED_BEHAVIOR:-ignore}
|
||||
# ── M3.1: Runbook runner token ───────────────────────────────────────
|
||||
# X-Control-Token for POST /api/runbooks/internal/runs (sofiia-console)
|
||||
- SOFIIA_CONTROL_TOKEN=${SOFIIA_CONTROL_TOKEN:-}
|
||||
# M3.4: Control channel safety — rate limiting + cooldown
|
||||
- CONTROL_ROOM_RPM=${CONTROL_ROOM_RPM:-60}
|
||||
- CONTROL_OPERATOR_RPM=${CONTROL_OPERATOR_RPM:-30}
|
||||
- CONTROL_RUN_NEXT_RPM=${CONTROL_RUN_NEXT_RPM:-20}
|
||||
- CONTROL_COOLDOWN_S=${CONTROL_COOLDOWN_S:-2.0}
|
||||
# M2.3: Persistent event deduplication
|
||||
- PERSISTENT_DEDUPE=${PERSISTENT_DEDUPE:-1}
|
||||
- BRIDGE_DATA_DIR=${BRIDGE_DATA_DIR:-/app/data}
|
||||
- PROCESSED_EVENTS_TTL_H=${PROCESSED_EVENTS_TTL_H:-48}
|
||||
- PROCESSED_EVENTS_PRUNE_BATCH=${PROCESSED_EVENTS_PRUNE_BATCH:-5000}
|
||||
- PROCESSED_EVENTS_PRUNE_INTERVAL_S=${PROCESSED_EVENTS_PRUNE_INTERVAL_S:-3600}
|
||||
# M4.0: agent discovery
|
||||
- DISCOVERY_RPM=${DISCOVERY_RPM:-20}
|
||||
# M5.0: node-aware routing
|
||||
- BRIDGE_ALLOWED_NODES=${BRIDGE_ALLOWED_NODES:-NODA1}
|
||||
- BRIDGE_DEFAULT_NODE=${BRIDGE_DEFAULT_NODE:-NODA1}
|
||||
- BRIDGE_ROOM_NODE_MAP=${BRIDGE_ROOM_NODE_MAP:-}
|
||||
# M8.0: Node health + soft-failover thresholds
|
||||
- NODE_FAIL_CONSEC=${NODE_FAIL_CONSEC:-3}
|
||||
- NODE_LAT_EWMA_S=${NODE_LAT_EWMA_S:-12.0}
|
||||
- NODE_EWMA_ALPHA=${NODE_EWMA_ALPHA:-0.3}
|
||||
# M8.1: Sticky failover TTL (0 = disabled)
|
||||
- FAILOVER_STICKY_TTL_S=${FAILOVER_STICKY_TTL_S:-300}
|
||||
# M8.2: HA state persistence
|
||||
- HA_HEALTH_SNAPSHOT_INTERVAL_S=${HA_HEALTH_SNAPSHOT_INTERVAL_S:-60}
|
||||
- HA_HEALTH_MAX_AGE_S=${HA_HEALTH_MAX_AGE_S:-600}
|
||||
# M9.0: Two-step confirmation TTL for dangerous commands (0 = disabled)
|
||||
- CONFIRM_TTL_S=${CONFIRM_TTL_S:-120}
|
||||
- POLICY_EXPORT_RETENTION_DAYS=${POLICY_EXPORT_RETENTION_DAYS:-30}
|
||||
- POLICY_HISTORY_LIMIT=${POLICY_HISTORY_LIMIT:-100}
|
||||
# M11 soak: NEVER set to true in production
|
||||
- DEBUG_INJECT_ENABLED=${DEBUG_INJECT_ENABLED:-false}
|
||||
|
||||
# ── M2.2: Mixed room guard rails ────────────────────────────────────
|
||||
# Fail-fast if any room defines more agents than this
|
||||
|
||||
Reference in New Issue
Block a user