Files
microdao-daarion/docs/runbooks/NODE_ARCH_RECONCILIATION_PLAN_2026-02-16.md

84 lines
3.0 KiB
Markdown

# NODE Architecture Reconciliation Plan (NODE1 + NODE3 + NODE4)
Date: 2026-02-16
Policy: Runtime-first for current state, roadmap-preserving for NODE3/NODE4.
## 1) Documents Confirmed (Legacy/Planning Set)
Found in worktrees (not in current main tree root):
- `.worktrees/origin-main/IMPLEMENTATION-STATUS.md`
- `.worktrees/origin-main/ARCHITECTURE-150-NODES.md`
- `.worktrees/origin-main/infrastructure/auth/AUTH-IMPLEMENTATION-PLAN.md`
- `.worktrees/origin-main/infrastructure/matrix-gateway/README.md`
Same copies found in:
- `.worktrees/docs-node1-sync/...`
These files are valid architecture/program documents (dated 2026-01-10), but they are not an exact reflection of current NODE1 runtime code state on 2026-02-16.
## 2) Current Runtime Truth (NODE1)
- Runtime root: `/opt/microdao-daarion`
- Router/Gateway/Swapper healthy.
- Canary suite passing:
- `ops/canary_all.sh`
- `ops/canary_senpai_osr_guard.sh`
- Router endpoint contract in runtime:
- active: `POST /v1/agents/{agent_id}/infer`
- not active: `POST /route`
## 3) NODE3/NODE4 Policy (Do NOT remove from architecture)
NODE3/NODE4 remain part of target architecture and deployment plan.
Current status (observed now):
- From laptop: `212.8.58.133:33147` and `:33148` unreachable.
- From NODE1: `212.8.58.133:8880` timeout, `:33147/:33148` no route.
Interpretation:
- This is a connectivity/runtime availability issue, not an architecture removal decision.
- Keep NODE3/NODE4 in docs and topology as `planned/temporarily_unreachable`.
## 4) Operating Model Until Connectivity Restored
Use explicit mode labeling:
- `ACTIVE`: reachable and health-checked.
- `DEGRADED`: included in architecture but currently unreachable.
- `DISABLED`: intentionally turned off (not the case for NODE3/NODE4 now).
Current recommendation:
- NODE1: `ACTIVE`
- NODE3: `DEGRADED`
- NODE4: `DEGRADED`
## 5) Reconciliation Rules
1. Do not delete NODE3/NODE4 docs, routes, or architecture references.
2. Mark external generation dependencies as conditional by reachability checks.
3. Runtime registries/config must not advertise unavailable external agents as locally active.
4. Keep roadmap docs (150 nodes, auth, matrix gateway) as strategic references; do not treat them as runtime contract files.
## 6) Action Plan (No Risk to Production)
1. Create a single "Architecture Status Board" document that maps:
- planned topology (NODE1/2/3/4...)
- current health/reachability per node
- last verified timestamp.
2. Add preflight checks for external node dependencies in deployment scripts:
- TCP check
- service health check
- fallback behavior logging.
3. Resolve registry drift:
- align `config/agent_registry.yml` and generated registry artifacts on NODE1 runtime.
4. After NODE3/NODE4 connectivity returns:
- run connectivity proof
- run media generation smoke
- switch node status from `DEGRADED` to `ACTIVE`.
## 7) Decision Summary
- Keep NODE3/NODE4 in architecture and planning.
- Use runtime-first truth for what is currently active.
- Maintain explicit degraded-mode status instead of silent exclusion.