Files
microdao-daarion/docs/runbooks/NODE_ARCH_RECONCILIATION_PLAN_2026-02-16.md

3.0 KiB

NODE Architecture Reconciliation Plan (NODE1 + NODE3 + NODE4)

Date: 2026-02-16 Policy: Runtime-first for current state, roadmap-preserving for NODE3/NODE4.

1) Documents Confirmed (Legacy/Planning Set)

Found in worktrees (not in current main tree root):

  • .worktrees/origin-main/IMPLEMENTATION-STATUS.md
  • .worktrees/origin-main/ARCHITECTURE-150-NODES.md
  • .worktrees/origin-main/infrastructure/auth/AUTH-IMPLEMENTATION-PLAN.md
  • .worktrees/origin-main/infrastructure/matrix-gateway/README.md

Same copies found in:

  • .worktrees/docs-node1-sync/...

These files are valid architecture/program documents (dated 2026-01-10), but they are not an exact reflection of current NODE1 runtime code state on 2026-02-16.

2) Current Runtime Truth (NODE1)

  • Runtime root: /opt/microdao-daarion
  • Router/Gateway/Swapper healthy.
  • Canary suite passing:
    • ops/canary_all.sh
    • ops/canary_senpai_osr_guard.sh
  • Router endpoint contract in runtime:
    • active: POST /v1/agents/{agent_id}/infer
    • not active: POST /route

3) NODE3/NODE4 Policy (Do NOT remove from architecture)

NODE3/NODE4 remain part of target architecture and deployment plan.

Current status (observed now):

  • From laptop: 212.8.58.133:33147 and :33148 unreachable.
  • From NODE1: 212.8.58.133:8880 timeout, :33147/:33148 no route.

Interpretation:

  • This is a connectivity/runtime availability issue, not an architecture removal decision.
  • Keep NODE3/NODE4 in docs and topology as planned/temporarily_unreachable.

4) Operating Model Until Connectivity Restored

Use explicit mode labeling:

  • ACTIVE: reachable and health-checked.
  • DEGRADED: included in architecture but currently unreachable.
  • DISABLED: intentionally turned off (not the case for NODE3/NODE4 now).

Current recommendation:

  • NODE1: ACTIVE
  • NODE3: DEGRADED
  • NODE4: DEGRADED

5) Reconciliation Rules

  1. Do not delete NODE3/NODE4 docs, routes, or architecture references.
  2. Mark external generation dependencies as conditional by reachability checks.
  3. Runtime registries/config must not advertise unavailable external agents as locally active.
  4. Keep roadmap docs (150 nodes, auth, matrix gateway) as strategic references; do not treat them as runtime contract files.

6) Action Plan (No Risk to Production)

  1. Create a single "Architecture Status Board" document that maps:
    • planned topology (NODE1/2/3/4...)
    • current health/reachability per node
    • last verified timestamp.
  2. Add preflight checks for external node dependencies in deployment scripts:
    • TCP check
    • service health check
    • fallback behavior logging.
  3. Resolve registry drift:
    • align config/agent_registry.yml and generated registry artifacts on NODE1 runtime.
  4. After NODE3/NODE4 connectivity returns:
    • run connectivity proof
    • run media generation smoke
    • switch node status from DEGRADED to ACTIVE.

7) Decision Summary

  • Keep NODE3/NODE4 in architecture and planning.
  • Use runtime-first truth for what is currently active.
  • Maintain explicit degraded-mode status instead of silent exclusion.