Files
microdao-daarion/docs/tasks/TASK_PHASE_NODE1_REPAIR.md
Apple a6e531a098 fix: NODE1_REPAIR - healthchecks, dependencies, SSR env, telegram gateway
TASK_PHASE_NODE1_REPAIR:
- Fix daarion-web SSR: use CITY_API_BASE_URL instead of 127.0.0.1
- Fix auth API routes: use AUTH_API_URL env var
- Add wget to Dockerfiles for healthchecks (stt, ocr, web-search, swapper, vector-db, rag)
- Update healthchecks to use wget instead of curl
- Fix vector-db-service: update torch==2.4.0, sentence-transformers==2.6.1
- Fix rag-service: correct haystack imports for v2.x
- Fix telegram-gateway: remove msg.ack() for non-JetStream NATS
- Add /health endpoint to nginx mvp-routes.conf
- Add room_role, is_public, sort_order columns to city_rooms migration
- Add TASK_PHASE_NODE1_REPAIR.md and DEPLOY_NODE1_REPAIR.md docs

Previous tasks included:
- TASK 039-044: Orchestrator rooms, Matrix chat cleanup, CrewAI integration
2025-11-29 05:17:08 -08:00

15 KiB
Raw Blame History

TASK_PHASE_NODE1_REPAIR.md

Phase name

NODE1_REPAIR — bring NODE1 to a healthy, MVP-ready state.

Goal

  1. All core services on NODE1 are running and healthy in docker ps.
  2. daarion-web serves working UI for:
    • /microdao/daarion (orchestrator room view),
    • /nodes/node-1 (NODE1 status),
    • /agents/... (agents/crew views).
  3. Telegram bot(s) can route a message through telegram-gateway → dagi-router → LLM and return a response.
  4. https://gateway.daarion.city/health returns HTTP 200.
  5. DB schema and code are aligned with the MVP product brief and room/orchestrator features (TASK 039044).

Context (facts — do not "redefine" them in code)

NODE1 (144.76.224.179):

  • docker ps shows multiple services as unhealthy or Restarting:
    • daarion-web,
    • dagi-router,
    • dagi-stt-service,
    • dagi-ocr-service,
    • dagi-web-search-service,
    • dagi-swapper-service,
    • dagi-vector-db-service,
    • dagi-rag-service.
  • Git HEAD on server = TASK 038 (no TASK 039044 applied).
  • daarion-web (Next.js) fails on SSR with:
    • connect ECONNREFUSED 127.0.0.1:80
    • It tries to fetch http://127.0.0.1:80/...
  • daarion-city-service is alive:
    • curl http://localhost:7001/health → healthy
    • But DB schema is missing new columns (e.g. room_role, is_public, sort_order) for orchestrator rooms.
  • dagi-router responds:
    • curl localhost:9102/healthok
    • Docker healthcheck runs python -c "import requests"; requests is not installed → container marked unhealthy.
  • STT/OCR/WebSearch/Swapper:
    • Healthchecks run curl inside slim images without curl installed → false unhealthy.
  • dagi-vector-db-service:
    • Keeps restarting with:
      • AttributeError: module 'torch.utils._pytree' has no attribute 'register_pytree_node'
    • Torch version is incompatible with sentence-transformers.
  • dagi-rag-service:
    • Crashes with:
      • ModuleNotFoundError: No module named 'haystack'
  • telegram-gateway:
    • Logs Temporary failure in name resolution for http://router:9102/route
      • Real service name in Docker is dagi-router, not router.
    • Logs NotJSMessageError when calling msg.ack() ack is used on a non-JetStream subject.
  • https://gateway.daarion.city/health returns 404 (SSL OK but no health endpoint).
  • Because daarion-web is unhealthy, MVP UI for NODE1 (microDAO, nodes, agents) is effectively offline.
  • Product brief requires at least six core flows live for MVP:
    • MicroDAO onboarding,
    • Public channel for guests,
    • MicroDAO chat,
    • Follow-ups,
    • Kanban tasks,
    • Private agent.

Do NOT change these facts; change code/config to fix the system.


Scope

In scope

  • Code and config changes in the main repo:
    • Dockerfiles and docker-compose.yml (and any overrides).
    • daarion-web env/SSR config.
    • daarion-city-service migrations and DB schema updates.
    • dagi-router, STT/OCR/WebSearch/Swapper healthchecks.
    • dagi-vector-db-service dependencies (Torch, sentence-transformers).
    • dagi-rag-service dependencies (Haystack).
    • telegram-gateway configuration and NATS usage.
    • Gateway /health endpoint (backend or nginx, depending on actual stack).
  • Local verification (via docker compose) + instructions for running on NODE1.

Out of scope

  • New product features beyond MVP (no new flows).
  • Large refactors of architecture.
  • Switching to a different LLM stack or DB vendor.

Prerequisites

Before editing:

  1. Inspect repo structure to locate:
    • Docker compose files (e.g. docker-compose.yml, docker-compose.prod.yml).
    • Services:
      • daarion-web,
      • daarion-city-service,
      • dagi-router,
      • dagi-stt-service,
      • dagi-ocr-service,
      • dagi-web-search-service,
      • dagi-swapper-service,
      • dagi-vector-db-service,
      • dagi-rag-service,
      • telegram-gateway,
      • gateway (or equivalent).
    • Migration tooling for daarion-city-service (Alembic / Prisma / Drizzle / etc.).
    • Existing deploy scripts:
      • scripts/deploy-prod.sh,
      • scripts/migrate-prod.sh (or equivalents).
  2. Read:
    • 01_product_brief_mvp.md — especially sections about microDAO, rooms, orchestrator, onboarding, follow-ups, Kanban, private agent.
    • docs/DEPLOY_MIGRATIONS.md or any deployment doc describing DB migrations.
    • microdao — Data Model & Event Catalog (if present in repo/docs) to understand expected DB fields for rooms.

Tasks

1. Bring codebase up to TASK 039044 (rooms / orchestrator) and align DB schema

1.1. Locate tasks 039044 (look under docs/cursor/ / docs/tasks/ / similar).

  • Identify what changes they describe:
    • new fields for rooms (e.g. room_role, is_public, sort_order),
    • any additional tables/relations required for orchestrator rooms and microDAO UI.

1.2. Implement DB/schema changes:

  • Use existing migration framework for daarion-city-service.
  • Create a new migration that:
    • adds missing columns (e.g. room_role, is_public, sort_order) to relevant tables (e.g. rooms),
    • adds any indices or constraints described in the docs,
    • is idempotent and safe to apply on existing prod DB.
  • Ensure migration can run in both dev and prod environments.

1.3. Update daarion-city-service models/ORM to match the new schema.

  • All API endpoints that return rooms/microDAO views must expose these fields (if required by frontend).

1.4. Ensure deploy pipeline uses these migrations:

  • Confirm scripts/migrate-prod.sh (or equivalent) calls the migration tool.
  • If not, update it so that running the script applies the new migration.

1.5. Add/update minimal tests:

  • Unit/integration test for room creation / listing that uses the new fields.
  • At least one test for the orchestrator room API.

2. Fix daarion-web API base URLs and SSR errors

2.1. Locate daarion-web config:

  • .env / .env.production / next.config.js / app/config.ts etc.

2.2. Define correct base URL for city-service:

For server-side calls:

CITY_API_BASE_URL=http://daarion-city-service:7001

For client-side calls (if needed):

NEXT_PUBLIC_CITY_API_BASE_URL=https://gateway.daarion.city/api
# or, for internal-only, http://daarion-city-service:7001

2.3. Update all fetch calls in daarion-web to use these env vars instead of hardcoded http://127.0.0.1:80.

  • Search for 127.0.0.1, localhost, and update to use CITY_API_BASE_URL / NEXT_PUBLIC_CITY_API_BASE_URL.
  • Ensure Next.js server components and API routes read values from process.env.

2.4. Local smoke test:

docker compose up -d daarion-city-service
docker compose up -d --build daarion-web
  • Open http://localhost:<WEB_PORT>/microdao/daarion and check there are no SSR 500 errors.
  • Check /nodes/node-1 and one of /agents/... pages.

3. Fix healthchecks for dagi-router and STT/OCR/WebSearch/Swapper

3.1. dagi-router healthcheck (Python requests)

3.1.1. Locate dagi-router Dockerfile and docker-compose service.

3.1.2. Replace healthcheck that uses python -c "import requests" with an HTTP healthcheck pointing at the service's /health endpoint.

Example docker-compose.yml snippet:

services:
  dagi-router:
    # ...
    healthcheck:
      test: ["CMD-SHELL", "wget -qO- http://localhost:9102/health || exit 1"]
      interval: 10s
      timeout: 3s
      retries: 5

3.1.3. Ensure the image has wget (or curl):

RUN apt-get update && apt-get install -y --no-install-recommends wget \
    && rm -rf /var/lib/apt/lists/*

3.2. STT/OCR/WebSearch/Swapper healthchecks (curl)

3.2.1. For each of:

  • dagi-stt-service,
  • dagi-ocr-service,
  • dagi-web-search-service,
  • dagi-swapper-service,

replace curl-based healthcheck with wget or an equivalent command that is available in the image, or add wget/curl to Dockerfile as above.

Example:

healthcheck:
  test: ["CMD-SHELL", "wget -qO- http://localhost:<PORT>/health || exit 1"]
  interval: 10s
  timeout: 3s
  retries: 5

3.2.2. Rebuild and run locally:

docker compose build dagi-router dagi-stt-service dagi-ocr-service dagi-web-search-service dagi-swapper-service
docker compose up -d dagi-router dagi-stt-service dagi-ocr-service dagi-web-search-service dagi-swapper-service
docker ps
  • Verify STATUS shows healthy after the healthcheck grace period.

4. Fix dagi-vector-db-service dependencies (Torch / sentence-transformers)

4.1. Locate Dockerfile / requirements for dagi-vector-db-service.

4.2. Update Python dependencies to a compatible set, e.g.:

RUN pip install --no-cache-dir "torch==2.4.0" "sentence-transformers==2.6.1"

(or another version pair that is known to work together).

4.3. Rebuild and run:

docker compose build dagi-vector-db-service
docker compose up -d dagi-vector-db-service
docker logs -f dagi-vector-db-service
  • Ensure there is no torch.utils._pytree error and service reaches "ready" state.
  • Add a simple /health endpoint test if not present.

5. Fix dagi-rag-service dependencies (Haystack)

5.1. Locate Dockerfile / requirements for dagi-rag-service.

5.2. Add Haystack dependency, for example:

RUN pip install --no-cache-dir "farm-haystack[all]==1.26.2"

(or the version used locally).

5.3. Rebuild and run:

docker compose build dagi-rag-service
docker compose up -d dagi-rag-service
docker logs -f dagi-rag-service
  • Confirm ModuleNotFoundError: No module named 'haystack' is gone.
  • Add/verify /health endpoint and healthcheck.

6. Fix Telegram gateway configuration and NATS usage

6.1. Router URL (DNS / service name)

6.1.1. Find telegram-gateway service in docker-compose.yml and its env/config.

6.1.2. Set correct router URL:

services:
  telegram-gateway:
    environment:
      # ...
      ROUTER_URL: http://dagi-router:9102

6.1.3. Alternatively, define network alias:

services:
  dagi-router:
    networks:
      default:
        aliases:
          - router

and keep ROUTER_URL=http://router:9102.

6.2. Avoid NotJSMessageError (msg.ack on non-JetStream)

6.2.1. Locate the code where telegram-gateway subscribes to NATS and calls msg.ack().

6.2.2. If the subject is not part of a JetStream stream, remove msg.ack():

# Before
msg = await sub.__anext__()
# ... process ...
await msg.ack()

# After (simple NATS)
msg = await sub.__anext__()
# ... process ...
# no ack for core NATS

6.2.3. If you want JetStream in the future, add TODO comments and separate task; for this phase keep it simple and working.

6.2.4. Local smoke test:

  • Start NATS, dagi-router, and telegram-gateway.
  • Simulate a message (if test tooling exists) and ensure no NotJSMessageError appears.

7. Add /health endpoint for gateway.daarion.city

Depending on implementation:

7.1. If gateway is a backend service (Node/FastAPI/etc.)

7.1.1. Add minimal endpoint:

// Node/Express example
app.get('/health', (req, res) => {
  res.status(200).json({ status: 'ok' });
});

or

# FastAPI example
@app.get("/health")
def health():
    return {"status": "ok"}

7.1.2. Ensure this endpoint is mounted at the top level of the gateway service.

7.2. If gateway.daarion.city is served via nginx

7.2.1. Update nginx config (e.g. /etc/nginx/sites-available/gateway.conf) to include:

location /health {
    return 200 'OK';
    add_header Content-Type text/plain;
}

7.2.2. Reload nginx:

nginx -t && nginx -s reload

7.2.3. Local/container test:

  • curl -k https://gateway.daarion.city/health should return HTTP 200.

8. Deployment flow for NODE1 (instructions)

Agent should prepare / update deployment docs (e.g. docs/DEPLOY_NODE1_REPAIR.md) with:

8.1. Git update on NODE1:

cd /opt/microdao-daarion
git fetch
git checkout main           # or production branch
git pull

8.2. Apply migrations:

./scripts/migrate-prod.sh   # or documented migrations command

8.3. Rebuild and restart only relevant services:

docker compose build \
  daarion-city-service \
  daarion-web \
  dagi-router \
  dagi-stt-service \
  dagi-ocr-service \
  dagi-web-search-service \
  dagi-swapper-service \
  dagi-vector-db-service \
  dagi-rag-service \
  telegram-gateway \
  gateway

docker compose up -d \
  daarion-city-service \
  daarion-web \
  dagi-router \
  dagi-stt-service \
  dagi-ocr-service \
  dagi-web-search-service \
  dagi-swapper-service \
  dagi-vector-db-service \
  dagi-rag-service \
  telegram-gateway \
  gateway

8.4. Quick docker ps check:

  • All listed services must be Up and healthy after grace period.

Acceptance checklist

Task is done when all of the following are true:

  1. Services/health

    • On NODE1, docker ps shows:

      • daarion-web, daarion-city-service,
      • dagi-router, dagi-stt-service, dagi-ocr-service,
      • dagi-web-search-service, dagi-swapper-service,
      • dagi-vector-db-service, dagi-rag-service,
      • telegram-gateway, gateway in state Up and healthy.
    • curl http://localhost:7001/health (city-service) → 200.

    • curl http://localhost:9102/health (dagi-router) → 200.

    • curl -k https://gateway.daarion.city/health → 200.

  2. DB & API

    • DB schema contains required fields for rooms (e.g. room_role, is_public, sort_order), matching Data Model & product brief.
    • Migration for these changes runs successfully on DEV and PROD.
    • API endpoints that frontend uses for microDAO/rooms/orchestrator return the new fields (where specified in docs).
  3. daarion-web UI

    • /microdao/daarion loads without SSR error and displays orchestrator/microDAO context.
    • /nodes/node-1 loads and shows NODE1 data.
    • At least one /agents/... page loads and shows crew/agents data.
    • No ECONNREFUSED 127.0.0.1:80 in daarion-web logs.
  4. Telegram routing

    • telegram-gateway uses the correct router URL (http://dagi-router:9102 or via alias router).
    • No Temporary failure in name resolution or NotJSMessageError in telegram-gateway logs under normal operation.
    • Sending a message through the Telegram bot results in a valid LLM-based reply via dagi-router.
  5. Docs

    • This task file TASK_PHASE_NODE1_REPAIR.md is saved under docs/tasks/ (or the project's task folder).
    • A short deploy how-to for NODE1 (from "git pull" to "docker compose up") is added/updated (e.g. docs/DEPLOY_NODE1_REPAIR.md).

Notes for agents

  • Prefer minimal, targeted changes over large refactors.

  • Reuse existing patterns from other services (Dockerfiles, healthchecks, migrations).

  • When in doubt which version of a library to pin (Torch, Haystack), check:

    • existing working services in this repo,
    • or the versions used in local/dev containers (if recorded in lockfiles).
  • Keep logs and errors in comments / commit messages to help future debugging.