Files
microdao-daarion/services/swapper-service/README.md
Apple 3de3c8cb36 feat: Add presence heartbeat for Matrix online status
- matrix-gateway: POST /internal/matrix/presence/online endpoint
- usePresenceHeartbeat hook with activity tracking
- Auto away after 5 min inactivity
- Offline on page close/visibility change
- Integrated in MatrixChatRoom component
2025-11-27 00:19:40 -08:00

6.6 KiB

Swapper Service

Version: 1.0.0
Status: Ready for Node #2
Port: 8890

Dynamic model loading service that manages LLM models on-demand to optimize memory usage. Supports single-active mode (one model loaded at a time).


Overview

Swapper Service provides:

  • Dynamic Model Loading — Load/unload models on-demand
  • Single-Active Mode — Only one model loaded at a time (memory optimization)
  • Model Metrics — Track uptime, request count, load/unload times
  • Ollama Integration — Works with Ollama models
  • REST API — Full API for model management

Features

Model Management

  • Load models on-demand
  • Unload models to free memory
  • Track which model is currently active
  • Monitor model uptime and usage

Metrics

  • Current active model
  • Model uptime (hours)
  • Request count per model
  • Load/unload timestamps
  • Total uptime per model

Single-Active Mode

  • Only one model loaded at a time
  • Automatic unloading of previous model when loading new one
  • Optimizes memory usage on resource-constrained systems

Quick Start

# Build and start
docker-compose up -d swapper-service

# Check health
curl http://localhost:8890/health

# Get status
curl http://localhost:8890/status

# List models
curl http://localhost:8890/models

Local Development

cd services/swapper-service

# Install dependencies
pip install -r requirements.txt

# Set environment variables
export OLLAMA_BASE_URL=http://localhost:11434
export SWAPPER_CONFIG_PATH=./config/swapper_config.yaml

# Run service
python -m app.main

API Endpoints

Health & Status

GET /health

Health check endpoint

Response:

{
  "status": "healthy",
  "service": "swapper-service",
  "active_model": "deepseek-r1-70b",
  "mode": "single-active"
}

GET /status

Get full Swapper service status

Response:

{
  "status": "healthy",
  "active_model": "deepseek-r1-70b",
  "available_models": ["deepseek-r1-70b", "qwen2.5-coder-32b", ...],
  "loaded_models": ["deepseek-r1-70b"],
  "mode": "single-active",
  "total_models": 8
}

Model Management

GET /models

List all available models

Response:

{
  "models": [
    {
      "name": "deepseek-r1-70b",
      "ollama_name": "deepseek-r1:70b",
      "type": "llm",
      "size_gb": 42,
      "priority": "high",
      "status": "loaded"
    }
  ]
}

GET /models/{model_name}

Get information about a specific model

Response:

{
  "name": "deepseek-r1-70b",
  "ollama_name": "deepseek-r1:70b",
  "type": "llm",
  "size_gb": 42,
  "priority": "high",
  "status": "loaded",
  "loaded_at": "2025-11-22T10:30:00",
  "unloaded_at": null,
  "total_uptime_seconds": 3600.5
}

POST /models/{model_name}/load

Load a model

Response:

{
  "status": "success",
  "model": "deepseek-r1-70b",
  "message": "Model deepseek-r1-70b loaded"
}

POST /models/{model_name}/unload

Unload a model

Response:

{
  "status": "success",
  "model": "deepseek-r1-70b",
  "message": "Model deepseek-r1-70b unloaded"
}

Metrics

GET /metrics

Get metrics for all models

Response:

{
  "metrics": [
    {
      "model_name": "deepseek-r1-70b",
      "status": "loaded",
      "loaded_at": "2025-11-22T10:30:00",
      "uptime_hours": 1.5,
      "request_count": 42,
      "total_uptime_seconds": 5400.0
    }
  ]
}

GET /metrics/{model_name}

Get metrics for a specific model

Response:

{
  "model_name": "deepseek-r1-70b",
  "status": "loaded",
  "loaded_at": "2025-11-22T10:30:00",
  "uptime_hours": 1.5,
  "request_count": 42,
  "total_uptime_seconds": 5400.0
}

Configuration

Environment Variables

Variable Default Description
OLLAMA_BASE_URL http://localhost:11434 Ollama API URL
SWAPPER_CONFIG_PATH ./config/swapper_config.yaml Path to config file
SWAPPER_MODE single-active Mode: single-active or multi-active
MAX_CONCURRENT_MODELS 1 Max concurrent models (for multi-active mode)
MODEL_SWAP_TIMEOUT 30 Timeout for model swap (seconds)

Config File (swapper_config.yaml)

swapper:
  mode: single-active
  max_concurrent_models: 1
  model_swap_timeout: 30
  gpu_enabled: true
  metal_acceleration: true

models:
  deepseek-r1-70b:
    path: ollama:deepseek-r1:70b
    type: llm
    size_gb: 42
    priority: high

Integration with Router

Swapper Service integrates with DAGI Router through metadata:

router_request = {
    "message": "Your request",
    "mode": "chat",
    "metadata": {
        "use_llm": "specialist_vision_8b",  # Swapper will load this model
        "swapper_service": "http://swapper-service:8890"
    }
}

Monitoring

Health Check

curl http://localhost:8890/health

Prometheus Metrics (Future)

  • swapper_active_model — Currently active model
  • swapper_model_uptime_seconds — Uptime per model
  • swapper_model_requests_total — Total requests per model

Troubleshooting

Model won't load

# Check Ollama is running
curl http://localhost:11434/api/tags

# Check model exists in Ollama
curl http://localhost:11434/api/tags | grep "model_name"

# Check Swapper logs
docker logs swapper-service

Service not responding

# Check if service is running
docker ps | grep swapper-service

# Check health
curl http://localhost:8890/health

# Check logs
docker logs -f swapper-service

Differences: Swapper Service vs vLLM

Swapper Service:

  • Model loading/unloading manager
  • Single-active mode (one model at a time)
  • Memory optimization
  • Works with Ollama
  • Lightweight, simple API

vLLM:

  • High-performance inference engine
  • Continuous serving (models stay loaded)
  • Optimized for throughput
  • Direct GPU acceleration
  • More complex, production-grade

Use Swapper when:

  • Memory is limited
  • Need to switch between models frequently
  • Running on resource-constrained systems (like Node #2 MacBook)

Use vLLM when:

  • Need maximum throughput
  • Models stay loaded for long periods
  • Have dedicated GPU resources
  • Production serving at scale

Next Steps

  1. Add to Node #2 Admin Console

    • Display active model
    • Show model metrics (uptime, requests)
    • Allow manual model loading/unloading
  2. Integration with Router

    • Auto-load models based on request type
    • Route requests to appropriate models
  3. Metrics Dashboard

    • Grafana dashboard for Swapper metrics
    • Model usage analytics

Last Updated: 2025-11-22
Maintained by: Ivan Tytar & DAARION Team
Status: Ready for Node #2