microdao-daarion/services/swapper-service/README.md

# Swapper Service

**Version:** 1.0.0
**Status:** ✅ Ready for Node #2
**Port:** 8890

Dynamic model loading service that manages LLM models on-demand to optimize memory usage. Supports single-active mode (one model loaded at a time).

---

## Overview

Swapper Service provides:
- **Dynamic Model Loading** — Load/unload models on-demand
- **Single-Active Mode** — Only one model loaded at a time (memory optimization)
- **Model Metrics** — Track uptime, request count, load/unload times
- **Ollama Integration** — Works with Ollama models
- **REST API** — Full API for model management

---

## Features

### Model Management
- Load models on-demand
- Unload models to free memory
- Track which model is currently active
- Monitor model uptime and usage

### Metrics
- Current active model
- Model uptime (hours)
- Request count per model
- Load/unload timestamps
- Total uptime per model

### Single-Active Mode
- Only one model loaded at a time
- Automatic unloading of previous model when loading new one
- Optimizes memory usage on resource-constrained systems

---

## Quick Start

### Docker (Recommended)

```bash
# Build and start
docker-compose up -d swapper-service

# Check health
curl http://localhost:8890/health

# Get status
curl http://localhost:8890/status

# List models
curl http://localhost:8890/models
```

### Local Development

```bash
cd services/swapper-service

# Install dependencies
pip install -r requirements.txt

# Set environment variables
export OLLAMA_BASE_URL=http://localhost:11434
export SWAPPER_CONFIG_PATH=./config/swapper_config.yaml

# Run service
python -m app.main
```

---

## API Endpoints

### Health & Status

#### GET /health
Health check endpoint

**Response:**
```json
{
  "status": "healthy",
  "service": "swapper-service",
  "active_model": "deepseek-r1-70b",
  "mode": "single-active"
}
```

#### GET /status
Get full Swapper service status

**Response:**
```json
{
  "status": "healthy",
  "active_model": "deepseek-r1-70b",
  "available_models": ["deepseek-r1-70b", "qwen2.5-coder-32b", ...],
  "loaded_models": ["deepseek-r1-70b"],
  "mode": "single-active",
  "total_models": 8
}
```

### Model Management

#### GET /models
List all available models

**Response:**
```json
{
  "models": [
    {
      "name": "deepseek-r1-70b",
      "ollama_name": "deepseek-r1:70b",
      "type": "llm",
      "size_gb": 42,
      "priority": "high",
      "status": "loaded"
    }
  ]
}
```

#### GET /models/{model_name}
Get information about a specific model

**Response:**
```json
{
  "name": "deepseek-r1-70b",
  "ollama_name": "deepseek-r1:70b",
  "type": "llm",
  "size_gb": 42,
  "priority": "high",
  "status": "loaded",
  "loaded_at": "2025-11-22T10:30:00",
  "unloaded_at": null,
  "total_uptime_seconds": 3600.5
}
```

#### POST /models/{model_name}/load
Load a model

**Response:**
```json
{
  "status": "success",
  "model": "deepseek-r1-70b",
  "message": "Model deepseek-r1-70b loaded"
}
```

#### POST /models/{model_name}/unload
Unload a model

**Response:**
```json
{
  "status": "success",
  "model": "deepseek-r1-70b",
  "message": "Model deepseek-r1-70b unloaded"
}
```

### Metrics

#### GET /metrics
Get metrics for all models

**Response:**
```json
{
  "metrics": [
    {
      "model_name": "deepseek-r1-70b",
      "status": "loaded",
      "loaded_at": "2025-11-22T10:30:00",
      "uptime_hours": 1.5,
      "request_count": 42,
      "total_uptime_seconds": 5400.0
    }
  ]
}
```

#### GET /metrics/{model_name}
Get metrics for a specific model

**Response:**
```json
{
  "model_name": "deepseek-r1-70b",
  "status": "loaded",
  "loaded_at": "2025-11-22T10:30:00",
  "uptime_hours": 1.5,
  "request_count": 42,
  "total_uptime_seconds": 5400.0
}
```

---

## Configuration

### Environment Variables

| Variable | Default | Description |
|----------|---------|-------------|
| `OLLAMA_BASE_URL` | `http://localhost:11434` | Ollama API URL |
| `SWAPPER_CONFIG_PATH` | `./config/swapper_config.yaml` | Path to config file |
| `SWAPPER_MODE` | `single-active` | Mode: `single-active` or `multi-active` |
| `MAX_CONCURRENT_MODELS` | `1` | Max concurrent models (for multi-active mode) |
| `MODEL_SWAP_TIMEOUT` | `30` | Timeout for model swap (seconds) |

### Config File (swapper_config.yaml)

```yaml
swapper:
  mode: single-active
  max_concurrent_models: 1
  model_swap_timeout: 30
  gpu_enabled: true
  metal_acceleration: true

models:
  deepseek-r1-70b:
    path: ollama:deepseek-r1:70b
    type: llm
    size_gb: 42
    priority: high
```

---

## Integration with Router

Swapper Service integrates with DAGI Router through metadata:

```python
router_request = {
    "message": "Your request",
    "mode": "chat",
    "metadata": {
        "use_llm": "specialist_vision_8b",  # Swapper will load this model
        "swapper_service": "http://swapper-service:8890"
    }
}
```

---

## Monitoring

### Health Check
```bash
curl http://localhost:8890/health
```

### Prometheus Metrics (Future)
- `swapper_active_model` — Currently active model
- `swapper_model_uptime_seconds` — Uptime per model
- `swapper_model_requests_total` — Total requests per model

---

## Troubleshooting

### Model won't load
```bash
# Check Ollama is running
curl http://localhost:11434/api/tags

# Check model exists in Ollama
curl http://localhost:11434/api/tags | grep "model_name"

# Check Swapper logs
docker logs swapper-service
```

### Service not responding
```bash
# Check if service is running
docker ps | grep swapper-service

# Check health
curl http://localhost:8890/health

# Check logs
docker logs -f swapper-service
```

---

## Differences: Swapper Service vs vLLM

**Swapper Service:**
- Model loading/unloading manager
- Single-active mode (one model at a time)
- Memory optimization
- Works with Ollama
- Lightweight, simple API

**vLLM:**
- High-performance inference engine
- Continuous serving (models stay loaded)
- Optimized for throughput
- Direct GPU acceleration
- More complex, production-grade

**Use Swapper when:**
- Memory is limited
- Need to switch between models frequently
- Running on resource-constrained systems (like Node #2 MacBook)

**Use vLLM when:**
- Need maximum throughput
- Models stay loaded for long periods
- Have dedicated GPU resources
- Production serving at scale

---

## Next Steps

1. **Add to Node #2 Admin Console**
   - Display active model
   - Show model metrics (uptime, requests)
   - Allow manual model loading/unloading

2. **Integration with Router**
   - Auto-load models based on request type
   - Route requests to appropriate models

3. **Metrics Dashboard**
   - Grafana dashboard for Swapper metrics
   - Model usage analytics

---

**Last Updated:** 2025-11-22
**Maintained by:** Ivan Tytar & DAARION Team
**Status:** ✅ Ready for Node #2