Files
Apple 3de3c8cb36 feat: Add presence heartbeat for Matrix online status
- matrix-gateway: POST /internal/matrix/presence/online endpoint
- usePresenceHeartbeat hook with activity tracking
- Auto away after 5 min inactivity
- Offline on page close/visibility change
- Integrated in MatrixChatRoom component
2025-11-27 00:19:40 -08:00

354 lines
6.6 KiB
Markdown

# Swapper Service
**Version:** 1.0.0
**Status:** ✅ Ready for Node #2
**Port:** 8890
Dynamic model loading service that manages LLM models on-demand to optimize memory usage. Supports single-active mode (one model loaded at a time).
---
## Overview
Swapper Service provides:
- **Dynamic Model Loading** — Load/unload models on-demand
- **Single-Active Mode** — Only one model loaded at a time (memory optimization)
- **Model Metrics** — Track uptime, request count, load/unload times
- **Ollama Integration** — Works with Ollama models
- **REST API** — Full API for model management
---
## Features
### Model Management
- Load models on-demand
- Unload models to free memory
- Track which model is currently active
- Monitor model uptime and usage
### Metrics
- Current active model
- Model uptime (hours)
- Request count per model
- Load/unload timestamps
- Total uptime per model
### Single-Active Mode
- Only one model loaded at a time
- Automatic unloading of previous model when loading new one
- Optimizes memory usage on resource-constrained systems
---
## Quick Start
### Docker (Recommended)
```bash
# Build and start
docker-compose up -d swapper-service
# Check health
curl http://localhost:8890/health
# Get status
curl http://localhost:8890/status
# List models
curl http://localhost:8890/models
```
### Local Development
```bash
cd services/swapper-service
# Install dependencies
pip install -r requirements.txt
# Set environment variables
export OLLAMA_BASE_URL=http://localhost:11434
export SWAPPER_CONFIG_PATH=./config/swapper_config.yaml
# Run service
python -m app.main
```
---
## API Endpoints
### Health & Status
#### GET /health
Health check endpoint
**Response:**
```json
{
"status": "healthy",
"service": "swapper-service",
"active_model": "deepseek-r1-70b",
"mode": "single-active"
}
```
#### GET /status
Get full Swapper service status
**Response:**
```json
{
"status": "healthy",
"active_model": "deepseek-r1-70b",
"available_models": ["deepseek-r1-70b", "qwen2.5-coder-32b", ...],
"loaded_models": ["deepseek-r1-70b"],
"mode": "single-active",
"total_models": 8
}
```
### Model Management
#### GET /models
List all available models
**Response:**
```json
{
"models": [
{
"name": "deepseek-r1-70b",
"ollama_name": "deepseek-r1:70b",
"type": "llm",
"size_gb": 42,
"priority": "high",
"status": "loaded"
}
]
}
```
#### GET /models/{model_name}
Get information about a specific model
**Response:**
```json
{
"name": "deepseek-r1-70b",
"ollama_name": "deepseek-r1:70b",
"type": "llm",
"size_gb": 42,
"priority": "high",
"status": "loaded",
"loaded_at": "2025-11-22T10:30:00",
"unloaded_at": null,
"total_uptime_seconds": 3600.5
}
```
#### POST /models/{model_name}/load
Load a model
**Response:**
```json
{
"status": "success",
"model": "deepseek-r1-70b",
"message": "Model deepseek-r1-70b loaded"
}
```
#### POST /models/{model_name}/unload
Unload a model
**Response:**
```json
{
"status": "success",
"model": "deepseek-r1-70b",
"message": "Model deepseek-r1-70b unloaded"
}
```
### Metrics
#### GET /metrics
Get metrics for all models
**Response:**
```json
{
"metrics": [
{
"model_name": "deepseek-r1-70b",
"status": "loaded",
"loaded_at": "2025-11-22T10:30:00",
"uptime_hours": 1.5,
"request_count": 42,
"total_uptime_seconds": 5400.0
}
]
}
```
#### GET /metrics/{model_name}
Get metrics for a specific model
**Response:**
```json
{
"model_name": "deepseek-r1-70b",
"status": "loaded",
"loaded_at": "2025-11-22T10:30:00",
"uptime_hours": 1.5,
"request_count": 42,
"total_uptime_seconds": 5400.0
}
```
---
## Configuration
### Environment Variables
| Variable | Default | Description |
|----------|---------|-------------|
| `OLLAMA_BASE_URL` | `http://localhost:11434` | Ollama API URL |
| `SWAPPER_CONFIG_PATH` | `./config/swapper_config.yaml` | Path to config file |
| `SWAPPER_MODE` | `single-active` | Mode: `single-active` or `multi-active` |
| `MAX_CONCURRENT_MODELS` | `1` | Max concurrent models (for multi-active mode) |
| `MODEL_SWAP_TIMEOUT` | `30` | Timeout for model swap (seconds) |
### Config File (swapper_config.yaml)
```yaml
swapper:
mode: single-active
max_concurrent_models: 1
model_swap_timeout: 30
gpu_enabled: true
metal_acceleration: true
models:
deepseek-r1-70b:
path: ollama:deepseek-r1:70b
type: llm
size_gb: 42
priority: high
```
---
## Integration with Router
Swapper Service integrates with DAGI Router through metadata:
```python
router_request = {
"message": "Your request",
"mode": "chat",
"metadata": {
"use_llm": "specialist_vision_8b", # Swapper will load this model
"swapper_service": "http://swapper-service:8890"
}
}
```
---
## Monitoring
### Health Check
```bash
curl http://localhost:8890/health
```
### Prometheus Metrics (Future)
- `swapper_active_model` — Currently active model
- `swapper_model_uptime_seconds` — Uptime per model
- `swapper_model_requests_total` — Total requests per model
---
## Troubleshooting
### Model won't load
```bash
# Check Ollama is running
curl http://localhost:11434/api/tags
# Check model exists in Ollama
curl http://localhost:11434/api/tags | grep "model_name"
# Check Swapper logs
docker logs swapper-service
```
### Service not responding
```bash
# Check if service is running
docker ps | grep swapper-service
# Check health
curl http://localhost:8890/health
# Check logs
docker logs -f swapper-service
```
---
## Differences: Swapper Service vs vLLM
**Swapper Service:**
- Model loading/unloading manager
- Single-active mode (one model at a time)
- Memory optimization
- Works with Ollama
- Lightweight, simple API
**vLLM:**
- High-performance inference engine
- Continuous serving (models stay loaded)
- Optimized for throughput
- Direct GPU acceleration
- More complex, production-grade
**Use Swapper when:**
- Memory is limited
- Need to switch between models frequently
- Running on resource-constrained systems (like Node #2 MacBook)
**Use vLLM when:**
- Need maximum throughput
- Models stay loaded for long periods
- Have dedicated GPU resources
- Production serving at scale
---
## Next Steps
1. **Add to Node #2 Admin Console**
- Display active model
- Show model metrics (uptime, requests)
- Allow manual model loading/unloading
2. **Integration with Router**
- Auto-load models based on request type
- Route requests to appropriate models
3. **Metrics Dashboard**
- Grafana dashboard for Swapper metrics
- Model usage analytics
---
**Last Updated:** 2025-11-22
**Maintained by:** Ivan Tytar & DAARION Team
**Status:** ✅ Ready for Node #2