feat: Add presence heartbeat for Matrix online status
- matrix-gateway: POST /internal/matrix/presence/online endpoint - usePresenceHeartbeat hook with activity tracking - Auto away after 5 min inactivity - Offline on page close/visibility change - Integrated in MatrixChatRoom component
This commit is contained in:
353
services/swapper-service/README.md
Normal file
353
services/swapper-service/README.md
Normal file
@@ -0,0 +1,353 @@
|
||||
# Swapper Service
|
||||
|
||||
**Version:** 1.0.0
|
||||
**Status:** ✅ Ready for Node #2
|
||||
**Port:** 8890
|
||||
|
||||
Dynamic model loading service that manages LLM models on-demand to optimize memory usage. Supports single-active mode (one model loaded at a time).
|
||||
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
Swapper Service provides:
|
||||
- **Dynamic Model Loading** — Load/unload models on-demand
|
||||
- **Single-Active Mode** — Only one model loaded at a time (memory optimization)
|
||||
- **Model Metrics** — Track uptime, request count, load/unload times
|
||||
- **Ollama Integration** — Works with Ollama models
|
||||
- **REST API** — Full API for model management
|
||||
|
||||
---
|
||||
|
||||
## Features
|
||||
|
||||
### Model Management
|
||||
- Load models on-demand
|
||||
- Unload models to free memory
|
||||
- Track which model is currently active
|
||||
- Monitor model uptime and usage
|
||||
|
||||
### Metrics
|
||||
- Current active model
|
||||
- Model uptime (hours)
|
||||
- Request count per model
|
||||
- Load/unload timestamps
|
||||
- Total uptime per model
|
||||
|
||||
### Single-Active Mode
|
||||
- Only one model loaded at a time
|
||||
- Automatic unloading of previous model when loading new one
|
||||
- Optimizes memory usage on resource-constrained systems
|
||||
|
||||
---
|
||||
|
||||
## Quick Start
|
||||
|
||||
### Docker (Recommended)
|
||||
|
||||
```bash
|
||||
# Build and start
|
||||
docker-compose up -d swapper-service
|
||||
|
||||
# Check health
|
||||
curl http://localhost:8890/health
|
||||
|
||||
# Get status
|
||||
curl http://localhost:8890/status
|
||||
|
||||
# List models
|
||||
curl http://localhost:8890/models
|
||||
```
|
||||
|
||||
### Local Development
|
||||
|
||||
```bash
|
||||
cd services/swapper-service
|
||||
|
||||
# Install dependencies
|
||||
pip install -r requirements.txt
|
||||
|
||||
# Set environment variables
|
||||
export OLLAMA_BASE_URL=http://localhost:11434
|
||||
export SWAPPER_CONFIG_PATH=./config/swapper_config.yaml
|
||||
|
||||
# Run service
|
||||
python -m app.main
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## API Endpoints
|
||||
|
||||
### Health & Status
|
||||
|
||||
#### GET /health
|
||||
Health check endpoint
|
||||
|
||||
**Response:**
|
||||
```json
|
||||
{
|
||||
"status": "healthy",
|
||||
"service": "swapper-service",
|
||||
"active_model": "deepseek-r1-70b",
|
||||
"mode": "single-active"
|
||||
}
|
||||
```
|
||||
|
||||
#### GET /status
|
||||
Get full Swapper service status
|
||||
|
||||
**Response:**
|
||||
```json
|
||||
{
|
||||
"status": "healthy",
|
||||
"active_model": "deepseek-r1-70b",
|
||||
"available_models": ["deepseek-r1-70b", "qwen2.5-coder-32b", ...],
|
||||
"loaded_models": ["deepseek-r1-70b"],
|
||||
"mode": "single-active",
|
||||
"total_models": 8
|
||||
}
|
||||
```
|
||||
|
||||
### Model Management
|
||||
|
||||
#### GET /models
|
||||
List all available models
|
||||
|
||||
**Response:**
|
||||
```json
|
||||
{
|
||||
"models": [
|
||||
{
|
||||
"name": "deepseek-r1-70b",
|
||||
"ollama_name": "deepseek-r1:70b",
|
||||
"type": "llm",
|
||||
"size_gb": 42,
|
||||
"priority": "high",
|
||||
"status": "loaded"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
#### GET /models/{model_name}
|
||||
Get information about a specific model
|
||||
|
||||
**Response:**
|
||||
```json
|
||||
{
|
||||
"name": "deepseek-r1-70b",
|
||||
"ollama_name": "deepseek-r1:70b",
|
||||
"type": "llm",
|
||||
"size_gb": 42,
|
||||
"priority": "high",
|
||||
"status": "loaded",
|
||||
"loaded_at": "2025-11-22T10:30:00",
|
||||
"unloaded_at": null,
|
||||
"total_uptime_seconds": 3600.5
|
||||
}
|
||||
```
|
||||
|
||||
#### POST /models/{model_name}/load
|
||||
Load a model
|
||||
|
||||
**Response:**
|
||||
```json
|
||||
{
|
||||
"status": "success",
|
||||
"model": "deepseek-r1-70b",
|
||||
"message": "Model deepseek-r1-70b loaded"
|
||||
}
|
||||
```
|
||||
|
||||
#### POST /models/{model_name}/unload
|
||||
Unload a model
|
||||
|
||||
**Response:**
|
||||
```json
|
||||
{
|
||||
"status": "success",
|
||||
"model": "deepseek-r1-70b",
|
||||
"message": "Model deepseek-r1-70b unloaded"
|
||||
}
|
||||
```
|
||||
|
||||
### Metrics
|
||||
|
||||
#### GET /metrics
|
||||
Get metrics for all models
|
||||
|
||||
**Response:**
|
||||
```json
|
||||
{
|
||||
"metrics": [
|
||||
{
|
||||
"model_name": "deepseek-r1-70b",
|
||||
"status": "loaded",
|
||||
"loaded_at": "2025-11-22T10:30:00",
|
||||
"uptime_hours": 1.5,
|
||||
"request_count": 42,
|
||||
"total_uptime_seconds": 5400.0
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
#### GET /metrics/{model_name}
|
||||
Get metrics for a specific model
|
||||
|
||||
**Response:**
|
||||
```json
|
||||
{
|
||||
"model_name": "deepseek-r1-70b",
|
||||
"status": "loaded",
|
||||
"loaded_at": "2025-11-22T10:30:00",
|
||||
"uptime_hours": 1.5,
|
||||
"request_count": 42,
|
||||
"total_uptime_seconds": 5400.0
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Configuration
|
||||
|
||||
### Environment Variables
|
||||
|
||||
| Variable | Default | Description |
|
||||
|----------|---------|-------------|
|
||||
| `OLLAMA_BASE_URL` | `http://localhost:11434` | Ollama API URL |
|
||||
| `SWAPPER_CONFIG_PATH` | `./config/swapper_config.yaml` | Path to config file |
|
||||
| `SWAPPER_MODE` | `single-active` | Mode: `single-active` or `multi-active` |
|
||||
| `MAX_CONCURRENT_MODELS` | `1` | Max concurrent models (for multi-active mode) |
|
||||
| `MODEL_SWAP_TIMEOUT` | `30` | Timeout for model swap (seconds) |
|
||||
|
||||
### Config File (swapper_config.yaml)
|
||||
|
||||
```yaml
|
||||
swapper:
|
||||
mode: single-active
|
||||
max_concurrent_models: 1
|
||||
model_swap_timeout: 30
|
||||
gpu_enabled: true
|
||||
metal_acceleration: true
|
||||
|
||||
models:
|
||||
deepseek-r1-70b:
|
||||
path: ollama:deepseek-r1:70b
|
||||
type: llm
|
||||
size_gb: 42
|
||||
priority: high
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Integration with Router
|
||||
|
||||
Swapper Service integrates with DAGI Router through metadata:
|
||||
|
||||
```python
|
||||
router_request = {
|
||||
"message": "Your request",
|
||||
"mode": "chat",
|
||||
"metadata": {
|
||||
"use_llm": "specialist_vision_8b", # Swapper will load this model
|
||||
"swapper_service": "http://swapper-service:8890"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Monitoring
|
||||
|
||||
### Health Check
|
||||
```bash
|
||||
curl http://localhost:8890/health
|
||||
```
|
||||
|
||||
### Prometheus Metrics (Future)
|
||||
- `swapper_active_model` — Currently active model
|
||||
- `swapper_model_uptime_seconds` — Uptime per model
|
||||
- `swapper_model_requests_total` — Total requests per model
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Model won't load
|
||||
```bash
|
||||
# Check Ollama is running
|
||||
curl http://localhost:11434/api/tags
|
||||
|
||||
# Check model exists in Ollama
|
||||
curl http://localhost:11434/api/tags | grep "model_name"
|
||||
|
||||
# Check Swapper logs
|
||||
docker logs swapper-service
|
||||
```
|
||||
|
||||
### Service not responding
|
||||
```bash
|
||||
# Check if service is running
|
||||
docker ps | grep swapper-service
|
||||
|
||||
# Check health
|
||||
curl http://localhost:8890/health
|
||||
|
||||
# Check logs
|
||||
docker logs -f swapper-service
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Differences: Swapper Service vs vLLM
|
||||
|
||||
**Swapper Service:**
|
||||
- Model loading/unloading manager
|
||||
- Single-active mode (one model at a time)
|
||||
- Memory optimization
|
||||
- Works with Ollama
|
||||
- Lightweight, simple API
|
||||
|
||||
**vLLM:**
|
||||
- High-performance inference engine
|
||||
- Continuous serving (models stay loaded)
|
||||
- Optimized for throughput
|
||||
- Direct GPU acceleration
|
||||
- More complex, production-grade
|
||||
|
||||
**Use Swapper when:**
|
||||
- Memory is limited
|
||||
- Need to switch between models frequently
|
||||
- Running on resource-constrained systems (like Node #2 MacBook)
|
||||
|
||||
**Use vLLM when:**
|
||||
- Need maximum throughput
|
||||
- Models stay loaded for long periods
|
||||
- Have dedicated GPU resources
|
||||
- Production serving at scale
|
||||
|
||||
---
|
||||
|
||||
## Next Steps
|
||||
|
||||
1. **Add to Node #2 Admin Console**
|
||||
- Display active model
|
||||
- Show model metrics (uptime, requests)
|
||||
- Allow manual model loading/unloading
|
||||
|
||||
2. **Integration with Router**
|
||||
- Auto-load models based on request type
|
||||
- Route requests to appropriate models
|
||||
|
||||
3. **Metrics Dashboard**
|
||||
- Grafana dashboard for Swapper metrics
|
||||
- Model usage analytics
|
||||
|
||||
---
|
||||
|
||||
**Last Updated:** 2025-11-22
|
||||
**Maintained by:** Ivan Tytar & DAARION Team
|
||||
**Status:** ✅ Ready for Node #2
|
||||
|
||||
Reference in New Issue
Block a user