Includes updates across gateway, router, node-worker, memory-service, aurora-service, swapper, sofiia-console UI and node2 infrastructure: - gateway-bot: Dockerfile, http_api.py, druid/aistalk prompts, doc_service - services/router: main.py, router-config.yml, fabric_metrics, memory_retrieval, offload_client, prompt_builder - services/node-worker: worker.py, main.py, config.py, fabric_metrics - services/memory-service: Dockerfile, database.py, main.py, requirements - services/aurora-service: main.py (+399), kling.py, quality_report.py - services/swapper-service: main.py, swapper_config_node2.yaml - services/sofiia-console: static/index.html (console UI update) - config: agent_registry, crewai_agents/teams, router_agents - ops/fabric_preflight.sh: updated preflight checks - router-config.yml, docker-compose.node2.yml: infra updates - docs: NODA1-AGENT-ARCHITECTURE, fabric_contract updated Made-with: Cursor
Swapper Service
Version: 1.0.0
Status: ✅ Ready for Node #2
Port: 8890
Dynamic model loading service that manages LLM models on-demand to optimize memory usage. Supports single-active mode (one model loaded at a time).
Overview
Swapper Service provides:
- Dynamic Model Loading — Load/unload models on-demand
- Single-Active Mode — Only one model loaded at a time (memory optimization)
- Model Metrics — Track uptime, request count, load/unload times
- Ollama Integration — Works with Ollama models
- REST API — Full API for model management
Features
Model Management
- Load models on-demand
- Unload models to free memory
- Track which model is currently active
- Monitor model uptime and usage
Metrics
- Current active model
- Model uptime (hours)
- Request count per model
- Load/unload timestamps
- Total uptime per model
Single-Active Mode
- Only one model loaded at a time
- Automatic unloading of previous model when loading new one
- Optimizes memory usage on resource-constrained systems
Quick Start
Docker (Recommended)
# Build and start
docker-compose up -d swapper-service
# Check health
curl http://localhost:8890/health
# Get status
curl http://localhost:8890/status
# List models
curl http://localhost:8890/models
Local Development
cd services/swapper-service
# Install dependencies
pip install -r requirements.txt
# Set environment variables
export OLLAMA_BASE_URL=http://localhost:11434
export SWAPPER_CONFIG_PATH=./config/swapper_config.yaml
# Run service
python -m app.main
API Endpoints
Health & Status
GET /health
Health check endpoint
Response:
{
"status": "healthy",
"service": "swapper-service",
"active_model": "deepseek-r1-70b",
"mode": "single-active"
}
GET /status
Get full Swapper service status
Response:
{
"status": "healthy",
"active_model": "deepseek-r1-70b",
"available_models": ["deepseek-r1-70b", "qwen2.5-coder-32b", ...],
"loaded_models": ["deepseek-r1-70b"],
"mode": "single-active",
"total_models": 8
}
Model Management
GET /models
List all available models
Response:
{
"models": [
{
"name": "deepseek-r1-70b",
"ollama_name": "deepseek-r1:70b",
"type": "llm",
"size_gb": 42,
"priority": "high",
"status": "loaded"
}
]
}
GET /models/{model_name}
Get information about a specific model
Response:
{
"name": "deepseek-r1-70b",
"ollama_name": "deepseek-r1:70b",
"type": "llm",
"size_gb": 42,
"priority": "high",
"status": "loaded",
"loaded_at": "2025-11-22T10:30:00",
"unloaded_at": null,
"total_uptime_seconds": 3600.5
}
POST /models/{model_name}/load
Load a model
Response:
{
"status": "success",
"model": "deepseek-r1-70b",
"message": "Model deepseek-r1-70b loaded"
}
POST /models/{model_name}/unload
Unload a model
Response:
{
"status": "success",
"model": "deepseek-r1-70b",
"message": "Model deepseek-r1-70b unloaded"
}
Metrics
GET /metrics
Get metrics for all models
Response:
{
"metrics": [
{
"model_name": "deepseek-r1-70b",
"status": "loaded",
"loaded_at": "2025-11-22T10:30:00",
"uptime_hours": 1.5,
"request_count": 42,
"total_uptime_seconds": 5400.0
}
]
}
GET /metrics/{model_name}
Get metrics for a specific model
Response:
{
"model_name": "deepseek-r1-70b",
"status": "loaded",
"loaded_at": "2025-11-22T10:30:00",
"uptime_hours": 1.5,
"request_count": 42,
"total_uptime_seconds": 5400.0
}
Configuration
Environment Variables
| Variable | Default | Description |
|---|---|---|
OLLAMA_BASE_URL |
http://localhost:11434 |
Ollama API URL |
SWAPPER_CONFIG_PATH |
./config/swapper_config.yaml |
Path to config file |
SWAPPER_MODE |
single-active |
Mode: single-active or multi-active |
MAX_CONCURRENT_MODELS |
1 |
Max concurrent models (for multi-active mode) |
MODEL_SWAP_TIMEOUT |
30 |
Timeout for model swap (seconds) |
Config File (swapper_config.yaml)
swapper:
mode: single-active
max_concurrent_models: 1
model_swap_timeout: 30
gpu_enabled: true
metal_acceleration: true
models:
deepseek-r1-70b:
path: ollama:deepseek-r1:70b
type: llm
size_gb: 42
priority: high
Integration with Router
Swapper Service integrates with DAGI Router through metadata:
router_request = {
"message": "Your request",
"mode": "chat",
"metadata": {
"use_llm": "specialist_vision_8b", # Swapper will load this model
"swapper_service": "http://swapper-service:8890"
}
}
Monitoring
Health Check
curl http://localhost:8890/health
Prometheus Metrics (Future)
swapper_active_model— Currently active modelswapper_model_uptime_seconds— Uptime per modelswapper_model_requests_total— Total requests per model
Troubleshooting
Model won't load
# Check Ollama is running
curl http://localhost:11434/api/tags
# Check model exists in Ollama
curl http://localhost:11434/api/tags | grep "model_name"
# Check Swapper logs
docker logs swapper-service
Service not responding
# Check if service is running
docker ps | grep swapper-service
# Check health
curl http://localhost:8890/health
# Check logs
docker logs -f swapper-service
Differences: Swapper Service vs vLLM
Swapper Service:
- Model loading/unloading manager
- Single-active mode (one model at a time)
- Memory optimization
- Works with Ollama
- Lightweight, simple API
vLLM:
- High-performance inference engine
- Continuous serving (models stay loaded)
- Optimized for throughput
- Direct GPU acceleration
- More complex, production-grade
Use Swapper when:
- Memory is limited
- Need to switch between models frequently
- Running on resource-constrained systems (like Node #2 MacBook)
Use vLLM when:
- Need maximum throughput
- Models stay loaded for long periods
- Have dedicated GPU resources
- Production serving at scale
Next Steps
-
Add to Node #2 Admin Console
- Display active model
- Show model metrics (uptime, requests)
- Allow manual model loading/unloading
-
Integration with Router
- Auto-load models based on request type
- Route requests to appropriate models
-
Metrics Dashboard
- Grafana dashboard for Swapper metrics
- Model usage analytics
Last Updated: 2025-11-22
Maintained by: Ivan Tytar & DAARION Team
Status: ✅ Ready for Node #2