Files
microdao-daarion/SYSTEM-INVENTORY.md

536 lines
13 KiB
Markdown

# 🖥️ System Inventory — DAARION & MicroDAO
**Version:** 1.0.0
**Last Updated:** 2025-01-17
**Server:** GEX44 #2844465 (Hetzner)
---
## 🖥️ Hardware Specifications
### Production Server (144.76.224.179)
**Provider:** Hetzner Dedicated Server GEX44
**Server ID:** #2844465
#### GPU Configuration
**GPU Model:** NVIDIA RTX 4000 SFF Ada Generation
**VRAM:** 20 GB GDDR6
**Architecture:** Ada Lovelace
**CUDA Version:** 12.2
**Driver Version:** 535.274.02
**Current VRAM Usage:**
- Ollama (qwen3:8b): ~5.6 GB
- Vision Encoder (ViT-L/14): ~1.9 GB
- **Total:** ~7.5 GB / 20 GB (37.5% usage)
#### CPU & RAM (Typical GEX44)
- **CPU:** AMD Ryzen 9 5950X (16 cores, 32 threads) or similar
- **RAM:** 128 GB DDR4
- **Storage:** 2x NVMe SSD (RAID configuration)
---
## 🤖 Installed AI Models
### 1. LLM Models (Language Models)
#### Ollama (Local)
**Service:** Ollama
**Port:** 11434
**Status:** ✅ Active
**Installed Models:**
| Model | Size | Parameters | Context | VRAM Usage | Purpose |
|-------|------|-----------|---------|------------|---------|
| **qwen3:8b** | ~4.7 GB | 8B | 32K | ~6 GB | Primary LLM for Router, fast inference |
**API:**
```bash
# List models
curl http://localhost:11434/api/tags
# Generate
curl http://localhost:11434/api/generate -d '{
"model": "qwen3:8b",
"prompt": "Hello"
}'
```
**Configuration:**
- Base URL: `http://172.17.0.1:11434` (from Docker containers)
- Used by: DAGI Router, DevTools, CrewAI, Gateway
---
### 2. Vision Models (Multimodal)
#### OpenCLIP (Vision Encoder Service)
**Service:** vision-encoder
**Port:** 8001
**Status:** ✅ Active (GPU-accelerated)
**Model Details:**
| Model | Architecture | Parameters | Embedding Dim | VRAM Usage | Purpose |
|-------|-------------|-----------|---------------|------------|---------|
| **ViT-L/14** | Vision Transformer Large | ~428M | 768 | ~4 GB | Text/Image embeddings for RAG |
| **OpenAI CLIP** | CLIP (Contrastive Language-Image Pre-training) | - | 768 | - | Pretrained weights |
**Capabilities:**
- ✅ Text → 768-dim embedding (0.1-0.5s on GPU, ~10-15s on CPU)
- ✅ Image → 768-dim embedding (0.3-1s on GPU, ~15-20s on CPU)
- ✅ Text-to-image search (via Qdrant)
- ✅ Image-to-image similarity search (via Qdrant)
- ✅ GPU acceleration: **~20-30x speedup** vs CPU
- ⏳ Zero-shot image classification (planned)
- ⏳ CLIP score calculation (planned)
**API Endpoints:**
```bash
# Text embedding
POST http://localhost:8001/embed/text
# Image embedding (URL)
POST http://localhost:8001/embed/image
# Image embedding (file upload)
POST http://localhost:8001/embed/image/upload
# Health check
GET http://localhost:8001/health
# Model info
GET http://localhost:8001/info
```
**Configuration:**
- Model: `ViT-L-14`
- Pretrained: `openai`
- Device: `cuda` (GPU)
- Normalize: `true`
- Integration: DAGI Router (mode: `vision_embed`, `image_search`)
---
### 3. Embedding Models (Text)
#### BAAI/bge-m3 (RAG Service)
**Service:** rag-service
**Port:** 9500
**Status:** ✅ Active
**Model Details:**
| Model | Type | Embedding Dim | Context Length | Device | Purpose |
|-------|------|---------------|----------------|--------|---------|
| **BAAI/bge-m3** | Dense Retrieval | 1024 | 8192 | CPU/GPU | Text embeddings for RAG |
**Capabilities:**
- ✅ Document embedding for retrieval
- ✅ Query embedding
- ✅ Multi-lingual support
- ✅ Long context (8192 tokens)
**Storage:**
- Vector database: PostgreSQL with pgvector extension
- Indexed documents: Chat messages, tasks, meetings, governance docs
**Configuration:**
- Model: `BAAI/bge-m3`
- Device: `cpu` (can use GPU if available)
- HuggingFace cache: `/root/.cache/huggingface`
---
### 4. Audio Models
**Status:** ❌ Not installed yet
**Planned:**
- Whisper (speech-to-text)
- TTS models (text-to-speech)
- Audio classification
---
## 🗄️ Vector Databases
### 1. Qdrant (Image Embeddings)
**Service:** qdrant
**Port:** 6333 (HTTP), 6334 (gRPC)
**Status:** ✅ Active
**Collections:**
| Collection | Vectors | Dimension | Distance | Purpose |
|-----------|---------|-----------|----------|---------|
| **daarion_images** | Variable | 768 | Cosine | Image search (text→image, image→image) |
**Storage:** Docker volume `qdrant-data`
**API:**
```bash
# Health check
curl http://localhost:6333/healthz
# List collections
curl http://localhost:6333/collections
# Collection info
curl http://localhost:6333/collections/daarion_images
```
---
### 2. PostgreSQL + pgvector (Text Embeddings)
**Service:** dagi-postgres
**Port:** 5432
**Status:** ✅ Active
**Databases:**
| Database | Extension | Purpose |
|----------|-----------|---------|
| **daarion_memory** | - | Agent memory, context |
| **daarion_city** | pgvector | RAG document storage (1024-dim) |
**Storage:** Docker volume `postgres-data`
---
### 3. Neo4j (Graph Memory)
**Service:** neo4j
**Port:** 7687 (Bolt), 7474 (HTTP)
**Status:** ✅ Active (optional)
**Purpose:**
- Knowledge graph for entities
- Agent relationships
- DAO structure mapping
**Storage:** Docker volume (if configured)
---
## 🛠️ AI Services
### 1. DAGI Router (9102)
**Purpose:** Main routing engine for AI requests
**LLM Integration:**
- Ollama (qwen3:8b)
- DeepSeek (optional, API key required)
- OpenAI (optional, API key required)
**Providers:**
- LLM Provider (Ollama, DeepSeek, OpenAI)
- Vision Encoder Provider (OpenCLIP)
- DevTools Provider
- CrewAI Provider
- Vision RAG Provider (image search)
---
### 2. RAG Service (9500)
**Purpose:** Document retrieval and Q&A
**Models:**
- Embeddings: BAAI/bge-m3 (1024-dim)
- LLM: via DAGI Router (qwen3:8b)
**Capabilities:**
- Document ingestion (chat, tasks, meetings, governance, RWA, oracle)
- Vector search (pgvector)
- Q&A generation
- Context ranking
---
### 3. Vision Encoder (8001)
**Purpose:** Text/Image embeddings for multimodal RAG
**Models:**
- OpenCLIP ViT-L/14 (768-dim)
**Capabilities:**
- Text embeddings
- Image embeddings
- Image search (text-to-image, image-to-image)
---
### 4. Parser Service (9400)
**Purpose:** Document parsing and processing
**Capabilities:**
- PDF parsing
- Image extraction
- OCR (via Crawl4AI)
- Q&A generation
**Integration:**
- Crawl4AI for web content
- Vision Encoder for image analysis (planned)
---
### 5. Memory Service (8000)
**Purpose:** Agent memory and context management
**Storage:**
- PostgreSQL (daarion_memory)
- Redis (short-term cache, optional)
- Neo4j (graph memory, optional)
---
### 6. CrewAI Orchestrator (9010)
**Purpose:** Multi-agent workflow execution
**LLM:** via DAGI Router (qwen3:8b)
**Workflows:**
- microDAO onboarding
- Code review
- Proposal review
- Task decomposition
---
### 7. DevTools Backend (8008)
**Purpose:** Development tool execution
**Tools:**
- File operations (read/write)
- Test execution
- Notebook execution
- Git operations (planned)
---
### 8. Bot Gateway (9300)
**Purpose:** Telegram/Discord bot integration
**Bots:**
- DAARWIZZ (Telegram)
- Helion (Telegram, Energy Union)
---
### 9. RBAC Service (9200)
**Purpose:** Role-based access control
**Storage:** SQLite (`rbac.db`)
---
## 📊 GPU Memory Allocation (Estimated)
**Total VRAM:** 24 GB
| Service | Model | VRAM Usage | Status |
|---------|-------|-----------|--------|
| **Vision Encoder** | OpenCLIP ViT-L/14 | ~4 GB | Always loaded |
| **Ollama** | qwen3:8b | ~6 GB | Loaded on demand |
| **Available** | - | ~14 GB | For other models |
**Note:**
- Ollama and Vision Encoder can run simultaneously (~10 GB total)
- Remaining 14 GB available for additional models (audio, larger LLMs, etc.)
---
## 🔄 Model Loading Strategy
### Vision Encoder (Always-On)
- **Preloaded:** Yes (on service startup)
- **Reason:** Fast inference for image search
- **Unload:** Never (unless service restart)
### Ollama qwen3:8b (On-Demand)
- **Preloaded:** No
- **Load Time:** 2-3 seconds (first request)
- **Keep Alive:** 5 minutes (default)
- **Unload:** After idle timeout
### Future Models (Planned)
- **Whisper:** Load on-demand for audio transcription
- **TTS:** Load on-demand for speech synthesis
- **Larger LLMs:** Load on-demand (if VRAM available)
---
## 📈 Performance Benchmarks
### LLM Inference (qwen3:8b)
- **Tokens/sec:** ~50-80 tokens/sec (GPU)
- **Latency:** 100-200ms (first token)
- **Context:** 32K tokens
- **Batch size:** 1 (default)
### Vision Inference (ViT-L/14)
- **Text embedding:** 10-20ms (GPU)
- **Image embedding:** 30-50ms (GPU)
- **Throughput:** 50-100 images/sec (batch)
### RAG Search (BAAI/bge-m3)
- **Query embedding:** 50-100ms (CPU)
- **Vector search:** 5-10ms (pgvector)
- **Total latency:** 60-120ms
---
## 🔧 Model Management
### Ollama Models
**List installed models:**
```bash
curl http://localhost:11434/api/tags
```
**Pull new model:**
```bash
ollama pull llama2:7b
ollama pull mistral:7b
```
**Remove model:**
```bash
ollama rm qwen3:8b
```
**Check model info:**
```bash
ollama show qwen3:8b
```
---
### Vision Encoder Models
**Change model (in docker-compose.yml):**
```yaml
environment:
- MODEL_NAME=ViT-B-32 # Smaller, faster
- MODEL_PRETRAINED=openai
```
**Available models:**
- `ViT-B-32` (512-dim, 2 GB VRAM)
- `ViT-L-14` (768-dim, 4 GB VRAM) ← Current
- `ViT-L-14@336` (768-dim, 6 GB VRAM, higher resolution)
- `ViT-H-14` (1024-dim, 8 GB VRAM, highest quality)
---
## 📋 Complete Service List (17 Services)
| # | Service | Port | GPU | Models/Tools | Status |
|---|---------|------|-----|-------------|--------|
| 1 | DAGI Router | 9102 | ❌ | Routing engine | ✅ |
| 2 | Bot Gateway | 9300 | ❌ | Telegram bots | ✅ |
| 3 | DevTools | 8008 | ❌ | File ops, tests | ✅ |
| 4 | CrewAI | 9010 | ❌ | Multi-agent | ✅ |
| 5 | RBAC | 9200 | ❌ | Access control | ✅ |
| 6 | RAG Service | 9500 | ❌ | BAAI/bge-m3 | ✅ |
| 7 | Memory Service | 8000 | ❌ | Context mgmt | ✅ |
| 8 | Parser Service | 9400 | ❌ | PDF, OCR | ✅ |
| 9 | **Vision Encoder** | 8001 | ✅ | **OpenCLIP ViT-L/14** | ✅ |
| 10 | PostgreSQL | 5432 | ❌ | pgvector | ✅ |
| 11 | Redis | 6379 | ❌ | Cache | ✅ |
| 12 | Neo4j | 7687 | ❌ | Graph DB | ✅ |
| 13 | **Qdrant** | 6333 | ❌ | Vector DB | ✅ |
| 14 | Grafana | 3000 | ❌ | Dashboards | ✅ |
| 15 | Prometheus | 9090 | ❌ | Metrics | ✅ |
| 16 | Neo4j Exporter | 9091 | ❌ | Metrics | ✅ |
| 17 | **Ollama** | 11434 | ✅ | **qwen3:8b** | ✅ |
**GPU Services:** 2 (Vision Encoder, Ollama)
**Total VRAM Usage:** ~10 GB (concurrent)
---
## 🚀 Deployment Checklist
### Pre-Deployment (Local)
- [x] Code reviewed and tested
- [x] Documentation updated (WARP.md, INFRASTRUCTURE.md)
- [x] Jupyter Notebook updated
- [x] All tests passing
- [x] Git committed and pushed
### Deployment (Server)
```bash
# 1. SSH to server
ssh root@144.76.224.179
# 2. Pull latest code
cd /opt/microdao-daarion
git pull origin main
# 3. Check GPU
nvidia-smi
# 4. Build new services
docker-compose build vision-encoder
# 5. Start all services
docker-compose up -d
# 6. Verify health
docker-compose ps
curl http://localhost:8001/health # Vision Encoder
curl http://localhost:6333/healthz # Qdrant
curl http://localhost:9102/health # Router
# 7. Run smoke tests
./smoke.sh
./test-vision-encoder.sh
# 8. Check logs
docker-compose logs -f vision-encoder
docker-compose logs -f router
# 9. Monitor GPU
watch -n 1 nvidia-smi
```
---
## 📖 Documentation Index
- **[WARP.md](./WARP.md)** — Developer guide (quick start for Warp AI)
- **[INFRASTRUCTURE.md](./INFRASTRUCTURE.md)** — Server, services, deployment
- **[VISION-ENCODER-STATUS.md](./VISION-ENCODER-STATUS.md)** — Vision Encoder status
- **[VISION-RAG-IMPLEMENTATION.md](./VISION-RAG-IMPLEMENTATION.md)** — Vision RAG complete implementation
- **[docs/cursor/vision_encoder_deployment_task.md](./docs/cursor/vision_encoder_deployment_task.md)** — Deployment task
- **[docs/infrastructure_quick_ref.ipynb](./docs/infrastructure_quick_ref.ipynb)** — Jupyter quick reference
---
## 🎯 Next Steps
### Phase 1: Audio Integration
- [ ] Install Whisper (speech-to-text)
- [ ] Install TTS model (text-to-speech)
- [ ] Integrate with Telegram voice messages
- [ ] Audio RAG (transcription + search)
### Phase 2: Larger LLMs
- [ ] Install Mistral 7B (better reasoning)
- [ ] Install Llama 2 70B (if enough VRAM via quantization)
- [ ] Multi-model routing (task-specific models)
### Phase 3: Advanced Vision
- [ ] Image captioning (BLIP-2)
- [ ] Zero-shot classification
- [ ] Video understanding (frame extraction + CLIP)
### Phase 4: Optimization
- [ ] Model quantization (reduce VRAM)
- [ ] Batch inference (increase throughput)
- [ ] Model caching (Redis)
- [ ] GPU sharing (multiple models concurrently)
---
**Last Updated:** 2025-01-17
**Maintained by:** Ivan Tytar & DAARION Team
**Status:** ✅ Production Ready (17 services, 3 AI models)