Commit Graph

14 Commits

Author SHA1 Message Date
Apple
5290287058 feat: implement TTS, Document processing, and Memory Service /facts API
- TTS: xtts-v2 integration with voice cloning support
- Document: docling integration for PDF/DOCX/PPTX processing
- Memory Service: added /facts/upsert, /facts/{key}, /facts endpoints
- Added required dependencies (TTS, docling)
2026-01-17 08:16:37 -08:00
Apple
3478dfce5f 🔒 КРИТИЧНО: Видалено паролі/API ключі з документів + закрито NodePort
Some checks failed
Build and Deploy Docs / build-and-deploy (push) Has been cancelled
- Видалено всі паролі та API ключі з документів
- Замінено на посилання на Vault
- Закрито NodePort для Memory Service (тільки internal)
- Створено SECURITY-ROTATION-PLAN.md
- Створено ARCHITECTURE-150-NODES.md (план для 150 нод)
- Оновлено config.py (видалено hardcoded Cohere key)
2026-01-10 09:46:03 -08:00
Apple
1c247ea40c 📝 Update context docs with session logging system
Some checks failed
Build and Deploy Docs / build-and-deploy (push) Has been cancelled
- Added Session Logging System section to INFRASTRUCTURE.md
- Added Git Multi-Remote configuration (GitHub + Gitea + GitLab)
- Updated version to 2.5.0
- Added logging commands reference
- Updated infrastructure_quick_ref.ipynb with new features
- Added SSH tunnel instructions for GitLab access
2026-01-10 04:58:01 -08:00
Apple
744c149300 Add automated session logging system
Some checks failed
Build and Deploy Docs / build-and-deploy (push) Has been cancelled
- Created logs/ structure (sessions, operations, incidents)
- Added session-start/log/end scripts
- Installed Git hooks for auto-logging commits/pushes
- Added shell integration for zsh
- Created CHANGELOG.md
- Documented today's session (2026-01-10)
2026-01-10 04:53:17 -08:00
Apple
e67882fd15 docs: Security Incident #3 - postgres:15-alpine compromised image
CRITICAL SECURITY INCIDENT - Postgres:15-alpine Contains Crypto Miners

Discovered: Jan 9, 2026 20:47 UTC
Resolved: Jan 9, 2026 22:07 UTC
Duration: ~2 hours

Malware Discovered (3 variants):
1. cpioshuf (1764% CPU) - /tmp/.perf.c/cpioshuf
2. ipcalcpg_recvlogical - /tmp/.perf.c/ipcalcpg_recvlogical
3. mysql (933% CPU) - /tmp/mysql

Compromised Image:
- Image: postgres:15-alpine
- SHA: b3968e348b48f1198cc6de6611d055dbad91cd561b7990c406c3fc28d7095b21
- Status: BANNED - DO NOT USE

Impact:
- CPU load: 17+ (critical)
- Multiple containers affected: daarion-postgres, dagi-postgres, docker-db-1
- Dify also compromised (used same image)
- System performance degraded for 2 hours

Resolution:
- Killed all 3 miner variants (PIDs: 2294271, 2310302, 2314793, 2366898)
- Stopped and removed ALL postgres:15-alpine containers
- Deleted compromised image permanently
- Migrated to postgres:14-alpine (verified clean)
- Removed entire Dify installation (precautionary)
- System load normalized: 17+ → 0.40

Root Cause:
- Official Docker Hub image postgres:15-alpine either:
  * Temporarily compromised on Docker Hub, OR
  * PostgreSQL 15 has supply chain vulnerability
- Persistent infection: malware embedded in image layers
- Auto-restart: orphan containers kept respawning miners

Actions Taken:
-  All miners killed and files removed
-  Compromised image deleted and BLOCKED
-  Migrated to postgres:14-alpine
-  Dify completely removed
-  /tmp cleaned of all suspicious files
-  System verified clean

Prevention Measures:
1. Pin images by SHA (not tags like :latest or :15-alpine)
2. Implement Trivy/Grype security scanning
3. Monitor CPU spikes (alert if >5 load average)
4. Regular /tmp audits for executables
5. Use --remove-orphans in docker-compose
6. Block orphan container spawning

Lessons Learned:
- Official images can be compromised
- Never trust :latest or version tags blindly
- Scan ALL images before deployment
- Monitor /tmp for suspicious executables
- One compromised image can spread (Dify used same postgres)
- Multiple malware variants = fallback payloads

Files Updated:
- INFRASTRUCTURE.md - Added Incident #3 complete documentation

POSTGRES:15-ALPINE IS PERMANENTLY BANNED
Use postgres:14-alpine (verified safe)

Co-Authored-By: Warp <agent@warp.dev>
2026-01-09 13:09:38 -08:00
Apple
778907cf0e docs: add NODE3 (Threadripper PRO + RTX 3090) to infrastructure
Added NODE3 - AI/ML Workstation Specification:

Hardware:
- CPU: AMD Ryzen Threadripper PRO 5975WX (32 cores / 64 threads, 3.6 GHz boost)
- RAM: 128GB DDR4
- GPU: NVIDIA GeForce RTX 3090 24GB GDDR6X
  - 10496 CUDA cores
  - CUDA 13.0, Driver 580.95.05
- Storage: Samsung SSD 990 PRO 4TB NVMe
  - Root: 100GB (27% used)
  - Available for expansion: 3.5TB

System:
- Hostname: llm80-che-1-1
- IP: 80.77.35.151:33147
- OS: Ubuntu 24.04.3 LTS (Noble Numbat)
- Container Runtime: MicroK8s + containerd
- Uptime: 24/7

Security Status:  CLEAN (verified 2026-01-09)
- No crypto miners detected
- 0 zombie processes
- CPU load: 0.17 (very low)
- GPU utilization: 0% (ready for workloads)

Services Running:
- Port 3000 - Unknown service (needs investigation)
- Port 8080 - Unknown service (needs investigation)
- Port 11434 - Ollama (localhost only)
- Port 27017/27019 - MongoDB (localhost only)
- Kubernetes API: 16443
- K8s services: 10248-10259, 25000

Recommended Use Cases:
- 🤖 Large LLM inference (Llama 70B, Qwen 72B, Mixtral 8x22B)
- 🧠 Model training and fine-tuning
- 🎨 Stable Diffusion XL image generation
- 🔬 AI/ML research and experimentation
- 🚀 Kubernetes-based AI service orchestration

Files Updated:
- INFRASTRUCTURE.md v2.4.0
- docs/infrastructure_quick_ref.ipynb v2.3.0

NODE3 is the most powerful node in the infrastructure:
- Most CPU cores: 32c/64t (vs 16c M4 Max)
- Most RAM: 128GB (vs 64GB)
- Dedicated GPU: RTX 3090 24GB VRAM
- Largest storage: 4TB NVMe (vs 2TB)

Co-Authored-By: Warp <agent@warp.dev>
2026-01-09 05:53:16 -08:00
Apple
21691aa042 docs: document Security Incident #2 - recurring container compromise
Security Incident #2 Emergency Response (Jan 9, 2026):
- Documented second compromise with NEW crypto miners (softirq, vrarhpb)
- Root cause: Docker image auto-restarted after server reboot
- Emergency mitigation completed (processes killed, container/images removed, load normalized)
- Created comprehensive rebuild task document: TASK_REBUILD_DAARION_WEB.md
- Updated INFRASTRUCTURE.md v2.3.0 with Incident #2 timeline and lessons learned
- Updated infrastructure_quick_ref.ipynb v2.2.0 with security status

Critical Changes:
- daarion-web container permanently disabled until secure rebuild
- Docker images DELETED (not just container stopped)
- Enhanced firewall rules (SSH rate limiting, port scan blocking)
- Retry test registered with Hetzner
- System load normalized: 30+ → 4.19
- Zombie processes cleaned: 1499 → 5

Files Created/Updated:
1. TASK_REBUILD_DAARION_WEB.md - Detailed rebuild instructions for Cursor agent
2. INFRASTRUCTURE.md - Added Incident #2 to Security section
3. docs/infrastructure_quick_ref.ipynb - Updated security status and version

Lessons Learned:
- ALWAYS delete Docker images, not just containers
- Auto-restart policies are dangerous for compromised containers
- Complete removal = container + image + restart policy change

Status: Emergency mitigation complete, statement submission pending (deadline: 2026-01-09 12:54 UTC)

Hetzner Incident ID: 10F3971:2A (AbuseID)

Co-Authored-By: Warp <agent@warp.dev>
2026-01-09 02:08:13 -08:00
Apple
a1091b03a3 docs: add Cursor Agent SSH access instructions for NODE1
- Add detailed SSH connection guide for Cursor agents
- Include common commands, safety checks, and troubleshooting
- Add interactive session example and best practices
- Update INFRASTRUCTURE.md with section for Cursor agents
- Update infrastructure_quick_ref.ipynb with SSH access configuration
- Provide complete workflow examples for remote operations

Co-Authored-By: Warp <agent@warp.dev>
2026-01-09 02:08:13 -08:00
Apple
e829fe66f2 docs: security incident resolution & firewall implementation
- Document network scanning incident (Dec 6 2025 - Jan 8 2026)
- Add firewall rules to prevent internal network access
- Deploy monitoring script for scanning attempts
- Update INFRASTRUCTURE.md v2.2.0 with Security section
- Update infrastructure_quick_ref.ipynb v2.1.0
- Root cause: compromised daarion-web container with crypto miner
- Resolution: container removed, firewall applied, monitoring deployed

Co-Authored-By: Warp <agent@warp.dev>
2026-01-09 02:08:13 -08:00
GitHub Action
f0d113e234 docs: auto-update repository information [skip ci] 2025-12-01 09:30:48 +00:00
Apple
3de3c8cb36 feat: Add presence heartbeat for Matrix online status
- matrix-gateway: POST /internal/matrix/presence/online endpoint
- usePresenceHeartbeat hook with activity tracking
- Auto away after 5 min inactivity
- Offline on page close/visibility change
- Integrated in MatrixChatRoom component
2025-11-27 00:19:40 -08:00
Apple
3fa206503e docs: оновлено документацію - Swapper Service замість Vision Encoder на Node #1 2025-11-21 00:52:00 -08:00
Apple
31f3602047 feat: оновлення інфраструктури з Node #2 та нові сервіси 2025-11-21 00:35:03 -08:00
Apple
4601c6fca8 feat: add Vision Encoder service + Vision RAG implementation
- Vision Encoder Service (OpenCLIP ViT-L/14, GPU-accelerated)
  - FastAPI app with text/image embedding endpoints (768-dim)
  - Docker support with NVIDIA GPU runtime
  - Port 8001, health checks, model info API

- Qdrant Vector Database integration
  - Port 6333/6334 (HTTP/gRPC)
  - Image embeddings storage (768-dim, Cosine distance)
  - Auto collection creation

- Vision RAG implementation
  - VisionEncoderClient (Python client for API)
  - Image Search module (text-to-image, image-to-image)
  - Vision RAG routing in DAGI Router (mode: image_search)
  - VisionEncoderProvider integration

- Documentation (5000+ lines)
  - SYSTEM-INVENTORY.md - Complete system inventory
  - VISION-ENCODER-STATUS.md - Service status
  - VISION-RAG-IMPLEMENTATION.md - Implementation details
  - vision_encoder_deployment_task.md - Deployment checklist
  - services/vision-encoder/README.md - Deployment guide
  - Updated WARP.md, INFRASTRUCTURE.md, Jupyter Notebook

- Testing
  - test-vision-encoder.sh - Smoke tests (6 tests)
  - Unit tests for client, image search, routing

- Services: 17 total (added Vision Encoder + Qdrant)
- AI Models: 3 (qwen3:8b, OpenCLIP ViT-L/14, BAAI/bge-m3)
- GPU Services: 2 (Vision Encoder, Ollama)
- VRAM Usage: ~10 GB (concurrent)

Status: Production Ready 
2025-11-17 05:24:36 -08:00