Apple
a0c3c0cbb5
🚀 Matrix Gateway: базова реалізація v1
...
- Matrix Client (підключення та синхронізація)
- RBAC Checker (перевірка прав через Postgres)
- Job Creator (створення jobs з команд)
- NATS Publisher (публікація jobs у streams)
- K8s deployment
- README з документацією
Команди: !embed, !retrieve, !summarize
TODO: Реальна інтеграція з Matrix homeserver, статуси результатів
2026-01-10 10:40:18 -08:00
Apple
a001636c11
🔧 NATS: standalone режим + streams creation Job
...
- NATS працює в standalone режимі (1 replica)
- Виправлено server_name через initContainer
- Створено K8s Job для створення streams (через Python)
- Створено create-streams.py скрипт
TODO: Streams створити через worker-daemon або після виправлення DNS в Job
2026-01-10 10:32:44 -08:00
Apple
346dfdfb2d
🔧 NATS: виправлено deployment.yaml з правильним initContainer
...
- Додано initContainer для підстановки server_name
- Використано emptyDir для запису конфігу
- Оновлено volumeMounts
2026-01-10 10:24:41 -08:00
Apple
a688666fa1
🔧 Worker Daemon: базова реалізація v1
...
Update Documentation / update-repos-info (push) Has been cancelled
- Capability Registry (Postgres heartbeat)
- NATS Client (підписка на streams)
- Job Executor (виконання jobs)
- Metrics Exporter (Prometheus)
- Dockerfile для deployment
- Виправлено server_name в NATS (emptyDir)
TODO: Реальна реалізація embed/retrieve/summarize, Matrix Gateway, Auth
2026-01-10 10:24:13 -08:00
Apple
8fe0b58978
🚀 NATS JetStream: K8s deployment + streams + job schema v1
...
- K8s deployment (2 replicas, PVC, initContainer для server_name)
- Streams definitions (MM_ONLINE, MM_OFFLINE, MM_WRITE, MM_EVENTS)
- Job payload schema (JSON v1 з idempotency)
- Worker contract (capabilities + ack/retry)
- Init streams script
- Оновлено ARCHITECTURE-150-NODES.md (Control-plane vs Data-plane)
TODO: Auth (nkeys), 3+ replicas для prod, worker-daemon implementation
2026-01-10 10:02:25 -08:00
Apple
3478dfce5f
🔒 КРИТИЧНО: Видалено паролі/API ключі з документів + закрито NodePort
...
Build and Deploy Docs / build-and-deploy (push) Has been cancelled
- Видалено всі паролі та API ключі з документів
- Замінено на посилання на Vault
- Закрито NodePort для Memory Service (тільки internal)
- Створено SECURITY-ROTATION-PLAN.md
- Створено ARCHITECTURE-150-NODES.md (план для 150 нод)
- Оновлено config.py (видалено hardcoded Cohere key)
2026-01-10 09:46:03 -08:00
Apple
f7bf935a21
✅ NODE3: Memory Service мігровано з Docker в K8s
...
Build and Deploy Docs / build-and-deploy (push) Has been cancelled
- NODE3 додано до K3s кластера як worker (llm80-che-1-1)
- Memory Service працює в K8s на NODE3 (pod: memory-service-node3-*)
- Docker контейнер зупинено та видалено
- Оновлено MEMORY-MODULE-STATUS.md v3.1.0
2026-01-10 09:26:59 -08:00
Apple
116bf5f3f3
✅ Memory Service запущено на всіх нодах + Cohere API налаштовано
...
Build and Deploy Docs / build-and-deploy (push) Has been cancelled
- NODE1: Memory Service в K8s (port 30800) ✅
- NODE2: Memory Service в Docker (port 8001) ✅
- NODE3: Memory Service в Docker (port 8001) ✅
- Всі ноди: Cohere API налаштовано для embeddings ✅
- NODE2: ComfyUI перевірено (macOS App, port 8000) ✅
- Оновлено MEMORY-MODULE-STATUS.md v3.0.0
2026-01-10 09:13:20 -08:00
Apple
6b02349300
🧠 Update Memory Module Status v2.1.0
...
Build and Deploy Docs / build-and-deploy (push) Has been cancelled
- NODE2: PostgreSQL + Agent Memory Schema ✅
- NODE3: ComfyUI installed (v0.8.2, PyTorch+CUDA) ✅
- All nodes now have full memory stack
- Added critical TODOs: Memory Service & Cohere API
2026-01-10 09:00:17 -08:00
Apple
f4ccf7c570
🧠 Complete Memory Stack setup across all nodes
...
Build and Deploy Docs / build-and-deploy (push) Has been cancelled
- NODE1: Neo4j (K8s), NVIDIA RTX 4000 + CUDA 13.1
- NODE2: Fixed Neo4j & Qdrant containers
- NODE3: Full stack (PostgreSQL + Qdrant + Neo4j)
- Updated MEMORY-MODULE-STATUS.md v2.0.0
2026-01-10 08:26:42 -08:00
Apple
8aee29d42d
📊 Add Memory Module Status Report across all nodes
Build and Deploy Docs / build-and-deploy (push) Has been cancelled
2026-01-10 08:11:12 -08:00
Apple
eed1e30aca
🔧 Add site/ to .gitignore (mkdocs build output)
Build and Deploy Docs / build-and-deploy (push) Has been cancelled
2026-01-10 07:57:47 -08:00
Apple
fb4f4a16d5
🔧 Fix GitHub Actions docs workflow
...
- Update mkdocs dependencies to latest versions
- Add permissions for GitHub Pages deployment
- Add workflow_dispatch for manual trigger
- Fix build command with fallback
2026-01-10 07:57:36 -08:00
Apple
90758facae
🧠 Add Agent Memory System with PostgreSQL + Qdrant + Cohere
...
Features:
- Three-tier memory architecture (short/mid/long-term)
- PostgreSQL schema for conversations, events, memories
- Qdrant vector database for semantic search
- Cohere embeddings (embed-multilingual-v3.0, 1024 dims)
- FastAPI Memory Service with full CRUD
- External Secrets integration with Vault
- Kubernetes deployment manifests
Components:
- infrastructure/database/agent-memory-schema.sql
- infrastructure/kubernetes/apps/qdrant/
- infrastructure/kubernetes/apps/memory-service/
- services/memory-service/ (FastAPI app)
Also includes:
- External Secrets Operator
- Traefik Ingress Controller
- Cert-Manager with Let's Encrypt
- ArgoCD for GitOps
2026-01-10 07:52:32 -08:00
Apple
12545a7c76
🏗️ Add DAARION Infrastructure Stack
...
- Terraform + Ansible + K3s + Vault + Consul + Observability
- Decentralized network architecture (own datacenters)
- Complete Ansible playbooks:
- bootstrap.yml: OS setup, packages, SSH
- hardening.yml: Security (UFW, fail2ban, auditd, Trivy)
- k3s-install.yml: Lightweight Kubernetes cluster
- Production inventory with NODE1, NODE3
- Group variables for all nodes
- Security check cron script
- Multi-DC ready with Consul support
2026-01-10 05:31:51 -08:00
Apple
02cfd90b6f
🌐 Add 150 nodes network deployment plan
...
- Created NETWORK-150-NODES-PLAN.md with complete architecture
- Ansible playbooks for automated security and deployment
- Terraform configs for Hetzner infrastructure
- Zero Trust security architecture
- Prometheus federation for monitoring
- Estimated costs and roadmap
- PostgreSQL deployed on NODE1 and NODE3
2026-01-10 05:11:45 -08:00
Apple
1231647f94
🛡️ Add comprehensive Security Hardening Plan
...
- Created SECURITY-HARDENING-PLAN.md with 6 security levels
- Added setup-node1-security.sh for automated hardening
- Added scan-image.sh for pre-deployment image scanning
- Created docker-compose.secure.yml template
- Includes: Trivy, fail2ban, UFW, auditd, rkhunter, chkrootkit
- Network isolation, egress filtering, process monitoring
- Incident response procedures and recovery playbook
2026-01-10 05:05:21 -08:00
Apple
1c247ea40c
📝 Update context docs with session logging system
...
Build and Deploy Docs / build-and-deploy (push) Has been cancelled
- Added Session Logging System section to INFRASTRUCTURE.md
- Added Git Multi-Remote configuration (GitHub + Gitea + GitLab)
- Updated version to 2.5.0
- Added logging commands reference
- Updated infrastructure_quick_ref.ipynb with new features
- Added SSH tunnel instructions for GitLab access
2026-01-10 04:58:01 -08:00
Apple
744c149300
✨ Add automated session logging system
...
Build and Deploy Docs / build-and-deploy (push) Has been cancelled
- Created logs/ structure (sessions, operations, incidents)
- Added session-start/log/end scripts
- Installed Git hooks for auto-logging commits/pushes
- Added shell integration for zsh
- Created CHANGELOG.md
- Documented today's session (2026-01-10)
2026-01-10 04:53:17 -08:00
Apple
e67882fd15
docs: Security Incident #3 - postgres:15-alpine compromised image
...
CRITICAL SECURITY INCIDENT - Postgres:15-alpine Contains Crypto Miners
Discovered: Jan 9, 2026 20:47 UTC
Resolved: Jan 9, 2026 22:07 UTC
Duration: ~2 hours
Malware Discovered (3 variants):
1. cpioshuf (1764% CPU) - /tmp/.perf.c/cpioshuf
2. ipcalcpg_recvlogical - /tmp/.perf.c/ipcalcpg_recvlogical
3. mysql (933% CPU) - /tmp/mysql
Compromised Image:
- Image: postgres:15-alpine
- SHA: b3968e348b48f1198cc6de6611d055dbad91cd561b7990c406c3fc28d7095b21
- Status: BANNED - DO NOT USE
Impact:
- CPU load: 17+ (critical)
- Multiple containers affected: daarion-postgres, dagi-postgres, docker-db-1
- Dify also compromised (used same image)
- System performance degraded for 2 hours
Resolution:
- Killed all 3 miner variants (PIDs: 2294271, 2310302, 2314793, 2366898)
- Stopped and removed ALL postgres:15-alpine containers
- Deleted compromised image permanently
- Migrated to postgres:14-alpine (verified clean)
- Removed entire Dify installation (precautionary)
- System load normalized: 17+ → 0.40
Root Cause:
- Official Docker Hub image postgres:15-alpine either:
* Temporarily compromised on Docker Hub, OR
* PostgreSQL 15 has supply chain vulnerability
- Persistent infection: malware embedded in image layers
- Auto-restart: orphan containers kept respawning miners
Actions Taken:
- ✅ All miners killed and files removed
- ✅ Compromised image deleted and BLOCKED
- ✅ Migrated to postgres:14-alpine
- ✅ Dify completely removed
- ✅ /tmp cleaned of all suspicious files
- ✅ System verified clean
Prevention Measures:
1. Pin images by SHA (not tags like :latest or :15-alpine)
2. Implement Trivy/Grype security scanning
3. Monitor CPU spikes (alert if >5 load average)
4. Regular /tmp audits for executables
5. Use --remove-orphans in docker-compose
6. Block orphan container spawning
Lessons Learned:
- Official images can be compromised
- Never trust :latest or version tags blindly
- Scan ALL images before deployment
- Monitor /tmp for suspicious executables
- One compromised image can spread (Dify used same postgres)
- Multiple malware variants = fallback payloads
Files Updated:
- INFRASTRUCTURE.md - Added Incident #3 complete documentation
POSTGRES:15-ALPINE IS PERMANENTLY BANNED
Use postgres:14-alpine (verified safe)
Co-Authored-By: Warp <agent@warp.dev >
2026-01-09 13:09:38 -08:00
Apple
f6a2007c77
🛡️ security: Implement full container audit and signed images guide
...
- Added security/audit-all-containers.sh: Automated Trivy scan for all 60+ project images
- Added security/hardening/signed-images.md: Guide for Docker Content Trust (DCT)
- Updated NODE1 with audit script and started background scan
- Results will be saved to /opt/microdao-daarion/logs/audits/
Co-authored-by: Cursor Agent <agent@cursor.sh >
2026-01-09 12:16:30 -08:00
Apple
778907cf0e
docs: add NODE3 (Threadripper PRO + RTX 3090) to infrastructure
...
Added NODE3 - AI/ML Workstation Specification:
Hardware:
- CPU: AMD Ryzen Threadripper PRO 5975WX (32 cores / 64 threads, 3.6 GHz boost)
- RAM: 128GB DDR4
- GPU: NVIDIA GeForce RTX 3090 24GB GDDR6X
- 10496 CUDA cores
- CUDA 13.0, Driver 580.95.05
- Storage: Samsung SSD 990 PRO 4TB NVMe
- Root: 100GB (27% used)
- Available for expansion: 3.5TB
System:
- Hostname: llm80-che-1-1
- IP: 80.77.35.151:33147
- OS: Ubuntu 24.04.3 LTS (Noble Numbat)
- Container Runtime: MicroK8s + containerd
- Uptime: 24/7
Security Status: ✅ CLEAN (verified 2026-01-09)
- No crypto miners detected
- 0 zombie processes
- CPU load: 0.17 (very low)
- GPU utilization: 0% (ready for workloads)
Services Running:
- Port 3000 - Unknown service (needs investigation)
- Port 8080 - Unknown service (needs investigation)
- Port 11434 - Ollama (localhost only)
- Port 27017/27019 - MongoDB (localhost only)
- Kubernetes API: 16443
- K8s services: 10248-10259, 25000
Recommended Use Cases:
- 🤖 Large LLM inference (Llama 70B, Qwen 72B, Mixtral 8x22B)
- 🧠 Model training and fine-tuning
- 🎨 Stable Diffusion XL image generation
- 🔬 AI/ML research and experimentation
- 🚀 Kubernetes-based AI service orchestration
Files Updated:
- INFRASTRUCTURE.md v2.4.0
- docs/infrastructure_quick_ref.ipynb v2.3.0
NODE3 is the most powerful node in the infrastructure:
- Most CPU cores: 32c/64t (vs 16c M4 Max)
- Most RAM: 128GB (vs 64GB)
- Dedicated GPU: RTX 3090 24GB VRAM
- Largest storage: 4TB NVMe (vs 2TB)
Co-Authored-By: Warp <agent@warp.dev >
2026-01-09 05:53:16 -08:00
Apple
cba2ff47f3
📚 docs(security): Add comprehensive Security chapter
...
## New Security Documentation Structure
/security/
├── README.md # Security overview & contacts
├── forensics-checklist.md # Incident investigation guide
├── persistence-scan.sh # Quick persistence detector
├── runtime-detector.sh # Mining/suspicious process detector
└── hardening/
├── docker.md # Docker security baseline
├── kubernetes.md # K8s policies (future reference)
└── cloud.md # Hetzner-specific hardening
## Key Components
### Forensics Checklist
- Process analysis commands
- Persistence mechanism detection
- Network connection analysis
- File system inspection
- Authentication audit
- Decision matrix for threat response
### Scripts
- persistence-scan.sh: Cron, systemd, executables, SSH keys
- runtime-detector.sh: Mining process detection with --kill option
### Hardening Guides
- Docker: Secure compose template, Dockerfile best practices
- Kubernetes: NetworkPolicy, PodSecurityStandard, Falco rules
- Cloud: Egress firewall, SSH hardening, fail2ban, monitoring
## Post-Incident Documentation
Based on lessons learned from Incidents #1 and #2 (Jan 2026)
Co-authored-by: Cursor Agent <agent@cursor.sh >
2026-01-09 02:08:13 -08:00
Apple
d77a4769c6
🔒 security(daarion-web): Hardening after crypto-mining incidents
...
## Root Cause Analysis
- Found CRITICAL RCE vulnerability in Next.js 15.0.3 (GHSA-9qr9-h5gf-34mp)
- 10 vulnerabilities total including SSRF, DoS, Auth Bypass
- Attack vector: exposed port 3000 + vulnerable Next.js → remote code execution
## Security Fixes
- Upgraded Next.js: 15.0.3 → 15.5.9 (0 vulnerabilities)
- Upgraded eslint-config-next: 15.0.3 → 15.5.9
## Hardening (New Files)
- apps/web/Dockerfile.secure: Multi-stage build, read-only FS, no shell
- docker-compose.web.secure.yml: Resource limits, cap_drop ALL, localhost bind
- scripts/rebuild-daarion-web-secure.sh: Local secure rebuild with Trivy scan
- scripts/deploy-daarion-web-node1.sh: Production deployment to NODE1
- SECURITY-REBUILD-REPORT.md: Full incident analysis and remediation report
## Key Security Measures
- restart: "no" (until verified)
- ports: 127.0.0.1:3000 (localhost only, use Nginx reverse proxy)
- read_only: true
- cap_drop: ALL
- resources.limits: 1 CPU, 512M RAM
- no-new-privileges: true
## Related Incidents
- Incident #1 (Jan 8): catcal, G4NQXBp miners
- Incident #2 (Jan 9): softirq, vrarhpb miners
- Hetzner AbuseID: 10F3971:2A
Co-authored-by: Cursor Agent <agent@cursor.sh >
2026-01-09 02:08:13 -08:00
Apple
21691aa042
docs: document Security Incident #2 - recurring container compromise
...
Security Incident #2 Emergency Response (Jan 9, 2026):
- Documented second compromise with NEW crypto miners (softirq, vrarhpb)
- Root cause: Docker image auto-restarted after server reboot
- Emergency mitigation completed (processes killed, container/images removed, load normalized)
- Created comprehensive rebuild task document: TASK_REBUILD_DAARION_WEB.md
- Updated INFRASTRUCTURE.md v2.3.0 with Incident #2 timeline and lessons learned
- Updated infrastructure_quick_ref.ipynb v2.2.0 with security status
Critical Changes:
- daarion-web container permanently disabled until secure rebuild
- Docker images DELETED (not just container stopped)
- Enhanced firewall rules (SSH rate limiting, port scan blocking)
- Retry test registered with Hetzner
- System load normalized: 30+ → 4.19
- Zombie processes cleaned: 1499 → 5
Files Created/Updated:
1. TASK_REBUILD_DAARION_WEB.md - Detailed rebuild instructions for Cursor agent
2. INFRASTRUCTURE.md - Added Incident #2 to Security section
3. docs/infrastructure_quick_ref.ipynb - Updated security status and version
Lessons Learned:
- ALWAYS delete Docker images, not just containers
- Auto-restart policies are dangerous for compromised containers
- Complete removal = container + image + restart policy change
Status: Emergency mitigation complete, statement submission pending (deadline: 2026-01-09 12:54 UTC)
Hetzner Incident ID: 10F3971:2A (AbuseID)
Co-Authored-By: Warp <agent@warp.dev >
2026-01-09 02:08:13 -08:00
Apple
a1091b03a3
docs: add Cursor Agent SSH access instructions for NODE1
...
- Add detailed SSH connection guide for Cursor agents
- Include common commands, safety checks, and troubleshooting
- Add interactive session example and best practices
- Update INFRASTRUCTURE.md with section for Cursor agents
- Update infrastructure_quick_ref.ipynb with SSH access configuration
- Provide complete workflow examples for remote operations
Co-Authored-By: Warp <agent@warp.dev >
2026-01-09 02:08:13 -08:00
Apple
e829fe66f2
docs: security incident resolution & firewall implementation
...
- Document network scanning incident (Dec 6 2025 - Jan 8 2026)
- Add firewall rules to prevent internal network access
- Deploy monitoring script for scanning attempts
- Update INFRASTRUCTURE.md v2.2.0 with Security section
- Update infrastructure_quick_ref.ipynb v2.1.0
- Root cause: compromised daarion-web container with crypto miner
- Resolution: container removed, firewall applied, monitoring deployed
Co-Authored-By: Warp <agent@warp.dev >
2026-01-09 02:08:13 -08:00
GitHub Action
e3a8b7464a
docs: auto-update repository information [skip ci]
2025-12-08 09:30:23 +00:00
Apple
254267afa3
fix: Add timeout to agents API fetch to prevent hanging
2025-12-05 03:25:30 -08:00
Apple
33fcb04f65
fix: Make Redis optional for city rooms online count
...
- Handle Redis connection errors gracefully
- Return rooms even if Redis is unavailable
- This fixes 500 error on /api/city/rooms endpoint
2025-12-05 03:18:58 -08:00
Apple
1f97586699
fix: Handle missing crew_team_key column in agents query
...
- Remove crew_team_key from SELECT (column doesn't exist)
- Use pop() to safely handle crew_team_key in data processing
- This fixes 500 error on /api/agents/list endpoint
2025-12-05 02:52:14 -08:00
Apple
72b76bf29f
fix: Remove non-existent owner_type/owner_id columns from city_rooms queries
...
- Fix get_all_rooms() to not select owner_type/owner_id
- Fix get_city_rooms_for_list() to not select owner_type/owner_id
- Fix get_city_rooms_api() to use space_scope instead of owner_type
- This fixes 500 error on /api/city/rooms endpoint
2025-12-05 02:52:03 -08:00
Apple
ad3026e32d
docs: Document root cause of daily data loss and fix
2025-12-05 02:42:44 -08:00
Apple
b2caee4e0e
fix: CRITICAL - Prevent infinite DROP DATABASE loop
...
ROOT CAUSE: Monitor was doing DROP DATABASE when NODE2 agents were missing,
but the backup didn't have NODE2 agents, causing an infinite loop.
FIX:
- FULL RECOVERY (DROP DATABASE) only when MicroDAOs < 5 (critical data loss)
- SOFT RECOVERY (just sync agents) when MicroDAOs exist but agents missing
- Prefer backup with NODE2 agents (full_backup_with_node2*.sql)
- Never DROP DATABASE if MicroDAOs exist
This prevents the daily data loss issue.
2025-12-05 02:41:43 -08:00
Apple
70b528f5cf
docs: Add documentation for periodic data loss fix
2025-12-05 02:36:49 -08:00
Apple
02a0ea9540
fix: Add NODE2 agent count check to prevent data loss
...
- Check for at least 45 NODE2 agents (out of 50 expected)
- This prevents false positives when only core agents exist
- Better detection of actual data loss
2025-12-05 02:36:36 -08:00
Apple
06fe0c5204
fix: Improve database recovery process
...
- Fix empty variable handling in data checks
- Terminate active connections before dropping database
- Increase agent threshold to 50 (9 core + 50 NODE2)
- Add better logging for agent sync verification
2025-12-05 02:35:57 -08:00
Apple
db3b74e1ba
fix: Integrate asset URL fix into recovery process and update docs
2025-12-03 10:13:19 -08:00
Apple
51fdd0d5da
feat: Add script to fix asset URLs after restore
2025-12-03 10:12:21 -08:00
Apple
94889783a3
fix: Restore asset URLs (logos/banners) after database recovery
...
- Update monitor-db-stability.sh to fix asset URLs after restore
- Convert old /assets/ URLs to MinIO format
- Clear invalid banner URLs
2025-12-03 10:12:16 -08:00
Apple
83b7e8f372
docs: Add database stability fix documentation
2025-12-03 10:00:11 -08:00
Apple
19e8436a02
fix: Add database stability monitoring and improve PostgreSQL config
...
- Add monitor-db-stability.sh for automatic recovery
- Improve PostgreSQL shutdown settings to prevent data loss
- Add checkpoint and WAL settings for better persistence
2025-12-03 09:59:41 -08:00
Apple
0c75ded63a
docs: Update test agents fix documentation with removed script info
2025-12-02 13:59:15 -08:00
Apple
7ac2f9c958
fix: Remove setup-node2-agents.sh that was creating test agents
...
- This script was trying to assign test agents (ag_atlas, etc.) to NODE2
- Use sync-node2-dagi-agents.py instead for loading real agents
- Test agents are now automatically removed by health check
2025-12-02 13:58:58 -08:00
Apple
9995e4ef75
docs: Add test agents fix documentation
2025-12-02 13:57:44 -08:00
Apple
6a76cffb88
fix: Add automatic removal of test agents in health check
...
- Add remove-test-agents.sh script
- Integrate test agent removal into db-health-check.sh
- Prevents test agents (ag_atlas, ag_oracle, ag_builder, ag_greeter) from reappearing
2025-12-02 13:57:28 -08:00
Apple
d128caacf6
docs: Add assets restoration guide
2025-12-02 13:45:57 -08:00
Apple
b27bfc1df5
feat: Add script to restore assets to MinIO and update DB URLs
2025-12-02 13:45:14 -08:00
Apple
8c801c1dab
docs: Add database persistence summary
2025-12-02 13:43:16 -08:00
Apple
2bc00b99a8
docs: Add database persistence documentation and improve docker-compose
2025-12-02 13:42:45 -08:00