Commit Graph

530 Commits

Author SHA1 Message Date
Apple
8aee29d42d 📊 Add Memory Module Status Report across all nodes
Some checks failed
Build and Deploy Docs / build-and-deploy (push) Has been cancelled
2026-01-10 08:11:12 -08:00
Apple
eed1e30aca 🔧 Add site/ to .gitignore (mkdocs build output)
Some checks failed
Build and Deploy Docs / build-and-deploy (push) Has been cancelled
2026-01-10 07:57:47 -08:00
Apple
fb4f4a16d5 🔧 Fix GitHub Actions docs workflow
- Update mkdocs dependencies to latest versions
- Add permissions for GitHub Pages deployment
- Add workflow_dispatch for manual trigger
- Fix build command with fallback
2026-01-10 07:57:36 -08:00
Apple
90758facae 🧠 Add Agent Memory System with PostgreSQL + Qdrant + Cohere
Features:
- Three-tier memory architecture (short/mid/long-term)
- PostgreSQL schema for conversations, events, memories
- Qdrant vector database for semantic search
- Cohere embeddings (embed-multilingual-v3.0, 1024 dims)
- FastAPI Memory Service with full CRUD
- External Secrets integration with Vault
- Kubernetes deployment manifests

Components:
- infrastructure/database/agent-memory-schema.sql
- infrastructure/kubernetes/apps/qdrant/
- infrastructure/kubernetes/apps/memory-service/
- services/memory-service/ (FastAPI app)

Also includes:
- External Secrets Operator
- Traefik Ingress Controller
- Cert-Manager with Let's Encrypt
- ArgoCD for GitOps
2026-01-10 07:52:32 -08:00
Apple
12545a7c76 🏗️ Add DAARION Infrastructure Stack
- Terraform + Ansible + K3s + Vault + Consul + Observability
- Decentralized network architecture (own datacenters)
- Complete Ansible playbooks:
  - bootstrap.yml: OS setup, packages, SSH
  - hardening.yml: Security (UFW, fail2ban, auditd, Trivy)
  - k3s-install.yml: Lightweight Kubernetes cluster
- Production inventory with NODE1, NODE3
- Group variables for all nodes
- Security check cron script
- Multi-DC ready with Consul support
2026-01-10 05:31:51 -08:00
Apple
02cfd90b6f 🌐 Add 150 nodes network deployment plan
- Created NETWORK-150-NODES-PLAN.md with complete architecture
- Ansible playbooks for automated security and deployment
- Terraform configs for Hetzner infrastructure
- Zero Trust security architecture
- Prometheus federation for monitoring
- Estimated costs and roadmap
- PostgreSQL deployed on NODE1 and NODE3
2026-01-10 05:11:45 -08:00
Apple
1231647f94 🛡️ Add comprehensive Security Hardening Plan
- Created SECURITY-HARDENING-PLAN.md with 6 security levels
- Added setup-node1-security.sh for automated hardening
- Added scan-image.sh for pre-deployment image scanning
- Created docker-compose.secure.yml template
- Includes: Trivy, fail2ban, UFW, auditd, rkhunter, chkrootkit
- Network isolation, egress filtering, process monitoring
- Incident response procedures and recovery playbook
2026-01-10 05:05:21 -08:00
Apple
1c247ea40c 📝 Update context docs with session logging system
Some checks failed
Build and Deploy Docs / build-and-deploy (push) Has been cancelled
- Added Session Logging System section to INFRASTRUCTURE.md
- Added Git Multi-Remote configuration (GitHub + Gitea + GitLab)
- Updated version to 2.5.0
- Added logging commands reference
- Updated infrastructure_quick_ref.ipynb with new features
- Added SSH tunnel instructions for GitLab access
2026-01-10 04:58:01 -08:00
Apple
744c149300 Add automated session logging system
Some checks failed
Build and Deploy Docs / build-and-deploy (push) Has been cancelled
- Created logs/ structure (sessions, operations, incidents)
- Added session-start/log/end scripts
- Installed Git hooks for auto-logging commits/pushes
- Added shell integration for zsh
- Created CHANGELOG.md
- Documented today's session (2026-01-10)
2026-01-10 04:53:17 -08:00
Apple
e67882fd15 docs: Security Incident #3 - postgres:15-alpine compromised image
CRITICAL SECURITY INCIDENT - Postgres:15-alpine Contains Crypto Miners

Discovered: Jan 9, 2026 20:47 UTC
Resolved: Jan 9, 2026 22:07 UTC
Duration: ~2 hours

Malware Discovered (3 variants):
1. cpioshuf (1764% CPU) - /tmp/.perf.c/cpioshuf
2. ipcalcpg_recvlogical - /tmp/.perf.c/ipcalcpg_recvlogical
3. mysql (933% CPU) - /tmp/mysql

Compromised Image:
- Image: postgres:15-alpine
- SHA: b3968e348b48f1198cc6de6611d055dbad91cd561b7990c406c3fc28d7095b21
- Status: BANNED - DO NOT USE

Impact:
- CPU load: 17+ (critical)
- Multiple containers affected: daarion-postgres, dagi-postgres, docker-db-1
- Dify also compromised (used same image)
- System performance degraded for 2 hours

Resolution:
- Killed all 3 miner variants (PIDs: 2294271, 2310302, 2314793, 2366898)
- Stopped and removed ALL postgres:15-alpine containers
- Deleted compromised image permanently
- Migrated to postgres:14-alpine (verified clean)
- Removed entire Dify installation (precautionary)
- System load normalized: 17+ → 0.40

Root Cause:
- Official Docker Hub image postgres:15-alpine either:
  * Temporarily compromised on Docker Hub, OR
  * PostgreSQL 15 has supply chain vulnerability
- Persistent infection: malware embedded in image layers
- Auto-restart: orphan containers kept respawning miners

Actions Taken:
-  All miners killed and files removed
-  Compromised image deleted and BLOCKED
-  Migrated to postgres:14-alpine
-  Dify completely removed
-  /tmp cleaned of all suspicious files
-  System verified clean

Prevention Measures:
1. Pin images by SHA (not tags like :latest or :15-alpine)
2. Implement Trivy/Grype security scanning
3. Monitor CPU spikes (alert if >5 load average)
4. Regular /tmp audits for executables
5. Use --remove-orphans in docker-compose
6. Block orphan container spawning

Lessons Learned:
- Official images can be compromised
- Never trust :latest or version tags blindly
- Scan ALL images before deployment
- Monitor /tmp for suspicious executables
- One compromised image can spread (Dify used same postgres)
- Multiple malware variants = fallback payloads

Files Updated:
- INFRASTRUCTURE.md - Added Incident #3 complete documentation

POSTGRES:15-ALPINE IS PERMANENTLY BANNED
Use postgres:14-alpine (verified safe)

Co-Authored-By: Warp <agent@warp.dev>
2026-01-09 13:09:38 -08:00
Apple
f6a2007c77 🛡️ security: Implement full container audit and signed images guide
- Added security/audit-all-containers.sh: Automated Trivy scan for all 60+ project images
- Added security/hardening/signed-images.md: Guide for Docker Content Trust (DCT)
- Updated NODE1 with audit script and started background scan
- Results will be saved to /opt/microdao-daarion/logs/audits/

Co-authored-by: Cursor Agent <agent@cursor.sh>
2026-01-09 12:16:30 -08:00
Apple
778907cf0e docs: add NODE3 (Threadripper PRO + RTX 3090) to infrastructure
Added NODE3 - AI/ML Workstation Specification:

Hardware:
- CPU: AMD Ryzen Threadripper PRO 5975WX (32 cores / 64 threads, 3.6 GHz boost)
- RAM: 128GB DDR4
- GPU: NVIDIA GeForce RTX 3090 24GB GDDR6X
  - 10496 CUDA cores
  - CUDA 13.0, Driver 580.95.05
- Storage: Samsung SSD 990 PRO 4TB NVMe
  - Root: 100GB (27% used)
  - Available for expansion: 3.5TB

System:
- Hostname: llm80-che-1-1
- IP: 80.77.35.151:33147
- OS: Ubuntu 24.04.3 LTS (Noble Numbat)
- Container Runtime: MicroK8s + containerd
- Uptime: 24/7

Security Status:  CLEAN (verified 2026-01-09)
- No crypto miners detected
- 0 zombie processes
- CPU load: 0.17 (very low)
- GPU utilization: 0% (ready for workloads)

Services Running:
- Port 3000 - Unknown service (needs investigation)
- Port 8080 - Unknown service (needs investigation)
- Port 11434 - Ollama (localhost only)
- Port 27017/27019 - MongoDB (localhost only)
- Kubernetes API: 16443
- K8s services: 10248-10259, 25000

Recommended Use Cases:
- 🤖 Large LLM inference (Llama 70B, Qwen 72B, Mixtral 8x22B)
- 🧠 Model training and fine-tuning
- 🎨 Stable Diffusion XL image generation
- 🔬 AI/ML research and experimentation
- 🚀 Kubernetes-based AI service orchestration

Files Updated:
- INFRASTRUCTURE.md v2.4.0
- docs/infrastructure_quick_ref.ipynb v2.3.0

NODE3 is the most powerful node in the infrastructure:
- Most CPU cores: 32c/64t (vs 16c M4 Max)
- Most RAM: 128GB (vs 64GB)
- Dedicated GPU: RTX 3090 24GB VRAM
- Largest storage: 4TB NVMe (vs 2TB)

Co-Authored-By: Warp <agent@warp.dev>
2026-01-09 05:53:16 -08:00
Apple
cba2ff47f3 📚 docs(security): Add comprehensive Security chapter
## New Security Documentation Structure

/security/
├── README.md                    # Security overview & contacts
├── forensics-checklist.md       # Incident investigation guide
├── persistence-scan.sh          # Quick persistence detector
├── runtime-detector.sh          # Mining/suspicious process detector
└── hardening/
    ├── docker.md                # Docker security baseline
    ├── kubernetes.md            # K8s policies (future reference)
    └── cloud.md                 # Hetzner-specific hardening

## Key Components

### Forensics Checklist
- Process analysis commands
- Persistence mechanism detection
- Network connection analysis
- File system inspection
- Authentication audit
- Decision matrix for threat response

### Scripts
- persistence-scan.sh: Cron, systemd, executables, SSH keys
- runtime-detector.sh: Mining process detection with --kill option

### Hardening Guides
- Docker: Secure compose template, Dockerfile best practices
- Kubernetes: NetworkPolicy, PodSecurityStandard, Falco rules
- Cloud: Egress firewall, SSH hardening, fail2ban, monitoring

## Post-Incident Documentation
Based on lessons learned from Incidents #1 and #2 (Jan 2026)

Co-authored-by: Cursor Agent <agent@cursor.sh>
2026-01-09 02:08:13 -08:00
Apple
d77a4769c6 🔒 security(daarion-web): Hardening after crypto-mining incidents
## Root Cause Analysis
- Found CRITICAL RCE vulnerability in Next.js 15.0.3 (GHSA-9qr9-h5gf-34mp)
- 10 vulnerabilities total including SSRF, DoS, Auth Bypass
- Attack vector: exposed port 3000 + vulnerable Next.js → remote code execution

## Security Fixes
- Upgraded Next.js: 15.0.3 → 15.5.9 (0 vulnerabilities)
- Upgraded eslint-config-next: 15.0.3 → 15.5.9

## Hardening (New Files)
- apps/web/Dockerfile.secure: Multi-stage build, read-only FS, no shell
- docker-compose.web.secure.yml: Resource limits, cap_drop ALL, localhost bind
- scripts/rebuild-daarion-web-secure.sh: Local secure rebuild with Trivy scan
- scripts/deploy-daarion-web-node1.sh: Production deployment to NODE1
- SECURITY-REBUILD-REPORT.md: Full incident analysis and remediation report

## Key Security Measures
- restart: "no" (until verified)
- ports: 127.0.0.1:3000 (localhost only, use Nginx reverse proxy)
- read_only: true
- cap_drop: ALL
- resources.limits: 1 CPU, 512M RAM
- no-new-privileges: true

## Related Incidents
- Incident #1 (Jan 8): catcal, G4NQXBp miners
- Incident #2 (Jan 9): softirq, vrarhpb miners
- Hetzner AbuseID: 10F3971:2A

Co-authored-by: Cursor Agent <agent@cursor.sh>
2026-01-09 02:08:13 -08:00
Apple
21691aa042 docs: document Security Incident #2 - recurring container compromise
Security Incident #2 Emergency Response (Jan 9, 2026):
- Documented second compromise with NEW crypto miners (softirq, vrarhpb)
- Root cause: Docker image auto-restarted after server reboot
- Emergency mitigation completed (processes killed, container/images removed, load normalized)
- Created comprehensive rebuild task document: TASK_REBUILD_DAARION_WEB.md
- Updated INFRASTRUCTURE.md v2.3.0 with Incident #2 timeline and lessons learned
- Updated infrastructure_quick_ref.ipynb v2.2.0 with security status

Critical Changes:
- daarion-web container permanently disabled until secure rebuild
- Docker images DELETED (not just container stopped)
- Enhanced firewall rules (SSH rate limiting, port scan blocking)
- Retry test registered with Hetzner
- System load normalized: 30+ → 4.19
- Zombie processes cleaned: 1499 → 5

Files Created/Updated:
1. TASK_REBUILD_DAARION_WEB.md - Detailed rebuild instructions for Cursor agent
2. INFRASTRUCTURE.md - Added Incident #2 to Security section
3. docs/infrastructure_quick_ref.ipynb - Updated security status and version

Lessons Learned:
- ALWAYS delete Docker images, not just containers
- Auto-restart policies are dangerous for compromised containers
- Complete removal = container + image + restart policy change

Status: Emergency mitigation complete, statement submission pending (deadline: 2026-01-09 12:54 UTC)

Hetzner Incident ID: 10F3971:2A (AbuseID)

Co-Authored-By: Warp <agent@warp.dev>
2026-01-09 02:08:13 -08:00
Apple
a1091b03a3 docs: add Cursor Agent SSH access instructions for NODE1
- Add detailed SSH connection guide for Cursor agents
- Include common commands, safety checks, and troubleshooting
- Add interactive session example and best practices
- Update INFRASTRUCTURE.md with section for Cursor agents
- Update infrastructure_quick_ref.ipynb with SSH access configuration
- Provide complete workflow examples for remote operations

Co-Authored-By: Warp <agent@warp.dev>
2026-01-09 02:08:13 -08:00
Apple
e829fe66f2 docs: security incident resolution & firewall implementation
- Document network scanning incident (Dec 6 2025 - Jan 8 2026)
- Add firewall rules to prevent internal network access
- Deploy monitoring script for scanning attempts
- Update INFRASTRUCTURE.md v2.2.0 with Security section
- Update infrastructure_quick_ref.ipynb v2.1.0
- Root cause: compromised daarion-web container with crypto miner
- Resolution: container removed, firewall applied, monitoring deployed

Co-Authored-By: Warp <agent@warp.dev>
2026-01-09 02:08:13 -08:00
GitHub Action
e3a8b7464a docs: auto-update repository information [skip ci] 2025-12-08 09:30:23 +00:00
Apple
254267afa3 fix: Add timeout to agents API fetch to prevent hanging 2025-12-05 03:25:30 -08:00
Apple
33fcb04f65 fix: Make Redis optional for city rooms online count
- Handle Redis connection errors gracefully
- Return rooms even if Redis is unavailable
- This fixes 500 error on /api/city/rooms endpoint
2025-12-05 03:18:58 -08:00
Apple
1f97586699 fix: Handle missing crew_team_key column in agents query
- Remove crew_team_key from SELECT (column doesn't exist)
- Use pop() to safely handle crew_team_key in data processing
- This fixes 500 error on /api/agents/list endpoint
2025-12-05 02:52:14 -08:00
Apple
72b76bf29f fix: Remove non-existent owner_type/owner_id columns from city_rooms queries
- Fix get_all_rooms() to not select owner_type/owner_id
- Fix get_city_rooms_for_list() to not select owner_type/owner_id
- Fix get_city_rooms_api() to use space_scope instead of owner_type
- This fixes 500 error on /api/city/rooms endpoint
2025-12-05 02:52:03 -08:00
Apple
ad3026e32d docs: Document root cause of daily data loss and fix 2025-12-05 02:42:44 -08:00
Apple
b2caee4e0e fix: CRITICAL - Prevent infinite DROP DATABASE loop
ROOT CAUSE: Monitor was doing DROP DATABASE when NODE2 agents were missing,
but the backup didn't have NODE2 agents, causing an infinite loop.

FIX:
- FULL RECOVERY (DROP DATABASE) only when MicroDAOs < 5 (critical data loss)
- SOFT RECOVERY (just sync agents) when MicroDAOs exist but agents missing
- Prefer backup with NODE2 agents (full_backup_with_node2*.sql)
- Never DROP DATABASE if MicroDAOs exist

This prevents the daily data loss issue.
2025-12-05 02:41:43 -08:00
Apple
70b528f5cf docs: Add documentation for periodic data loss fix 2025-12-05 02:36:49 -08:00
Apple
02a0ea9540 fix: Add NODE2 agent count check to prevent data loss
- Check for at least 45 NODE2 agents (out of 50 expected)
- This prevents false positives when only core agents exist
- Better detection of actual data loss
2025-12-05 02:36:36 -08:00
Apple
06fe0c5204 fix: Improve database recovery process
- Fix empty variable handling in data checks
- Terminate active connections before dropping database
- Increase agent threshold to 50 (9 core + 50 NODE2)
- Add better logging for agent sync verification
2025-12-05 02:35:57 -08:00
Apple
db3b74e1ba fix: Integrate asset URL fix into recovery process and update docs 2025-12-03 10:13:19 -08:00
Apple
51fdd0d5da feat: Add script to fix asset URLs after restore 2025-12-03 10:12:21 -08:00
Apple
94889783a3 fix: Restore asset URLs (logos/banners) after database recovery
- Update monitor-db-stability.sh to fix asset URLs after restore
- Convert old /assets/ URLs to MinIO format
- Clear invalid banner URLs
2025-12-03 10:12:16 -08:00
Apple
83b7e8f372 docs: Add database stability fix documentation 2025-12-03 10:00:11 -08:00
Apple
19e8436a02 fix: Add database stability monitoring and improve PostgreSQL config
- Add monitor-db-stability.sh for automatic recovery
- Improve PostgreSQL shutdown settings to prevent data loss
- Add checkpoint and WAL settings for better persistence
2025-12-03 09:59:41 -08:00
Apple
0c75ded63a docs: Update test agents fix documentation with removed script info 2025-12-02 13:59:15 -08:00
Apple
7ac2f9c958 fix: Remove setup-node2-agents.sh that was creating test agents
- This script was trying to assign test agents (ag_atlas, etc.) to NODE2
- Use sync-node2-dagi-agents.py instead for loading real agents
- Test agents are now automatically removed by health check
2025-12-02 13:58:58 -08:00
Apple
9995e4ef75 docs: Add test agents fix documentation 2025-12-02 13:57:44 -08:00
Apple
6a76cffb88 fix: Add automatic removal of test agents in health check
- Add remove-test-agents.sh script
- Integrate test agent removal into db-health-check.sh
- Prevents test agents (ag_atlas, ag_oracle, ag_builder, ag_greeter) from reappearing
2025-12-02 13:57:28 -08:00
Apple
d128caacf6 docs: Add assets restoration guide 2025-12-02 13:45:57 -08:00
Apple
b27bfc1df5 feat: Add script to restore assets to MinIO and update DB URLs 2025-12-02 13:45:14 -08:00
Apple
8c801c1dab docs: Add database persistence summary 2025-12-02 13:43:16 -08:00
Apple
2bc00b99a8 docs: Add database persistence documentation and improve docker-compose 2025-12-02 13:42:45 -08:00
Apple
488dd13af2 fix: Add database persistence and health check scripts
- Add apply-migrations.sh for automatic migration application
- Add ensure-db-persistence.sh for database integrity checks
- Add db-health-check.sh for periodic health monitoring
- Improve PostgreSQL configuration in docker-compose.db.yml
- Add proper shutdown settings to prevent data loss
2025-12-02 13:41:03 -08:00
Apple
770c6a0dfe feat: Add banner display to MicroDAO list cards
- Add banner background to MicroDAO cards in list view
- Use normalizeAssetUrl for banner URLs
- Add fallback green gradient when banner_url is null
- Banner displays as background with overlay for readability
2025-12-02 09:38:58 -08:00
Apple
7ac64c3183 fix: Add banner_url to MicrodaoDetail response
- Add missing banner_url field when creating MicrodaoDetail
- This fixes issue where banner_url was saved in DB but not returned by /api/microdao/{slug} endpoint
2025-12-02 09:19:36 -08:00
Apple
c968705ec7 docs: Add task for completing branding banners MVP
- Add task to verify upload flow for banners
- Document fallback options for banner_url == null
- Add troubleshooting guide
- Document branding assets guide requirements
2025-12-02 09:13:38 -08:00
Apple
fd710da55d fix: Fix TypeScript errors in assets route and add banner_url to MicrodaoSummary 2025-12-02 09:03:35 -08:00
Apple
cf0b3feee0 fix: Add missing fetchMicrodaoDashboard export 2025-12-02 09:01:10 -08:00
Apple
742c238b3b docs: Add manual test plan for assets proxy debugging 2025-12-02 09:00:35 -08:00
Apple
d659f8fd32 fix: Fix Dockerfile COPY command for correct build context 2025-12-02 08:58:48 -08:00
Apple
bc4338f2c0 fix: Fix Dockerfile build context and ensure normalizeAssetUrl is used everywhere
- Fix Dockerfile to use correct paths (context is already apps/web)
- Ensure normalizeAssetUrl is used when setting preview URLs after upload
- This ensures all asset URLs go through the proxy
2025-12-02 08:58:34 -08:00
Apple
51571b3e61 docs: Add assets proxy fix report with HEAD method support 2025-12-02 08:51:23 -08:00