Apple
778907cf0e
docs: add NODE3 (Threadripper PRO + RTX 3090) to infrastructure
...
Added NODE3 - AI/ML Workstation Specification:
Hardware:
- CPU: AMD Ryzen Threadripper PRO 5975WX (32 cores / 64 threads, 3.6 GHz boost)
- RAM: 128GB DDR4
- GPU: NVIDIA GeForce RTX 3090 24GB GDDR6X
- 10496 CUDA cores
- CUDA 13.0, Driver 580.95.05
- Storage: Samsung SSD 990 PRO 4TB NVMe
- Root: 100GB (27% used)
- Available for expansion: 3.5TB
System:
- Hostname: llm80-che-1-1
- IP: 80.77.35.151:33147
- OS: Ubuntu 24.04.3 LTS (Noble Numbat)
- Container Runtime: MicroK8s + containerd
- Uptime: 24/7
Security Status: ✅ CLEAN (verified 2026-01-09)
- No crypto miners detected
- 0 zombie processes
- CPU load: 0.17 (very low)
- GPU utilization: 0% (ready for workloads)
Services Running:
- Port 3000 - Unknown service (needs investigation)
- Port 8080 - Unknown service (needs investigation)
- Port 11434 - Ollama (localhost only)
- Port 27017/27019 - MongoDB (localhost only)
- Kubernetes API: 16443
- K8s services: 10248-10259, 25000
Recommended Use Cases:
- 🤖 Large LLM inference (Llama 70B, Qwen 72B, Mixtral 8x22B)
- 🧠 Model training and fine-tuning
- 🎨 Stable Diffusion XL image generation
- 🔬 AI/ML research and experimentation
- 🚀 Kubernetes-based AI service orchestration
Files Updated:
- INFRASTRUCTURE.md v2.4.0
- docs/infrastructure_quick_ref.ipynb v2.3.0
NODE3 is the most powerful node in the infrastructure:
- Most CPU cores: 32c/64t (vs 16c M4 Max)
- Most RAM: 128GB (vs 64GB)
- Dedicated GPU: RTX 3090 24GB VRAM
- Largest storage: 4TB NVMe (vs 2TB)
Co-Authored-By: Warp <agent@warp.dev >
2026-01-09 05:53:16 -08:00
Apple
cba2ff47f3
📚 docs(security): Add comprehensive Security chapter
...
## New Security Documentation Structure
/security/
├── README.md # Security overview & contacts
├── forensics-checklist.md # Incident investigation guide
├── persistence-scan.sh # Quick persistence detector
├── runtime-detector.sh # Mining/suspicious process detector
└── hardening/
├── docker.md # Docker security baseline
├── kubernetes.md # K8s policies (future reference)
└── cloud.md # Hetzner-specific hardening
## Key Components
### Forensics Checklist
- Process analysis commands
- Persistence mechanism detection
- Network connection analysis
- File system inspection
- Authentication audit
- Decision matrix for threat response
### Scripts
- persistence-scan.sh: Cron, systemd, executables, SSH keys
- runtime-detector.sh: Mining process detection with --kill option
### Hardening Guides
- Docker: Secure compose template, Dockerfile best practices
- Kubernetes: NetworkPolicy, PodSecurityStandard, Falco rules
- Cloud: Egress firewall, SSH hardening, fail2ban, monitoring
## Post-Incident Documentation
Based on lessons learned from Incidents #1 and #2 (Jan 2026)
Co-authored-by: Cursor Agent <agent@cursor.sh >
2026-01-09 02:08:13 -08:00
Apple
d77a4769c6
🔒 security(daarion-web): Hardening after crypto-mining incidents
...
## Root Cause Analysis
- Found CRITICAL RCE vulnerability in Next.js 15.0.3 (GHSA-9qr9-h5gf-34mp)
- 10 vulnerabilities total including SSRF, DoS, Auth Bypass
- Attack vector: exposed port 3000 + vulnerable Next.js → remote code execution
## Security Fixes
- Upgraded Next.js: 15.0.3 → 15.5.9 (0 vulnerabilities)
- Upgraded eslint-config-next: 15.0.3 → 15.5.9
## Hardening (New Files)
- apps/web/Dockerfile.secure: Multi-stage build, read-only FS, no shell
- docker-compose.web.secure.yml: Resource limits, cap_drop ALL, localhost bind
- scripts/rebuild-daarion-web-secure.sh: Local secure rebuild with Trivy scan
- scripts/deploy-daarion-web-node1.sh: Production deployment to NODE1
- SECURITY-REBUILD-REPORT.md: Full incident analysis and remediation report
## Key Security Measures
- restart: "no" (until verified)
- ports: 127.0.0.1:3000 (localhost only, use Nginx reverse proxy)
- read_only: true
- cap_drop: ALL
- resources.limits: 1 CPU, 512M RAM
- no-new-privileges: true
## Related Incidents
- Incident #1 (Jan 8): catcal, G4NQXBp miners
- Incident #2 (Jan 9): softirq, vrarhpb miners
- Hetzner AbuseID: 10F3971:2A
Co-authored-by: Cursor Agent <agent@cursor.sh >
2026-01-09 02:08:13 -08:00
Apple
21691aa042
docs: document Security Incident #2 - recurring container compromise
...
Security Incident #2 Emergency Response (Jan 9, 2026):
- Documented second compromise with NEW crypto miners (softirq, vrarhpb)
- Root cause: Docker image auto-restarted after server reboot
- Emergency mitigation completed (processes killed, container/images removed, load normalized)
- Created comprehensive rebuild task document: TASK_REBUILD_DAARION_WEB.md
- Updated INFRASTRUCTURE.md v2.3.0 with Incident #2 timeline and lessons learned
- Updated infrastructure_quick_ref.ipynb v2.2.0 with security status
Critical Changes:
- daarion-web container permanently disabled until secure rebuild
- Docker images DELETED (not just container stopped)
- Enhanced firewall rules (SSH rate limiting, port scan blocking)
- Retry test registered with Hetzner
- System load normalized: 30+ → 4.19
- Zombie processes cleaned: 1499 → 5
Files Created/Updated:
1. TASK_REBUILD_DAARION_WEB.md - Detailed rebuild instructions for Cursor agent
2. INFRASTRUCTURE.md - Added Incident #2 to Security section
3. docs/infrastructure_quick_ref.ipynb - Updated security status and version
Lessons Learned:
- ALWAYS delete Docker images, not just containers
- Auto-restart policies are dangerous for compromised containers
- Complete removal = container + image + restart policy change
Status: Emergency mitigation complete, statement submission pending (deadline: 2026-01-09 12:54 UTC)
Hetzner Incident ID: 10F3971:2A (AbuseID)
Co-Authored-By: Warp <agent@warp.dev >
2026-01-09 02:08:13 -08:00
Apple
a1091b03a3
docs: add Cursor Agent SSH access instructions for NODE1
...
- Add detailed SSH connection guide for Cursor agents
- Include common commands, safety checks, and troubleshooting
- Add interactive session example and best practices
- Update INFRASTRUCTURE.md with section for Cursor agents
- Update infrastructure_quick_ref.ipynb with SSH access configuration
- Provide complete workflow examples for remote operations
Co-Authored-By: Warp <agent@warp.dev >
2026-01-09 02:08:13 -08:00
Apple
e829fe66f2
docs: security incident resolution & firewall implementation
...
- Document network scanning incident (Dec 6 2025 - Jan 8 2026)
- Add firewall rules to prevent internal network access
- Deploy monitoring script for scanning attempts
- Update INFRASTRUCTURE.md v2.2.0 with Security section
- Update infrastructure_quick_ref.ipynb v2.1.0
- Root cause: compromised daarion-web container with crypto miner
- Resolution: container removed, firewall applied, monitoring deployed
Co-Authored-By: Warp <agent@warp.dev >
2026-01-09 02:08:13 -08:00
GitHub Action
e3a8b7464a
docs: auto-update repository information [skip ci]
2025-12-08 09:30:23 +00:00
Apple
254267afa3
fix: Add timeout to agents API fetch to prevent hanging
2025-12-05 03:25:30 -08:00
Apple
33fcb04f65
fix: Make Redis optional for city rooms online count
...
- Handle Redis connection errors gracefully
- Return rooms even if Redis is unavailable
- This fixes 500 error on /api/city/rooms endpoint
2025-12-05 03:18:58 -08:00
Apple
1f97586699
fix: Handle missing crew_team_key column in agents query
...
- Remove crew_team_key from SELECT (column doesn't exist)
- Use pop() to safely handle crew_team_key in data processing
- This fixes 500 error on /api/agents/list endpoint
2025-12-05 02:52:14 -08:00
Apple
72b76bf29f
fix: Remove non-existent owner_type/owner_id columns from city_rooms queries
...
- Fix get_all_rooms() to not select owner_type/owner_id
- Fix get_city_rooms_for_list() to not select owner_type/owner_id
- Fix get_city_rooms_api() to use space_scope instead of owner_type
- This fixes 500 error on /api/city/rooms endpoint
2025-12-05 02:52:03 -08:00
Apple
ad3026e32d
docs: Document root cause of daily data loss and fix
2025-12-05 02:42:44 -08:00
Apple
b2caee4e0e
fix: CRITICAL - Prevent infinite DROP DATABASE loop
...
ROOT CAUSE: Monitor was doing DROP DATABASE when NODE2 agents were missing,
but the backup didn't have NODE2 agents, causing an infinite loop.
FIX:
- FULL RECOVERY (DROP DATABASE) only when MicroDAOs < 5 (critical data loss)
- SOFT RECOVERY (just sync agents) when MicroDAOs exist but agents missing
- Prefer backup with NODE2 agents (full_backup_with_node2*.sql)
- Never DROP DATABASE if MicroDAOs exist
This prevents the daily data loss issue.
2025-12-05 02:41:43 -08:00
Apple
70b528f5cf
docs: Add documentation for periodic data loss fix
2025-12-05 02:36:49 -08:00
Apple
02a0ea9540
fix: Add NODE2 agent count check to prevent data loss
...
- Check for at least 45 NODE2 agents (out of 50 expected)
- This prevents false positives when only core agents exist
- Better detection of actual data loss
2025-12-05 02:36:36 -08:00
Apple
06fe0c5204
fix: Improve database recovery process
...
- Fix empty variable handling in data checks
- Terminate active connections before dropping database
- Increase agent threshold to 50 (9 core + 50 NODE2)
- Add better logging for agent sync verification
2025-12-05 02:35:57 -08:00
Apple
db3b74e1ba
fix: Integrate asset URL fix into recovery process and update docs
2025-12-03 10:13:19 -08:00
Apple
51fdd0d5da
feat: Add script to fix asset URLs after restore
2025-12-03 10:12:21 -08:00
Apple
94889783a3
fix: Restore asset URLs (logos/banners) after database recovery
...
- Update monitor-db-stability.sh to fix asset URLs after restore
- Convert old /assets/ URLs to MinIO format
- Clear invalid banner URLs
2025-12-03 10:12:16 -08:00
Apple
83b7e8f372
docs: Add database stability fix documentation
2025-12-03 10:00:11 -08:00
Apple
19e8436a02
fix: Add database stability monitoring and improve PostgreSQL config
...
- Add monitor-db-stability.sh for automatic recovery
- Improve PostgreSQL shutdown settings to prevent data loss
- Add checkpoint and WAL settings for better persistence
2025-12-03 09:59:41 -08:00
Apple
0c75ded63a
docs: Update test agents fix documentation with removed script info
2025-12-02 13:59:15 -08:00
Apple
7ac2f9c958
fix: Remove setup-node2-agents.sh that was creating test agents
...
- This script was trying to assign test agents (ag_atlas, etc.) to NODE2
- Use sync-node2-dagi-agents.py instead for loading real agents
- Test agents are now automatically removed by health check
2025-12-02 13:58:58 -08:00
Apple
9995e4ef75
docs: Add test agents fix documentation
2025-12-02 13:57:44 -08:00
Apple
6a76cffb88
fix: Add automatic removal of test agents in health check
...
- Add remove-test-agents.sh script
- Integrate test agent removal into db-health-check.sh
- Prevents test agents (ag_atlas, ag_oracle, ag_builder, ag_greeter) from reappearing
2025-12-02 13:57:28 -08:00
Apple
d128caacf6
docs: Add assets restoration guide
2025-12-02 13:45:57 -08:00
Apple
b27bfc1df5
feat: Add script to restore assets to MinIO and update DB URLs
2025-12-02 13:45:14 -08:00
Apple
8c801c1dab
docs: Add database persistence summary
2025-12-02 13:43:16 -08:00
Apple
2bc00b99a8
docs: Add database persistence documentation and improve docker-compose
2025-12-02 13:42:45 -08:00
Apple
488dd13af2
fix: Add database persistence and health check scripts
...
- Add apply-migrations.sh for automatic migration application
- Add ensure-db-persistence.sh for database integrity checks
- Add db-health-check.sh for periodic health monitoring
- Improve PostgreSQL configuration in docker-compose.db.yml
- Add proper shutdown settings to prevent data loss
2025-12-02 13:41:03 -08:00
Apple
770c6a0dfe
feat: Add banner display to MicroDAO list cards
...
- Add banner background to MicroDAO cards in list view
- Use normalizeAssetUrl for banner URLs
- Add fallback green gradient when banner_url is null
- Banner displays as background with overlay for readability
2025-12-02 09:38:58 -08:00
Apple
7ac64c3183
fix: Add banner_url to MicrodaoDetail response
...
- Add missing banner_url field when creating MicrodaoDetail
- This fixes issue where banner_url was saved in DB but not returned by /api/microdao/{slug} endpoint
2025-12-02 09:19:36 -08:00
Apple
c968705ec7
docs: Add task for completing branding banners MVP
...
- Add task to verify upload flow for banners
- Document fallback options for banner_url == null
- Add troubleshooting guide
- Document branding assets guide requirements
2025-12-02 09:13:38 -08:00
Apple
fd710da55d
fix: Fix TypeScript errors in assets route and add banner_url to MicrodaoSummary
2025-12-02 09:03:35 -08:00
Apple
cf0b3feee0
fix: Add missing fetchMicrodaoDashboard export
2025-12-02 09:01:10 -08:00
Apple
742c238b3b
docs: Add manual test plan for assets proxy debugging
2025-12-02 09:00:35 -08:00
Apple
d659f8fd32
fix: Fix Dockerfile COPY command for correct build context
2025-12-02 08:58:48 -08:00
Apple
bc4338f2c0
fix: Fix Dockerfile build context and ensure normalizeAssetUrl is used everywhere
...
- Fix Dockerfile to use correct paths (context is already apps/web)
- Ensure normalizeAssetUrl is used when setting preview URLs after upload
- This ensures all asset URLs go through the proxy
2025-12-02 08:58:34 -08:00
Apple
51571b3e61
docs: Add assets proxy fix report with HEAD method support
2025-12-02 08:51:23 -08:00
Apple
f19d5de52b
fix: Add HEAD method support and fix proxy URL in Next.js assets route
...
- Add HEAD method handler in Next.js route
- Fix proxy URL to use correct city-service endpoint
- Handle HEAD requests properly (return headers only)
- This should fix 405 errors when browser checks image availability
2025-12-02 08:50:13 -08:00
Apple
62f03f0dad
fix: Use api_route for HEAD method support in assets proxy
2025-12-02 08:48:13 -08:00
Apple
d13115e3b0
fix: Fix Request import for HEAD method support in assets proxy
2025-12-02 08:47:45 -08:00
Apple
192631c2eb
fix: Add HEAD method support to assets proxy endpoint
...
- Add HEAD method handler for browser preflight requests
- Use stat_object for HEAD requests (more efficient)
- Return proper headers for HEAD requests
- This fixes 405 errors when browser checks image availability
2025-12-02 08:47:02 -08:00
Apple
55634eac9b
docs: Add assets proxy debug report
2025-12-02 08:37:40 -08:00
Apple
1ca6a4f55a
feat: Complete assets proxy implementation with documentation
...
- Add comprehensive documentation in docs/ASSETS_PROXY.md
- Add contract comments in normalizeAssetUrl and proxy_asset
- Verify all components use normalizeAssetUrl
- Verify ENV variables are correctly set
- Add troubleshooting guide
2025-12-02 08:36:55 -08:00
Apple
b49d7489ea
fix: Use /api/city/assets/proxy/ for asset URLs instead of /api/assets/
...
- Change normalizeAssetUrl to use working city-service proxy endpoint
- This ensures assets work without assets.daarion.space DNS
2025-12-02 07:46:30 -08:00
Apple
517efc6a16
fix: Add API proxy for MinIO assets to work without assets.daarion.space DNS
...
- Add /api/assets/[...path] proxy route in Next.js
- Add /assets/proxy/{path} endpoint in city-service
- Update normalizeAssetUrl to convert assets.daarion.space URLs to /api/assets/...
- This allows assets to work even if DNS for assets.daarion.space is not configured
2025-12-02 07:43:36 -08:00
Apple
77d7b0b06d
fix: Disable test agents (ag_atlas, ag_builder, ag_greeter, ag_oracle) in migration 013
...
- Comment out INSERT for test agents that keep reappearing
- These are not real agents and should not be created
- Real agents are managed through agents_city_mapping.yaml and sync scripts
2025-12-02 07:14:22 -08:00
Apple
fca48b3eb0
feat(node2): Complete NODE2 setup - guardian, agents, swapper models
...
- Node-guardian running on MacBook and updating metrics
- NODE2 agents (Atlas, Greeter, Oracle, Builder Bot) assigned to node-2-macbook-m4max
- Swapper models displaying correctly (8 models)
- DAGI Router agents showing with correct status (3 active, 1 stale)
- Router health check using node_cache for remote nodes
2025-12-02 07:07:58 -08:00
Apple
240ceba2e8
debug(node2): Change logging to WARNING level for router_healthy
2025-12-02 07:05:54 -08:00