Files
microdao-daarion/docs/SECRETS_ROTATION_RUNBOOK.md
Apple ef3473db21 snapshot: NODE1 production state 2026-02-09
Complete snapshot of /opt/microdao-daarion/ from NODE1 (144.76.224.179).
This represents the actual running production code that has diverged
significantly from the previous main branch.

Key changes from old main:
- Gateway (http_api.py): expanded from ~40KB to 164KB with full agent support
- Router: new /v1/agents/{id}/infer endpoint with vision + DeepSeek routing
- Behavior Policy: SOWA v2.2 (3-level: FULL/ACK/SILENT)
- Agent Registry: config/agent_registry.yml as single source of truth
- 13 agents configured (was 3)
- Memory service integration
- CrewAI teams and roles

Excluded from snapshot: venv/, .env, data/, backups, .tgz archives

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-02-09 08:46:46 -08:00

4.9 KiB

Secrets Rotation Runbook

Last Updated: 2026-01-19
Owner: Platform Team


Overview

This runbook describes the procedure for rotating secrets in the DAARION platform without service downtime.


Secrets Inventory

Secret Location Service Rotation Frequency
NATS Creds .env files All services Quarterly
JWT Secrets Gateway .env Gateway Quarterly
DeepSeek API Key Router .env Router As needed
Mistral API Key Router .env Router As needed
Grok API Key Router .env Router As needed
Cohere API Key Router .env Router As needed
Qdrant API Key Memory .env Memory Service Quarterly
PostgreSQL Password .env files All services Quarterly
Neo4j Password .env files Router, Memory Quarterly
Telegram Bot Tokens Gateway .env Gateway As needed

Rotation Procedure

Phase 1: Preparation (5 min)

  1. Backup current secrets:

    # Backup all .env files
    cd /opt/microdao-daarion
    tar -czf secrets_backup_$(date +%Y%m%d).tar.gz \
      gateway-bot/.env \
      services/*/.env \
      docker-compose.node1.yml
    
  2. Create new secrets:

    • Generate new passwords/keys
    • Store in secure location (not in git)
  3. Verify services are healthy:

    docker ps --format "{{.Names}}: {{.Status}}" | grep -E "(gateway|router|memory)"
    

Phase 2: Dual-Validity Period (10 min)

For API Keys (DeepSeek, Mistral, etc.):

  1. Add new key alongside old:

    # In Router .env
    DEEPSEEK_API_KEY_OLD=$DEEPSEEK_API_KEY
    DEEPSEEK_API_KEY_NEW=<new_key>
    
  2. Update code to try new key first, fallback to old:

    api_key = os.getenv("DEEPSEEK_API_KEY_NEW") or os.getenv("DEEPSEEK_API_KEY_OLD")
    
  3. Restart Router:

    docker restart dagi-router-node1
    
  4. Monitor for 5 minutes:

    # Check logs for errors
    docker logs -f dagi-router-node1 | grep -i error
    
  5. If stable, remove old key:

    # Remove DEEPSEEK_API_KEY_OLD from .env
    docker restart dagi-router-node1
    

Phase 3: Database Password Rotation (15 min)

For PostgreSQL:

  1. Update password in Postgres:

    docker exec dagi-postgres psql -U daarion -c "
    ALTER USER daarion WITH PASSWORD 'NewPassword123!';
    "
    
  2. Update all .env files with new password:

    # Update in all services
    find /opt/microdao-daarion -name ".env" -exec sed -i 's/DB_PASSWORD=.*/DB_PASSWORD=NewPassword123!/g' {} \;
    
  3. Rolling restart (one service at a time):

    docker restart dagi-memory-service-node1
    sleep 5
    docker restart dagi-router-node1
    sleep 5
    # ... continue for all services
    
  4. Verify connectivity:

    docker exec dagi-memory-service-node1 psql -U daarion -d daarion_main -c "SELECT 1;"
    

Phase 4: NATS Credentials Rotation (10 min)

  1. Generate new NATS credentials:

    # On NATS server
    docker exec dagi-nats-node1 nats server generate-credentials \
      --name worker \
      --output /data/worker.creds
    
  2. Update NATS config:

    # Update docker-compose.node1.yml
    # Add new credentials path
    
  3. Rolling restart services:

    # Restart workers first (they reconnect automatically)
    docker restart crewai-nats-worker
    docker restart parser-pipeline
    
    # Then restart Gateway/Router
    docker restart dagi-gateway-node1
    docker restart dagi-router-node1
    
  4. Verify NATS connectivity:

    # Check JetStream streams
    curl http://localhost:8222/jsz
    

Phase 5: Verification (5 min)

  1. Health checks:

    curl http://localhost:9300/health  # Gateway
    curl http://localhost:9102/health  # Router
    curl http://localhost:8000/health  # Memory
    
  2. Test critical flows:

    • Send test message to Telegram bot
    • Verify agent response
    • Check memory storage
  3. Monitor logs:

    docker logs --tail 100 dagi-gateway-node1 | grep -i error
    docker logs --tail 100 dagi-router-node1 | grep -i error
    

Rollback Procedure

If rotation fails:

  1. Restore old secrets:

    tar -xzf secrets_backup_YYYYMMDD.tar.gz
    
  2. Restart services:

    docker-compose -f docker-compose.node1.yml restart
    
  3. Verify recovery:

    # Run health checks
    # Test critical flows
    

Emergency Contacts

  • Platform Lead: @platform-lead
  • On-Call: Check PagerDuty
  • Slack Channel: #platform-ops

Notes

  • Never commit secrets to git
  • Use environment variables, not hardcoded values
  • Test rotation in staging first
  • Keep backup of old secrets for 30 days
  • Document any custom rotation procedures