Files
microdao-daarion/docs/SECRETS_ROTATION_RUNBOOK.md
Apple ef3473db21 snapshot: NODE1 production state 2026-02-09
Complete snapshot of /opt/microdao-daarion/ from NODE1 (144.76.224.179).
This represents the actual running production code that has diverged
significantly from the previous main branch.

Key changes from old main:
- Gateway (http_api.py): expanded from ~40KB to 164KB with full agent support
- Router: new /v1/agents/{id}/infer endpoint with vision + DeepSeek routing
- Behavior Policy: SOWA v2.2 (3-level: FULL/ACK/SILENT)
- Agent Registry: config/agent_registry.yml as single source of truth
- 13 agents configured (was 3)
- Memory service integration
- CrewAI teams and roles

Excluded from snapshot: venv/, .env, data/, backups, .tgz archives

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-02-09 08:46:46 -08:00

218 lines
4.9 KiB
Markdown

# Secrets Rotation Runbook
**Last Updated:** 2026-01-19
**Owner:** Platform Team
---
## Overview
This runbook describes the procedure for rotating secrets in the DAARION platform without service downtime.
---
## Secrets Inventory
| Secret | Location | Service | Rotation Frequency |
|-------|----------|---------|-------------------|
| NATS Creds | `.env` files | All services | Quarterly |
| JWT Secrets | Gateway `.env` | Gateway | Quarterly |
| DeepSeek API Key | Router `.env` | Router | As needed |
| Mistral API Key | Router `.env` | Router | As needed |
| Grok API Key | Router `.env` | Router | As needed |
| Cohere API Key | Router `.env` | Router | As needed |
| Qdrant API Key | Memory `.env` | Memory Service | Quarterly |
| PostgreSQL Password | `.env` files | All services | Quarterly |
| Neo4j Password | `.env` files | Router, Memory | Quarterly |
| Telegram Bot Tokens | Gateway `.env` | Gateway | As needed |
---
## Rotation Procedure
### Phase 1: Preparation (5 min)
1. **Backup current secrets:**
```bash
# Backup all .env files
cd /opt/microdao-daarion
tar -czf secrets_backup_$(date +%Y%m%d).tar.gz \
gateway-bot/.env \
services/*/.env \
docker-compose.node1.yml
```
2. **Create new secrets:**
- Generate new passwords/keys
- Store in secure location (not in git)
3. **Verify services are healthy:**
```bash
docker ps --format "{{.Names}}: {{.Status}}" | grep -E "(gateway|router|memory)"
```
---
### Phase 2: Dual-Validity Period (10 min)
**For API Keys (DeepSeek, Mistral, etc.):**
1. **Add new key alongside old:**
```bash
# In Router .env
DEEPSEEK_API_KEY_OLD=$DEEPSEEK_API_KEY
DEEPSEEK_API_KEY_NEW=<new_key>
```
2. **Update code to try new key first, fallback to old:**
```python
api_key = os.getenv("DEEPSEEK_API_KEY_NEW") or os.getenv("DEEPSEEK_API_KEY_OLD")
```
3. **Restart Router:**
```bash
docker restart dagi-router-node1
```
4. **Monitor for 5 minutes:**
```bash
# Check logs for errors
docker logs -f dagi-router-node1 | grep -i error
```
5. **If stable, remove old key:**
```bash
# Remove DEEPSEEK_API_KEY_OLD from .env
docker restart dagi-router-node1
```
---
### Phase 3: Database Password Rotation (15 min)
**For PostgreSQL:**
1. **Update password in Postgres:**
```bash
docker exec dagi-postgres psql -U daarion -c "
ALTER USER daarion WITH PASSWORD 'NewPassword123!';
"
```
2. **Update all .env files with new password:**
```bash
# Update in all services
find /opt/microdao-daarion -name ".env" -exec sed -i 's/DB_PASSWORD=.*/DB_PASSWORD=NewPassword123!/g' {} \;
```
3. **Rolling restart (one service at a time):**
```bash
docker restart dagi-memory-service-node1
sleep 5
docker restart dagi-router-node1
sleep 5
# ... continue for all services
```
4. **Verify connectivity:**
```bash
docker exec dagi-memory-service-node1 psql -U daarion -d daarion_main -c "SELECT 1;"
```
---
### Phase 4: NATS Credentials Rotation (10 min)
1. **Generate new NATS credentials:**
```bash
# On NATS server
docker exec dagi-nats-node1 nats server generate-credentials \
--name worker \
--output /data/worker.creds
```
2. **Update NATS config:**
```bash
# Update docker-compose.node1.yml
# Add new credentials path
```
3. **Rolling restart services:**
```bash
# Restart workers first (they reconnect automatically)
docker restart crewai-nats-worker
docker restart parser-pipeline
# Then restart Gateway/Router
docker restart dagi-gateway-node1
docker restart dagi-router-node1
```
4. **Verify NATS connectivity:**
```bash
# Check JetStream streams
curl http://localhost:8222/jsz
```
---
### Phase 5: Verification (5 min)
1. **Health checks:**
```bash
curl http://localhost:9300/health # Gateway
curl http://localhost:9102/health # Router
curl http://localhost:8000/health # Memory
```
2. **Test critical flows:**
- Send test message to Telegram bot
- Verify agent response
- Check memory storage
3. **Monitor logs:**
```bash
docker logs --tail 100 dagi-gateway-node1 | grep -i error
docker logs --tail 100 dagi-router-node1 | grep -i error
```
---
## Rollback Procedure
If rotation fails:
1. **Restore old secrets:**
```bash
tar -xzf secrets_backup_YYYYMMDD.tar.gz
```
2. **Restart services:**
```bash
docker-compose -f docker-compose.node1.yml restart
```
3. **Verify recovery:**
```bash
# Run health checks
# Test critical flows
```
---
## Emergency Contacts
- **Platform Lead:** @platform-lead
- **On-Call:** Check PagerDuty
- **Slack Channel:** #platform-ops
---
## Notes
- **Never commit secrets to git**
- **Use environment variables, not hardcoded values**
- **Test rotation in staging first**
- **Keep backup of old secrets for 30 days**
- **Document any custom rotation procedures**