Files
microdao-daarion/NODE-REGISTRY-DEPLOYMENT-CHECKLIST.md

390 lines
9.0 KiB
Markdown

# ✅ Node Registry Service — Deployment Checklist
**Version:** 1.0.0
**Date:** 2025-01-17
**Status:** Ready for Production
---
## 📋 Pre-Deployment Checklist
### Local Verification (Node #2)
- [ ] **Test service locally**
```bash
cd services/node-registry
pip install -r requirements.txt
export NODE_REGISTRY_ENV=development
export NODE_REGISTRY_DB_HOST=localhost
export NODE_REGISTRY_DB_NAME=node_registry
python -m app.main
```
- [ ] **Verify endpoints**
```bash
curl http://localhost:9205/health
curl http://localhost:9205/metrics
curl http://localhost:9205/docs # Interactive API docs
```
- [ ] **Test node registration**
```bash
curl -X POST http://localhost:9205/api/v1/nodes/register \
-H "Content-Type: application/json" \
-d '{"hostname": "test-node", "ip": "192.168.1.100", "role": "test", "labels": ["test"]}'
```
- [ ] **Test heartbeat**
```bash
curl -X POST http://localhost:9205/api/v1/nodes/heartbeat \
-H "Content-Type: application/json" \
-d '{"node_id": "node-test", "status": "online"}'
```
- [ ] **List nodes**
```bash
curl http://localhost:9205/api/v1/nodes
```
- [ ] **Run tests** (if available)
```bash
cd services/node-registry
pytest tests/
```
---
## 🚀 Production Deployment (Node #1)
### Step 1: Push to GitHub
- [ ] **Commit changes**
```bash
git add services/node-registry/
git add docker-compose.yml
git add scripts/deploy-node-registry.sh
git add NODE-REGISTRY-*.md
git commit -m "feat: Node Registry Service - Full implementation"
git push origin main
```
### Step 2: Pull on Node #1
- [ ] **SSH to Node #1 and pull latest**
```bash
ssh root@144.76.224.179
cd /opt/microdao-daarion
git pull origin main
```
### Step 3: Initialize Database
- [ ] **Run SQL migration**
```bash
# On Node #1
cd /opt/microdao-daarion
# Copy SQL file to container
docker cp services/node-registry/migrations/001_create_node_registry_tables.sql dagi-postgres:/tmp/
# Execute migration
docker exec -i dagi-postgres psql -U postgres < /tmp/001_create_node_registry_tables.sql
# Verify tables
docker exec dagi-postgres psql -U postgres -d node_registry -c "\dt"
```
- [ ] **Generate secure password**
```bash
# Generate password
PASSWORD=$(openssl rand -base64 32)
# Add to .env
echo "NODE_REGISTRY_DB_PASSWORD=$PASSWORD" >> .env
# Verify
grep NODE_REGISTRY_DB_PASSWORD .env
```
### Step 4: Build and Start Service
- [ ] **Build Docker image**
```bash
docker-compose build node-registry
```
- [ ] **Start service**
```bash
docker-compose up -d node-registry
```
- [ ] **Check container status**
```bash
docker-compose ps | grep node-registry
docker logs dagi-node-registry --tail 50
```
### Step 5: Configure Firewall
- [ ] **Set UFW rules**
```bash
# Allow from local network
ufw allow from 192.168.1.0/24 to any port 9205 proto tcp comment 'Node Registry - LAN'
# Allow from Docker network
ufw allow from 172.16.0.0/12 to any port 9205 proto tcp comment 'Node Registry - Docker'
# Deny from external
ufw deny 9205/tcp comment 'Node Registry - Block external'
# Verify rules
ufw status | grep 9205
```
### Step 6: Verify Deployment
- [ ] **Health check**
```bash
curl http://localhost:9205/health
# Expected: {"status":"healthy",...,"database":{"connected":true,...}}
```
- [ ] **Metrics check**
```bash
curl http://localhost:9205/metrics
```
- [ ] **Check database connectivity**
```bash
docker exec dagi-postgres psql -U node_registry_user -d node_registry -c "SELECT COUNT(*) FROM nodes;"
```
### Step 7: Register Nodes
- [ ] **Register Node #1 (Production)**
```bash
# Option A: Using bootstrap tool (if installed on Node #1)
python -m tools.dagi_node_agent.bootstrap \
--role production-router \
--labels router,gateway,production \
--registry-url http://localhost:9205
# Option B: Manual API call
curl -X POST http://localhost:9205/api/v1/nodes/register \
-H "Content-Type: application/json" \
-d '{
"hostname": "gateway.daarion.city",
"ip": "144.76.224.179",
"role": "production-router",
"labels": ["router", "gateway", "production"]
}'
```
- [ ] **Register Node #2 (Development) from MacBook**
```bash
# From Node #2
python -m tools.dagi_node_agent.bootstrap \
--role development-router \
--labels router,development,mac,gpu \
--registry-url http://192.168.1.244:9205
```
- [ ] **Verify node registration**
```bash
# List all nodes
curl http://localhost:9205/api/v1/nodes
# Get specific node
curl http://localhost:9205/api/v1/nodes/node-1-hetzner-gex44
```
---
## 🧪 Post-Deployment Testing
### Functional Tests
- [ ] **Test node listing**
```bash
# All nodes
curl http://144.76.224.179:9205/api/v1/nodes
# Filter by role
curl "http://144.76.224.179:9205/api/v1/nodes?role=production-router"
# Filter by label
curl "http://144.76.224.179:9205/api/v1/nodes?label=gateway"
# Filter by status
curl "http://144.76.224.179:9205/api/v1/nodes?status=online"
```
- [ ] **Test heartbeat updates**
```bash
curl -X POST http://144.76.224.179:9205/api/v1/nodes/heartbeat \
-H "Content-Type: application/json" \
-d '{"node_id": "node-1-hetzner-gex44", "status": "online"}'
# Verify heartbeat timestamp updated
curl http://144.76.224.179:9205/api/v1/nodes/node-1-hetzner-gex44 | grep last_heartbeat
```
- [ ] **Test role profiles**
```bash
curl http://144.76.224.179:9205/api/v1/profiles/production-router
```
### Network Access Tests
- [ ] **Test from Node #2 (internal network)**
```bash
# From MacBook
curl http://144.76.224.179:9205/health
```
- [ ] **Verify external access blocked**
```bash
# From external machine (should fail or timeout)
curl --max-time 5 http://144.76.224.179:9205/health
```
### Integration Tests
- [ ] **DAGI Router integration** (future)
```bash
# Test router can fetch node list
curl http://dagi-router:9102/api/nodes
```
- [ ] **Prometheus scraping** (future)
```bash
# Verify metrics endpoint is scrapable
curl http://144.76.224.179:9205/metrics | grep node_registry
```
---
## 📊 Monitoring Setup
### Prometheus Configuration
- [ ] **Add scrape job to prometheus.yml**
```yaml
scrape_configs:
- job_name: 'node-registry'
static_configs:
- targets: ['node-registry:9205']
scrape_interval: 30s
```
- [ ] **Reload Prometheus**
```bash
docker-compose restart prometheus
```
### Grafana Dashboard
- [ ] **Create dashboard for Node Registry**
- Panel: Node Registry uptime
- Panel: Total registered nodes
- Panel: Active vs offline nodes
- Panel: Nodes by role
- Panel: Recent heartbeats
### Health Check Alerts
- [ ] **Configure alerting** (optional)
```yaml
# prometheus/alerts/node_registry.yml
groups:
- name: node_registry
rules:
- alert: NodeRegistryDown
expr: up{job="node-registry"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Node Registry is down"
```
---
## 🔄 Operational Tasks
### Regular Maintenance
- [ ] **Weekly: Check node heartbeats**
```bash
docker exec dagi-postgres psql -U postgres -d node_registry -c \
"SELECT node_id, last_heartbeat, status FROM nodes ORDER BY last_heartbeat DESC;"
```
- [ ] **Weekly: Clean old heartbeat logs** (if needed)
```bash
docker exec dagi-postgres psql -U postgres -d node_registry -c \
"DELETE FROM heartbeat_log WHERE timestamp < NOW() - INTERVAL '30 days';"
```
- [ ] **Monthly: Review registered nodes**
```bash
curl http://144.76.224.179:9205/api/v1/nodes | jq '.[] | {node_id, role, status, last_heartbeat}'
```
### Backup
- [ ] **Backup node_registry database**
```bash
docker exec dagi-postgres pg_dump -U postgres node_registry > backups/node_registry_$(date +%Y%m%d).sql
```
---
## 📚 Documentation Updates
- [ ] **Update INFRASTRUCTURE.md**
- Add Node Registry to services table (Port 9205)
- Add environment variables section
- [ ] **Update SYSTEM-INVENTORY.md**
- Add node-registry service to inventory
- Update total service count (17 → 18)
- [ ] **Update WARP.md**
- Add Node Registry service restart command
- Add node registration examples
---
## ✅ Final Verification
- [ ] Service running on Node #1
- [ ] Database initialized with schema
- [ ] Firewall configured (internal only)
- [ ] Node #1 registered and heartbeat working
- [ ] Node #2 registered and heartbeat working
- [ ] Health endpoint responding
- [ ] Metrics endpoint responding
- [ ] API endpoints functional
- [ ] Documentation updated
- [ ] Monitoring configured
---
## 🎉 Deployment Complete!
**Node Registry Service is now live and ready for production use.**
### Next Steps:
1. Integrate with DAGI Router for node discovery
2. Set up automated heartbeat cron jobs for each node
3. Add authentication/authorization
4. Implement Prometheus metrics export
5. Create Grafana dashboard
---
**Deployed by:** [Your Name]
**Date:** [Deployment Date]
**Status:** ✅ Production Ready