docs: add node1 runbooks, consolidation artifacts, and maintenance scripts

This commit is contained in:
Apple
2026-02-19 00:14:27 -08:00
parent c57e6ed96b
commit 544874d952
586 changed files with 14065 additions and 22 deletions

View File

@@ -0,0 +1,46 @@
# Observability and Backups
## Observability Stack
- Prometheus config: `monitoring/prometheus/prometheus.yml`.
- Scrapes: prometheus self, `agent-e2e-prober`, `gateway`, `router`, `qdrant`, `grafana`.
- Alert rules: `monitoring/prometheus/rules/node1.rules.yml`.
- Grafana provisioning and dashboards:
- datasources: `monitoring/grafana/provisioning/datasources/prometheus.yml`
- dashboards: `monitoring/grafana/dashboards/*.json`
- alerting: `monitoring/grafana/provisioning/alerting/alerts.yml`
- Loki/OTel/Tempo/Jaeger: no active compose evidence in this repos current manifests.
## Service-Level Telemetry
- Router exposes `/metrics` (`services/router/main.py`).
- Gateway exposes metrics endpoint (compose monitors `/metrics`).
- SenpAI consumer has Prometheus metrics in code (`senpai_nats_connected`, reconnect counters).
- Prober exports metrics on `9108`.
## Backup and DR
### Data backups
- Scheduled Postgres backup container: `docker-compose.backups.yml` (`SCHEDULE: @every 6h`, keep days/weeks/months).
- Full backup script: `scripts/backup/backup_all.sh` (Postgres dump + Qdrant snapshots + Neo4j dump + metadata file).
- Restore validation script: `scripts/restore/restore_test.sh`.
### Documentation backups
- `scripts/docs/docs_backup.sh` creates timestamped archives and retention rotation.
- `scripts/docs/install_local_cron.sh` installs local managed cron block for docs maintenance.
## DR Readiness Notes
- Backup script metadata and restore script provide reproducible path checks.
- Compose-based backup path uses host bind `/opt/backups/postgres:/backups` (host-level storage requirement).
- Runbooks report prior backup-image version mismatch issue; currently compose pins backup image `:16`.
## Source pointers
- `monitoring/prometheus/prometheus.yml`
- `monitoring/prometheus/rules/node1.rules.yml`
- `monitoring/grafana/provisioning/datasources/prometheus.yml`
- `monitoring/grafana/provisioning/alerting/alerts.yml`
- `monitoring/grafana/dashboards/nats_memory.json`
- `docker-compose.backups.yml`
- `scripts/backup/backup_all.sh`
- `scripts/restore/restore_test.sh`
- `scripts/docs/docs_backup.sh`
- `scripts/docs/install_local_cron.sh`
- `docs/NODA1-MEMORY-RUNBOOK.md`
- `docs/NODA1-TECHBORGS-PATCHES.md`