feat: додано Node Registry, GreenFood, Monitoring та Utils

This commit is contained in:
Apple
2025-11-21 00:35:41 -08:00
parent 31f3602047
commit e018b9ab68
74 changed files with 13948 additions and 0 deletions

225
monitoring/README.md Normal file
View File

@@ -0,0 +1,225 @@
# DAARION Platform Monitoring
**Stack**: Prometheus + Grafana
**Сервер**: `144.76.224.179`
---
## 🚀 Швидкий старт
### 1. Деплой на сервер
```bash
# З локальної машини
cd /Users/apple/github-projects/microdao-daarion
rsync -avz monitoring/ root@144.76.224.179:/opt/microdao-daarion/monitoring/
# На сервері
ssh root@144.76.224.179
cd /opt/microdao-daarion/monitoring
docker-compose -f docker-compose.monitoring.yml up -d
```
### 2. Доступ до інтерфейсів
- **Prometheus**: http://144.76.224.179:9090
- **Grafana**: http://144.76.224.179:3000
- Username: `admin`
- Password: `daarion2025`
---
## 📊 Що моніториться?
### Core Services
- **dagi-router** (9102) - Центральний маршрутизатор
- **telegram-gateway** (8000) - Telegram боти
- **dagi-gateway** (9300) - HTTP Gateway
- **dagi-rbac** (9200) - RBAC Service
### AI/ML Services
- **dagi-crewai** (9010) - CrewAI workflows
- **dagi-vision-encoder** (8001) - Vision AI
- **dagi-parser** (9400) - OCR/PDF parsing
- **dagi-stt** (9000) - Speech-to-Text
- **dagi-tts** (9101) - Text-to-Speech
### Infrastructure
- **nats** (8222) - Message broker
- **dagi-qdrant** (6333) - Vector DB
- **dagi-postgres** (5432) - Main DB
---
## 🎯 Ключові метрики
### 1. Request Rate
```promql
rate(http_requests_total[5m])
```
### 2. Error Rate
```promql
rate(http_requests_total{status=~"5.."}[5m])
```
### 3. Latency (p95)
```promql
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
```
### 4. LLM Performance
```promql
rate(llm_requests_total[5m])
histogram_quantile(0.95, rate(llm_request_duration_seconds_bucket[5m]))
```
### 5. Telegram Activity
```promql
rate(telegram_messages_total[5m])
telegram_active_chats
```
---
## 🚨 Alerts
### Critical
- **ServiceDown**: Сервіс не відповідає > 2 хв
- **TelegramGatewayDown**: Telegram боти не працюють
- **PostgreSQLDown**: База даних недоступна
- **NATSDown**: Message broker недоступний
- **DiskSpaceCritical**: < 10% диску
### Warning
- **HighErrorRate**: > 5% помилок
- **RouterHighLatency**: P95 > 10s
- **LLMHighLatency**: P95 > 30s
- **DiskSpaceWarning**: < 20% диску
---
## 📈 Додавання метрик до сервісу
### Python (FastAPI)
```python
from prometheus_client import Counter, Histogram, generate_latest
from fastapi import FastAPI
app = FastAPI()
# Metrics
REQUEST_COUNT = Counter('http_requests_total', 'Total HTTP requests', ['method', 'endpoint', 'status'])
REQUEST_LATENCY = Histogram('http_request_duration_seconds', 'HTTP request latency', ['method', 'endpoint'])
@app.middleware("http")
async def metrics_middleware(request: Request, call_next):
start_time = time.time()
response = await call_next(request)
duration = time.time() - start_time
REQUEST_COUNT.labels(
method=request.method,
endpoint=request.url.path,
status=response.status_code
).inc()
REQUEST_LATENCY.labels(
method=request.method,
endpoint=request.url.path
).observe(duration)
return response
@app.get("/metrics")
async def metrics():
return Response(generate_latest(), media_type="text/plain")
```
### Додати сервіс в Prometheus
Відредагувати `monitoring/prometheus/prometheus.yml`:
```yaml
scrape_configs:
- job_name: 'my-new-service'
static_configs:
- targets: ['my-service:9999']
metrics_path: '/metrics'
scrape_interval: 15s
```
---
## 🛠️ Troubleshooting
### Prometheus не скрейпить метрики
```bash
# Перевірити статус targets
curl http://localhost:9090/api/v1/targets
# Перевірити logs
docker logs dagi-prometheus
# Перевірити endpoint вручну
curl http://dagi-router:9102/metrics
```
### Grafana не показує дані
```bash
# Перевірити datasource
docker exec dagi-grafana grafana-cli admin reset-admin-password daarion2025
# Restart Grafana
docker restart dagi-grafana
```
### Reload Prometheus config без рестарту
```bash
curl -X POST http://localhost:9090/-/reload
```
---
## 📚 Корисні запити
### Top 10 найповільніших endpoints
```promql
topk(10, histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])))
```
### Error rate по сервісах
```promql
sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)
```
### LLM requests per second
```promql
sum(rate(llm_requests_total[1m])) by (agent_id)
```
### Active Telegram chats
```promql
sum(telegram_active_chats) by (agent_id)
```
---
## 🎯 Наступні кроки
1. ✅ Prometheus + Grafana встановлено
2. ⏳ Додати метрики в DAGI Router
3. ⏳ Додати метрики в Telegram Gateway
4. ⏳ Створити дашборди в Grafana
5. ⏳ Налаштувати Alertmanager (Slack/Telegram notifications)
6. ⏳ Додати Loki для централізованих логів
7. ⏳ Додати Jaeger для distributed tracing
---
*Оновлено: 2025-11-18*

View File

@@ -0,0 +1,64 @@
version: '3.8'
services:
prometheus:
image: prom/prometheus:latest
container_name: dagi-prometheus
ports:
- "9090:9090"
volumes:
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
- ./prometheus/alerts:/etc/prometheus/alerts:ro
- prometheus-data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--web.console.libraries=/usr/share/prometheus/console_libraries'
- '--web.console.templates=/usr/share/prometheus/consoles'
- '--web.enable-lifecycle'
networks:
- dagi-network
restart: unless-stopped
healthcheck:
test: ["CMD", "wget", "--quiet", "--tries=1", "--spider", "http://localhost:9090/-/healthy"]
interval: 30s
timeout: 10s
retries: 3
grafana:
image: grafana/grafana:latest
container_name: dagi-grafana
ports:
- "3000:3000"
volumes:
- grafana-data:/var/lib/grafana
- ./grafana/dashboards:/etc/grafana/provisioning/dashboards:ro
- ./grafana/datasources:/etc/grafana/provisioning/datasources:ro
environment:
- GF_SECURITY_ADMIN_USER=admin
- GF_SECURITY_ADMIN_PASSWORD=daarion2025
- GF_USERS_ALLOW_SIGN_UP=false
- GF_SERVER_ROOT_URL=http://localhost:3000
- GF_ANALYTICS_REPORTING_ENABLED=false
- GF_ANALYTICS_CHECK_FOR_UPDATES=false
networks:
- dagi-network
restart: unless-stopped
depends_on:
- prometheus
healthcheck:
test: ["CMD", "wget", "--quiet", "--tries=1", "--spider", "http://localhost:3000/api/health"]
interval: 30s
timeout: 10s
retries: 3
networks:
dagi-network:
external: true
volumes:
prometheus-data:
driver: local
grafana-data:
driver: local

View File

@@ -0,0 +1,462 @@
{
"annotations": {
"list": [
{
"builtIn": 1,
"datasource": {
"type": "grafana",
"uid": "-- Grafana --"
},
"enable": true,
"hide": true,
"iconColor": "rgba(0, 211, 255, 1)",
"name": "Annotations & Alerts",
"type": "dashboard"
}
]
},
"editable": true,
"fiscalYearStartMonth": 0,
"graphTooltip": 0,
"id": null,
"links": [],
"liveNow": false,
"panels": [
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"custom": {
"axisCenteredZero": false,
"axisColorMode": "text",
"axisLabel": "",
"axisPlacement": "auto",
"barAlignment": 0,
"drawStyle": "line",
"fillOpacity": 10,
"gradientMode": "none",
"hideFrom": {
"tooltip": false,
"viz": false,
"legend": false
},
"lineInterpolation": "linear",
"lineWidth": 1,
"pointSize": 5,
"scaleDistribution": {
"type": "linear"
},
"showPoints": "never",
"spanNulls": false,
"stacking": {
"group": "A",
"mode": "none"
},
"thresholdsStyle": {
"mode": "off"
}
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "red",
"value": 80
}
]
},
"unit": "reqps"
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 0
},
"id": 1,
"options": {
"legend": {
"calcs": [],
"displayMode": "list",
"placement": "bottom",
"showLegend": true
},
"tooltip": {
"mode": "single",
"sort": "none"
}
},
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"expr": "rate(http_requests_total[5m])",
"legendFormat": "{{job}} - {{method}} {{endpoint}}",
"refId": "A"
}
],
"title": "HTTP Requests/sec",
"type": "timeseries"
},
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "red",
"value": 0.05
}
]
},
"unit": "percentunit"
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 0
},
"id": 2,
"options": {
"orientation": "auto",
"reduceOptions": {
"values": false,
"calcs": [
"lastNotNull"
],
"fields": ""
},
"showThresholdLabels": false,
"showThresholdMarkers": true
},
"pluginVersion": "9.5.3",
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"expr": "rate(http_requests_total{status_code=~\"5..\"}[5m]) / rate(http_requests_total[5m])",
"legendFormat": "{{job}}",
"refId": "A"
}
],
"title": "Error Rate (%)",
"type": "gauge"
},
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"custom": {
"axisCenteredZero": false,
"axisColorMode": "text",
"axisLabel": "",
"axisPlacement": "auto",
"barAlignment": 0,
"drawStyle": "line",
"fillOpacity": 10,
"gradientMode": "none",
"hideFrom": {
"tooltip": false,
"viz": false,
"legend": false
},
"lineInterpolation": "linear",
"lineWidth": 1,
"pointSize": 5,
"scaleDistribution": {
"type": "linear"
},
"showPoints": "never",
"spanNulls": false,
"stacking": {
"group": "A",
"mode": "none"
},
"thresholdsStyle": {
"mode": "off"
}
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "red",
"value": 80
}
]
},
"unit": "s"
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 8
},
"id": 3,
"options": {
"legend": {
"calcs": [
"mean",
"max"
],
"displayMode": "table",
"placement": "right",
"showLegend": true
},
"tooltip": {
"mode": "multi",
"sort": "none"
}
},
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))",
"legendFormat": "p95 - {{job}}",
"refId": "A"
},
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"expr": "histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[5m]))",
"legendFormat": "p50 - {{job}}",
"refId": "B"
}
],
"title": "Request Duration (p50, p95)",
"type": "timeseries"
},
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
}
]
},
"unit": "short"
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 8
},
"id": 4,
"options": {
"colorMode": "value",
"graphMode": "area",
"justifyMode": "auto",
"orientation": "auto",
"reduceOptions": {
"values": false,
"calcs": [
"lastNotNull"
],
"fields": ""
},
"textMode": "auto"
},
"pluginVersion": "9.5.3",
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"expr": "count(up{job=~\"dagi-.*|telegram-gateway\"} == 1)",
"legendFormat": "Active Services",
"refId": "A"
}
],
"title": "Active Services",
"type": "stat"
},
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"custom": {
"axisCenteredZero": false,
"axisColorMode": "text",
"axisLabel": "",
"axisPlacement": "auto",
"barAlignment": 0,
"drawStyle": "line",
"fillOpacity": 10,
"gradientMode": "none",
"hideFrom": {
"tooltip": false,
"viz": false,
"legend": false
},
"lineInterpolation": "linear",
"lineWidth": 1,
"pointSize": 5,
"scaleDistribution": {
"type": "linear"
},
"showPoints": "never",
"spanNulls": false,
"stacking": {
"group": "A",
"mode": "none"
},
"thresholdsStyle": {
"mode": "off"
}
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
}
]
},
"unit": "reqps"
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 24,
"x": 0,
"y": 16
},
"id": 5,
"options": {
"legend": {
"calcs": [
"mean",
"last"
],
"displayMode": "table",
"placement": "right",
"showLegend": true
},
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"expr": "rate(http_requests_total{job=\"dagi-router\"}[5m])",
"legendFormat": "Router - {{method}} {{endpoint}} [{{status_code}}]",
"refId": "A"
},
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"expr": "rate(http_requests_total{job=\"telegram-gateway\"}[5m])",
"legendFormat": "Gateway - {{method}} {{endpoint}} [{{status_code}}]",
"refId": "B"
}
],
"title": "Requests by Service & Endpoint",
"type": "timeseries"
}
],
"refresh": "5s",
"schemaVersion": 38,
"style": "dark",
"tags": [
"daarion",
"microdao"
],
"templating": {
"list": []
},
"time": {
"from": "now-15m",
"to": "now"
},
"timepicker": {},
"timezone": "",
"title": "DAARION Services Overview",
"uid": "daarion-services",
"version": 0,
"weekStart": ""
}

View File

@@ -0,0 +1,14 @@
apiVersion: 1
providers:
- name: 'DAARION Dashboards'
orgId: 1
folder: ''
type: file
disableDeletion: false
updateIntervalSeconds: 10
allowUiUpdates: true
options:
path: /etc/grafana/provisioning/dashboards
foldersFromFilesStructure: true

View File

@@ -0,0 +1,557 @@
{
"annotations": {
"list": []
},
"editable": true,
"fiscalYearStartMonth": 0,
"graphTooltip": 0,
"id": null,
"links": [],
"liveNow": false,
"panels": [
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"mappings": [
{
"options": {
"0": {
"color": "red",
"index": 0,
"text": "Down"
},
"1": {
"color": "green",
"index": 1,
"text": "Up"
}
},
"type": "value"
}
],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "red",
"value": null
},
{
"color": "green",
"value": 1
}
]
}
},
"overrides": []
},
"gridPos": {
"h": 6,
"w": 8,
"x": 0,
"y": 0
},
"id": 1,
"options": {
"colorMode": "background",
"graphMode": "none",
"justifyMode": "center",
"orientation": "auto",
"reduceOptions": {
"values": false,
"calcs": [
"lastNotNull"
],
"fields": ""
},
"textMode": "value_and_name"
},
"pluginVersion": "9.5.3",
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"expr": "up{job=\"telegram-gateway\"}",
"legendFormat": "Gateway",
"refId": "A"
},
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"expr": "up{job=\"dagi-stt\"}",
"legendFormat": "STT",
"refId": "B"
},
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"expr": "up{job=\"dagi-tts\"}",
"legendFormat": "TTS",
"refId": "C"
}
],
"title": "Service Status",
"type": "stat"
},
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"custom": {
"axisCenteredZero": false,
"axisColorMode": "text",
"axisLabel": "",
"axisPlacement": "auto",
"barAlignment": 0,
"drawStyle": "line",
"fillOpacity": 10,
"gradientMode": "none",
"hideFrom": {
"tooltip": false,
"viz": false,
"legend": false
},
"lineInterpolation": "linear",
"lineWidth": 1,
"pointSize": 5,
"scaleDistribution": {
"type": "linear"
},
"showPoints": "never",
"spanNulls": false,
"stacking": {
"group": "A",
"mode": "none"
},
"thresholdsStyle": {
"mode": "off"
}
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
}
]
},
"unit": "reqps"
},
"overrides": []
},
"gridPos": {
"h": 6,
"w": 16,
"x": 8,
"y": 0
},
"id": 2,
"options": {
"legend": {
"calcs": [
"mean",
"last"
],
"displayMode": "table",
"placement": "right",
"showLegend": true
},
"tooltip": {
"mode": "multi",
"sort": "none"
}
},
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"expr": "rate(http_requests_total{job=\"telegram-gateway\",endpoint=\"/telegram/webhook\"}[5m])",
"legendFormat": "Incoming Messages",
"refId": "A"
}
],
"title": "Telegram Messages Rate",
"type": "timeseries"
},
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"custom": {
"axisCenteredZero": false,
"axisColorMode": "text",
"axisLabel": "",
"axisPlacement": "auto",
"barAlignment": 0,
"drawStyle": "line",
"fillOpacity": 10,
"gradientMode": "none",
"hideFrom": {
"tooltip": false,
"viz": false,
"legend": false
},
"lineInterpolation": "linear",
"lineWidth": 1,
"pointSize": 5,
"scaleDistribution": {
"type": "linear"
},
"showPoints": "never",
"spanNulls": false,
"stacking": {
"group": "A",
"mode": "none"
},
"thresholdsStyle": {
"mode": "off"
}
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
}
]
},
"unit": "s"
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 6
},
"id": 3,
"options": {
"legend": {
"calcs": [
"mean",
"max"
],
"displayMode": "table",
"placement": "right",
"showLegend": true
},
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{job=\"dagi-router\"}[5m]))",
"legendFormat": "Router p95",
"refId": "A"
},
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{job=\"telegram-gateway\"}[5m]))",
"legendFormat": "Gateway p95",
"refId": "B"
}
],
"title": "Response Time (p95)",
"type": "timeseries"
},
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"custom": {
"hideFrom": {
"tooltip": false,
"viz": false,
"legend": false
}
},
"mappings": []
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 6
},
"id": 4,
"options": {
"legend": {
"displayMode": "table",
"placement": "right",
"showLegend": true,
"values": [
"value"
]
},
"pieType": "pie",
"tooltip": {
"mode": "single",
"sort": "none"
}
},
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"expr": "sum by (status_code) (increase(http_requests_total{job=\"telegram-gateway\"}[1h]))",
"legendFormat": "{{status_code}}",
"refId": "A"
}
],
"title": "HTTP Status Codes (1h)",
"type": "piechart"
},
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
}
]
},
"unit": "short"
},
"overrides": []
},
"gridPos": {
"h": 4,
"w": 8,
"x": 0,
"y": 14
},
"id": 5,
"options": {
"colorMode": "value",
"graphMode": "area",
"justifyMode": "auto",
"orientation": "auto",
"reduceOptions": {
"values": false,
"calcs": [
"sum"
],
"fields": ""
},
"textMode": "auto"
},
"pluginVersion": "9.5.3",
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"expr": "sum(increase(http_requests_total{job=\"dagi-stt\"}[1h]))",
"legendFormat": "STT Requests (1h)",
"refId": "A"
}
],
"title": "Voice Messages (1h)",
"type": "stat"
},
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
}
]
},
"unit": "short"
},
"overrides": []
},
"gridPos": {
"h": 4,
"w": 8,
"x": 8,
"y": 14
},
"id": 6,
"options": {
"colorMode": "value",
"graphMode": "area",
"justifyMode": "auto",
"orientation": "auto",
"reduceOptions": {
"values": false,
"calcs": [
"sum"
],
"fields": ""
},
"textMode": "auto"
},
"pluginVersion": "9.5.3",
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"expr": "sum(increase(http_requests_total{job=\"dagi-tts\"}[1h]))",
"legendFormat": "TTS Requests (1h)",
"refId": "A"
}
],
"title": "Voice Responses (1h)",
"type": "stat"
},
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
}
]
},
"unit": "short"
},
"overrides": []
},
"gridPos": {
"h": 4,
"w": 8,
"x": 16,
"y": 14
},
"id": 7,
"options": {
"colorMode": "value",
"graphMode": "area",
"justifyMode": "auto",
"orientation": "auto",
"reduceOptions": {
"values": false,
"calcs": [
"sum"
],
"fields": ""
},
"textMode": "auto"
},
"pluginVersion": "9.5.3",
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"expr": "sum(increase(http_requests_total{job=\"dagi-parser\"}[1h]))",
"legendFormat": "Parser Requests (1h)",
"refId": "A"
}
],
"title": "Documents Processed (1h)",
"type": "stat"
}
],
"refresh": "5s",
"schemaVersion": 38,
"style": "dark",
"tags": [
"telegram",
"bots"
],
"templating": {
"list": []
},
"time": {
"from": "now-1h",
"to": "now"
},
"timepicker": {},
"timezone": "",
"title": "Telegram Bots Monitoring",
"uid": "telegram-bots",
"version": 0,
"weekStart": ""
}

View File

@@ -0,0 +1,13 @@
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
editable: true
jsonData:
timeInterval: "15s"
queryTimeout: "60s"

View File

@@ -0,0 +1,129 @@
groups:
- name: DAARION Platform
interval: 30s
rules:
# Service Health Alerts
- alert: ServiceDown
expr: up == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Service {{ $labels.job }} is down"
description: "{{ $labels.job }} has been down for more than 2 minutes"
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "High error rate on {{ $labels.job }}"
description: "Error rate is {{ $value }} errors/sec"
# Router Alerts
- alert: RouterHighLatency
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{job="dagi-router"}[5m])) > 10
for: 5m
labels:
severity: warning
annotations:
summary: "DAGI Router high latency"
description: "95th percentile latency is {{ $value }}s"
- alert: RouterHighLoad
expr: rate(http_requests_total{job="dagi-router"}[1m]) > 100
for: 5m
labels:
severity: warning
annotations:
summary: "DAGI Router high load"
description: "Request rate is {{ $value }} req/sec"
# Telegram Gateway Alerts
- alert: TelegramGatewayDown
expr: up{job="telegram-gateway"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Telegram Gateway is down"
description: "Telegram bots will not respond"
- alert: TelegramMessageBacklog
expr: telegram_message_queue_size > 100
for: 5m
labels:
severity: warning
annotations:
summary: "Telegram message backlog"
description: "{{ $value }} messages in queue"
# LLM Performance
- alert: LLMHighLatency
expr: histogram_quantile(0.95, rate(llm_request_duration_seconds_bucket[5m])) > 30
for: 10m
labels:
severity: warning
annotations:
summary: "LLM high latency"
description: "95th percentile LLM latency is {{ $value }}s"
- alert: LLMErrorRate
expr: rate(llm_errors_total[5m]) > 0.1
for: 5m
labels:
severity: critical
annotations:
summary: "High LLM error rate"
description: "LLM error rate is {{ $value }} errors/sec"
# Database Alerts
- alert: PostgreSQLDown
expr: up{job="postgres"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "PostgreSQL is down"
description: "Database is unavailable"
# NATS Alerts
- alert: NATSDown
expr: up{job="nats"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "NATS is down"
description: "Message broker is unavailable"
# Vector DB Alerts
- alert: QdrantHighMemory
expr: qdrant_memory_used_bytes / qdrant_memory_total_bytes > 0.9
for: 5m
labels:
severity: warning
annotations:
summary: "Qdrant high memory usage"
description: "Memory usage is {{ $value | humanizePercentage }}"
# Disk Space Alerts
- alert: DiskSpaceWarning
expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) < 0.2
for: 5m
labels:
severity: warning
annotations:
summary: "Low disk space"
description: "Only {{ $value | humanizePercentage }} disk space left"
- alert: DiskSpaceCritical
expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) < 0.1
for: 2m
labels:
severity: critical
annotations:
summary: "Critical disk space"
description: "Only {{ $value | humanizePercentage }} disk space left"

View File

@@ -0,0 +1,124 @@
# Prometheus Configuration for DAARION Platform
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: 'daarion-prod'
environment: 'production'
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets: []
# - alertmanager:9093
# Load rules once and periodically evaluate them
rule_files:
- "/etc/prometheus/alerts/*.yml"
# Scrape configurations
scrape_configs:
# DAGI Router
- job_name: 'dagi-router'
static_configs:
- targets: ['dagi-router:9102']
metrics_path: '/metrics'
scrape_interval: 10s
# Telegram Gateway
- job_name: 'telegram-gateway'
static_configs:
- targets: ['telegram-gateway:8000']
metrics_path: '/metrics'
scrape_interval: 10s
# DAGI Gateway
- job_name: 'dagi-gateway'
static_configs:
- targets: ['dagi-gateway:9300']
metrics_path: '/metrics'
scrape_interval: 10s
# RBAC Service
- job_name: 'dagi-rbac'
static_configs:
- targets: ['dagi-rbac:9200']
metrics_path: '/metrics'
scrape_interval: 15s
# CrewAI Service
- job_name: 'dagi-crewai'
static_configs:
- targets: ['dagi-crewai:9010']
metrics_path: '/metrics'
scrape_interval: 15s
# Parser Service
- job_name: 'dagi-parser'
static_configs:
- targets: ['dagi-parser:9400']
metrics_path: '/metrics'
scrape_interval: 20s
# Vision Encoder
- job_name: 'dagi-vision-encoder'
static_configs:
- targets: ['dagi-vision-encoder:8001']
metrics_path: '/metrics'
scrape_interval: 20s
# DevTools
- job_name: 'dagi-devtools'
static_configs:
- targets: ['dagi-devtools:8008']
metrics_path: '/metrics'
scrape_interval: 15s
# STT Service
- job_name: 'dagi-stt'
static_configs:
- targets: ['dagi-stt:9000']
metrics_path: '/metrics'
scrape_interval: 20s
# TTS Service
- job_name: 'dagi-tts'
static_configs:
- targets: ['dagi-tts:9101']
metrics_path: '/metrics'
scrape_interval: 20s
# Qdrant Vector DB
- job_name: 'dagi-qdrant'
static_configs:
- targets: ['dagi-qdrant:6333']
metrics_path: '/metrics'
scrape_interval: 30s
# NATS
- job_name: 'nats'
static_configs:
- targets: ['nats:8222']
metrics_path: '/varz'
scrape_interval: 15s
# PostgreSQL (if exporter is installed)
- job_name: 'postgres'
static_configs:
- targets: ['dagi-postgres:5432']
metrics_path: '/metrics'
scrape_interval: 30s
# Prometheus self-monitoring
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
# Docker containers (if node_exporter is installed)
- job_name: 'node-exporter'
static_configs:
- targets: ['host.docker.internal:9100']
scrape_interval: 30s