Config policies (16 files): alert_routing, architecture_pressure, backlog, cost_weights, data_governance, incident_escalation, incident_intelligence, network_allowlist, nodes_registry, observability_sources, rbac_tools_matrix, release_gate, risk_attribution, risk_policy, slo_policy, tool_limits, tools_rollout Ops (22 files): Caddyfile, calendar compose, grafana voice dashboard, deployments/incidents logs, runbooks for alerts/audit/backlog/incidents/sofiia/voice, cron jobs, scripts (alert_triage, audit_cleanup, migrate_*, governance, schedule), task_registry, voice alerts/ha/latency/policy Docs (30+ files): HUMANIZED_STEPAN v2.7-v3 changelogs and runbooks, NODA1/NODA2 status and setup, audit index and traces, backlog, incident, supervisor, tools, voice, opencode, release, risk, aistalk, spacebot Made-with: Cursor
207 lines
3.3 KiB
Markdown
207 lines
3.3 KiB
Markdown
# Observability Tool - Documentation
|
|
|
|
## Overview
|
|
|
|
Observability Tool provides read-only access to metrics (Prometheus), logs (Loki), and traces (Tempo). Designed for CTO/SRE operations.
|
|
|
|
## Integration
|
|
|
|
### Tool Definition
|
|
|
|
Registered in `services/router/tool_manager.py`:
|
|
|
|
```python
|
|
{
|
|
"type": "function",
|
|
"function": {
|
|
"name": "observability_tool",
|
|
"description": "📊 Метрики, логи, трейси...",
|
|
"parameters": {...}
|
|
}
|
|
}
|
|
```
|
|
|
|
### RBAC Configuration
|
|
|
|
Added to `FULL_STANDARD_STACK` in `services/router/agent_tools_config.py`.
|
|
|
|
## Configuration
|
|
|
|
Data sources configured in `config/observability_sources.yml`:
|
|
|
|
```yaml
|
|
prometheus:
|
|
base_url: "http://prometheus:9090"
|
|
allow_promql_prefixes:
|
|
- "sum("
|
|
- "rate("
|
|
- "histogram_quantile("
|
|
|
|
loki:
|
|
base_url: "http://loki:3100"
|
|
|
|
tempo:
|
|
base_url: "http://tempo:3200"
|
|
|
|
limits:
|
|
max_time_window_hours: 24
|
|
max_series: 200
|
|
max_points: 2000
|
|
timeout_seconds: 5
|
|
```
|
|
|
|
Override URLs via environment variables:
|
|
- `PROMETHEUS_URL`
|
|
- `LOKI_URL`
|
|
- `TEMPO_URL`
|
|
|
|
## Actions
|
|
|
|
### 1. metrics_query
|
|
|
|
Prometheus instant query.
|
|
|
|
```json
|
|
{
|
|
"action": "metrics_query",
|
|
"params": {
|
|
"query": "rate(http_requests_total[5m])",
|
|
"datasource": "prometheus"
|
|
}
|
|
}
|
|
```
|
|
|
|
**Allowed PromQL prefixes:**
|
|
- `sum(`, `rate(`, `histogram_quantile(`, `avg(`, `max(`, `min(`, `count(`, `irate(`
|
|
|
|
### 2. metrics_range
|
|
|
|
Prometheus range query.
|
|
|
|
```json
|
|
{
|
|
"action": "metrics_range",
|
|
"params": {
|
|
"query": "rate(http_requests_total[5m])",
|
|
"time_range": {
|
|
"from": "2024-01-15T10:00:00Z",
|
|
"to": "2024-01-15T11:00:00Z"
|
|
},
|
|
"step_seconds": 30
|
|
}
|
|
}
|
|
```
|
|
|
|
### 3. logs_query
|
|
|
|
Loki log query.
|
|
|
|
```json
|
|
{
|
|
"action": "logs_query",
|
|
"params": {
|
|
"query": "{service=\"gateway\"}",
|
|
"time_range": {
|
|
"from": "2024-01-15T10:00:00Z",
|
|
"to": "2024-01-15T11:00:00Z"
|
|
},
|
|
"limit": 100
|
|
}
|
|
}
|
|
```
|
|
|
|
### 4. traces_query
|
|
|
|
Tempo trace search.
|
|
|
|
```json
|
|
{
|
|
"action": "traces_query",
|
|
"params": {
|
|
"trace_id": "abc123"
|
|
}
|
|
}
|
|
```
|
|
|
|
### 5. service_overview
|
|
|
|
Aggregated service metrics.
|
|
|
|
```json
|
|
{
|
|
"action": "service_overview",
|
|
"params": {
|
|
"service": "gateway",
|
|
"time_range": {
|
|
"from": "2024-01-15T10:00:00Z",
|
|
"to": "2024-01-15T11:00:00Z"
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
Returns:
|
|
- p95 latency
|
|
- error rate
|
|
- throughput
|
|
|
|
## Security Features
|
|
|
|
### Query Allowlist
|
|
Only allowlisted PromQL prefixes can be used.
|
|
|
|
### Time Window Limits
|
|
- Max 24 hours per query
|
|
- Step min: 15s, max: 300s
|
|
|
|
### Limits
|
|
- Max series: 200
|
|
- Max points: 2000
|
|
- Timeout: 5 seconds
|
|
|
|
### Redaction
|
|
Secrets automatically redacted from logs:
|
|
- `api_key=***`
|
|
- `token=***`
|
|
- `password=***`
|
|
|
|
## Example Usage
|
|
|
|
### Check Service Latency
|
|
```
|
|
"Покажи p95 latency для gateway за останні 30 хвилин"
|
|
```
|
|
|
|
### View Error Rate
|
|
```
|
|
"Який error rate для router за останню годину?"
|
|
```
|
|
|
|
### Search Logs
|
|
```
|
|
"Знайди помилки в логах gateway за останні 2 години"
|
|
```
|
|
|
|
### Get Trace
|
|
```
|
|
"Покажи трейс abc123"
|
|
```
|
|
|
|
### Service Overview
|
|
```
|
|
"Дай overview gateway сервісу"
|
|
```
|
|
|
|
## Testing
|
|
|
|
```bash
|
|
pytest tools/observability_tool/tests/test_observability_tool.py -v
|
|
```
|
|
|
|
Test coverage:
|
|
- Valid PromQL queries work
|
|
- Invalid PromQL blocked
|
|
- Time window limit enforced
|
|
- Trace by ID query
|
|
- Service overview
|