docs(platform): add policy configs, runbooks, ops scripts and platform documentation
Config policies (16 files): alert_routing, architecture_pressure, backlog, cost_weights, data_governance, incident_escalation, incident_intelligence, network_allowlist, nodes_registry, observability_sources, rbac_tools_matrix, release_gate, risk_attribution, risk_policy, slo_policy, tool_limits, tools_rollout Ops (22 files): Caddyfile, calendar compose, grafana voice dashboard, deployments/incidents logs, runbooks for alerts/audit/backlog/incidents/sofiia/voice, cron jobs, scripts (alert_triage, audit_cleanup, migrate_*, governance, schedule), task_registry, voice alerts/ha/latency/policy Docs (30+ files): HUMANIZED_STEPAN v2.7-v3 changelogs and runbooks, NODA1/NODA2 status and setup, audit index and traces, backlog, incident, supervisor, tools, voice, opencode, release, risk, aistalk, spacebot Made-with: Cursor
This commit is contained in:
206
docs/tools/observability_tool.md
Normal file
206
docs/tools/observability_tool.md
Normal file
@@ -0,0 +1,206 @@
|
||||
# Observability Tool - Documentation
|
||||
|
||||
## Overview
|
||||
|
||||
Observability Tool provides read-only access to metrics (Prometheus), logs (Loki), and traces (Tempo). Designed for CTO/SRE operations.
|
||||
|
||||
## Integration
|
||||
|
||||
### Tool Definition
|
||||
|
||||
Registered in `services/router/tool_manager.py`:
|
||||
|
||||
```python
|
||||
{
|
||||
"type": "function",
|
||||
"function": {
|
||||
"name": "observability_tool",
|
||||
"description": "📊 Метрики, логи, трейси...",
|
||||
"parameters": {...}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### RBAC Configuration
|
||||
|
||||
Added to `FULL_STANDARD_STACK` in `services/router/agent_tools_config.py`.
|
||||
|
||||
## Configuration
|
||||
|
||||
Data sources configured in `config/observability_sources.yml`:
|
||||
|
||||
```yaml
|
||||
prometheus:
|
||||
base_url: "http://prometheus:9090"
|
||||
allow_promql_prefixes:
|
||||
- "sum("
|
||||
- "rate("
|
||||
- "histogram_quantile("
|
||||
|
||||
loki:
|
||||
base_url: "http://loki:3100"
|
||||
|
||||
tempo:
|
||||
base_url: "http://tempo:3200"
|
||||
|
||||
limits:
|
||||
max_time_window_hours: 24
|
||||
max_series: 200
|
||||
max_points: 2000
|
||||
timeout_seconds: 5
|
||||
```
|
||||
|
||||
Override URLs via environment variables:
|
||||
- `PROMETHEUS_URL`
|
||||
- `LOKI_URL`
|
||||
- `TEMPO_URL`
|
||||
|
||||
## Actions
|
||||
|
||||
### 1. metrics_query
|
||||
|
||||
Prometheus instant query.
|
||||
|
||||
```json
|
||||
{
|
||||
"action": "metrics_query",
|
||||
"params": {
|
||||
"query": "rate(http_requests_total[5m])",
|
||||
"datasource": "prometheus"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Allowed PromQL prefixes:**
|
||||
- `sum(`, `rate(`, `histogram_quantile(`, `avg(`, `max(`, `min(`, `count(`, `irate(`
|
||||
|
||||
### 2. metrics_range
|
||||
|
||||
Prometheus range query.
|
||||
|
||||
```json
|
||||
{
|
||||
"action": "metrics_range",
|
||||
"params": {
|
||||
"query": "rate(http_requests_total[5m])",
|
||||
"time_range": {
|
||||
"from": "2024-01-15T10:00:00Z",
|
||||
"to": "2024-01-15T11:00:00Z"
|
||||
},
|
||||
"step_seconds": 30
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 3. logs_query
|
||||
|
||||
Loki log query.
|
||||
|
||||
```json
|
||||
{
|
||||
"action": "logs_query",
|
||||
"params": {
|
||||
"query": "{service=\"gateway\"}",
|
||||
"time_range": {
|
||||
"from": "2024-01-15T10:00:00Z",
|
||||
"to": "2024-01-15T11:00:00Z"
|
||||
},
|
||||
"limit": 100
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 4. traces_query
|
||||
|
||||
Tempo trace search.
|
||||
|
||||
```json
|
||||
{
|
||||
"action": "traces_query",
|
||||
"params": {
|
||||
"trace_id": "abc123"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 5. service_overview
|
||||
|
||||
Aggregated service metrics.
|
||||
|
||||
```json
|
||||
{
|
||||
"action": "service_overview",
|
||||
"params": {
|
||||
"service": "gateway",
|
||||
"time_range": {
|
||||
"from": "2024-01-15T10:00:00Z",
|
||||
"to": "2024-01-15T11:00:00Z"
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Returns:
|
||||
- p95 latency
|
||||
- error rate
|
||||
- throughput
|
||||
|
||||
## Security Features
|
||||
|
||||
### Query Allowlist
|
||||
Only allowlisted PromQL prefixes can be used.
|
||||
|
||||
### Time Window Limits
|
||||
- Max 24 hours per query
|
||||
- Step min: 15s, max: 300s
|
||||
|
||||
### Limits
|
||||
- Max series: 200
|
||||
- Max points: 2000
|
||||
- Timeout: 5 seconds
|
||||
|
||||
### Redaction
|
||||
Secrets automatically redacted from logs:
|
||||
- `api_key=***`
|
||||
- `token=***`
|
||||
- `password=***`
|
||||
|
||||
## Example Usage
|
||||
|
||||
### Check Service Latency
|
||||
```
|
||||
"Покажи p95 latency для gateway за останні 30 хвилин"
|
||||
```
|
||||
|
||||
### View Error Rate
|
||||
```
|
||||
"Який error rate для router за останню годину?"
|
||||
```
|
||||
|
||||
### Search Logs
|
||||
```
|
||||
"Знайди помилки в логах gateway за останні 2 години"
|
||||
```
|
||||
|
||||
### Get Trace
|
||||
```
|
||||
"Покажи трейс abc123"
|
||||
```
|
||||
|
||||
### Service Overview
|
||||
```
|
||||
"Дай overview gateway сервісу"
|
||||
```
|
||||
|
||||
## Testing
|
||||
|
||||
```bash
|
||||
pytest tools/observability_tool/tests/test_observability_tool.py -v
|
||||
```
|
||||
|
||||
Test coverage:
|
||||
- Valid PromQL queries work
|
||||
- Invalid PromQL blocked
|
||||
- Time window limit enforced
|
||||
- Trace by ID query
|
||||
- Service overview
|
||||
Reference in New Issue
Block a user