Files
microdao-daarion/docs/tools/observability_tool.md
Apple 67225a39fa docs(platform): add policy configs, runbooks, ops scripts and platform documentation
Config policies (16 files): alert_routing, architecture_pressure, backlog,
cost_weights, data_governance, incident_escalation, incident_intelligence,
network_allowlist, nodes_registry, observability_sources, rbac_tools_matrix,
release_gate, risk_attribution, risk_policy, slo_policy, tool_limits, tools_rollout

Ops (22 files): Caddyfile, calendar compose, grafana voice dashboard,
deployments/incidents logs, runbooks for alerts/audit/backlog/incidents/sofiia/voice,
cron jobs, scripts (alert_triage, audit_cleanup, migrate_*, governance, schedule),
task_registry, voice alerts/ha/latency/policy

Docs (30+ files): HUMANIZED_STEPAN v2.7-v3 changelogs and runbooks,
NODA1/NODA2 status and setup, audit index and traces, backlog, incident,
supervisor, tools, voice, opencode, release, risk, aistalk, spacebot

Made-with: Cursor
2026-03-03 07:14:53 -08:00

207 lines
3.3 KiB
Markdown

# Observability Tool - Documentation
## Overview
Observability Tool provides read-only access to metrics (Prometheus), logs (Loki), and traces (Tempo). Designed for CTO/SRE operations.
## Integration
### Tool Definition
Registered in `services/router/tool_manager.py`:
```python
{
"type": "function",
"function": {
"name": "observability_tool",
"description": "📊 Метрики, логи, трейси...",
"parameters": {...}
}
}
```
### RBAC Configuration
Added to `FULL_STANDARD_STACK` in `services/router/agent_tools_config.py`.
## Configuration
Data sources configured in `config/observability_sources.yml`:
```yaml
prometheus:
base_url: "http://prometheus:9090"
allow_promql_prefixes:
- "sum("
- "rate("
- "histogram_quantile("
loki:
base_url: "http://loki:3100"
tempo:
base_url: "http://tempo:3200"
limits:
max_time_window_hours: 24
max_series: 200
max_points: 2000
timeout_seconds: 5
```
Override URLs via environment variables:
- `PROMETHEUS_URL`
- `LOKI_URL`
- `TEMPO_URL`
## Actions
### 1. metrics_query
Prometheus instant query.
```json
{
"action": "metrics_query",
"params": {
"query": "rate(http_requests_total[5m])",
"datasource": "prometheus"
}
}
```
**Allowed PromQL prefixes:**
- `sum(`, `rate(`, `histogram_quantile(`, `avg(`, `max(`, `min(`, `count(`, `irate(`
### 2. metrics_range
Prometheus range query.
```json
{
"action": "metrics_range",
"params": {
"query": "rate(http_requests_total[5m])",
"time_range": {
"from": "2024-01-15T10:00:00Z",
"to": "2024-01-15T11:00:00Z"
},
"step_seconds": 30
}
}
```
### 3. logs_query
Loki log query.
```json
{
"action": "logs_query",
"params": {
"query": "{service=\"gateway\"}",
"time_range": {
"from": "2024-01-15T10:00:00Z",
"to": "2024-01-15T11:00:00Z"
},
"limit": 100
}
}
```
### 4. traces_query
Tempo trace search.
```json
{
"action": "traces_query",
"params": {
"trace_id": "abc123"
}
}
```
### 5. service_overview
Aggregated service metrics.
```json
{
"action": "service_overview",
"params": {
"service": "gateway",
"time_range": {
"from": "2024-01-15T10:00:00Z",
"to": "2024-01-15T11:00:00Z"
}
}
}
```
Returns:
- p95 latency
- error rate
- throughput
## Security Features
### Query Allowlist
Only allowlisted PromQL prefixes can be used.
### Time Window Limits
- Max 24 hours per query
- Step min: 15s, max: 300s
### Limits
- Max series: 200
- Max points: 2000
- Timeout: 5 seconds
### Redaction
Secrets automatically redacted from logs:
- `api_key=***`
- `token=***`
- `password=***`
## Example Usage
### Check Service Latency
```
"Покажи p95 latency для gateway за останні 30 хвилин"
```
### View Error Rate
```
"Який error rate для router за останню годину?"
```
### Search Logs
```
"Знайди помилки в логах gateway за останні 2 години"
```
### Get Trace
```
"Покажи трейс abc123"
```
### Service Overview
```
"Дай overview gateway сервісу"
```
## Testing
```bash
pytest tools/observability_tool/tests/test_observability_tool.py -v
```
Test coverage:
- Valid PromQL queries work
- Invalid PromQL blocked
- Time window limit enforced
- Trace by ID query
- Service overview