Config policies (16 files): alert_routing, architecture_pressure, backlog, cost_weights, data_governance, incident_escalation, incident_intelligence, network_allowlist, nodes_registry, observability_sources, rbac_tools_matrix, release_gate, risk_attribution, risk_policy, slo_policy, tool_limits, tools_rollout Ops (22 files): Caddyfile, calendar compose, grafana voice dashboard, deployments/incidents logs, runbooks for alerts/audit/backlog/incidents/sofiia/voice, cron jobs, scripts (alert_triage, audit_cleanup, migrate_*, governance, schedule), task_registry, voice alerts/ha/latency/policy Docs (30+ files): HUMANIZED_STEPAN v2.7-v3 changelogs and runbooks, NODA1/NODA2 status and setup, audit index and traces, backlog, incident, supervisor, tools, voice, opencode, release, risk, aistalk, spacebot Made-with: Cursor
3.3 KiB
3.3 KiB
Observability Tool - Documentation
Overview
Observability Tool provides read-only access to metrics (Prometheus), logs (Loki), and traces (Tempo). Designed for CTO/SRE operations.
Integration
Tool Definition
Registered in services/router/tool_manager.py:
{
"type": "function",
"function": {
"name": "observability_tool",
"description": "📊 Метрики, логи, трейси...",
"parameters": {...}
}
}
RBAC Configuration
Added to FULL_STANDARD_STACK in services/router/agent_tools_config.py.
Configuration
Data sources configured in config/observability_sources.yml:
prometheus:
base_url: "http://prometheus:9090"
allow_promql_prefixes:
- "sum("
- "rate("
- "histogram_quantile("
loki:
base_url: "http://loki:3100"
tempo:
base_url: "http://tempo:3200"
limits:
max_time_window_hours: 24
max_series: 200
max_points: 2000
timeout_seconds: 5
Override URLs via environment variables:
PROMETHEUS_URLLOKI_URLTEMPO_URL
Actions
1. metrics_query
Prometheus instant query.
{
"action": "metrics_query",
"params": {
"query": "rate(http_requests_total[5m])",
"datasource": "prometheus"
}
}
Allowed PromQL prefixes:
sum(,rate(,histogram_quantile(,avg(,max(,min(,count(,irate(
2. metrics_range
Prometheus range query.
{
"action": "metrics_range",
"params": {
"query": "rate(http_requests_total[5m])",
"time_range": {
"from": "2024-01-15T10:00:00Z",
"to": "2024-01-15T11:00:00Z"
},
"step_seconds": 30
}
}
3. logs_query
Loki log query.
{
"action": "logs_query",
"params": {
"query": "{service=\"gateway\"}",
"time_range": {
"from": "2024-01-15T10:00:00Z",
"to": "2024-01-15T11:00:00Z"
},
"limit": 100
}
}
4. traces_query
Tempo trace search.
{
"action": "traces_query",
"params": {
"trace_id": "abc123"
}
}
5. service_overview
Aggregated service metrics.
{
"action": "service_overview",
"params": {
"service": "gateway",
"time_range": {
"from": "2024-01-15T10:00:00Z",
"to": "2024-01-15T11:00:00Z"
}
}
}
Returns:
- p95 latency
- error rate
- throughput
Security Features
Query Allowlist
Only allowlisted PromQL prefixes can be used.
Time Window Limits
- Max 24 hours per query
- Step min: 15s, max: 300s
Limits
- Max series: 200
- Max points: 2000
- Timeout: 5 seconds
Redaction
Secrets automatically redacted from logs:
api_key=***token=***password=***
Example Usage
Check Service Latency
"Покажи p95 latency для gateway за останні 30 хвилин"
View Error Rate
"Який error rate для router за останню годину?"
Search Logs
"Знайди помилки в логах gateway за останні 2 години"
Get Trace
"Покажи трейс abc123"
Service Overview
"Дай overview gateway сервісу"
Testing
pytest tools/observability_tool/tests/test_observability_tool.py -v
Test coverage:
- Valid PromQL queries work
- Invalid PromQL blocked
- Time window limit enforced
- Trace by ID query
- Service overview