Files
microdao-daarion/docs/tools/observability_tool.md
Apple 67225a39fa docs(platform): add policy configs, runbooks, ops scripts and platform documentation
Config policies (16 files): alert_routing, architecture_pressure, backlog,
cost_weights, data_governance, incident_escalation, incident_intelligence,
network_allowlist, nodes_registry, observability_sources, rbac_tools_matrix,
release_gate, risk_attribution, risk_policy, slo_policy, tool_limits, tools_rollout

Ops (22 files): Caddyfile, calendar compose, grafana voice dashboard,
deployments/incidents logs, runbooks for alerts/audit/backlog/incidents/sofiia/voice,
cron jobs, scripts (alert_triage, audit_cleanup, migrate_*, governance, schedule),
task_registry, voice alerts/ha/latency/policy

Docs (30+ files): HUMANIZED_STEPAN v2.7-v3 changelogs and runbooks,
NODA1/NODA2 status and setup, audit index and traces, backlog, incident,
supervisor, tools, voice, opencode, release, risk, aistalk, spacebot

Made-with: Cursor
2026-03-03 07:14:53 -08:00

3.3 KiB

Observability Tool - Documentation

Overview

Observability Tool provides read-only access to metrics (Prometheus), logs (Loki), and traces (Tempo). Designed for CTO/SRE operations.

Integration

Tool Definition

Registered in services/router/tool_manager.py:

{
    "type": "function",
    "function": {
        "name": "observability_tool",
        "description": "📊 Метрики, логи, трейси...",
        "parameters": {...}
    }
}

RBAC Configuration

Added to FULL_STANDARD_STACK in services/router/agent_tools_config.py.

Configuration

Data sources configured in config/observability_sources.yml:

prometheus:
  base_url: "http://prometheus:9090"
  allow_promql_prefixes:
    - "sum("
    - "rate("
    - "histogram_quantile("

loki:
  base_url: "http://loki:3100"

tempo:
  base_url: "http://tempo:3200"

limits:
  max_time_window_hours: 24
  max_series: 200
  max_points: 2000
  timeout_seconds: 5

Override URLs via environment variables:

  • PROMETHEUS_URL
  • LOKI_URL
  • TEMPO_URL

Actions

1. metrics_query

Prometheus instant query.

{
  "action": "metrics_query",
  "params": {
    "query": "rate(http_requests_total[5m])",
    "datasource": "prometheus"
  }
}

Allowed PromQL prefixes:

  • sum(, rate(, histogram_quantile(, avg(, max(, min(, count(, irate(

2. metrics_range

Prometheus range query.

{
  "action": "metrics_range",
  "params": {
    "query": "rate(http_requests_total[5m])",
    "time_range": {
      "from": "2024-01-15T10:00:00Z",
      "to": "2024-01-15T11:00:00Z"
    },
    "step_seconds": 30
  }
}

3. logs_query

Loki log query.

{
  "action": "logs_query",
  "params": {
    "query": "{service=\"gateway\"}",
    "time_range": {
      "from": "2024-01-15T10:00:00Z",
      "to": "2024-01-15T11:00:00Z"
    },
    "limit": 100
  }
}

4. traces_query

Tempo trace search.

{
  "action": "traces_query",
  "params": {
    "trace_id": "abc123"
  }
}

5. service_overview

Aggregated service metrics.

{
  "action": "service_overview",
  "params": {
    "service": "gateway",
    "time_range": {
      "from": "2024-01-15T10:00:00Z",
      "to": "2024-01-15T11:00:00Z"
    }
  }
}

Returns:

  • p95 latency
  • error rate
  • throughput

Security Features

Query Allowlist

Only allowlisted PromQL prefixes can be used.

Time Window Limits

  • Max 24 hours per query
  • Step min: 15s, max: 300s

Limits

  • Max series: 200
  • Max points: 2000
  • Timeout: 5 seconds

Redaction

Secrets automatically redacted from logs:

  • api_key=***
  • token=***
  • password=***

Example Usage

Check Service Latency

"Покажи p95 latency для gateway за останні 30 хвилин"

View Error Rate

"Який error rate для router за останню годину?"

Search Logs

"Знайди помилки в логах gateway за останні 2 години"

Get Trace

"Покажи трейс abc123"

Service Overview

"Дай overview gateway сервісу"

Testing

pytest tools/observability_tool/tests/test_observability_tool.py -v

Test coverage:

  • Valid PromQL queries work
  • Invalid PromQL blocked
  • Time window limit enforced
  • Trace by ID query
  • Service overview