Files
microdao-daarion/docs/tools/oncall_tool.md
Apple 67225a39fa docs(platform): add policy configs, runbooks, ops scripts and platform documentation
Config policies (16 files): alert_routing, architecture_pressure, backlog,
cost_weights, data_governance, incident_escalation, incident_intelligence,
network_allowlist, nodes_registry, observability_sources, rbac_tools_matrix,
release_gate, risk_attribution, risk_policy, slo_policy, tool_limits, tools_rollout

Ops (22 files): Caddyfile, calendar compose, grafana voice dashboard,
deployments/incidents logs, runbooks for alerts/audit/backlog/incidents/sofiia/voice,
cron jobs, scripts (alert_triage, audit_cleanup, migrate_*, governance, schedule),
task_registry, voice alerts/ha/latency/policy

Docs (30+ files): HUMANIZED_STEPAN v2.7-v3 changelogs and runbooks,
NODA1/NODA2 status and setup, audit index and traces, backlog, incident,
supervisor, tools, voice, opencode, release, risk, aistalk, spacebot

Made-with: Cursor
2026-03-03 07:14:53 -08:00

5.2 KiB

Oncall/Runbook Tool - Documentation

Overview

Oncall Tool provides operational information: services catalog, health checks, deployments, runbooks, and incident tracking. Read-only for most agents, with gated write for.

Integration incidents

Tool Definition

Registered in services/router/tool_manager.py:

{
    "type": "function",
    "function": {
        "name": "oncall_tool",
        "description": "📋 Операційна інформація...",
        "parameters": {...}
    }
}

RBAC Configuration

Added to FULL_STANDARD_STACK in services/router/agent_tools_config.py.

Actions

1. services_list

List all services from docker-compose files and service catalogs.

{
  "action": "services_list"
}

Response:

{
  "services": [
    {"name": "router", "source": "docker-compose.yml", "type": "service", "criticality": "medium"},
    {"name": "gateway", "source": "docker-compose.yml", "type": "service", "criticality": "high"}
  ],
  "count": 2
}

2. service_health

Check health endpoint of a service.

{
  "action": "service_health",
  "params": {
    "service_name": "router",
    "health_endpoint": "http://router-service:8000/health"
  }
}

Security: Only allowlisted internal hosts can be checked.

Allowlist: localhost, 127.0.0.1, router-service, gateway-service, memory-service, swapper-service, crewai-service

Response:

{
  "service": "router",
  "endpoint": "http://router-service:8000/health",
  "status": "healthy",
  "status_code": 200,
  "latency_ms": 15
}

3. service_status

Get service status and version info.

{
  "action": "service_status",
  "params": {
    "service_name": "router"
  }
}

4. deployments_recent

Get recent deployments from log file or git.

{
  "action": "deployments_recent"
}

Sources (priority):

  1. ops/deployments.jsonl
  2. Git commit history (fallback)

Response:

{
  "deployments": [
    {"ts": "2024-01-15T10:00:00", "service": "router", "version": "1.2.0"},
    {"type": "git_commit", "commit": "abc123 Fix bug"}
  ],
  "count": 2
}

Search for runbooks.

{
  "action": "runbook_search",
  "params": {
    "query": "deployment"
  }
}

Search directories: ops/, runbooks/, docs/runbooks/, docs/ops/

Response:

{
  "results": [
    {"path": "ops/deploy.md", "file": "deploy.md"}
  ],
  "query": "deployment",
  "count": 1
}

6. runbook_read

Read a specific runbook.

{
  "action": "runbook_read",
  "params": {
    "runbook_path": "ops/deploy.md"
  }
}

Security:

  • Only reads from allowlisted directories
  • Path traversal blocked
  • Secrets masked in content
  • Max 200KB per read

Response:

{
  "path": "ops/deploy.md",
  "content": "# Deployment Runbook\n\n...",
  "size": 1234
}

7. incident_log_list

List incidents.

{
  "action": "incident_log_list",
  "params": {
    "severity": "sev1",
    "limit": 20
  }
}

Response:

{
  "incidents": [
    {
      "ts": "2024-01-15T10:00:00",
      "severity": "sev1",
      "title": "Router down",
      "service": "router"
    }
  ],
  "count": 1
}

8. incident_log_append

Add new incident (gated - requires entitlement).

{
  "action": "incident_log_append",
  "params": {
    "service_name": "router",
    "incident_title": "High latency",
    "incident_severity": "sev2",
    "incident_details": "Router experiencing 500ms latency",
    "incident_tags": ["performance", "router"]
  }
}

RBAC: Only sofiia, helion, admin can add incidents.

Storage: ops/incidents.jsonl

Response:

{
  "incident_id": "2024-01-15T10:00:00",
  "status": "logged"
}

Security Features

Health Check Allowlist

Only internal service endpoints can be checked:

  • localhost, 127.0.0.1
  • Service names: router-service, gateway-service, memory-service, swapper-service, crewai-service

Runbook Security

  • Only read from allowlisted directories: ops/, runbooks/, docs/runbooks/, docs/ops/
  • Path traversal blocked
  • Secrets automatically masked

RBAC

  • Read actions: tools.oncall.read (default for all agents)
  • Write incidents: tools.oncall.incident_write (only sofiia, helion, admin)

Data Files

Created empty files for data storage:

  • ops/incidents.jsonl - Incident log
  • ops/deployments.jsonl - Deployment log

Example Usage

Check Service Health

"Перевіри health router сервісу"

Find Runbook

"Знайди runbook про деплой"

Read Deployment Runbook

"Відкрий runbook/deploy.md"

View Recent Deployments

"Покажи останні деплої"

Log Incident

"Зареєструй інцидент: router висока затримка, sev2"

Testing

pytest tools/oncall_tool/tests/test_oncall_tool.py -v

Test coverage:

  • services_list parses docker-compose
  • runbook_search finds results
  • runbook_read blocks path traversal
  • runbook_read masks secrets
  • incident_log_append allowed for sofiia
  • incident_log_append blocked for regular agents
  • service_health blocks non-allowlisted hosts