Config policies (16 files): alert_routing, architecture_pressure, backlog, cost_weights, data_governance, incident_escalation, incident_intelligence, network_allowlist, nodes_registry, observability_sources, rbac_tools_matrix, release_gate, risk_attribution, risk_policy, slo_policy, tool_limits, tools_rollout Ops (22 files): Caddyfile, calendar compose, grafana voice dashboard, deployments/incidents logs, runbooks for alerts/audit/backlog/incidents/sofiia/voice, cron jobs, scripts (alert_triage, audit_cleanup, migrate_*, governance, schedule), task_registry, voice alerts/ha/latency/policy Docs (30+ files): HUMANIZED_STEPAN v2.7-v3 changelogs and runbooks, NODA1/NODA2 status and setup, audit index and traces, backlog, incident, supervisor, tools, voice, opencode, release, risk, aistalk, spacebot Made-with: Cursor
5.2 KiB
Oncall/Runbook Tool - Documentation
Overview
Oncall Tool provides operational information: services catalog, health checks, deployments, runbooks, and incident tracking. Read-only for most agents, with gated write for.
Integration incidents
Tool Definition
Registered in services/router/tool_manager.py:
{
"type": "function",
"function": {
"name": "oncall_tool",
"description": "📋 Операційна інформація...",
"parameters": {...}
}
}
RBAC Configuration
Added to FULL_STANDARD_STACK in services/router/agent_tools_config.py.
Actions
1. services_list
List all services from docker-compose files and service catalogs.
{
"action": "services_list"
}
Response:
{
"services": [
{"name": "router", "source": "docker-compose.yml", "type": "service", "criticality": "medium"},
{"name": "gateway", "source": "docker-compose.yml", "type": "service", "criticality": "high"}
],
"count": 2
}
2. service_health
Check health endpoint of a service.
{
"action": "service_health",
"params": {
"service_name": "router",
"health_endpoint": "http://router-service:8000/health"
}
}
Security: Only allowlisted internal hosts can be checked.
Allowlist: localhost, 127.0.0.1, router-service, gateway-service, memory-service, swapper-service, crewai-service
Response:
{
"service": "router",
"endpoint": "http://router-service:8000/health",
"status": "healthy",
"status_code": 200,
"latency_ms": 15
}
3. service_status
Get service status and version info.
{
"action": "service_status",
"params": {
"service_name": "router"
}
}
4. deployments_recent
Get recent deployments from log file or git.
{
"action": "deployments_recent"
}
Sources (priority):
ops/deployments.jsonl- Git commit history (fallback)
Response:
{
"deployments": [
{"ts": "2024-01-15T10:00:00", "service": "router", "version": "1.2.0"},
{"type": "git_commit", "commit": "abc123 Fix bug"}
],
"count": 2
}
5. runbook_search
Search for runbooks.
{
"action": "runbook_search",
"params": {
"query": "deployment"
}
}
Search directories: ops/, runbooks/, docs/runbooks/, docs/ops/
Response:
{
"results": [
{"path": "ops/deploy.md", "file": "deploy.md"}
],
"query": "deployment",
"count": 1
}
6. runbook_read
Read a specific runbook.
{
"action": "runbook_read",
"params": {
"runbook_path": "ops/deploy.md"
}
}
Security:
- Only reads from allowlisted directories
- Path traversal blocked
- Secrets masked in content
- Max 200KB per read
Response:
{
"path": "ops/deploy.md",
"content": "# Deployment Runbook\n\n...",
"size": 1234
}
7. incident_log_list
List incidents.
{
"action": "incident_log_list",
"params": {
"severity": "sev1",
"limit": 20
}
}
Response:
{
"incidents": [
{
"ts": "2024-01-15T10:00:00",
"severity": "sev1",
"title": "Router down",
"service": "router"
}
],
"count": 1
}
8. incident_log_append
Add new incident (gated - requires entitlement).
{
"action": "incident_log_append",
"params": {
"service_name": "router",
"incident_title": "High latency",
"incident_severity": "sev2",
"incident_details": "Router experiencing 500ms latency",
"incident_tags": ["performance", "router"]
}
}
RBAC: Only sofiia, helion, admin can add incidents.
Storage: ops/incidents.jsonl
Response:
{
"incident_id": "2024-01-15T10:00:00",
"status": "logged"
}
Security Features
Health Check Allowlist
Only internal service endpoints can be checked:
localhost,127.0.0.1- Service names:
router-service,gateway-service,memory-service,swapper-service,crewai-service
Runbook Security
- Only read from allowlisted directories:
ops/,runbooks/,docs/runbooks/,docs/ops/ - Path traversal blocked
- Secrets automatically masked
RBAC
- Read actions:
tools.oncall.read(default for all agents) - Write incidents:
tools.oncall.incident_write(only sofiia, helion, admin)
Data Files
Created empty files for data storage:
ops/incidents.jsonl- Incident logops/deployments.jsonl- Deployment log
Example Usage
Check Service Health
"Перевіри health router сервісу"
Find Runbook
"Знайди runbook про деплой"
Read Deployment Runbook
"Відкрий runbook/deploy.md"
View Recent Deployments
"Покажи останні деплої"
Log Incident
"Зареєструй інцидент: router висока затримка, sev2"
Testing
pytest tools/oncall_tool/tests/test_oncall_tool.py -v
Test coverage:
- services_list parses docker-compose
- runbook_search finds results
- runbook_read blocks path traversal
- runbook_read masks secrets
- incident_log_append allowed for sofiia
- incident_log_append blocked for regular agents
- service_health blocks non-allowlisted hosts