Config policies (16 files): alert_routing, architecture_pressure, backlog, cost_weights, data_governance, incident_escalation, incident_intelligence, network_allowlist, nodes_registry, observability_sources, rbac_tools_matrix, release_gate, risk_attribution, risk_policy, slo_policy, tool_limits, tools_rollout Ops (22 files): Caddyfile, calendar compose, grafana voice dashboard, deployments/incidents logs, runbooks for alerts/audit/backlog/incidents/sofiia/voice, cron jobs, scripts (alert_triage, audit_cleanup, migrate_*, governance, schedule), task_registry, voice alerts/ha/latency/policy Docs (30+ files): HUMANIZED_STEPAN v2.7-v3 changelogs and runbooks, NODA1/NODA2 status and setup, audit index and traces, backlog, incident, supervisor, tools, voice, opencode, release, risk, aistalk, spacebot Made-with: Cursor
293 lines
5.2 KiB
Markdown
293 lines
5.2 KiB
Markdown
# Oncall/Runbook Tool - Documentation
|
|
|
|
## Overview
|
|
|
|
Oncall Tool provides operational information: services catalog, health checks, deployments, runbooks, and incident tracking. Read-only for most agents, with gated write for.
|
|
|
|
## Integration incidents
|
|
|
|
### Tool Definition
|
|
|
|
Registered in `services/router/tool_manager.py`:
|
|
|
|
```python
|
|
{
|
|
"type": "function",
|
|
"function": {
|
|
"name": "oncall_tool",
|
|
"description": "📋 Операційна інформація...",
|
|
"parameters": {...}
|
|
}
|
|
}
|
|
```
|
|
|
|
### RBAC Configuration
|
|
|
|
Added to `FULL_STANDARD_STACK` in `services/router/agent_tools_config.py`.
|
|
|
|
## Actions
|
|
|
|
### 1. services_list
|
|
|
|
List all services from docker-compose files and service catalogs.
|
|
|
|
```json
|
|
{
|
|
"action": "services_list"
|
|
}
|
|
```
|
|
|
|
**Response:**
|
|
```json
|
|
{
|
|
"services": [
|
|
{"name": "router", "source": "docker-compose.yml", "type": "service", "criticality": "medium"},
|
|
{"name": "gateway", "source": "docker-compose.yml", "type": "service", "criticality": "high"}
|
|
],
|
|
"count": 2
|
|
}
|
|
```
|
|
|
|
### 2. service_health
|
|
|
|
Check health endpoint of a service.
|
|
|
|
```json
|
|
{
|
|
"action": "service_health",
|
|
"params": {
|
|
"service_name": "router",
|
|
"health_endpoint": "http://router-service:8000/health"
|
|
}
|
|
}
|
|
```
|
|
|
|
**Security:** Only allowlisted internal hosts can be checked.
|
|
|
|
**Allowlist:** `localhost`, `127.0.0.1`, `router-service`, `gateway-service`, `memory-service`, `swapper-service`, `crewai-service`
|
|
|
|
**Response:**
|
|
```json
|
|
{
|
|
"service": "router",
|
|
"endpoint": "http://router-service:8000/health",
|
|
"status": "healthy",
|
|
"status_code": 200,
|
|
"latency_ms": 15
|
|
}
|
|
```
|
|
|
|
### 3. service_status
|
|
|
|
Get service status and version info.
|
|
|
|
```json
|
|
{
|
|
"action": "service_status",
|
|
"params": {
|
|
"service_name": "router"
|
|
}
|
|
}
|
|
```
|
|
|
|
### 4. deployments_recent
|
|
|
|
Get recent deployments from log file or git.
|
|
|
|
```json
|
|
{
|
|
"action": "deployments_recent"
|
|
}
|
|
```
|
|
|
|
**Sources (priority):**
|
|
1. `ops/deployments.jsonl`
|
|
2. Git commit history (fallback)
|
|
|
|
**Response:**
|
|
```json
|
|
{
|
|
"deployments": [
|
|
{"ts": "2024-01-15T10:00:00", "service": "router", "version": "1.2.0"},
|
|
{"type": "git_commit", "commit": "abc123 Fix bug"}
|
|
],
|
|
"count": 2
|
|
}
|
|
```
|
|
|
|
### 5. runbook_search
|
|
|
|
Search for runbooks.
|
|
|
|
```json
|
|
{
|
|
"action": "runbook_search",
|
|
"params": {
|
|
"query": "deployment"
|
|
}
|
|
}
|
|
```
|
|
|
|
**Search directories:** `ops/`, `runbooks/`, `docs/runbooks/`, `docs/ops/`
|
|
|
|
**Response:**
|
|
```json
|
|
{
|
|
"results": [
|
|
{"path": "ops/deploy.md", "file": "deploy.md"}
|
|
],
|
|
"query": "deployment",
|
|
"count": 1
|
|
}
|
|
```
|
|
|
|
### 6. runbook_read
|
|
|
|
Read a specific runbook.
|
|
|
|
```json
|
|
{
|
|
"action": "runbook_read",
|
|
"params": {
|
|
"runbook_path": "ops/deploy.md"
|
|
}
|
|
}
|
|
```
|
|
|
|
**Security:**
|
|
- Only reads from allowlisted directories
|
|
- Path traversal blocked
|
|
- Secrets masked in content
|
|
- Max 200KB per read
|
|
|
|
**Response:**
|
|
```json
|
|
{
|
|
"path": "ops/deploy.md",
|
|
"content": "# Deployment Runbook\n\n...",
|
|
"size": 1234
|
|
}
|
|
```
|
|
|
|
### 7. incident_log_list
|
|
|
|
List incidents.
|
|
|
|
```json
|
|
{
|
|
"action": "incident_log_list",
|
|
"params": {
|
|
"severity": "sev1",
|
|
"limit": 20
|
|
}
|
|
}
|
|
```
|
|
|
|
**Response:**
|
|
```json
|
|
{
|
|
"incidents": [
|
|
{
|
|
"ts": "2024-01-15T10:00:00",
|
|
"severity": "sev1",
|
|
"title": "Router down",
|
|
"service": "router"
|
|
}
|
|
],
|
|
"count": 1
|
|
}
|
|
```
|
|
|
|
### 8. incident_log_append
|
|
|
|
Add new incident (gated - requires entitlement).
|
|
|
|
```json
|
|
{
|
|
"action": "incident_log_append",
|
|
"params": {
|
|
"service_name": "router",
|
|
"incident_title": "High latency",
|
|
"incident_severity": "sev2",
|
|
"incident_details": "Router experiencing 500ms latency",
|
|
"incident_tags": ["performance", "router"]
|
|
}
|
|
}
|
|
```
|
|
|
|
**RBAC:** Only `sofiia`, `helion`, `admin` can add incidents.
|
|
|
|
**Storage:** `ops/incidents.jsonl`
|
|
|
|
**Response:**
|
|
```json
|
|
{
|
|
"incident_id": "2024-01-15T10:00:00",
|
|
"status": "logged"
|
|
}
|
|
```
|
|
|
|
## Security Features
|
|
|
|
### Health Check Allowlist
|
|
Only internal service endpoints can be checked:
|
|
- `localhost`, `127.0.0.1`
|
|
- Service names: `router-service`, `gateway-service`, `memory-service`, `swapper-service`, `crewai-service`
|
|
|
|
### Runbook Security
|
|
- Only read from allowlisted directories: `ops/`, `runbooks/`, `docs/runbooks/`, `docs/ops/`
|
|
- Path traversal blocked
|
|
- Secrets automatically masked
|
|
|
|
### RBAC
|
|
- Read actions: `tools.oncall.read` (default for all agents)
|
|
- Write incidents: `tools.oncall.incident_write` (only sofiia, helion, admin)
|
|
|
|
## Data Files
|
|
|
|
Created empty files for data storage:
|
|
- `ops/incidents.jsonl` - Incident log
|
|
- `ops/deployments.jsonl` - Deployment log
|
|
|
|
## Example Usage
|
|
|
|
### Check Service Health
|
|
```
|
|
"Перевіри health router сервісу"
|
|
```
|
|
|
|
### Find Runbook
|
|
```
|
|
"Знайди runbook про деплой"
|
|
```
|
|
|
|
### Read Deployment Runbook
|
|
```
|
|
"Відкрий runbook/deploy.md"
|
|
```
|
|
|
|
### View Recent Deployments
|
|
```
|
|
"Покажи останні деплої"
|
|
```
|
|
|
|
### Log Incident
|
|
```
|
|
"Зареєструй інцидент: router висока затримка, sev2"
|
|
```
|
|
|
|
## Testing
|
|
|
|
```bash
|
|
pytest tools/oncall_tool/tests/test_oncall_tool.py -v
|
|
```
|
|
|
|
Test coverage:
|
|
- services_list parses docker-compose
|
|
- runbook_search finds results
|
|
- runbook_read blocks path traversal
|
|
- runbook_read masks secrets
|
|
- incident_log_append allowed for sofiia
|
|
- incident_log_append blocked for regular agents
|
|
- service_health blocks non-allowlisted hosts
|