docs(platform): add policy configs, runbooks, ops scripts and platform documentation
Config policies (16 files): alert_routing, architecture_pressure, backlog, cost_weights, data_governance, incident_escalation, incident_intelligence, network_allowlist, nodes_registry, observability_sources, rbac_tools_matrix, release_gate, risk_attribution, risk_policy, slo_policy, tool_limits, tools_rollout Ops (22 files): Caddyfile, calendar compose, grafana voice dashboard, deployments/incidents logs, runbooks for alerts/audit/backlog/incidents/sofiia/voice, cron jobs, scripts (alert_triage, audit_cleanup, migrate_*, governance, schedule), task_registry, voice alerts/ha/latency/policy Docs (30+ files): HUMANIZED_STEPAN v2.7-v3 changelogs and runbooks, NODA1/NODA2 status and setup, audit index and traces, backlog, incident, supervisor, tools, voice, opencode, release, risk, aistalk, spacebot Made-with: Cursor
This commit is contained in:
292
docs/tools/oncall_tool.md
Normal file
292
docs/tools/oncall_tool.md
Normal file
@@ -0,0 +1,292 @@
|
||||
# Oncall/Runbook Tool - Documentation
|
||||
|
||||
## Overview
|
||||
|
||||
Oncall Tool provides operational information: services catalog, health checks, deployments, runbooks, and incident tracking. Read-only for most agents, with gated write for.
|
||||
|
||||
## Integration incidents
|
||||
|
||||
### Tool Definition
|
||||
|
||||
Registered in `services/router/tool_manager.py`:
|
||||
|
||||
```python
|
||||
{
|
||||
"type": "function",
|
||||
"function": {
|
||||
"name": "oncall_tool",
|
||||
"description": "📋 Операційна інформація...",
|
||||
"parameters": {...}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### RBAC Configuration
|
||||
|
||||
Added to `FULL_STANDARD_STACK` in `services/router/agent_tools_config.py`.
|
||||
|
||||
## Actions
|
||||
|
||||
### 1. services_list
|
||||
|
||||
List all services from docker-compose files and service catalogs.
|
||||
|
||||
```json
|
||||
{
|
||||
"action": "services_list"
|
||||
}
|
||||
```
|
||||
|
||||
**Response:**
|
||||
```json
|
||||
{
|
||||
"services": [
|
||||
{"name": "router", "source": "docker-compose.yml", "type": "service", "criticality": "medium"},
|
||||
{"name": "gateway", "source": "docker-compose.yml", "type": "service", "criticality": "high"}
|
||||
],
|
||||
"count": 2
|
||||
}
|
||||
```
|
||||
|
||||
### 2. service_health
|
||||
|
||||
Check health endpoint of a service.
|
||||
|
||||
```json
|
||||
{
|
||||
"action": "service_health",
|
||||
"params": {
|
||||
"service_name": "router",
|
||||
"health_endpoint": "http://router-service:8000/health"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Security:** Only allowlisted internal hosts can be checked.
|
||||
|
||||
**Allowlist:** `localhost`, `127.0.0.1`, `router-service`, `gateway-service`, `memory-service`, `swapper-service`, `crewai-service`
|
||||
|
||||
**Response:**
|
||||
```json
|
||||
{
|
||||
"service": "router",
|
||||
"endpoint": "http://router-service:8000/health",
|
||||
"status": "healthy",
|
||||
"status_code": 200,
|
||||
"latency_ms": 15
|
||||
}
|
||||
```
|
||||
|
||||
### 3. service_status
|
||||
|
||||
Get service status and version info.
|
||||
|
||||
```json
|
||||
{
|
||||
"action": "service_status",
|
||||
"params": {
|
||||
"service_name": "router"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 4. deployments_recent
|
||||
|
||||
Get recent deployments from log file or git.
|
||||
|
||||
```json
|
||||
{
|
||||
"action": "deployments_recent"
|
||||
}
|
||||
```
|
||||
|
||||
**Sources (priority):**
|
||||
1. `ops/deployments.jsonl`
|
||||
2. Git commit history (fallback)
|
||||
|
||||
**Response:**
|
||||
```json
|
||||
{
|
||||
"deployments": [
|
||||
{"ts": "2024-01-15T10:00:00", "service": "router", "version": "1.2.0"},
|
||||
{"type": "git_commit", "commit": "abc123 Fix bug"}
|
||||
],
|
||||
"count": 2
|
||||
}
|
||||
```
|
||||
|
||||
### 5. runbook_search
|
||||
|
||||
Search for runbooks.
|
||||
|
||||
```json
|
||||
{
|
||||
"action": "runbook_search",
|
||||
"params": {
|
||||
"query": "deployment"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Search directories:** `ops/`, `runbooks/`, `docs/runbooks/`, `docs/ops/`
|
||||
|
||||
**Response:**
|
||||
```json
|
||||
{
|
||||
"results": [
|
||||
{"path": "ops/deploy.md", "file": "deploy.md"}
|
||||
],
|
||||
"query": "deployment",
|
||||
"count": 1
|
||||
}
|
||||
```
|
||||
|
||||
### 6. runbook_read
|
||||
|
||||
Read a specific runbook.
|
||||
|
||||
```json
|
||||
{
|
||||
"action": "runbook_read",
|
||||
"params": {
|
||||
"runbook_path": "ops/deploy.md"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Security:**
|
||||
- Only reads from allowlisted directories
|
||||
- Path traversal blocked
|
||||
- Secrets masked in content
|
||||
- Max 200KB per read
|
||||
|
||||
**Response:**
|
||||
```json
|
||||
{
|
||||
"path": "ops/deploy.md",
|
||||
"content": "# Deployment Runbook\n\n...",
|
||||
"size": 1234
|
||||
}
|
||||
```
|
||||
|
||||
### 7. incident_log_list
|
||||
|
||||
List incidents.
|
||||
|
||||
```json
|
||||
{
|
||||
"action": "incident_log_list",
|
||||
"params": {
|
||||
"severity": "sev1",
|
||||
"limit": 20
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Response:**
|
||||
```json
|
||||
{
|
||||
"incidents": [
|
||||
{
|
||||
"ts": "2024-01-15T10:00:00",
|
||||
"severity": "sev1",
|
||||
"title": "Router down",
|
||||
"service": "router"
|
||||
}
|
||||
],
|
||||
"count": 1
|
||||
}
|
||||
```
|
||||
|
||||
### 8. incident_log_append
|
||||
|
||||
Add new incident (gated - requires entitlement).
|
||||
|
||||
```json
|
||||
{
|
||||
"action": "incident_log_append",
|
||||
"params": {
|
||||
"service_name": "router",
|
||||
"incident_title": "High latency",
|
||||
"incident_severity": "sev2",
|
||||
"incident_details": "Router experiencing 500ms latency",
|
||||
"incident_tags": ["performance", "router"]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**RBAC:** Only `sofiia`, `helion`, `admin` can add incidents.
|
||||
|
||||
**Storage:** `ops/incidents.jsonl`
|
||||
|
||||
**Response:**
|
||||
```json
|
||||
{
|
||||
"incident_id": "2024-01-15T10:00:00",
|
||||
"status": "logged"
|
||||
}
|
||||
```
|
||||
|
||||
## Security Features
|
||||
|
||||
### Health Check Allowlist
|
||||
Only internal service endpoints can be checked:
|
||||
- `localhost`, `127.0.0.1`
|
||||
- Service names: `router-service`, `gateway-service`, `memory-service`, `swapper-service`, `crewai-service`
|
||||
|
||||
### Runbook Security
|
||||
- Only read from allowlisted directories: `ops/`, `runbooks/`, `docs/runbooks/`, `docs/ops/`
|
||||
- Path traversal blocked
|
||||
- Secrets automatically masked
|
||||
|
||||
### RBAC
|
||||
- Read actions: `tools.oncall.read` (default for all agents)
|
||||
- Write incidents: `tools.oncall.incident_write` (only sofiia, helion, admin)
|
||||
|
||||
## Data Files
|
||||
|
||||
Created empty files for data storage:
|
||||
- `ops/incidents.jsonl` - Incident log
|
||||
- `ops/deployments.jsonl` - Deployment log
|
||||
|
||||
## Example Usage
|
||||
|
||||
### Check Service Health
|
||||
```
|
||||
"Перевіри health router сервісу"
|
||||
```
|
||||
|
||||
### Find Runbook
|
||||
```
|
||||
"Знайди runbook про деплой"
|
||||
```
|
||||
|
||||
### Read Deployment Runbook
|
||||
```
|
||||
"Відкрий runbook/deploy.md"
|
||||
```
|
||||
|
||||
### View Recent Deployments
|
||||
```
|
||||
"Покажи останні деплої"
|
||||
```
|
||||
|
||||
### Log Incident
|
||||
```
|
||||
"Зареєструй інцидент: router висока затримка, sev2"
|
||||
```
|
||||
|
||||
## Testing
|
||||
|
||||
```bash
|
||||
pytest tools/oncall_tool/tests/test_oncall_tool.py -v
|
||||
```
|
||||
|
||||
Test coverage:
|
||||
- services_list parses docker-compose
|
||||
- runbook_search finds results
|
||||
- runbook_read blocks path traversal
|
||||
- runbook_read masks secrets
|
||||
- incident_log_append allowed for sofiia
|
||||
- incident_log_append blocked for regular agents
|
||||
- service_health blocks non-allowlisted hosts
|
||||
Reference in New Issue
Block a user