docs(platform): add policy configs, runbooks, ops scripts and platform documentation

Config policies (16 files): alert_routing, architecture_pressure, backlog,
cost_weights, data_governance, incident_escalation, incident_intelligence,
network_allowlist, nodes_registry, observability_sources, rbac_tools_matrix,
release_gate, risk_attribution, risk_policy, slo_policy, tool_limits, tools_rollout

Ops (22 files): Caddyfile, calendar compose, grafana voice dashboard,
deployments/incidents logs, runbooks for alerts/audit/backlog/incidents/sofiia/voice,
cron jobs, scripts (alert_triage, audit_cleanup, migrate_*, governance, schedule),
task_registry, voice alerts/ha/latency/policy

Docs (30+ files): HUMANIZED_STEPAN v2.7-v3 changelogs and runbooks,
NODA1/NODA2 status and setup, audit index and traces, backlog, incident,
supervisor, tools, voice, opencode, release, risk, aistalk, spacebot

Made-with: Cursor
This commit is contained in:
Apple
2026-03-03 07:14:53 -08:00
parent 129e4ea1fc
commit 67225a39fa
102 changed files with 20060 additions and 0 deletions

292
docs/tools/oncall_tool.md Normal file
View File

@@ -0,0 +1,292 @@
# Oncall/Runbook Tool - Documentation
## Overview
Oncall Tool provides operational information: services catalog, health checks, deployments, runbooks, and incident tracking. Read-only for most agents, with gated write for.
## Integration incidents
### Tool Definition
Registered in `services/router/tool_manager.py`:
```python
{
"type": "function",
"function": {
"name": "oncall_tool",
"description": "📋 Операційна інформація...",
"parameters": {...}
}
}
```
### RBAC Configuration
Added to `FULL_STANDARD_STACK` in `services/router/agent_tools_config.py`.
## Actions
### 1. services_list
List all services from docker-compose files and service catalogs.
```json
{
"action": "services_list"
}
```
**Response:**
```json
{
"services": [
{"name": "router", "source": "docker-compose.yml", "type": "service", "criticality": "medium"},
{"name": "gateway", "source": "docker-compose.yml", "type": "service", "criticality": "high"}
],
"count": 2
}
```
### 2. service_health
Check health endpoint of a service.
```json
{
"action": "service_health",
"params": {
"service_name": "router",
"health_endpoint": "http://router-service:8000/health"
}
}
```
**Security:** Only allowlisted internal hosts can be checked.
**Allowlist:** `localhost`, `127.0.0.1`, `router-service`, `gateway-service`, `memory-service`, `swapper-service`, `crewai-service`
**Response:**
```json
{
"service": "router",
"endpoint": "http://router-service:8000/health",
"status": "healthy",
"status_code": 200,
"latency_ms": 15
}
```
### 3. service_status
Get service status and version info.
```json
{
"action": "service_status",
"params": {
"service_name": "router"
}
}
```
### 4. deployments_recent
Get recent deployments from log file or git.
```json
{
"action": "deployments_recent"
}
```
**Sources (priority):**
1. `ops/deployments.jsonl`
2. Git commit history (fallback)
**Response:**
```json
{
"deployments": [
{"ts": "2024-01-15T10:00:00", "service": "router", "version": "1.2.0"},
{"type": "git_commit", "commit": "abc123 Fix bug"}
],
"count": 2
}
```
### 5. runbook_search
Search for runbooks.
```json
{
"action": "runbook_search",
"params": {
"query": "deployment"
}
}
```
**Search directories:** `ops/`, `runbooks/`, `docs/runbooks/`, `docs/ops/`
**Response:**
```json
{
"results": [
{"path": "ops/deploy.md", "file": "deploy.md"}
],
"query": "deployment",
"count": 1
}
```
### 6. runbook_read
Read a specific runbook.
```json
{
"action": "runbook_read",
"params": {
"runbook_path": "ops/deploy.md"
}
}
```
**Security:**
- Only reads from allowlisted directories
- Path traversal blocked
- Secrets masked in content
- Max 200KB per read
**Response:**
```json
{
"path": "ops/deploy.md",
"content": "# Deployment Runbook\n\n...",
"size": 1234
}
```
### 7. incident_log_list
List incidents.
```json
{
"action": "incident_log_list",
"params": {
"severity": "sev1",
"limit": 20
}
}
```
**Response:**
```json
{
"incidents": [
{
"ts": "2024-01-15T10:00:00",
"severity": "sev1",
"title": "Router down",
"service": "router"
}
],
"count": 1
}
```
### 8. incident_log_append
Add new incident (gated - requires entitlement).
```json
{
"action": "incident_log_append",
"params": {
"service_name": "router",
"incident_title": "High latency",
"incident_severity": "sev2",
"incident_details": "Router experiencing 500ms latency",
"incident_tags": ["performance", "router"]
}
}
```
**RBAC:** Only `sofiia`, `helion`, `admin` can add incidents.
**Storage:** `ops/incidents.jsonl`
**Response:**
```json
{
"incident_id": "2024-01-15T10:00:00",
"status": "logged"
}
```
## Security Features
### Health Check Allowlist
Only internal service endpoints can be checked:
- `localhost`, `127.0.0.1`
- Service names: `router-service`, `gateway-service`, `memory-service`, `swapper-service`, `crewai-service`
### Runbook Security
- Only read from allowlisted directories: `ops/`, `runbooks/`, `docs/runbooks/`, `docs/ops/`
- Path traversal blocked
- Secrets automatically masked
### RBAC
- Read actions: `tools.oncall.read` (default for all agents)
- Write incidents: `tools.oncall.incident_write` (only sofiia, helion, admin)
## Data Files
Created empty files for data storage:
- `ops/incidents.jsonl` - Incident log
- `ops/deployments.jsonl` - Deployment log
## Example Usage
### Check Service Health
```
"Перевіри health router сервісу"
```
### Find Runbook
```
"Знайди runbook про деплой"
```
### Read Deployment Runbook
```
"Відкрий runbook/deploy.md"
```
### View Recent Deployments
```
"Покажи останні деплої"
```
### Log Incident
```
"Зареєструй інцидент: router висока затримка, sev2"
```
## Testing
```bash
pytest tools/oncall_tool/tests/test_oncall_tool.py -v
```
Test coverage:
- services_list parses docker-compose
- runbook_search finds results
- runbook_read blocks path traversal
- runbook_read masks secrets
- incident_log_append allowed for sofiia
- incident_log_append blocked for regular agents
- service_health blocks non-allowlisted hosts