docs(platform): add policy configs, runbooks, ops scripts and platform documentation
Config policies (16 files): alert_routing, architecture_pressure, backlog, cost_weights, data_governance, incident_escalation, incident_intelligence, network_allowlist, nodes_registry, observability_sources, rbac_tools_matrix, release_gate, risk_attribution, risk_policy, slo_policy, tool_limits, tools_rollout Ops (22 files): Caddyfile, calendar compose, grafana voice dashboard, deployments/incidents logs, runbooks for alerts/audit/backlog/incidents/sofiia/voice, cron jobs, scripts (alert_triage, audit_cleanup, migrate_*, governance, schedule), task_registry, voice alerts/ha/latency/policy Docs (30+ files): HUMANIZED_STEPAN v2.7-v3 changelogs and runbooks, NODA1/NODA2 status and setup, audit index and traces, backlog, incident, supervisor, tools, voice, opencode, release, risk, aistalk, spacebot Made-with: Cursor
This commit is contained in:
248
docs/release/release_check.md
Normal file
248
docs/release/release_check.md
Normal file
@@ -0,0 +1,248 @@
|
||||
# release_check — Release Gate
|
||||
|
||||
**Єдиний оркестрований job для перевірки готовності до релізу**
|
||||
Нода: NODE2 (dev) + NODA1 (production)
|
||||
|
||||
---
|
||||
|
||||
## Що це?
|
||||
|
||||
`release_check` — internal task у Job Orchestrator, який послідовно запускає всі release gates і повертає єдиний структурований verdict `pass/fail`.
|
||||
|
||||
Замінює ручне запускання кожного gate окремо.
|
||||
|
||||
---
|
||||
|
||||
## Gates (послідовно)
|
||||
|
||||
| # | Gate | Tool | Умова блокування |
|
||||
|---|------|------|-----------------|
|
||||
| 1 | **PR Review** | `pr_reviewer_tool` (mode=`blocking_only`) | blocking_count > 0 |
|
||||
| 2 | **Config Lint** | `config_linter_tool` (strict=true) | blocking_count > 0 |
|
||||
| 3 | **Contract Diff** | `contract_tool` (fail_on_breaking=true) | breaking_count > 0 |
|
||||
| 4 | **Threat Model** | `threatmodel_tool` (risk_profile) | unmitigated_high > 0 |
|
||||
| 5 | **Smoke** *(optional)* | `job_orchestrator_tool` → `smoke_gateway` | job fail |
|
||||
| 6 | **Drift** *(optional)* | `job_orchestrator_tool` → `drift_check_node1` | job fail |
|
||||
|
||||
Gates 1–4 завжди виконуються (якщо є вхідні дані).
|
||||
Gates 5–6 виконуються тільки при `run_smoke=true` / `run_drift=true`.
|
||||
|
||||
---
|
||||
|
||||
## Як запустити
|
||||
|
||||
### Через job_orchestrator_tool (рекомендовано)
|
||||
|
||||
```json
|
||||
{
|
||||
"action": "start_task",
|
||||
"agent_id": "sofiia",
|
||||
"params": {
|
||||
"task_id": "release_check",
|
||||
"inputs": {
|
||||
"service_name": "router",
|
||||
"diff_text": "<unified diff>",
|
||||
"openapi_base": "<base OpenAPI spec>",
|
||||
"openapi_head": "<head OpenAPI spec>",
|
||||
"risk_profile": "agentic_tools",
|
||||
"fail_fast": false,
|
||||
"run_smoke": true,
|
||||
"run_drift": false
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Через Sofiia (OpenCode/Telegram)
|
||||
|
||||
```
|
||||
"Запусти release_check для сервісу router з цим diff: ..."
|
||||
"Зроби release gate перевірку"
|
||||
```
|
||||
|
||||
### Dry run (тільки валідація)
|
||||
|
||||
```json
|
||||
{
|
||||
"action": "start_task",
|
||||
"params": {
|
||||
"task_id": "release_check",
|
||||
"dry_run": true,
|
||||
"inputs": {"service_name": "router"}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Вхідні параметри (inputs_schema)
|
||||
|
||||
| Параметр | Тип | Обов'язковий | Опис |
|
||||
|----------|-----|:---:|------|
|
||||
| `service_name` | string | ✅ | Назва сервісу |
|
||||
| `diff_text` | string | — | Unified diff (git diff) |
|
||||
| `openapi_base` | string | — | OpenAPI base spec (text) |
|
||||
| `openapi_head` | string | — | OpenAPI head spec (text) |
|
||||
| `risk_profile` | enum | — | `default` / `agentic_tools` / `public_api` (default: `default`) |
|
||||
| `fail_fast` | boolean | — | Зупинитись на першому fail (default: `false`) |
|
||||
| `run_smoke` | boolean | — | Запустити smoke tests (default: `false`) |
|
||||
| `run_drift` | boolean | — | Запустити drift check (default: `false`) |
|
||||
|
||||
---
|
||||
|
||||
## Вихідний формат
|
||||
|
||||
```json
|
||||
{
|
||||
"pass": true,
|
||||
"gates": [
|
||||
{
|
||||
"name": "pr_review",
|
||||
"status": "pass",
|
||||
"blocking_count": 0,
|
||||
"summary": "No blocking issues found",
|
||||
"score": 95
|
||||
},
|
||||
{
|
||||
"name": "config_lint",
|
||||
"status": "pass",
|
||||
"blocking_count": 0,
|
||||
"total_findings": 2
|
||||
},
|
||||
{
|
||||
"name": "contract_diff",
|
||||
"status": "skipped",
|
||||
"reason": "openapi_base or openapi_head not provided"
|
||||
},
|
||||
{
|
||||
"name": "threat_model",
|
||||
"status": "pass",
|
||||
"unmitigated_high": 0,
|
||||
"risk_profile": "default"
|
||||
}
|
||||
],
|
||||
"recommendations": [],
|
||||
"summary": "✅ RELEASE CHECK PASSED in 1234ms. Gates: ['pr_review', 'config_lint', 'threat_model'].",
|
||||
"elapsed_ms": 1234.5
|
||||
}
|
||||
```
|
||||
|
||||
### Gate statuses
|
||||
|
||||
| Status | Значення |
|
||||
|--------|----------|
|
||||
| `pass` | Gate пройшов |
|
||||
| `fail` | Gate не пройшов (блокує реліз) |
|
||||
| `skipped` | Вхідних даних не було (не блокує) |
|
||||
| `error` | Внутрішня помилка gate |
|
||||
|
||||
---
|
||||
|
||||
## Інтерпретація результату
|
||||
|
||||
### `pass: true`
|
||||
Всі mandatory gates пройшли → **можна випускати реліз**.
|
||||
|
||||
### `pass: false`
|
||||
Хоча б один gate має `status: fail` → **реліз заблоковано**.
|
||||
Дивись `gates[].status == "fail"` та `recommendations` для деталей.
|
||||
|
||||
### `status: error`
|
||||
Gate не зміг виконатись (internal error). Не є `fail`, але потребує уваги.
|
||||
|
||||
---
|
||||
|
||||
## Risk Profiles для Threat Model
|
||||
|
||||
| Профіль | Коли використовувати |
|
||||
|---------|---------------------|
|
||||
| `default` | Звичайний внутрішній сервіс |
|
||||
| `agentic_tools` | Сервіс з tool-викликами, prompt injection ризики |
|
||||
| `public_api` | Публічний API (rate limiting, WAF, auth hardening) |
|
||||
|
||||
---
|
||||
|
||||
## Необхідні Entitlements
|
||||
|
||||
Для запуску `release_check` агент повинен мати:
|
||||
- `tools.pr_review.gate`
|
||||
- `tools.contract.gate`
|
||||
- `tools.config_lint.gate`
|
||||
- `tools.threatmodel.gate`
|
||||
|
||||
Тільки агенти з роллю `agent_cto` (sofiia, yaromir) мають ці entitlements.
|
||||
|
||||
---
|
||||
|
||||
## Приклади сценаріїв
|
||||
|
||||
### Швидка перевірка PR (без openapi, без smoke)
|
||||
|
||||
```json
|
||||
{
|
||||
"service_name": "gateway-bot",
|
||||
"diff_text": "...",
|
||||
"fail_fast": true
|
||||
}
|
||||
```
|
||||
|
||||
### Повний release pipeline для публічного API
|
||||
|
||||
```json
|
||||
{
|
||||
"service_name": "router",
|
||||
"diff_text": "...",
|
||||
"openapi_base": "...",
|
||||
"openapi_head": "...",
|
||||
"risk_profile": "public_api",
|
||||
"run_smoke": true,
|
||||
"run_drift": true
|
||||
}
|
||||
```
|
||||
|
||||
### Тільки threat model (без diff)
|
||||
|
||||
```json
|
||||
{
|
||||
"service_name": "auth-service",
|
||||
"risk_profile": "agentic_tools"
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Внутрішня архітектура
|
||||
|
||||
```
|
||||
job_orchestrator_tool.start_task("release_check")
|
||||
→ _job_orchestrator_tool() виявляє runner="internal"
|
||||
→ release_check_runner.run_release_check(tool_manager, inputs, agent_id)
|
||||
→ Gate 1: _run_pr_review()
|
||||
→ Gate 2: _run_config_lint()
|
||||
→ Gate 3: _run_dependency_scan()
|
||||
→ Gate 4: _run_contract_diff()
|
||||
→ Gate 5: _run_threat_model()
|
||||
→ [Gate 6: _run_smoke()]
|
||||
→ [Gate 7: _run_drift()]
|
||||
→ Gate 8: _run_followup_watch() (policy: off/warn/strict)
|
||||
→ Gate 9: _run_privacy_watch() (policy: off/warn/strict)
|
||||
→ Gate 10: _run_cost_watch() (always warn)
|
||||
→ _build_report()
|
||||
→ ToolResult(success=True, result=report)
|
||||
```
|
||||
|
||||
Кожен gate викликає відповідний tool через `tool_manager.execute_tool()`.
|
||||
Governance middleware (RBAC, limits, audit) застосовується до кожного gate-виклику.
|
||||
|
||||
---
|
||||
|
||||
## Файли
|
||||
|
||||
| Файл | Призначення |
|
||||
|------|-------------|
|
||||
| `ops/task_registry.yml` | Реєстрація `release_check` task |
|
||||
| `services/router/release_check_runner.py` | Internal runner (gates logic) |
|
||||
| `config/release_gate_policy.yml` | Gate strictness profiles (dev/staging/prod) |
|
||||
| `config/slo_policy.yml` | SLO thresholds per service |
|
||||
| `tests/test_tool_governance.py` | Тести (включно з release_check fixtures) |
|
||||
| `tests/test_release_check_followup_watch.py` | Follow-up watch gate tests |
|
||||
68
docs/release/release_gate_policy.md
Normal file
68
docs/release/release_gate_policy.md
Normal file
@@ -0,0 +1,68 @@
|
||||
# Release Gate Policy
|
||||
|
||||
`config/release_gate_policy.yml` — централізований конфіг строгості gate-ів для різних профілів деплойменту.
|
||||
|
||||
## Профілі
|
||||
|
||||
| Профіль | Призначення | privacy_watch | cost_watch |
|
||||
|---------|-------------|---------------|------------|
|
||||
| `dev` | Розробка | warn | warn |
|
||||
| `staging` | Стейджинг | **strict** (fail_on error) | warn |
|
||||
| `prod` | Продакшн | **strict** (fail_on error) | warn |
|
||||
|
||||
## Режими gate-ів
|
||||
|
||||
| Режим | Поведінка |
|
||||
|-------|-----------|
|
||||
| `off` | Gate повністю пропускається (не викликається, не виводиться) |
|
||||
| `warn` | Gate завжди `pass=True`; findings → `recommendations` |
|
||||
| `strict` | Gate може заблокувати реліз за умовами `fail_on` |
|
||||
|
||||
## Використання
|
||||
|
||||
Передати `gate_profile` у inputs release_check:
|
||||
|
||||
```json
|
||||
{
|
||||
"gate_profile": "staging",
|
||||
"run_privacy_watch": true,
|
||||
"diff_text": "..."
|
||||
}
|
||||
```
|
||||
|
||||
## strict mode: privacy_watch
|
||||
|
||||
Блокує реліз якщо є findings із severity у `fail_on`:
|
||||
|
||||
```yaml
|
||||
privacy_watch:
|
||||
mode: "strict"
|
||||
fail_on: ["error"] # тільки error-severity блокує; warning = recommendation
|
||||
```
|
||||
|
||||
Наприклад, `DG-SEC-001` (private key) = error → `release_check.pass = false`.
|
||||
`DG-LOG-001` (sensitive logger) = warning → не блокує у staging/prod.
|
||||
|
||||
## cost_watch
|
||||
|
||||
**Завжди `warn`** у всіх профілях — cost spikes ніколи не блокують реліз (тільки recommendations).
|
||||
|
||||
## Backward compatibility
|
||||
|
||||
Якщо `gate_profile` не переданий → використовується `dev` (warn для privacy і cost).
|
||||
Якщо `release_gate_policy.yml` відсутній → всі gates використовують `warn` (graceful fallback).
|
||||
|
||||
## Приклад виводу для staging з error finding
|
||||
|
||||
```json
|
||||
{
|
||||
"pass": false,
|
||||
"gates": [
|
||||
{ "name": "privacy_watch", "status": "pass", "errors": 1,
|
||||
"top_findings": [{"id": "DG-SEC-001", "severity": "error", ...}],
|
||||
"recommendations": ["Remove private key from code..."] }
|
||||
],
|
||||
"summary": "❌ RELEASE CHECK FAILED. Failed: []. Errors: [].",
|
||||
"recommendations": ["Remove private key from code..."]
|
||||
}
|
||||
```
|
||||
109
docs/release/sofiia-console-v1-readiness.md
Normal file
109
docs/release/sofiia-console-v1-readiness.md
Normal file
@@ -0,0 +1,109 @@
|
||||
# Sofiia Console v1.0 Release Readiness Summary
|
||||
|
||||
One-page go/no-go артефакт для релізного рішення по `sofiia-console`.
|
||||
|
||||
## 1) Scope & Version
|
||||
|
||||
- Service: `sofiia-console`
|
||||
- Target version / tag: `v1.0` (to be assigned at release cut)
|
||||
- Git SHAs:
|
||||
- sofiia-console: `e75fd33`
|
||||
- router: `<set at release window>`
|
||||
- gateway: `<set at release window>`
|
||||
- Deployment target:
|
||||
- NODA1: production runtime/data plane
|
||||
- NODA2: control plane / sofiia-console
|
||||
- Date prepared: `<set at release window>`
|
||||
- Prepared by: `<operator>`
|
||||
|
||||
## 2) Production Guarantees
|
||||
|
||||
### Reliability
|
||||
|
||||
- Idempotent `POST /api/chats/{chat_id}/send` with selectable backend (`inmemory|redis`).
|
||||
- Multi-node routing covered by E2E tests (NODA1/NODA2 via `infer` monkeypatch path).
|
||||
- Cursor pagination hardened with tie-breakers (`(ts,id)` / stable ordering semantics).
|
||||
- Release process formalized via preflight + release runbook + smoke scripts.
|
||||
|
||||
### Security
|
||||
|
||||
- Rate limiting on send path:
|
||||
- per-chat scope
|
||||
- per-operator scope
|
||||
- Strict `/api/audit` protection:
|
||||
- key required
|
||||
- no localhost bypass
|
||||
- Structured audit trail:
|
||||
- write events for operator actions
|
||||
- cursor-based read endpoint
|
||||
- Secrets rotation runbook documented and operational.
|
||||
|
||||
### Operational Controls
|
||||
|
||||
- `/metrics` exposed (including rate-limit and idempotency counters).
|
||||
- Structured JSON logs for send/replay/pagination/error flows.
|
||||
- Audit retention policy in place (default 90 days).
|
||||
- Pruning script available (`ops/prune_audit_db.py`: dry-run + batch delete + optional vacuum).
|
||||
- Release evidence auto-generator available (`ops/generate_release_evidence.sh`).
|
||||
|
||||
## 3) Known Limitations / Residual Risks
|
||||
|
||||
- Chat index is still local DB-backed; full multi-instance HA for global chat index needs Phase 6 (Redis ChatIndexStore).
|
||||
- Rate-limit defaults to `inmemory`; multi-instance consistency needs `SOFIIA_RATE_LIMIT_BACKEND=redis`.
|
||||
- Audit storage is SQLite (single-node storage, non-clustered by default).
|
||||
- Automatic alerting/paging is not yet enabled; metric observation is primarily manual/runbook-driven.
|
||||
|
||||
## 4) Required Release-Day Checks
|
||||
|
||||
### Preflight
|
||||
|
||||
- `STRICT=1 bash ops/preflight_sofiia_console.sh`
|
||||
|
||||
### Deploy order
|
||||
|
||||
- NODA2 precheck
|
||||
- NODA1 rollout
|
||||
- NODA2 finalize
|
||||
|
||||
### Smoke
|
||||
|
||||
- `GET /api/health` -> `200`
|
||||
- `/metrics` reachable
|
||||
- `bash ops/redis_idempotency_smoke.sh` -> `PASS` (when redis backend is enabled)
|
||||
- `/api/audit` auth:
|
||||
- without key -> `401`
|
||||
- with key -> `200`
|
||||
|
||||
### Post-release
|
||||
|
||||
- Verify rate-limit metrics increment under controlled load.
|
||||
- Verify audit write/read quick check.
|
||||
- Run retention dry-run:
|
||||
- `python3 ops/prune_audit_db.py --dry-run`
|
||||
|
||||
## 5) Explicit Go / No-Go Criteria
|
||||
|
||||
**GO if all conditions hold:**
|
||||
|
||||
- Preflight is `PASS` (or only non-critical `WARN` accepted by operator).
|
||||
- Smoke checks pass.
|
||||
- No unexpected 5xx spike during first 5–10 minutes.
|
||||
- Rate-limit counters and idempotency behavior are within expected range.
|
||||
|
||||
**NO-GO if any condition holds:**
|
||||
|
||||
- Strict audit auth fails (401/200 behavior broken).
|
||||
- Redis idempotency A/B smoke fails.
|
||||
- Audit write/read fails.
|
||||
- Unexpected 500s on send path.
|
||||
|
||||
## 6) Rollback Readiness Statement
|
||||
|
||||
- Rollback method:
|
||||
- revert to previous known-good SHA/tag
|
||||
- restart affected services via docker compose/systemd as per runbook
|
||||
- Estimated rollback time: `<set by operator, typically 5-15 min>`
|
||||
- Mandatory post-rollback smoke:
|
||||
- `/api/health`
|
||||
- idempotency smoke
|
||||
- audit auth/read checks
|
||||
Reference in New Issue
Block a user