Files
microdao-daarion/docs/release/sofiia-console-v1-readiness.md
Apple 67225a39fa docs(platform): add policy configs, runbooks, ops scripts and platform documentation
Config policies (16 files): alert_routing, architecture_pressure, backlog,
cost_weights, data_governance, incident_escalation, incident_intelligence,
network_allowlist, nodes_registry, observability_sources, rbac_tools_matrix,
release_gate, risk_attribution, risk_policy, slo_policy, tool_limits, tools_rollout

Ops (22 files): Caddyfile, calendar compose, grafana voice dashboard,
deployments/incidents logs, runbooks for alerts/audit/backlog/incidents/sofiia/voice,
cron jobs, scripts (alert_triage, audit_cleanup, migrate_*, governance, schedule),
task_registry, voice alerts/ha/latency/policy

Docs (30+ files): HUMANIZED_STEPAN v2.7-v3 changelogs and runbooks,
NODA1/NODA2 status and setup, audit index and traces, backlog, incident,
supervisor, tools, voice, opencode, release, risk, aistalk, spacebot

Made-with: Cursor
2026-03-03 07:14:53 -08:00

3.4 KiB
Raw Blame History

Sofiia Console v1.0 Release Readiness Summary

One-page go/no-go артефакт для релізного рішення по sofiia-console.

1) Scope & Version

  • Service: sofiia-console
  • Target version / tag: v1.0 (to be assigned at release cut)
  • Git SHAs:
    • sofiia-console: e75fd33
    • router: <set at release window>
    • gateway: <set at release window>
  • Deployment target:
    • NODA1: production runtime/data plane
    • NODA2: control plane / sofiia-console
  • Date prepared: <set at release window>
  • Prepared by: <operator>

2) Production Guarantees

Reliability

  • Idempotent POST /api/chats/{chat_id}/send with selectable backend (inmemory|redis).
  • Multi-node routing covered by E2E tests (NODA1/NODA2 via infer monkeypatch path).
  • Cursor pagination hardened with tie-breakers ((ts,id) / stable ordering semantics).
  • Release process formalized via preflight + release runbook + smoke scripts.

Security

  • Rate limiting on send path:
    • per-chat scope
    • per-operator scope
  • Strict /api/audit protection:
    • key required
    • no localhost bypass
  • Structured audit trail:
    • write events for operator actions
    • cursor-based read endpoint
  • Secrets rotation runbook documented and operational.

Operational Controls

  • /metrics exposed (including rate-limit and idempotency counters).
  • Structured JSON logs for send/replay/pagination/error flows.
  • Audit retention policy in place (default 90 days).
  • Pruning script available (ops/prune_audit_db.py: dry-run + batch delete + optional vacuum).
  • Release evidence auto-generator available (ops/generate_release_evidence.sh).

3) Known Limitations / Residual Risks

  • Chat index is still local DB-backed; full multi-instance HA for global chat index needs Phase 6 (Redis ChatIndexStore).
  • Rate-limit defaults to inmemory; multi-instance consistency needs SOFIIA_RATE_LIMIT_BACKEND=redis.
  • Audit storage is SQLite (single-node storage, non-clustered by default).
  • Automatic alerting/paging is not yet enabled; metric observation is primarily manual/runbook-driven.

4) Required Release-Day Checks

Preflight

  • STRICT=1 bash ops/preflight_sofiia_console.sh

Deploy order

  • NODA2 precheck
  • NODA1 rollout
  • NODA2 finalize

Smoke

  • GET /api/health -> 200
  • /metrics reachable
  • bash ops/redis_idempotency_smoke.sh -> PASS (when redis backend is enabled)
  • /api/audit auth:
    • without key -> 401
    • with key -> 200

Post-release

  • Verify rate-limit metrics increment under controlled load.
  • Verify audit write/read quick check.
  • Run retention dry-run:
    • python3 ops/prune_audit_db.py --dry-run

5) Explicit Go / No-Go Criteria

GO if all conditions hold:

  • Preflight is PASS (or only non-critical WARN accepted by operator).
  • Smoke checks pass.
  • No unexpected 5xx spike during first 510 minutes.
  • Rate-limit counters and idempotency behavior are within expected range.

NO-GO if any condition holds:

  • Strict audit auth fails (401/200 behavior broken).
  • Redis idempotency A/B smoke fails.
  • Audit write/read fails.
  • Unexpected 500s on send path.

6) Rollback Readiness Statement

  • Rollback method:
    • revert to previous known-good SHA/tag
    • restart affected services via docker compose/systemd as per runbook
  • Estimated rollback time: <set by operator, typically 5-15 min>
  • Mandatory post-rollback smoke:
    • /api/health
    • idempotency smoke
    • audit auth/read checks