ops(ci): add phase6 smoke automation and CI workflows

2026-03-05 09:19:20 -08:00
parent 4d6e73f352
commit e6e705a38b
6 changed files with 837 additions and 1 deletions
--- a/docs/ops/ci_smoke.md
+++ b/docs/ops/ci_smoke.md
@@ -0,0 +1,59 @@
+# CI Smoke: Phase-6
+
+## Workflows
+- `.github/workflows/phase6-smoke.yml`
+  - `workflow_dispatch`: manual smoke run on NODA1 via SSH.
+  - `workflow_call`: reusable smoke step for deploy workflows (recommended for hard gate).
+  - `workflow_run`: auto-run after successful deploy workflows:
+    - `Deploy Node1`
+    - `deploy-node1`
+    - `deploy-node1-runtime`
+- `.gitea/workflows/phase6-smoke.yml`
+  - `workflow_dispatch`: manual smoke run for Gitea Actions.
+  - `workflow_run`: auto-run after deploy workflows in Gitea.
+
+## Required Secrets
+- `NODA1_SSH_HOST`
+- `NODA1_SSH_USER` (optional, defaults to `root` if empty)
+- `NODA1_SSH_KEY`
+
+## Manual Run
+1. Open Actions (`GitHub` or `Gitea`) -> `phase6-smoke`.
+2. Click `Run workflow` / `Run`.
+3. Optionally override `ssh_host` and `ssh_user`.
+4. Run and wait for the `phase6-smoke` job result.
+
+## On-Deploy Run
+- Triggered automatically only when configured deploy workflow finishes with `success`.
+- Job retries once on transient failures (SSH/network hiccups).
+- If smoke still fails, workflow is marked failed.
+- For strict deploy gating in the same pipeline, call this workflow via `workflow_call` from deploy workflow and set `needs`.
+  In Gitea, use same-workflow `needs` gate (or `workflow_run` from deploy workflow) because `workflow_call` support depends on runner/version.
+
+Example (`.github/workflows/deploy-node1.yml`):
+```yaml
+jobs:
+  deploy:
+    runs-on: ubuntu-latest
+    steps:
+      - run: echo "deploy..."
+
+  smoke:
+    needs: [deploy]
+    uses: ./.github/workflows/phase6-smoke.yml
+    secrets:
+      NODA1_SSH_HOST: ${{ secrets.NODA1_SSH_HOST }}
+      NODA1_SSH_USER: ${{ secrets.NODA1_SSH_USER }}
+      NODA1_SSH_KEY: ${{ secrets.NODA1_SSH_KEY }}
+```
+
+## Artifacts
+- `phase6-smoke-logs` artifact includes:
+  - `phase6-smoke.log`
+  - per-attempt logs (`phase6-smoke-attempt1.log`, `phase6-smoke-attempt2.log` when retry happened)
+
+## Troubleshooting
+- `Missing SSH host`: add `NODA1_SSH_HOST` secret or pass `ssh_host` input.
+- `Missing secret NODA1_SSH_KEY`: add deploy key secret.
+- SSH host key issues: workflow uses `StrictHostKeyChecking=accept-new`; if host changed, rotate known host entry and retry.
+- Remote smoke fail: open artifact logs and check `/opt/microdao-daarion` state on NODA1.
--- a/docs/ops/phase6_anti_silent_tuning.md
+++ b/docs/ops/phase6_anti_silent_tuning.md
@@ -0,0 +1,209 @@
+# Phase-6 Anti-Silent Lessons Tuning (Closed-loop)
+
+## Goal
+Tune anti-silent ACK template selection using evidence from gateway events (`reason + chat_type + template_id + user_signal`) without unsafe auto-mutation.
+
+## Safety Invariants
+- No automatic global policy rewrite.
+- Gateway applies tuning only when `ANTI_SILENT_TUNING_ENABLED=true`.
+- Learner emits tuning lessons only with thresholds (`MIN_EVIDENCE`, `MIN_SCORE`).
+- Lessons have TTL (`expires_at`) for rollback-by-expiry.
+
+## Components
+- `services/experience-learner/main.py`
+  - emits `lesson_type=anti_silent_tuning`
+  - computes score: `1 - (w_retry*retry_rate + w_negative*negative_rate + w_suppressed*suppressed_rate)`
+- `gateway-bot/http_api.py`
+  - applies tuning in anti-silent template resolver under feature flag
+  - fail-open on DB/lookup errors
+- `gateway-bot/gateway_experience_bus.py`
+  - DB lookup of active tuning lesson by trigger (`reason=<...>;chat_type=<...>`)
+
+## Environment
+
+### Learner
+- `ANTI_SILENT_TUNING_ENABLED=true`
+- `ANTI_SILENT_TUNING_WINDOW_DAYS=7`
+- `ANTI_SILENT_TUNING_MIN_EVIDENCE=20`
+- `ANTI_SILENT_TUNING_MIN_SCORE=0.75`
+- `ANTI_SILENT_TUNING_WEIGHT_RETRY=0.6`
+- `ANTI_SILENT_TUNING_WEIGHT_NEGATIVE=0.3`
+- `ANTI_SILENT_TUNING_WEIGHT_SUPPRESSED=0.1`
+- `ANTI_SILENT_TUNING_TTL_DAYS=7`
+
+### Gateway
+- `ANTI_SILENT_TUNING_ENABLED=false` (default; turn on only after smoke)
+- `ANTI_SILENT_TUNING_DB_TIMEOUT_MS=40`
+- `ANTI_SILENT_TUNING_CACHE_TTL_SECONDS=60`
+
+## Deploy
+```bash
+cd /opt/microdao-daarion
+docker compose -f docker-compose.node1.yml up -d --no-deps --build --force-recreate experience-learner gateway
+```
+
+## Single-Command Smoke (Phase-6.1)
+```bash
+make phase6-smoke
+```
+
+The command runs:
+1. deterministic seed events
+2. learner lesson generation assertion
+3. gateway apply assertion (flag ON)
+4. gateway fallback assertion (flag OFF)
+5. seed cleanup (events + lessons)
+
+Use `PHASE6_CLEANUP=0 make phase6-smoke` to keep artifacts for debugging.
+
+## CI Integration
+- Workflow: `.github/workflows/phase6-smoke.yml`
+- Modes:
+  - `workflow_dispatch` (manual)
+  - `workflow_run` after successful deploy workflow
+- Operations guide: `docs/ops/ci_smoke.md`
+
+## Fixed Smoke (Deterministic)
+
+### 1) Temporary smoke thresholds
+Use low thresholds only for smoke:
+- `ANTI_SILENT_TUNING_MIN_EVIDENCE=3`
+- `ANTI_SILENT_TUNING_MIN_SCORE=0.5`
+- `ANTI_SILENT_TUNING_WINDOW_DAYS=1`
+
+### 2) Seed synthetic gateway events
+```bash
+export PG_CONTAINER='dagi-postgres'
+
+docker exec "$PG_CONTAINER" psql -U daarion -d daarion_memory -c "
+WITH seed AS (
+  SELECT
+    (
+      substr(md5(random()::text || clock_timestamp()::text), 1, 8) || '-' ||
+      substr(md5(random()::text || clock_timestamp()::text), 9, 4) || '-' ||
+      substr(md5(random()::text || clock_timestamp()::text), 13, 4) || '-' ||
+      substr(md5(random()::text || clock_timestamp()::text), 17, 4) || '-' ||
+      substr(md5(random()::text || clock_timestamp()::text), 21, 12)
+    )::uuid AS event_id,
+    now() - (g * interval '1 minute') AS ts,
+    CASE WHEN g <= 3 THEN 'UNSUPPORTED_INPUT' ELSE 'SILENT_POLICY' END AS template_id,
+    CASE WHEN g <= 3 THEN 'retry' ELSE 'none' END AS user_signal
+  FROM generate_series(1,6) g
+)
+INSERT INTO agent_experience_events (
+  event_id, ts, node_id, source, agent_id, task_type, request_id,
+  channel, inputs_hash, provider, model, profile, latency_ms,
+  tokens_in, tokens_out, ok, error_class, error_msg_redacted,
+  http_status, raw
+)
+SELECT
+  event_id,
+  ts,
+  'NODA1',
+  'gateway',
+  'agromatrix',
+  'webhook',
+  'phase6-seed-' || event_id::text,
+  'telegram',
+  md5(event_id::text),
+  'gateway',
+  'gateway',
+  NULL,
+  25,
+  NULL,
+  NULL,
+  true,
+  NULL,
+  NULL,
+  200,
+  jsonb_build_object(
+    'event_id', event_id::text,
+    'ts', to_char(ts, 'YYYY-MM-DD"T"HH24:MI:SS"Z"'),
+    'source', 'gateway',
+    'agent_id', 'agromatrix',
+    'chat_type', 'group',
+    'anti_silent_action', 'ACK_EMITTED',
+    'anti_silent_template', template_id,
+    'policy', jsonb_build_object('sowa_decision', 'SILENT', 'reason', 'unsupported_no_message'),
+    'feedback', jsonb_build_object('user_signal', user_signal),
+    'result', jsonb_build_object('ok', true, 'http_status', 200)
+  )
+FROM seed;
+"
+```
+
+### 3) Trigger learner evaluation with one real event
+```bash
+export GATEWAY_WEBHOOK_URL='http://127.0.0.1:9300/agromatrix/telegram/webhook'
+
+curl -sS -X POST "$GATEWAY_WEBHOOK_URL" \
+  -H 'content-type: application/json' \
+  -d @docs/ops/payloads/phase5_payload_group_unsupported_no_message.json
+```
+
+### 4) Verify tuning lesson exists
+```bash
+docker exec "$PG_CONTAINER" psql -U daarion -d daarion_memory -P pager=off -c "
+SELECT ts,
+       trigger,
+       action,
+       raw->>'lesson_type' AS lesson_type,
+       raw->>'expires_at' AS expires_at,
+       evidence
+FROM agent_lessons
+WHERE raw->>'lesson_type'='anti_silent_tuning'
+ORDER BY ts DESC
+LIMIT 5;
+"
+```
+Expected: lesson with
+- `trigger=reason=unsupported_no_message;chat_type=group`
+- `action=prefer_template=SILENT_POLICY`
+
+### 5) Enable gateway tuning and verify apply
+```bash
+# set ANTI_SILENT_TUNING_ENABLED=true for gateway and restart container
+# then replay unsupported payload:
+
+curl -sS -X POST "$GATEWAY_WEBHOOK_URL" \
+  -H 'content-type: application/json' \
+  -d @docs/ops/payloads/phase5_payload_group_unsupported_no_message.json
+
+# verify latest gateway events
+
+docker exec "$PG_CONTAINER" psql -U daarion -d daarion_memory -P pager=off -c "
+SELECT ts,
+       raw->'policy'->>'reason' AS reason,
+       raw->>'chat_type' AS chat_type,
+       raw->>'anti_silent_template' AS template_id,
+       raw->>'anti_silent_tuning_applied' AS tuning_applied,
+       raw->>'anti_silent_action' AS anti_action
+FROM agent_experience_events
+WHERE source='gateway'
+  AND raw->'policy'->>'reason'='unsupported_no_message'
+ORDER BY ts DESC
+LIMIT 10;
+"
+```
+Expected: newest rows have
+- `template_id=SILENT_POLICY`
+- `tuning_applied=true`
+
+## PASS
+- Tuning lesson created only when evidence/score thresholds pass.
+- Gateway does not change template when feature flag is off.
+- With feature flag on, gateway applies active non-expired tuning lesson.
+- Expired lessons are ignored.
+
+## FAIL
+- Tuning lesson appears below evidence/score threshold.
+- Gateway changes template while feature flag is off.
+- Gateway applies expired lesson.
+- Webhook path fails when tuning lookup fails (must stay fail-open).
+
+## Manual Cleanup Query
+```sql
+DELETE FROM agent_lessons
+WHERE raw->>'lesson_type'='anti_silent_tuning'
+  AND raw->>'seed_test'='true';
+```