ops(ci): add phase6 smoke automation and CI workflows
This commit is contained in:
59
docs/ops/ci_smoke.md
Normal file
59
docs/ops/ci_smoke.md
Normal file
@@ -0,0 +1,59 @@
|
||||
# CI Smoke: Phase-6
|
||||
|
||||
## Workflows
|
||||
- `.github/workflows/phase6-smoke.yml`
|
||||
- `workflow_dispatch`: manual smoke run on NODA1 via SSH.
|
||||
- `workflow_call`: reusable smoke step for deploy workflows (recommended for hard gate).
|
||||
- `workflow_run`: auto-run after successful deploy workflows:
|
||||
- `Deploy Node1`
|
||||
- `deploy-node1`
|
||||
- `deploy-node1-runtime`
|
||||
- `.gitea/workflows/phase6-smoke.yml`
|
||||
- `workflow_dispatch`: manual smoke run for Gitea Actions.
|
||||
- `workflow_run`: auto-run after deploy workflows in Gitea.
|
||||
|
||||
## Required Secrets
|
||||
- `NODA1_SSH_HOST`
|
||||
- `NODA1_SSH_USER` (optional, defaults to `root` if empty)
|
||||
- `NODA1_SSH_KEY`
|
||||
|
||||
## Manual Run
|
||||
1. Open Actions (`GitHub` or `Gitea`) -> `phase6-smoke`.
|
||||
2. Click `Run workflow` / `Run`.
|
||||
3. Optionally override `ssh_host` and `ssh_user`.
|
||||
4. Run and wait for the `phase6-smoke` job result.
|
||||
|
||||
## On-Deploy Run
|
||||
- Triggered automatically only when configured deploy workflow finishes with `success`.
|
||||
- Job retries once on transient failures (SSH/network hiccups).
|
||||
- If smoke still fails, workflow is marked failed.
|
||||
- For strict deploy gating in the same pipeline, call this workflow via `workflow_call` from deploy workflow and set `needs`.
|
||||
In Gitea, use same-workflow `needs` gate (or `workflow_run` from deploy workflow) because `workflow_call` support depends on runner/version.
|
||||
|
||||
Example (`.github/workflows/deploy-node1.yml`):
|
||||
```yaml
|
||||
jobs:
|
||||
deploy:
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- run: echo "deploy..."
|
||||
|
||||
smoke:
|
||||
needs: [deploy]
|
||||
uses: ./.github/workflows/phase6-smoke.yml
|
||||
secrets:
|
||||
NODA1_SSH_HOST: ${{ secrets.NODA1_SSH_HOST }}
|
||||
NODA1_SSH_USER: ${{ secrets.NODA1_SSH_USER }}
|
||||
NODA1_SSH_KEY: ${{ secrets.NODA1_SSH_KEY }}
|
||||
```
|
||||
|
||||
## Artifacts
|
||||
- `phase6-smoke-logs` artifact includes:
|
||||
- `phase6-smoke.log`
|
||||
- per-attempt logs (`phase6-smoke-attempt1.log`, `phase6-smoke-attempt2.log` when retry happened)
|
||||
|
||||
## Troubleshooting
|
||||
- `Missing SSH host`: add `NODA1_SSH_HOST` secret or pass `ssh_host` input.
|
||||
- `Missing secret NODA1_SSH_KEY`: add deploy key secret.
|
||||
- SSH host key issues: workflow uses `StrictHostKeyChecking=accept-new`; if host changed, rotate known host entry and retry.
|
||||
- Remote smoke fail: open artifact logs and check `/opt/microdao-daarion` state on NODA1.
|
||||
209
docs/ops/phase6_anti_silent_tuning.md
Normal file
209
docs/ops/phase6_anti_silent_tuning.md
Normal file
@@ -0,0 +1,209 @@
|
||||
# Phase-6 Anti-Silent Lessons Tuning (Closed-loop)
|
||||
|
||||
## Goal
|
||||
Tune anti-silent ACK template selection using evidence from gateway events (`reason + chat_type + template_id + user_signal`) without unsafe auto-mutation.
|
||||
|
||||
## Safety Invariants
|
||||
- No automatic global policy rewrite.
|
||||
- Gateway applies tuning only when `ANTI_SILENT_TUNING_ENABLED=true`.
|
||||
- Learner emits tuning lessons only with thresholds (`MIN_EVIDENCE`, `MIN_SCORE`).
|
||||
- Lessons have TTL (`expires_at`) for rollback-by-expiry.
|
||||
|
||||
## Components
|
||||
- `services/experience-learner/main.py`
|
||||
- emits `lesson_type=anti_silent_tuning`
|
||||
- computes score: `1 - (w_retry*retry_rate + w_negative*negative_rate + w_suppressed*suppressed_rate)`
|
||||
- `gateway-bot/http_api.py`
|
||||
- applies tuning in anti-silent template resolver under feature flag
|
||||
- fail-open on DB/lookup errors
|
||||
- `gateway-bot/gateway_experience_bus.py`
|
||||
- DB lookup of active tuning lesson by trigger (`reason=<...>;chat_type=<...>`)
|
||||
|
||||
## Environment
|
||||
|
||||
### Learner
|
||||
- `ANTI_SILENT_TUNING_ENABLED=true`
|
||||
- `ANTI_SILENT_TUNING_WINDOW_DAYS=7`
|
||||
- `ANTI_SILENT_TUNING_MIN_EVIDENCE=20`
|
||||
- `ANTI_SILENT_TUNING_MIN_SCORE=0.75`
|
||||
- `ANTI_SILENT_TUNING_WEIGHT_RETRY=0.6`
|
||||
- `ANTI_SILENT_TUNING_WEIGHT_NEGATIVE=0.3`
|
||||
- `ANTI_SILENT_TUNING_WEIGHT_SUPPRESSED=0.1`
|
||||
- `ANTI_SILENT_TUNING_TTL_DAYS=7`
|
||||
|
||||
### Gateway
|
||||
- `ANTI_SILENT_TUNING_ENABLED=false` (default; turn on only after smoke)
|
||||
- `ANTI_SILENT_TUNING_DB_TIMEOUT_MS=40`
|
||||
- `ANTI_SILENT_TUNING_CACHE_TTL_SECONDS=60`
|
||||
|
||||
## Deploy
|
||||
```bash
|
||||
cd /opt/microdao-daarion
|
||||
docker compose -f docker-compose.node1.yml up -d --no-deps --build --force-recreate experience-learner gateway
|
||||
```
|
||||
|
||||
## Single-Command Smoke (Phase-6.1)
|
||||
```bash
|
||||
make phase6-smoke
|
||||
```
|
||||
|
||||
The command runs:
|
||||
1. deterministic seed events
|
||||
2. learner lesson generation assertion
|
||||
3. gateway apply assertion (flag ON)
|
||||
4. gateway fallback assertion (flag OFF)
|
||||
5. seed cleanup (events + lessons)
|
||||
|
||||
Use `PHASE6_CLEANUP=0 make phase6-smoke` to keep artifacts for debugging.
|
||||
|
||||
## CI Integration
|
||||
- Workflow: `.github/workflows/phase6-smoke.yml`
|
||||
- Modes:
|
||||
- `workflow_dispatch` (manual)
|
||||
- `workflow_run` after successful deploy workflow
|
||||
- Operations guide: `docs/ops/ci_smoke.md`
|
||||
|
||||
## Fixed Smoke (Deterministic)
|
||||
|
||||
### 1) Temporary smoke thresholds
|
||||
Use low thresholds only for smoke:
|
||||
- `ANTI_SILENT_TUNING_MIN_EVIDENCE=3`
|
||||
- `ANTI_SILENT_TUNING_MIN_SCORE=0.5`
|
||||
- `ANTI_SILENT_TUNING_WINDOW_DAYS=1`
|
||||
|
||||
### 2) Seed synthetic gateway events
|
||||
```bash
|
||||
export PG_CONTAINER='dagi-postgres'
|
||||
|
||||
docker exec "$PG_CONTAINER" psql -U daarion -d daarion_memory -c "
|
||||
WITH seed AS (
|
||||
SELECT
|
||||
(
|
||||
substr(md5(random()::text || clock_timestamp()::text), 1, 8) || '-' ||
|
||||
substr(md5(random()::text || clock_timestamp()::text), 9, 4) || '-' ||
|
||||
substr(md5(random()::text || clock_timestamp()::text), 13, 4) || '-' ||
|
||||
substr(md5(random()::text || clock_timestamp()::text), 17, 4) || '-' ||
|
||||
substr(md5(random()::text || clock_timestamp()::text), 21, 12)
|
||||
)::uuid AS event_id,
|
||||
now() - (g * interval '1 minute') AS ts,
|
||||
CASE WHEN g <= 3 THEN 'UNSUPPORTED_INPUT' ELSE 'SILENT_POLICY' END AS template_id,
|
||||
CASE WHEN g <= 3 THEN 'retry' ELSE 'none' END AS user_signal
|
||||
FROM generate_series(1,6) g
|
||||
)
|
||||
INSERT INTO agent_experience_events (
|
||||
event_id, ts, node_id, source, agent_id, task_type, request_id,
|
||||
channel, inputs_hash, provider, model, profile, latency_ms,
|
||||
tokens_in, tokens_out, ok, error_class, error_msg_redacted,
|
||||
http_status, raw
|
||||
)
|
||||
SELECT
|
||||
event_id,
|
||||
ts,
|
||||
'NODA1',
|
||||
'gateway',
|
||||
'agromatrix',
|
||||
'webhook',
|
||||
'phase6-seed-' || event_id::text,
|
||||
'telegram',
|
||||
md5(event_id::text),
|
||||
'gateway',
|
||||
'gateway',
|
||||
NULL,
|
||||
25,
|
||||
NULL,
|
||||
NULL,
|
||||
true,
|
||||
NULL,
|
||||
NULL,
|
||||
200,
|
||||
jsonb_build_object(
|
||||
'event_id', event_id::text,
|
||||
'ts', to_char(ts, 'YYYY-MM-DD"T"HH24:MI:SS"Z"'),
|
||||
'source', 'gateway',
|
||||
'agent_id', 'agromatrix',
|
||||
'chat_type', 'group',
|
||||
'anti_silent_action', 'ACK_EMITTED',
|
||||
'anti_silent_template', template_id,
|
||||
'policy', jsonb_build_object('sowa_decision', 'SILENT', 'reason', 'unsupported_no_message'),
|
||||
'feedback', jsonb_build_object('user_signal', user_signal),
|
||||
'result', jsonb_build_object('ok', true, 'http_status', 200)
|
||||
)
|
||||
FROM seed;
|
||||
"
|
||||
```
|
||||
|
||||
### 3) Trigger learner evaluation with one real event
|
||||
```bash
|
||||
export GATEWAY_WEBHOOK_URL='http://127.0.0.1:9300/agromatrix/telegram/webhook'
|
||||
|
||||
curl -sS -X POST "$GATEWAY_WEBHOOK_URL" \
|
||||
-H 'content-type: application/json' \
|
||||
-d @docs/ops/payloads/phase5_payload_group_unsupported_no_message.json
|
||||
```
|
||||
|
||||
### 4) Verify tuning lesson exists
|
||||
```bash
|
||||
docker exec "$PG_CONTAINER" psql -U daarion -d daarion_memory -P pager=off -c "
|
||||
SELECT ts,
|
||||
trigger,
|
||||
action,
|
||||
raw->>'lesson_type' AS lesson_type,
|
||||
raw->>'expires_at' AS expires_at,
|
||||
evidence
|
||||
FROM agent_lessons
|
||||
WHERE raw->>'lesson_type'='anti_silent_tuning'
|
||||
ORDER BY ts DESC
|
||||
LIMIT 5;
|
||||
"
|
||||
```
|
||||
Expected: lesson with
|
||||
- `trigger=reason=unsupported_no_message;chat_type=group`
|
||||
- `action=prefer_template=SILENT_POLICY`
|
||||
|
||||
### 5) Enable gateway tuning and verify apply
|
||||
```bash
|
||||
# set ANTI_SILENT_TUNING_ENABLED=true for gateway and restart container
|
||||
# then replay unsupported payload:
|
||||
|
||||
curl -sS -X POST "$GATEWAY_WEBHOOK_URL" \
|
||||
-H 'content-type: application/json' \
|
||||
-d @docs/ops/payloads/phase5_payload_group_unsupported_no_message.json
|
||||
|
||||
# verify latest gateway events
|
||||
|
||||
docker exec "$PG_CONTAINER" psql -U daarion -d daarion_memory -P pager=off -c "
|
||||
SELECT ts,
|
||||
raw->'policy'->>'reason' AS reason,
|
||||
raw->>'chat_type' AS chat_type,
|
||||
raw->>'anti_silent_template' AS template_id,
|
||||
raw->>'anti_silent_tuning_applied' AS tuning_applied,
|
||||
raw->>'anti_silent_action' AS anti_action
|
||||
FROM agent_experience_events
|
||||
WHERE source='gateway'
|
||||
AND raw->'policy'->>'reason'='unsupported_no_message'
|
||||
ORDER BY ts DESC
|
||||
LIMIT 10;
|
||||
"
|
||||
```
|
||||
Expected: newest rows have
|
||||
- `template_id=SILENT_POLICY`
|
||||
- `tuning_applied=true`
|
||||
|
||||
## PASS
|
||||
- Tuning lesson created only when evidence/score thresholds pass.
|
||||
- Gateway does not change template when feature flag is off.
|
||||
- With feature flag on, gateway applies active non-expired tuning lesson.
|
||||
- Expired lessons are ignored.
|
||||
|
||||
## FAIL
|
||||
- Tuning lesson appears below evidence/score threshold.
|
||||
- Gateway changes template while feature flag is off.
|
||||
- Gateway applies expired lesson.
|
||||
- Webhook path fails when tuning lookup fails (must stay fail-open).
|
||||
|
||||
## Manual Cleanup Query
|
||||
```sql
|
||||
DELETE FROM agent_lessons
|
||||
WHERE raw->>'lesson_type'='anti_silent_tuning'
|
||||
AND raw->>'seed_test'='true';
|
||||
```
|
||||
Reference in New Issue
Block a user