diff --git a/.agent/workflows/deploy_to_production.md b/.agent/workflows/deploy_to_production.md index f6f6515..8a7a149 100644 --- a/.agent/workflows/deploy_to_production.md +++ b/.agent/workflows/deploy_to_production.md @@ -7,6 +7,40 @@ **MAIN KNOWLEDGE POINT:** Agents and Users should refer to `.agent/workflows/deployment_reference.md` to understand the full proxy and architecture layout prior to running production debugging. +--- + +## ⚠️ MANDATORY: Run file sync tests BEFORE committing / deploying + +Any change to the following files **requires running the file sync integration tests first** and all must pass before pushing: + +- `agent-node/src/agent_node/core/watcher.py` +- `agent-node/src/agent_node/core/sync.py` +- `agent-node/src/agent_node/node.py` +- `ai-hub/app/core/grpc/services/assistant.py` +- `ai-hub/app/core/grpc/services/grpc_server.py` +- `ai-hub/app/core/grpc/core/mirror.py` +- `ai-hub/app/protos/agent.proto` + +**Run small file tests (Cases 1–4, ~60s) — the minimum bar before any commit:** + +// turbo +```bash +sshpass -p "a6163484a" ssh -o StrictHostKeyChecking=no axieyangb@192.168.68.113 \ + "echo 'a6163484a' | sudo -S docker exec \ + -e SYNC_TEST_BASE_URL=http://127.0.0.1:8000 \ + -e SYNC_TEST_USER_ID=9a333ccd-9c3f-432f-a030-7b1e1284a436 \ + ai_hub_service \ + python3 -m pytest /tmp/test_file_sync.py::TestSmallFileSync -v -s -p no:warnings 2>&1" +``` + +> The test file is deployed automatically by `remote_deploy.sh`. If you need to refresh it without a full deploy, see `/file_sync_tests` workflow. + +All 4 small-file cases must show `PASSED`. If any fail, **do not deploy** — diagnose the regression first (see `/file_sync_tests` for troubleshooting). + +--- + +## Deployment Steps + 1. **Automated Secret Fetching**: The `scripts/remote_deploy.sh` script will automatically pull the production password from the GitBucket Secret Vault if the `GITBUCKET_TOKEN` is available in `/app/.env.gitbucket`. 2. **Sync**: Sync local codebase to `/tmp/cortex-hub/` on the server. 3. **Proto & Version Management**: @@ -26,20 +60,37 @@ bash /app/scripts/remote_deploy.sh ``` -### Post-Deployment (MANDATORY) -After the script completes, the agent MUST run a basic connectivity check directly inside the production container: +--- + +## ⚠️ MANDATORY: Post-Deployment Verification + +After deploy completes, run **both** checks: + +### 1. Basic API health check ```bash -# SSH into the host and run (you may need to provide the SSH pass or be logged in) -docker exec ai_hub_service python3 -c " +sshpass -p "a6163484a" ssh -o StrictHostKeyChecking=no axieyangb@192.168.68.113 \ + "echo 'a6163484a' | sudo -S docker exec ai_hub_service python3 -c \" import requests -# Ensure you use a valid user_id headers = {'X-User-ID': '9a333ccd-9c3f-432f-a030-7b1e1284a436'} -r = requests.get('http://localhost:8000/api/v1/nodes/test-prod-node/fs/ls?path=.', headers=headers) +r = requests.get('http://localhost:8000/api/v1/nodes/test-node-1/status', headers=headers) print(f'Status: {r.status_code}') -print(r.text) -" +print(r.text[:200]) +\"" ``` -**Expected Outcome**: A `200 OK` status and a JSON body. +**Expected Outcome**: `200 OK` and `{"status": "online", ...}`. + +### 2. File sync regression test (re-run after deploy) + +// turbo +```bash +sshpass -p "a6163484a" ssh -o StrictHostKeyChecking=no axieyangb@192.168.68.113 \ + "echo 'a6163484a' | sudo -S docker exec \ + -e SYNC_TEST_BASE_URL=http://127.0.0.1:8000 \ + -e SYNC_TEST_USER_ID=9a333ccd-9c3f-432f-a030-7b1e1284a436 \ + ai_hub_service \ + python3 -m pytest /tmp/test_file_sync.py::TestSmallFileSync -v -p no:warnings 2>&1" +``` +All 4 must pass. If any fail post-deploy, **rollback immediately** using `git revert` and re-deploy. ### Visual Verification (Worst Case) If the backend check passes but UI behaviors are suspect or need deeper frontend validation, use the `/frontend_tester` workflow: diff --git a/.agent/workflows/file_sync_tests.md b/.agent/workflows/file_sync_tests.md new file mode 100644 index 0000000..b79168e --- /dev/null +++ b/.agent/workflows/file_sync_tests.md @@ -0,0 +1,116 @@ +--- +description: How to run the file sync integration tests against the production server to verify mesh file propagation is working correctly. +--- + +# File Sync Integration Tests + +Tests live in `ai-hub/integration_tests/test_file_sync.py`. +They verify real end-to-end file sync across nodes and the Hub mirror. + +## Prerequisites + +- Production server: `192.168.68.113` +- Credentials: user `axieyangb`, password `a6163484a` +- Nodes `test-node-1` and `test-node-2` must be **online** (status: online in `agent_nodes` table) +- The test file must be present inside the container at `/tmp/test_file_sync.py` + +## Key Environment Variables + +| Variable | Purpose | Value | +|---|---|---| +| `SYNC_TEST_BASE_URL` | Hub API base URL (inside container) | `http://127.0.0.1:8000` | +| `SYNC_TEST_USER_ID` | User ID with access to test nodes | `9a333ccd-9c3f-432f-a030-7b1e1284a436` | +| `SYNC_TEST_NODE1` | First test node (optional, defaults to `test-node-1`) | `test-node-1` | +| `SYNC_TEST_NODE2` | Second test node (optional, defaults to `test-node-2`) | `test-node-2` | + +> **Why this user ID?** The nodes (`test-node-1`, `test-node-2`) are registered under group `4a2e9ec4-4ae9-4ab3-9e78-69c35449ac94`. Only users in that group pass the `_require_node_access` check. `9a333ccd-...` is the admin user (`axieyangb@gmail.com`) who belongs to that group. If you create a new test user you must add them to that group first. + +--- + +## Step 1 — Deploy the test file to the container + +Run this whenever the test file changes (it does NOT require a full deploy): + +```bash +sshpass -p "a6163484a" scp -o StrictHostKeyChecking=no \ + /app/ai-hub/integration_tests/test_file_sync.py \ + axieyangb@192.168.68.113:/tmp/test_file_sync.py \ +&& sshpass -p "a6163484a" ssh -o StrictHostKeyChecking=no axieyangb@192.168.68.113 \ + "echo 'a6163484a' | sudo -S docker cp /tmp/test_file_sync.py ai_hub_service:/tmp/test_file_sync.py" +``` + +--- + +## Step 2 — Run all 8 tests (all-in-one) + +// turbo +```bash +sshpass -p "a6163484a" ssh -o StrictHostKeyChecking=no axieyangb@192.168.68.113 \ + "echo 'a6163484a' | sudo -S docker exec \ + -e SYNC_TEST_BASE_URL=http://127.0.0.1:8000 \ + -e SYNC_TEST_USER_ID=9a333ccd-9c3f-432f-a030-7b1e1284a436 \ + ai_hub_service \ + python3 -m pytest /tmp/test_file_sync.py -v -s -p no:warnings 2>&1" +``` + +**Expected runtime:** ~2 minutes (1 min small files + 1 min large files) + +--- + +## Run only small file tests (faster, ~60s) + +// turbo +```bash +sshpass -p "a6163484a" ssh -o StrictHostKeyChecking=no axieyangb@192.168.68.113 \ + "echo 'a6163484a' | sudo -S docker exec \ + -e SYNC_TEST_BASE_URL=http://127.0.0.1:8000 \ + -e SYNC_TEST_USER_ID=9a333ccd-9c3f-432f-a030-7b1e1284a436 \ + ai_hub_service \ + python3 -m pytest /tmp/test_file_sync.py::TestSmallFileSync -v -s -p no:warnings 2>&1" +``` + +--- + +## What the tests cover + +| Case | Scenario | File Size | +|---|---|---| +| 1 | Write from `test-node-1` → `test-node-2` + Hub mirror receive it | Small | +| 2 | Write from server (via node-1) → all client nodes receive it | Small | +| 3 | Delete from server → all client nodes purged | Small | +| 4 | Delete from `test-node-2` → Hub mirror + `test-node-1` purged | Small | +| 5 | Write 20 MB from `test-node-1` → `test-node-2` + Hub mirror (with SHA-256 verify) | 20 MB | +| 6 | Write 20 MB from server → all client nodes receive it | 20 MB | +| 7 | Delete large file from server → all nodes purged | 20 MB | +| 8 | Delete large file from `test-node-2` → Hub mirror + `test-node-1` purged | 20 MB | + +--- + +## Troubleshooting + +### Tests are SKIPped +The `requires_nodes` marker skips all tests if the Hub is unreachable. Check: +1. Is `ai_hub_service` running? `docker ps | grep ai_hub_service` +2. Is the Hub healthy? `docker logs --tail 20 ai_hub_service` + +### `Create session failed: {"detail":"Not Found"}` +The API path is wrong — ensure `SESSIONS_PATH = "/api/v1/sessions"` in the test file (production uses `/api/v1/` prefix set by `PATH_PREFIX` env var). + +### `Attach nodes failed` (403 / 404) +The `SYNC_TEST_USER_ID` doesn't have access to the nodes. Verify group membership: +```bash +sshpass -p "a6163484a" ssh -o StrictHostKeyChecking=no axieyangb@192.168.68.113 \ + "echo 'a6163484a' | sudo -S docker exec ai_hub_service python3 -c \ + 'import sqlite3; conn = sqlite3.connect(\"/app/data/ai_hub.db\"); cur = conn.cursor(); \ + cur.execute(\"SELECT id, email, group_id FROM users WHERE id = \\\"YOUR_USER_ID\\\"\"); \ + print(cur.fetchall())'" +``` + +### Case 3 or 4 (delete propagation) fails +This means the `broadcast_delete` in `assistant.py` or the `on_deleted` filter in `watcher.py` regressed. Check: +- `watcher.py` `on_deleted` must filter `.cortex_tmp` / `.cortex_lock` paths +- `grpc_server.py` DELETE handler must guard against temp-file paths +- `assistant.py` must have `broadcast_delete()` method + +### File appears on node-1 but not node-2 +`self.memberships` in `assistant.py` doesn't include both nodes for the session. Check `broadcast_file_chunk` log lines for the destinations list.