diff --git a/.agent/workflows/troubleshooting.md b/.agent/workflows/troubleshooting.md index 014192c..0cecefe 100644 --- a/.agent/workflows/troubleshooting.md +++ b/.agent/workflows/troubleshooting.md @@ -17,8 +17,19 @@ 2. Verify Temporary File filtering: Ensure the workspace `.watcher` and Server `.mirror` are actively ignoring files ending with `.cortex_tmp` and `.cortex_lock` to prevent recursion loops (Sync Echo). 3. Validate "Empty Workspace" Initialization: When creating `source="empty"` workspaces, verify that the AI Hub actually sends a `START_WATCHING` gRPC signal to the target nodes. Without this signal, nodes will create the empty directory but the `watchdog` daemon will not attach to it, resulting in zero outbound file syncs. 4. Ensure target node is properly receiving "on_closed" events for larger files. Writing a 10MB file via `dd` triggers 60+ `on_modified` events, which can occasionally choke the stream. Implementing an `on_closed` forced-sync is more reliable. +## 2. macOS Ghost Deletion Loop (Infinite Synchronized File Removals) +**Symptoms**: +- A file created natively on a macOS node (via `echo`, UI, or `cp`) syncs upward perfectly, but deletes itself globally a few seconds later. +- Reading the agent logs shows non-stop back-to-back entries mapping: `[📁⚠️] Watcher EMITTING DELETE to SERVER` and `[📁🗑️] Delete Fragment`. +- Atomic network uploads repeatedly cycle identically over and over indefinitely. -## 2. Server Root File Ownership (Jerxie Prod) +**Troubleshooting Steps & Checklist**: +1. Validate Zero-byte payload support: macOS frequently produces two immediate events (Creation: 0 bytes and Modification: 100 bytes). If the watcher chunk reader prematurely breaks on `if not chunk` without validating `index > 0`, the Hub never learns about the file initially, causing an aggressively executed `handle_manifest` remote deletion broadcast to scrub the new file. +2. Validate `FSEvents` `os.replace` behavior: macOS's native kernel watcher queues file deletion events structurally from `os.remove`, but occasionally delivers them hundreds of milliseconds *after* a new network write has arrived utilizing the exact same path! +3. Because `FSEvents` are delayed, when the `on_deleted` event is evaluated against the `last_sync` cache, it finds the newly calculated hash block instead of the suppressed `__DELETED__` lock natively, thereby assuming the user actively erased the file by hand again. +4. Ensure `watcher.py` validates absolute disk existence explicitly before broadcasting deletions: `if os.path.exists(real_src) or os.path.lexists(event.src_path): return`. This decisively breaks the macOS recursive network loop. + +## 3. Server Root File Ownership (Jerxie Prod) **Symptoms**: - Agent nodes sync files to the Hub mirror directory successfully. - The host user (`UID 1000`) is unable to manipulate or delete these files directly from the host machine because the container writes them as `root`. @@ -27,7 +38,7 @@ 1. Use Python's `os.chown` in the Mirror: During the atomic swap phase on the AI Hub (`os.replace()`), forcefully capture the `os.stat()` of the parent directory. 2. Apply `os.chown(tmp_path, parent_stat.st_uid, parent_stat.st_gid)` to the `.cortex_tmp` file immediately before the final swap. This ensures the host user retains ownership of all synced data on NFS/Mounted volumes. -## 3. Ghost Nodes Attaching to New Sessions +## 4. Ghost Nodes Attaching to New Sessions **Symptoms**: - An old or offline node (e.g., `synology-nas`) keeps automatically attaching itself to newly created sessions. - Removing it from the UI or closing previous sessions does not resolve the issue. @@ -91,7 +102,7 @@ echo 'your-password' | sudo -S ls -la /var/lib/docker/volumes/cortex-hub_ai_hub_data/_data/mirrors/session-YOUR_SESSION ``` -## 4. Unresponsive Terminal (Zombie PTY Processes on Local Nodes) +## 5. Unresponsive Terminal (Zombie PTY Processes on Local Nodes) **Symptoms**: - Agent node shows tasks being received in its logs. - Executing bash commands via the AI mesh terminal hangs indefinitely or outputs nothing. @@ -119,7 +130,7 @@ tail -n 100 ~/.cortex/agent.err.log ``` -## 5. Mac Mini Specific Agent Debugging +## 6. Mac Mini Specific Agent Debugging **Symptoms**: - Terminal hangs immediately after a large file sync (e.g., 10MB test file). - Commands like `ls` or `cd` work once, then the PTY bridge deadlocks. @@ -143,7 +154,7 @@ /opt/anaconda3/bin/python3 -c "import os; print(os.path.realpath('/tmp/cortex-sync'))" ``` -## 6. Production Health & Log Monitoring (Jerxie Prod) +## 7. Production Health & Log Monitoring (Jerxie Prod) A set of automation scripts are available in `.agent/utils/` to monitor the production environment at `ai.jerxie.com`. **Prerequisites**: