diff --git a/.agent/workflows/troubleshooting.md b/.agent/workflows/troubleshooting.md index 0cecefe..4029ba7 100644 --- a/.agent/workflows/troubleshooting.md +++ b/.agent/workflows/troubleshooting.md @@ -28,8 +28,18 @@ 2. Validate `FSEvents` `os.replace` behavior: macOS's native kernel watcher queues file deletion events structurally from `os.remove`, but occasionally delivers them hundreds of milliseconds *after* a new network write has arrived utilizing the exact same path! 3. Because `FSEvents` are delayed, when the `on_deleted` event is evaluated against the `last_sync` cache, it finds the newly calculated hash block instead of the suppressed `__DELETED__` lock natively, thereby assuming the user actively erased the file by hand again. 4. Ensure `watcher.py` validates absolute disk existence explicitly before broadcasting deletions: `if os.path.exists(real_src) or os.path.lexists(event.src_path): return`. This decisively breaks the macOS recursive network loop. +## 3. Infinite Sync Echo Loop (Modification Ping-Pong) +**Symptoms**: +- A file is created or modified, and then it immediately starts "pulsing" with `.cortex_tmp` and `.cortex_lock` files appearing and disappearing infinitely. +- High CPU/Network usage on agent nodes. +- Logs show non-stop `Streaming Sync Started` and `Async File Sync Complete` for the same file. -## 3. Server Root File Ownership (Jerxie Prod) +**Troubleshooting Steps & Checklist**: +1. **Hash-Check-First Validation**: Ensure `watcher.py` calculates the file hash **before** attempting to send chunks. If the hash matches `last_sync`, the modification event was likely an "echo" from a previous network write and must be ignored. +2. **Delayed FSEvents Masking**: macOS often delivers `on_modified` events up to several seconds after a write. Even if the path was suppressed during the actual write, the event might arrive after unsuppression. A pre-emptive hash verification is the most robust defense. +3. **Filter Internal Temp Files**: Ensure all event handlers (`on_created`, `on_modified`, `on_closed`) explicitly return if the filename ends in `.cortex_tmp` or `.cortex_lock`. This prevents the watcher from trying to sync the synchronization metadata itself. + +## 4. Server Root File Ownership (Jerxie Prod) **Symptoms**: - Agent nodes sync files to the Hub mirror directory successfully. - The host user (`UID 1000`) is unable to manipulate or delete these files directly from the host machine because the container writes them as `root`. @@ -38,7 +48,7 @@ 1. Use Python's `os.chown` in the Mirror: During the atomic swap phase on the AI Hub (`os.replace()`), forcefully capture the `os.stat()` of the parent directory. 2. Apply `os.chown(tmp_path, parent_stat.st_uid, parent_stat.st_gid)` to the `.cortex_tmp` file immediately before the final swap. This ensures the host user retains ownership of all synced data on NFS/Mounted volumes. -## 4. Ghost Nodes Attaching to New Sessions +## 5. Ghost Nodes Attaching to New Sessions **Symptoms**: - An old or offline node (e.g., `synology-nas`) keeps automatically attaching itself to newly created sessions. - Removing it from the UI or closing previous sessions does not resolve the issue. @@ -102,7 +112,7 @@ echo 'your-password' | sudo -S ls -la /var/lib/docker/volumes/cortex-hub_ai_hub_data/_data/mirrors/session-YOUR_SESSION ``` -## 5. Unresponsive Terminal (Zombie PTY Processes on Local Nodes) +## 6. Unresponsive Terminal (Zombie PTY Processes on Local Nodes) **Symptoms**: - Agent node shows tasks being received in its logs. - Executing bash commands via the AI mesh terminal hangs indefinitely or outputs nothing. @@ -130,7 +140,7 @@ tail -n 100 ~/.cortex/agent.err.log ``` -## 6. Mac Mini Specific Agent Debugging +## 7. Mac Mini Specific Agent Debugging **Symptoms**: - Terminal hangs immediately after a large file sync (e.g., 10MB test file). - Commands like `ls` or `cd` work once, then the PTY bridge deadlocks. @@ -154,7 +164,7 @@ /opt/anaconda3/bin/python3 -c "import os; print(os.path.realpath('/tmp/cortex-sync'))" ``` -## 7. Production Health & Log Monitoring (Jerxie Prod) +## 8. Production Health & Log Monitoring (Jerxie Prod) A set of automation scripts are available in `.agent/utils/` to monitor the production environment at `ai.jerxie.com`. **Prerequisites**: