diff --git a/.agent/workflows/troubleshooting.md b/.agent/workflows/troubleshooting.md new file mode 100644 index 0000000..2dd9bdc --- /dev/null +++ b/.agent/workflows/troubleshooting.md @@ -0,0 +1,92 @@ +--- +description: Troubleshooting Guide for File Sync and Node Issues +--- + +# Troubleshooting Guide: Sync & Node Debugging + +This workflow documents common issues encountered with the AI Hub file synchronization and node management, along with strategies and scripts used to diagnose and resolve them. + +## 1. File Synchronization Dropping Large Files +**Symptoms**: +- A node successfully creates a large file (e.g., using `dd` or `cp`). +- The file does not appear on the central AI Hub mirror. +- Other connected nodes do not receive the file chunk broadcasts. + +**Troubleshooting Steps & Checklist**: +1. Check Proto3 compliance: Ensure you are directly accessing `payload.offset` rather than using `payload.HasField("offset")` for primitive variables, as Proto3 does not reliably support `.HasField` for int64. +2. Verify Temporary File filtering: Ensure the workspace `.watcher` and Server `.mirror` are actively ignoring files ending with `.cortex_tmp` and `.cortex_lock` to prevent recursion loops (Sync Echo). +3. Validate "Empty Workspace" Initialization: When creating `source="empty"` workspaces, verify that the AI Hub actually sends a `START_WATCHING` gRPC signal to the target nodes. Without this signal, nodes will create the empty directory but the `watchdog` daemon will not attach to it, resulting in zero outbound file syncs. +4. Ensure target node is properly receiving "on_closed" events for larger files. Writing a 10MB file via `dd` triggers 60+ `on_modified` events, which can occasionally choke the stream. Implementing an `on_closed` forced-sync is more reliable. + +## 2. Server Root File Ownership (Jerxie Prod) +**Symptoms**: +- Agent nodes sync files to the Hub mirror directory successfully. +- The host user (`UID 1000`) is unable to manipulate or delete these files directly from the host machine because the container writes them as `root`. + +**Troubleshooting Steps & Checklist**: +1. Use Python's `os.chown` in the Mirror: During the atomic swap phase on the AI Hub (`os.replace()`), forcefully capture the `os.stat()` of the parent directory. +2. Apply `os.chown(tmp_path, parent_stat.st_uid, parent_stat.st_gid)` to the `.cortex_tmp` file immediately before the final swap. This ensures the host user retains ownership of all synced data on NFS/Mounted volumes. + +## 3. Ghost Nodes Attaching to New Sessions +**Symptoms**: +- An old or offline node (e.g., `synology-nas`) keeps automatically attaching itself to newly created sessions. +- Removing it from the UI or closing previous sessions does not resolve the issue. + +**Troubleshooting Steps & Checklist**: +- Verify the Postgres User `preferences` schema. In the Cortex Hub, default node attachments are tied to the user profile, not just individual prior sessions. +- Run a surgical database manipulation script directly on the backend to erase the ghost node from the `default_node_ids` array. + +**Example Surgical Database Script**: +```python +import sys +sys.path.append("/app") +from app.db.session import get_db_session +from app.db.models import User, Session +from sqlalchemy.orm.attributes import flag_modified + +try: + with get_db_session() as db: + # Find the specific user and update their preferences dict + users = db.query(User).all() + for u in users: + prefs = u.preferences or {} + nodes = prefs.get("nodes", {}) + defaults = nodes.get("default_node_ids", []) + if "synology-nas" in defaults: + defaults.remove("synology-nas") + nodes["default_node_ids"] = defaults + prefs["nodes"] = nodes + u.preferences = prefs + flag_modified(u, "preferences") + + # Clean up any already corrupted sessions + sessions = db.query(Session).filter(Session.sync_workspace_id == "YOUR_SESSION_ID").all() + for s in sessions: + attached = s.attached_node_ids or [] + if "synology-nas" in attached: + attached.remove("synology-nas") + s.attached_node_ids = attached + flag_modified(s, "attached_node_ids") + + db.commit() + print("Surgical cleanup complete.") +except Exception as e: + print(f"Error: {e}") +``` + +## Useful Diagnostic SSH Commands + +### Check Agent Watcher Logs +```bash +docker logs cortex-test-1 2>&1 | grep "📁👁️" +``` + +### Trace File Sync on the Hub +```bash +docker logs ai_hub_service | grep "dd_test_new_live.bin" +``` + +### Validate Container Mount Ownership +```bash +echo 'your-password' | sudo -S ls -la /var/lib/docker/volumes/cortex-hub_ai_hub_data/_data/mirrors/session-YOUR_SESSION +```