description: Troubleshooting Guide for File Sync and Node Issues

Troubleshooting Guide: Sync & Node Debugging

This workflow documents common issues encountered with the AI Hub file synchronization and node management, along with strategies and scripts used to diagnose and resolve them.

1. File Synchronization Dropping Large Files

Symptoms:

A node successfully creates a large file (e.g., using dd or cp).
The file does not appear on the central AI Hub mirror.
Other connected nodes do not receive the file chunk broadcasts.

Troubleshooting Steps & Checklist:

Check Proto3 compliance: Ensure you are directly accessing payload.offset rather than using payload.HasField("offset") for primitive variables, as Proto3 does not reliably support .HasField for int64.
Verify Temporary File filtering: Ensure the workspace .watcher and Server .mirror are actively ignoring files ending with .cortex_tmp and .cortex_lock to prevent recursion loops (Sync Echo).
Validate "Empty Workspace" Initialization: When creating source="empty" workspaces, verify that the AI Hub actually sends a START_WATCHING gRPC signal to the target nodes. Without this signal, nodes will create the empty directory but the watchdog daemon will not attach to it, resulting in zero outbound file syncs.
Ensure target node is properly receiving "on_closed" events for larger files. Writing a 10MB file via dd triggers 60+ on_modified events, which can occasionally choke the stream. Implementing an on_closed forced-sync is more reliable.

2. Server Root File Ownership (Jerxie Prod)

Symptoms:

Agent nodes sync files to the Hub mirror directory successfully.
The host user (UID 1000) is unable to manipulate or delete these files directly from the host machine because the container writes them as root.

Troubleshooting Steps & Checklist:

Use Python's os.chown in the Mirror: During the atomic swap phase on the AI Hub (os.replace()), forcefully capture the os.stat() of the parent directory.
Apply os.chown(tmp_path, parent_stat.st_uid, parent_stat.st_gid) to the .cortex_tmp file immediately before the final swap. This ensures the host user retains ownership of all synced data on NFS/Mounted volumes.

3. Ghost Nodes Attaching to New Sessions

Symptoms:

An old or offline node (e.g., synology-nas) keeps automatically attaching itself to newly created sessions.
Removing it from the UI or closing previous sessions does not resolve the issue.

Troubleshooting Steps & Checklist:

Verify the Postgres User preferences schema. In the Cortex Hub, default node attachments are tied to the user profile, not just individual prior sessions.
Run a surgical database manipulation script directly on the backend to erase the ghost node from the default_node_ids array.

Example Surgical Database Script:

import sys
sys.path.append("/app")
from app.db.session import get_db_session
from app.db.models import User, Session
from sqlalchemy.orm.attributes import flag_modified

try:
    with get_db_session() as db:
        # Find the specific user and update their preferences dict
        users = db.query(User).all()
        for u in users:
            prefs = u.preferences or {}
            nodes = prefs.get("nodes", {})
            defaults = nodes.get("default_node_ids", [])
            if "synology-nas" in defaults:
                defaults.remove("synology-nas")
                nodes["default_node_ids"] = defaults
                prefs["nodes"] = nodes
                u.preferences = prefs
                flag_modified(u, "preferences")
                
        # Clean up any already corrupted sessions
        sessions = db.query(Session).filter(Session.sync_workspace_id == "YOUR_SESSION_ID").all()
        for s in sessions:
            attached = s.attached_node_ids or []
            if "synology-nas" in attached:
                attached.remove("synology-nas")
                s.attached_node_ids = attached
                flag_modified(s, "attached_node_ids")
                
        db.commit()
        print("Surgical cleanup complete.")
except Exception as e:
    print(f"Error: {e}")

Useful Diagnostic SSH Commands

Check Agent Watcher Logs

docker logs cortex-test-1 2>&1 | grep "📁👁️"

Trace File Sync on the Hub

docker logs ai_hub_service | grep "dd_test_new_live.bin"

Validate Container Mount Ownership

echo 'your-password' | sudo -S ls -la /var/lib/docker/volumes/cortex-hub_ai_hub_data/_data/mirrors/session-YOUR_SESSION

4. Unresponsive Terminal (Zombie PTY Processes on Local Nodes)

Symptoms:

Agent node shows tasks being received in its logs.
Executing bash commands via the AI mesh terminal hangs indefinitely or outputs nothing.
The ThreadPoolExecutor or PTY bridge is gridlocked because of a previous broken shell session.

Troubleshooting Steps & Checklist:

Check the active Python process list to locate the running agent-node/main.py.
Inspect the agent stdout and stderr logs locally (e.g. ~/.cortex/agent.out.log and ~/.cortex/agent.err.log) for any errors or blocked threads.
If the process is hung on a zombie child shell, force-kill (kill -9) the main agent process AND all its recursive child processes (to properly clean up spawned PTY shells).
Restart the agent using bash run.sh and redirect logs appropriately. Note: The agent node logic (agent-node/src/agent_node/main.py) has been updated in v1.0.62 to automatically harvest and kill child PTY processes when clearing orphaned instances on boot, preventing this state across restarts.

Useful Diagnostic Commands (Local Node)

# Locate the agent process ID
pgrep -f "agent_node/main.py"

# Kill main process and all child PTYs (Force restart)
mac_pid=$(pgrep -f "agent_node/main.py") && kill -9 $mac_pid

# Restart agent gracefully with output redirected to background logs
cd ~/.cortex/agent-node && bash run.sh > ~/.cortex/agent.out.log 2> ~/.cortex/agent.err.log &

# Tail agent errors locally
tail -n 100 ~/.cortex/agent.err.log

5. Mac Mini Specific Agent Debugging

Symptoms:

Terminal hangs immediately after a large file sync (e.g., 10MB test file).
Commands like ls or cd work once, then the PTY bridge deadlocks.
Symptoms unique to macOS: Symlinked paths (e.g., /tmp -> /private/tmp) causing relative path mismatches.

Troubleshooting Steps & Checklist:

Identify PTY Deadlock: Run lsof -p <PID> | grep ptmx. If more than 2-3 PTYs are open during idle, or if the number of PTYs doesn't match the number of active terminal tabs, the reader thread has likely crashed.
Resolve Symlink Mismatches: macOS handles /tmp as a symlink. Always use os.path.realpath() in the Watcher to ensure that relpath calculations are consistent between the root and the modified file.
Check Queue Saturation: If the orchestrator's task_queue (default size 100) is full during a sync, unrelated PTY threads will block waiting for space. Bump maxsize to 10000 in node.py to handle 10MB+ sync bursts.
Investigate Deadlocks with SIGQUIT: Send kill -3 <PID> to the agent. This will dump all active Python thread stacks to stdout (viewable in agent.out.log), allowing you to see exactly where the TTY reader is blocked.

Diagnostic Commands:

# Check File Descriptors for PTY leaks on MacOS
lsof -p $(pgrep -f "agent_node/main.py") | grep ptmx

# Force a thread dump to logs for deadlock investigation
kill -3 $(pgrep -f "agent_node/main.py")

# Verify the absolute path resolution that the agent sees
/opt/anaconda3/bin/python3 -c "import os; print(os.path.realpath('/tmp/cortex-sync'))"

6. Production Health & Log Monitoring (Jerxie Prod)

A set of automation scripts are available in .agent/utils/ to monitor the production environment at ai.jerxie.com.

Prerequisites:

$REMOTE_HOST, $REMOTE_USER, and $REMOTE_PASSWORD must be set in the environment to allow SSH access to the production server.

Available Scripts:

bash .agent/utils/check_prod_health.sh: Checks AI Hub connectivity, node registration status, and internal browser service reachability.
bash .agent/utils/check_prod_status.sh: Lists all running Docker containers on the production server.
bash .agent/utils/get_prod_logs_hub.sh: Tails the last 200 lines of the ai_hub_service logs.
bash .agent/utils/get_prod_logs_browser.sh: Tails the last 200 lines of the cortex_browser_service logs.

When to use:

Node appearing 'Offline' in UI: Run check_prod_health.sh to see if the Hub registration endpoint is receiving heartbeats.
Browser actions failing: Run get_prod_logs_browser.sh to see if Playwright is encountering resource issues or container crashes.
Deployment verification: After running remote_deploy.sh, use check_prod_status.sh to ensure all services are marked as Up.