Newer
Older
cortex-hub / .agent / workflows / troubleshooting.md

description: Troubleshooting Guide for File Sync and Node Issues

Troubleshooting Guide: Sync & Node Debugging

This workflow documents common issues encountered with the AI Hub file synchronization and node management, along with strategies and scripts used to diagnose and resolve them.

1. File Synchronization Dropping Large Files

Symptoms:

  • A node successfully creates a large file (e.g., using dd or cp).
  • The file does not appear on the central AI Hub mirror.
  • Other connected nodes do not receive the file chunk broadcasts.

Troubleshooting Steps & Checklist:

  1. Check Proto3 compliance: Ensure you are directly accessing payload.offset rather than using payload.HasField("offset") for primitive variables, as Proto3 does not reliably support .HasField for int64.
  2. Verify Temporary File filtering: Ensure the workspace .watcher and Server .mirror are actively ignoring files ending with .cortex_tmp and .cortex_lock to prevent recursion loops (Sync Echo).
  3. Validate "Empty Workspace" Initialization: When creating source="empty" workspaces, verify that the AI Hub actually sends a START_WATCHING gRPC signal to the target nodes. Without this signal, nodes will create the empty directory but the watchdog daemon will not attach to it, resulting in zero outbound file syncs.
  4. Ensure target node is properly receiving "on_closed" events for larger files. Writing a 10MB file via dd triggers 60+ on_modified events, which can occasionally choke the stream. Implementing an on_closed forced-sync is more reliable.

    2. macOS Ghost Deletion Loop (Infinite Synchronized File Removals)

    Symptoms:
  5. A file created natively on a macOS node (via echo, UI, or cp) syncs upward perfectly, but deletes itself globally a few seconds later.
  6. Reading the agent logs shows non-stop back-to-back entries mapping: [📁⚠️] Watcher EMITTING DELETE to SERVER and [📁🗑️] Delete Fragment.
  7. Atomic network uploads repeatedly cycle identically over and over indefinitely.

Troubleshooting Steps & Checklist:

  1. Validate Zero-byte payload support: macOS frequently produces two immediate events (Creation: 0 bytes and Modification: 100 bytes). If the watcher chunk reader prematurely breaks on if not chunk without validating index > 0, the Hub never learns about the file initially, causing an aggressively executed handle_manifest remote deletion broadcast to scrub the new file.
  2. Validate FSEvents os.replace behavior: macOS's native kernel watcher queues file deletion events structurally from os.remove, but occasionally delivers them hundreds of milliseconds after a new network write has arrived utilizing the exact same path!
  3. Because FSEvents are delayed, when the on_deleted event is evaluated against the last_sync cache, it finds the newly calculated hash block instead of the suppressed __DELETED__ lock natively, thereby assuming the user actively erased the file by hand again.
  4. Ensure watcher.py validates absolute disk existence explicitly before broadcasting deletions: if os.path.exists(real_src) or os.path.lexists(event.src_path): return. This decisively breaks the macOS recursive network loop.

    3. Infinite Sync Echo Loop (Modification Ping-Pong)

    Symptoms:
  5. A file is created or modified, and then it immediately starts "pulsing" with .cortex_tmp and .cortex_lock files appearing and disappearing infinitely.
  6. High CPU/Network usage on agent nodes.
  7. Logs show non-stop Streaming Sync Started and Async File Sync Complete for the same file.

Troubleshooting Steps & Checklist:

  1. Hash-Check-First Validation: Ensure watcher.py calculates the file hash before attempting to send chunks. If the hash matches last_sync, the modification event was likely an "echo" from a previous network write and must be ignored.
  2. Delayed FSEvents Masking: macOS often delivers on_modified events up to several seconds after a write. Even if the path was suppressed during the actual write, the event might arrive after unsuppression. A pre-emptive hash verification is the most robust defense.
  3. Filter Internal Temp Files: Ensure all event handlers (on_created, on_modified, on_closed) explicitly return if the filename ends in .cortex_tmp or .cortex_lock. This prevents the watcher from trying to sync the synchronization metadata itself.

4. Server Root File Ownership (Jerxie Prod)

Symptoms:

  • Agent nodes sync files to the Hub mirror directory successfully.
  • The host user (UID 1000) is unable to manipulate or delete these files directly from the host machine because the container writes them as root.

Troubleshooting Steps & Checklist:

  1. Use Python's os.chown in the Mirror: During the atomic swap phase on the AI Hub (os.replace()), forcefully capture the os.stat() of the parent directory.
  2. Apply os.chown(tmp_path, parent_stat.st_uid, parent_stat.st_gid) to the .cortex_tmp file immediately before the final swap. This ensures the host user retains ownership of all synced data on NFS/Mounted volumes.

5. Ghost Nodes Attaching to New Sessions

Symptoms:

  • An old or offline node (e.g., synology-nas) keeps automatically attaching itself to newly created sessions.
  • Removing it from the UI or closing previous sessions does not resolve the issue.

Troubleshooting Steps & Checklist:

  • Verify the Postgres User preferences schema. In the Cortex Hub, default node attachments are tied to the user profile, not just individual prior sessions.
  • Run a surgical database manipulation script directly on the backend to erase the ghost node from the default_node_ids array.

Example Surgical Database Script:

import sys
sys.path.append("/app")
from app.db.session import get_db_session
from app.db.models import User, Session
from sqlalchemy.orm.attributes import flag_modified

try:
    with get_db_session() as db:
        # Find the specific user and update their preferences dict
        users = db.query(User).all()
        for u in users:
            prefs = u.preferences or {}
            nodes = prefs.get("nodes", {})
            defaults = nodes.get("default_node_ids", [])
            if "synology-nas" in defaults:
                defaults.remove("synology-nas")
                nodes["default_node_ids"] = defaults
                prefs["nodes"] = nodes
                u.preferences = prefs
                flag_modified(u, "preferences")
                
        # Clean up any already corrupted sessions
        sessions = db.query(Session).filter(Session.sync_workspace_id == "YOUR_SESSION_ID").all()
        for s in sessions:
            attached = s.attached_node_ids or []
            if "synology-nas" in attached:
                attached.remove("synology-nas")
                s.attached_node_ids = attached
                flag_modified(s, "attached_node_ids")
                
        db.commit()
        print("Surgical cleanup complete.")
except Exception as e:
    print(f"Error: {e}")

Useful Diagnostic SSH Commands

Check Agent Watcher Logs

docker logs cortex-test-1 2>&1 | grep "📁👁️"

Trace File Sync on the Hub

docker logs ai_hub_service | grep "dd_test_new_live.bin"

Validate Container Mount Ownership

echo 'your-password' | sudo -S ls -la /var/lib/docker/volumes/cortex-hub_ai_hub_data/_data/mirrors/session-YOUR_SESSION

6. Unresponsive Terminal (Zombie PTY Processes on Local Nodes)

Symptoms:

  • Agent node shows tasks being received in its logs.
  • Executing bash commands via the AI mesh terminal hangs indefinitely or outputs nothing.
  • The ThreadPoolExecutor or PTY bridge is gridlocked because of a previous broken shell session.

Troubleshooting Steps & Checklist:

  1. Check the active Python process list to locate the running agent-node/main.py.
  2. Inspect the agent stdout and stderr logs locally (e.g. ~/.cortex/agent.out.log and ~/.cortex/agent.err.log) for any errors or blocked threads.
  3. If the process is hung on a zombie child shell, force-kill (kill -9) the main agent process AND all its recursive child processes (to properly clean up spawned PTY shells).
  4. Restart the agent using bash run.sh and redirect logs appropriately. Note: The agent node logic (agent-node/src/agent_node/main.py) has been updated in v1.0.62 to automatically harvest and kill child PTY processes when clearing orphaned instances on boot, preventing this state across restarts.

Useful Diagnostic Commands (Local Node)

# Locate the agent process ID
pgrep -f "agent_node/main.py"

# Kill main process and all child PTYs (Force restart)
mac_pid=$(pgrep -f "agent_node/main.py") && kill -9 $mac_pid

# Restart agent gracefully with output redirected to background logs
cd ~/.cortex/agent-node && bash run.sh > ~/.cortex/agent.out.log 2> ~/.cortex/agent.err.log &

# Tail agent errors locally
tail -n 100 ~/.cortex/agent.err.log

7. Mac Mini Specific Agent Debugging

Symptoms:

  • Terminal hangs immediately after a large file sync (e.g., 10MB test file).
  • Commands like ls or cd work once, then the PTY bridge deadlocks.
  • Symptoms unique to macOS: Symlinked paths (e.g., /tmp -> /private/tmp) causing relative path mismatches.

Troubleshooting Steps & Checklist:

  1. Identify PTY Deadlock: Run lsof -p <PID> | grep ptmx. If more than 2-3 PTYs are open during idle, or if the number of PTYs doesn't match the number of active terminal tabs, the reader thread has likely crashed.
  2. Resolve Symlink Mismatches: macOS handles /tmp as a symlink. Always use os.path.realpath() in the Watcher to ensure that relpath calculations are consistent between the root and the modified file.
  3. Check Queue Saturation: If the orchestrator's task_queue (default size 100) is full during a sync, unrelated PTY threads will block waiting for space. Bump maxsize to 10000 in node.py to handle 10MB+ sync bursts.
  4. Investigate Deadlocks with SIGQUIT: Send kill -3 <PID> to the agent. This will dump all active Python thread stacks to stdout (viewable in agent.out.log), allowing you to see exactly where the TTY reader is blocked.

Diagnostic Commands:

# Check File Descriptors for PTY leaks on MacOS
lsof -p $(pgrep -f "agent_node/main.py") | grep ptmx

# Force a thread dump to logs for deadlock investigation
kill -3 $(pgrep -f "agent_node/main.py")

# Verify the absolute path resolution that the agent sees
/opt/anaconda3/bin/python3 -c "import os; print(os.path.realpath('/tmp/cortex-sync'))"

8. Production Health & Log Monitoring (Jerxie Prod)

A set of automation scripts are available in .agent/utils/ to monitor the production environment at ai.jerxie.com.

Prerequisites:

  • $REMOTE_HOST, $REMOTE_USER, and $REMOTE_PASSWORD must be set in the environment to allow SSH access to the production server.

Available Scripts:

  • bash .agent/utils/check_prod_health.sh: Checks AI Hub connectivity, node registration status, and internal browser service reachability.
  • bash .agent/utils/check_prod_status.sh: Lists all running Docker containers on the production server.
  • bash .agent/utils/get_prod_logs_hub.sh: Tails the last 200 lines of the ai_hub_service logs.
  • bash .agent/utils/get_prod_logs_browser.sh: Tails the last 200 lines of the cortex_browser_service logs.

When to use:

  • Node appearing 'Offline' in UI: Run check_prod_health.sh to see if the Hub registration endpoint is receiving heartbeats.
  • Browser actions failing: Run get_prod_logs_browser.sh to see if Playwright is encountering resource issues or container crashes.
  • Deployment verification: After running remote_deploy.sh, use check_prod_status.sh to ensure all services are marked as Up.