Newer
Older
cortex-hub / .agent / workflows / troubleshooting.md

description: Troubleshooting Guide for File Sync and Node Issues

Troubleshooting Guide: Sync & Node Debugging

This workflow documents common issues encountered with the AI Hub file synchronization and node management, along with strategies and scripts used to diagnose and resolve them.

1. File Synchronization Dropping Large Files

Symptoms:

  • A node successfully creates a large file (e.g., using dd or cp).
  • The file does not appear on the central AI Hub mirror.
  • Other connected nodes do not receive the file chunk broadcasts.

Troubleshooting Steps & Checklist:

  1. Check Proto3 compliance: Ensure you are directly accessing payload.offset rather than using payload.HasField("offset") for primitive variables, as Proto3 does not reliably support .HasField for int64.
  2. Verify Temporary File filtering: Ensure the workspace .watcher and Server .mirror are actively ignoring files ending with .cortex_tmp and .cortex_lock to prevent recursion loops (Sync Echo).
  3. Validate "Empty Workspace" Initialization: When creating source="empty" workspaces, verify that the AI Hub actually sends a START_WATCHING gRPC signal to the target nodes. Without this signal, nodes will create the empty directory but the watchdog daemon will not attach to it, resulting in zero outbound file syncs.
  4. Ensure target node is properly receiving "on_closed" events for larger files. Writing a 10MB file via dd triggers 60+ on_modified events, which can occasionally choke the stream. Implementing an on_closed forced-sync is more reliable.

2. Server Root File Ownership (Jerxie Prod)

Symptoms:

  • Agent nodes sync files to the Hub mirror directory successfully.
  • The host user (UID 1000) is unable to manipulate or delete these files directly from the host machine because the container writes them as root.

Troubleshooting Steps & Checklist:

  1. Use Python's os.chown in the Mirror: During the atomic swap phase on the AI Hub (os.replace()), forcefully capture the os.stat() of the parent directory.
  2. Apply os.chown(tmp_path, parent_stat.st_uid, parent_stat.st_gid) to the .cortex_tmp file immediately before the final swap. This ensures the host user retains ownership of all synced data on NFS/Mounted volumes.

3. Ghost Nodes Attaching to New Sessions

Symptoms:

  • An old or offline node (e.g., synology-nas) keeps automatically attaching itself to newly created sessions.
  • Removing it from the UI or closing previous sessions does not resolve the issue.

Troubleshooting Steps & Checklist:

  • Verify the Postgres User preferences schema. In the Cortex Hub, default node attachments are tied to the user profile, not just individual prior sessions.
  • Run a surgical database manipulation script directly on the backend to erase the ghost node from the default_node_ids array.

Example Surgical Database Script:

import sys
sys.path.append("/app")
from app.db.session import get_db_session
from app.db.models import User, Session
from sqlalchemy.orm.attributes import flag_modified

try:
    with get_db_session() as db:
        # Find the specific user and update their preferences dict
        users = db.query(User).all()
        for u in users:
            prefs = u.preferences or {}
            nodes = prefs.get("nodes", {})
            defaults = nodes.get("default_node_ids", [])
            if "synology-nas" in defaults:
                defaults.remove("synology-nas")
                nodes["default_node_ids"] = defaults
                prefs["nodes"] = nodes
                u.preferences = prefs
                flag_modified(u, "preferences")
                
        # Clean up any already corrupted sessions
        sessions = db.query(Session).filter(Session.sync_workspace_id == "YOUR_SESSION_ID").all()
        for s in sessions:
            attached = s.attached_node_ids or []
            if "synology-nas" in attached:
                attached.remove("synology-nas")
                s.attached_node_ids = attached
                flag_modified(s, "attached_node_ids")
                
        db.commit()
        print("Surgical cleanup complete.")
except Exception as e:
    print(f"Error: {e}")

Useful Diagnostic SSH Commands

Check Agent Watcher Logs

docker logs cortex-test-1 2>&1 | grep "📁👁️"

Trace File Sync on the Hub

docker logs ai_hub_service | grep "dd_test_new_live.bin"

Validate Container Mount Ownership

echo 'your-password' | sudo -S ls -la /var/lib/docker/volumes/cortex-hub_ai_hub_data/_data/mirrors/session-YOUR_SESSION

4. Unresponsive Terminal (Zombie PTY Processes on Local Nodes)

Symptoms:

  • Agent node shows tasks being received in its logs.
  • Executing bash commands via the AI mesh terminal hangs indefinitely or outputs nothing.
  • The ThreadPoolExecutor or PTY bridge is gridlocked because of a previous broken shell session.

Troubleshooting Steps & Checklist:

  1. Check the active Python process list to locate the running agent-node/main.py.
  2. Inspect the agent stdout and stderr logs locally (e.g. ~/.cortex/agent.out.log and ~/.cortex/agent.err.log) for any errors or blocked threads.
  3. If the process is hung on a zombie child shell, force-kill (kill -9) the main agent process AND all its recursive child processes (to properly clean up spawned PTY shells).
  4. Restart the agent using bash run.sh and redirect logs appropriately. Note: The agent node logic (agent-node/src/agent_node/main.py) has been updated in v1.0.62 to automatically harvest and kill child PTY processes when clearing orphaned instances on boot, preventing this state across restarts.

Useful Diagnostic Commands (Local Node)

# Locate the agent process ID
pgrep -f "agent_node/main.py"

# Kill main process and all child PTYs (Force restart)
mac_pid=$(pgrep -f "agent_node/main.py") && kill -9 $mac_pid

# Restart agent gracefully with output redirected to background logs
cd ~/.cortex/agent-node && bash run.sh > ~/.cortex/agent.out.log 2> ~/.cortex/agent.err.log &

# Tail agent errors locally
tail -n 100 ~/.cortex/agent.err.log