This workflow documents common issues encountered with the AI Hub file synchronization and node management, along with strategies and scripts used to diagnose and resolve them.
Symptoms:
dd or cp).Troubleshooting Steps & Checklist:
payload.offset rather than using payload.HasField("offset") for primitive variables, as Proto3 does not reliably support .HasField for int64..watcher and Server .mirror are actively ignoring files ending with .cortex_tmp and .cortex_lock to prevent recursion loops (Sync Echo).source="empty" workspaces, verify that the AI Hub actually sends a START_WATCHING gRPC signal to the target nodes. Without this signal, nodes will create the empty directory but the watchdog daemon will not attach to it, resulting in zero outbound file syncs.dd triggers 60+ on_modified events, which can occasionally choke the stream. Implementing an on_closed forced-sync is more reliable.Symptoms:
UID 1000) is unable to manipulate or delete these files directly from the host machine because the container writes them as root.Troubleshooting Steps & Checklist:
os.chown in the Mirror: During the atomic swap phase on the AI Hub (os.replace()), forcefully capture the os.stat() of the parent directory.os.chown(tmp_path, parent_stat.st_uid, parent_stat.st_gid) to the .cortex_tmp file immediately before the final swap. This ensures the host user retains ownership of all synced data on NFS/Mounted volumes.Symptoms:
synology-nas) keeps automatically attaching itself to newly created sessions.Troubleshooting Steps & Checklist:
preferences schema. In the Cortex Hub, default node attachments are tied to the user profile, not just individual prior sessions.default_node_ids array.Example Surgical Database Script:
import sys
sys.path.append("/app")
from app.db.session import get_db_session
from app.db.models import User, Session
from sqlalchemy.orm.attributes import flag_modified
try:
with get_db_session() as db:
# Find the specific user and update their preferences dict
users = db.query(User).all()
for u in users:
prefs = u.preferences or {}
nodes = prefs.get("nodes", {})
defaults = nodes.get("default_node_ids", [])
if "synology-nas" in defaults:
defaults.remove("synology-nas")
nodes["default_node_ids"] = defaults
prefs["nodes"] = nodes
u.preferences = prefs
flag_modified(u, "preferences")
# Clean up any already corrupted sessions
sessions = db.query(Session).filter(Session.sync_workspace_id == "YOUR_SESSION_ID").all()
for s in sessions:
attached = s.attached_node_ids or []
if "synology-nas" in attached:
attached.remove("synology-nas")
s.attached_node_ids = attached
flag_modified(s, "attached_node_ids")
db.commit()
print("Surgical cleanup complete.")
except Exception as e:
print(f"Error: {e}")
docker logs cortex-test-1 2>&1 | grep "📁👁️"
docker logs ai_hub_service | grep "dd_test_new_live.bin"
echo 'your-password' | sudo -S ls -la /var/lib/docker/volumes/cortex-hub_ai_hub_data/_data/mirrors/session-YOUR_SESSION
Symptoms:
ThreadPoolExecutor or PTY bridge is gridlocked because of a previous broken shell session.Troubleshooting Steps & Checklist:
agent-node/main.py.~/.cortex/agent.out.log and ~/.cortex/agent.err.log) for any errors or blocked threads.kill -9) the main agent process AND all its recursive child processes (to properly clean up spawned PTY shells).bash run.sh and redirect logs appropriately. Note: The agent node logic (agent-node/src/agent_node/main.py) has been updated in v1.0.62 to automatically harvest and kill child PTY processes when clearing orphaned instances on boot, preventing this state across restarts.# Locate the agent process ID pgrep -f "agent_node/main.py" # Kill main process and all child PTYs (Force restart) mac_pid=$(pgrep -f "agent_node/main.py") && kill -9 $mac_pid # Restart agent gracefully with output redirected to background logs cd ~/.cortex/agent-node && bash run.sh > ~/.cortex/agent.out.log 2> ~/.cortex/agent.err.log & # Tail agent errors locally tail -n 100 ~/.cortex/agent.err.log
Symptoms:
ls or cd work once, then the PTY bridge deadlocks./tmp -> /private/tmp) causing relative path mismatches.Troubleshooting Steps & Checklist:
lsof -p <PID> | grep ptmx. If more than 2-3 PTYs are open during idle, or if the number of PTYs doesn't match the number of active terminal tabs, the reader thread has likely crashed./tmp as a symlink. Always use os.path.realpath() in the Watcher to ensure that relpath calculations are consistent between the root and the modified file.task_queue (default size 100) is full during a sync, unrelated PTY threads will block waiting for space. Bump maxsize to 10000 in node.py to handle 10MB+ sync bursts.kill -3 <PID> to the agent. This will dump all active Python thread stacks to stdout (viewable in agent.out.log), allowing you to see exactly where the TTY reader is blocked.Diagnostic Commands:
# Check File Descriptors for PTY leaks on MacOS
lsof -p $(pgrep -f "agent_node/main.py") | grep ptmx
# Force a thread dump to logs for deadlock investigation
kill -3 $(pgrep -f "agent_node/main.py")
# Verify the absolute path resolution that the agent sees
/opt/anaconda3/bin/python3 -c "import os; print(os.path.realpath('/tmp/cortex-sync'))"
A set of automation scripts are available in .agent/utils/ to monitor the production environment at ai.jerxie.com.
Prerequisites:
$REMOTE_HOST, $REMOTE_USER, and $REMOTE_PASSWORD must be set in the environment to allow SSH access to the production server.bash .agent/utils/check_prod_health.sh: Checks AI Hub connectivity, node registration status, and internal browser service reachability.bash .agent/utils/check_prod_status.sh: Lists all running Docker containers on the production server.bash .agent/utils/get_prod_logs_hub.sh: Tails the last 200 lines of the ai_hub_service logs.bash .agent/utils/get_prod_logs_browser.sh: Tails the last 200 lines of the cortex_browser_service logs.check_prod_health.sh to see if the Hub registration endpoint is receiving heartbeats.get_prod_logs_browser.sh to see if Playwright is encountering resource issues or container crashes.remote_deploy.sh, use check_prod_status.sh to ensure all services are marked as Up.