This workflow documents common issues encountered with the AI Hub file synchronization and node management, along with strategies and scripts used to diagnose and resolve them.
Symptoms:
dd or cp).Troubleshooting Steps & Checklist:
payload.offset rather than using payload.HasField("offset") for primitive variables, as Proto3 does not reliably support .HasField for int64..watcher and Server .mirror are actively ignoring files ending with .cortex_tmp and .cortex_lock to prevent recursion loops (Sync Echo).source="empty" workspaces, verify that the AI Hub actually sends a START_WATCHING gRPC signal to the target nodes. Without this signal, nodes will create the empty directory but the watchdog daemon will not attach to it, resulting in zero outbound file syncs.dd triggers 60+ on_modified events, which can occasionally choke the stream. Implementing an on_closed forced-sync is more reliable.
echo, UI, or cp) syncs upward perfectly, but deletes itself globally a few seconds later.[📁⚠️] Watcher EMITTING DELETE to SERVER and [📁🗑️] Delete Fragment.Troubleshooting Steps & Checklist:
if not chunk without validating index > 0, the Hub never learns about the file initially, causing an aggressively executed handle_manifest remote deletion broadcast to scrub the new file.FSEvents os.replace behavior: macOS's native kernel watcher queues file deletion events structurally from os.remove, but occasionally delivers them hundreds of milliseconds after a new network write has arrived utilizing the exact same path!FSEvents are delayed, when the on_deleted event is evaluated against the last_sync cache, it finds the newly calculated hash block instead of the suppressed __DELETED__ lock natively, thereby assuming the user actively erased the file by hand again.watcher.py validates absolute disk existence explicitly before broadcasting deletions: if os.path.exists(real_src) or os.path.lexists(event.src_path): return. This decisively breaks the macOS recursive network loop.
.cortex_tmp and .cortex_lock files appearing and disappearing infinitely.Streaming Sync Started and Async File Sync Complete for the same file.Troubleshooting Steps & Checklist:
watcher.py calculates the file hash before attempting to send chunks. If the hash matches last_sync, the modification event was likely an "echo" from a previous network write and must be ignored.on_modified events up to several seconds after a write. Even if the path was suppressed during the actual write, the event might arrive after unsuppression. A pre-emptive hash verification is the most robust defense.on_created, on_modified, on_closed) explicitly return if the filename ends in .cortex_tmp or .cortex_lock. This prevents the watcher from trying to sync the synchronization metadata itself.Symptoms:
UID 1000) is unable to manipulate or delete these files directly from the host machine because the container writes them as root.Troubleshooting Steps & Checklist:
os.chown in the Mirror: During the atomic swap phase on the AI Hub (os.replace()), forcefully capture the os.stat() of the parent directory.os.chown(tmp_path, parent_stat.st_uid, parent_stat.st_gid) to the .cortex_tmp file immediately before the final swap. This ensures the host user retains ownership of all synced data on NFS/Mounted volumes.Symptoms:
synology-nas) keeps automatically attaching itself to newly created sessions.Troubleshooting Steps & Checklist:
preferences schema. In the Cortex Hub, default node attachments are tied to the user profile, not just individual prior sessions.default_node_ids array.Example Surgical Database Script:
import sys
sys.path.append("/app")
from app.db.session import get_db_session
from app.db.models import User, Session
from sqlalchemy.orm.attributes import flag_modified
try:
with get_db_session() as db:
# Find the specific user and update their preferences dict
users = db.query(User).all()
for u in users:
prefs = u.preferences or {}
nodes = prefs.get("nodes", {})
defaults = nodes.get("default_node_ids", [])
if "synology-nas" in defaults:
defaults.remove("synology-nas")
nodes["default_node_ids"] = defaults
prefs["nodes"] = nodes
u.preferences = prefs
flag_modified(u, "preferences")
# Clean up any already corrupted sessions
sessions = db.query(Session).filter(Session.sync_workspace_id == "YOUR_SESSION_ID").all()
for s in sessions:
attached = s.attached_node_ids or []
if "synology-nas" in attached:
attached.remove("synology-nas")
s.attached_node_ids = attached
flag_modified(s, "attached_node_ids")
db.commit()
print("Surgical cleanup complete.")
except Exception as e:
print(f"Error: {e}")
docker logs cortex-test-1 2>&1 | grep "📁👁️"
docker logs ai_hub_service | grep "dd_test_new_live.bin"
echo 'your-password' | sudo -S ls -la /var/lib/docker/volumes/cortex-hub_ai_hub_data/_data/mirrors/session-YOUR_SESSION
Symptoms:
ThreadPoolExecutor or PTY bridge is gridlocked because of a previous broken shell session.Troubleshooting Steps & Checklist:
agent-node/main.py.~/.cortex/agent.out.log and ~/.cortex/agent.err.log) for any errors or blocked threads.kill -9) the main agent process AND all its recursive child processes (to properly clean up spawned PTY shells).bash run.sh and redirect logs appropriately. Note: The agent node logic (agent-node/src/agent_node/main.py) has been updated in v1.0.62 to automatically harvest and kill child PTY processes when clearing orphaned instances on boot, preventing this state across restarts.# Locate the agent process ID pgrep -f "agent_node/main.py" # Kill main process and all child PTYs (Force restart) mac_pid=$(pgrep -f "agent_node/main.py") && kill -9 $mac_pid # Restart agent gracefully with output redirected to background logs cd ~/.cortex/agent-node && bash run.sh > ~/.cortex/agent.out.log 2> ~/.cortex/agent.err.log & # Tail agent errors locally tail -n 100 ~/.cortex/agent.err.log
Symptoms:
ls or cd work once, then the PTY bridge deadlocks./tmp -> /private/tmp) causing relative path mismatches.Troubleshooting Steps & Checklist:
lsof -p <PID> | grep ptmx. If more than 2-3 PTYs are open during idle, or if the number of PTYs doesn't match the number of active terminal tabs, the reader thread has likely crashed./tmp as a symlink. Always use os.path.realpath() in the Watcher to ensure that relpath calculations are consistent between the root and the modified file.task_queue (default size 100) is full during a sync, unrelated PTY threads will block waiting for space. Bump maxsize to 10000 in node.py to handle 10MB+ sync bursts.kill -3 <PID> to the agent. This will dump all active Python thread stacks to stdout (viewable in agent.out.log), allowing you to see exactly where the TTY reader is blocked.Diagnostic Commands:
# Check File Descriptors for PTY leaks on MacOS
lsof -p $(pgrep -f "agent_node/main.py") | grep ptmx
# Force a thread dump to logs for deadlock investigation
kill -3 $(pgrep -f "agent_node/main.py")
# Verify the absolute path resolution that the agent sees
/opt/anaconda3/bin/python3 -c "import os; print(os.path.realpath('/tmp/cortex-sync'))"
A set of automation scripts are available in .agent/utils/ to monitor the production environment at ai.jerxie.com.
Prerequisites:
$REMOTE_HOST, $REMOTE_USER, and $REMOTE_PASSWORD must be set in the environment to allow SSH access to the production server.bash .agent/utils/check_prod_health.sh: Checks AI Hub connectivity, node registration status, and internal browser service reachability.bash .agent/utils/check_prod_status.sh: Lists all running Docker containers on the production server.bash .agent/utils/get_prod_logs_hub.sh: Tails the last 200 lines of the ai_hub_service logs.bash .agent/utils/get_prod_logs_browser.sh: Tails the last 200 lines of the cortex_browser_service logs.check_prod_health.sh to see if the Hub registration endpoint is receiving heartbeats.get_prod_logs_browser.sh to see if Playwright is encountering resource issues or container crashes.remote_deploy.sh, use check_prod_status.sh to ensure all services are marked as Up.