Newer
Older
cortex-hub / docs / reviews / feature_review_mirror_sync.md

Code Review Report: Feature 13 — Distributed File Sync Engine (Ghost Mirror)

This report performs a deep-dive audit of the file synchronization and "Ghost Mirror" architecture within mirror.py, focusing on 12-Factor App Methodology, NFS Performance, and Filesystem Security.


🏗️ 12-Factor App Compliance Audit

Factor Status Observation
VI. Processes 🔴 Performance Hazard Volatile Hash Cache: self.hash_cache (Line 17) is kept in-memory only. Upon Hub restart (deployment/crash), the Hub loses all file hashes. On the next reconciliation cycle, it must perform a full recursive re-hash of all mirrored files. On NFS-backed storage, this "Re-hashing Wave" will cause catastrophic I/O spikes and delay mesh synchronization for minutes.
IX. Disposability Success Stale Lock Reclamation: The Hub implements rigorous background cleanup logic (purge_stale_locks, Line 513) to ensure orphaned .cortex_lock and .cortex_tmp files do not accumulate indefinitely during crashed transfers.
X. Dev/Prod Parity 🟡 Warning Local Path Logic: The use of os.chown (Line 100) and parent_stat.st_uid assumes the Hub process has sufficient permissions to manipulate file ownership on the underlying filesystem. This works in a privileged dev container but may fail in restrictive Rootless K8s/Docker production environments.

🔍 File-by-File Diagnostic

1. app/core/grpc/core/mirror.py

The synchronization engine that bridges Hub-side storage and Mesh-node workspaces.

[!CAUTION] I/O Starvation Risk (Reconciliation) Line 485: with ThreadPoolExecutor(max_workers=32) as executor: The reconciliation process uses a high thread-count for hash verification. While good for parallelization, 32 threads performing 1MB buffer reads simultaneously from NFS storage can lead to I/O starvation for other Hub services (like the Database or API). Fix: Externalize the hash_cache to a persistent SQLite or Redis store to eliminate redundant re-hashing across Hub reboots.

Identified Problems:

  • Symlink Promotion Safety: _check_skill_promotion (Line 177) automatically moves directories based on shutil.move (Line 219). If the target directory already exists or contains broken symlinks, this operation can fail silently or cause data loss in the global skill registry.
  • Incomplete Immutability Lock: The block on .skills/ (Line 59) is robust for symlinks, but it does not prevent a node from "shadowing" a system skill by creating a physical directory of the same name BEFORE the Hub can symlink it.

🛠️ Summary Recommendations

  1. Persistent Hash Cache: Migrate the in-memory hash_cache to a local persistent store (e.g., DATA_DIR/hashes.db) to enable sub-second reconciliation after Hub restarts.
  2. Configurable Worker Pool: Move the 32-thread hardcoded limit for hashing/reconciling to app/config.py to allow tuning for specific storage backends (SSD vs. NFS).
  3. Atomic Promotion: Ensure _check_skill_promotion uses a transactional move strategy (check $\rightarrow$ temp $\rightarrow$ rename) to prevent race conditions during concurrent skill creation.

This concludes Feature 13. I have persisted this report to /app/docs/reviews/feature_review_mirror_sync.md. Which of these storage optimizations should we prioritize?