Newer
Older
cortex-hub / docs / reviews / feature_review_mesh_assistant.md

Code Review Report: Feature 16 — Mesh Orchestration Assistant

This report performs a deep-dive audit of the orchestration "Brain" — the TaskAssistant service within assistant.py. It focuses on 12-Factor App Methodology, Mesh Scalability, and Synchronization Performance.


🏗️ 12-Factor App Compliance Audit

Factor Status Observation
VI. Processes 🟡 Warning Ephemeral Membership Cache: The memberships map (Line 24) is in-memory. While the Hub correctly implements a reconcile_node (Line 136) process to rebuild this state from the database upon node reconnection, a Hub crash during an active sync wave could cause temporary "broadcast orphans" until reconciliation completes.
IX. Disposability Success Optimized File Streaming: The "Line-rate" push logic (push_file, Line 62) uses 4MB gRPC-optimized chunks and zlib compression. This ensures that massive file transfers can be interrupted and resumed with minimal overhead, maintaining mesh disposability.
XI. Logs Success Event emission and logging are well-structured, providing clear visibility into "Drift Detection" and "Symlink Inversion" (Skill Promotion) events.

🔍 File-by-File Diagnostic

1. app/core/grpc/services/assistant.py

The service responsible for translating high-level AI intents (ls, cat, write) into mesh-wide gRPC commands.

[!TIP] Performance: Sequential Broadcast Hazard Line 218: for nid in destinations: _send_to_node(nid) The broadcast_file_chunk logic sends file data sequentially to all nodes in a session. While the node.queue handles backpressure, a single slow node on a latent link will still delay the iteration for all other nodes in the same session. Fix: Use a small ThreadPoolExecutor (e.g., 4 workers) for broadcast_file_chunk to ensure that data delivery to Fast nodes is not throttled by a single Slow node.

Identified Problems:

  • Synchronous NFS Reads: push_file (Line 79) performs synchronous with open(...) reads. For clusters with high-concurrency file sync needs (e.g., distributing a Docker build context), this can saturate Hub I/O and block the orchestration loop.
  • Skill Promotion Race: _check_skill_promotion (Line 177) relies on regex and shutil.move. If multiple nodes attempt to promote the same skill simultaneously from different sessions, a race condition occurs in the settings.DATA_DIR/skills folder.

🛠️ Summary Recommendations

  1. Parallel Multi-Node Broadcast: Refactor broadcast_file_chunk to use a thread pool for delivery, decoupling the sync speed of Fast nodes from the latency of Slow nodes.
  2. Explicit Flush in Write: Ensure that write (Line 418) explicitly calls f.flush() and os.fsync() before reporting success to the local mirror, preventing data loss during Hub power failures.
  3. Atomic Promotion with Locks: Wrap the shutil.move logic in _check_skill_promotion with a filesystem-level lock (or DB lock) to prevent corruption during concurrent global skill creation.

This concludes Feature 16. I have persisted this report to /app/docs/reviews/feature_review_mesh_assistant.md. All primary backend orchestration services have now been audited. Shall I perform a final summary and check the STT/TTS providers or setup scripts?