Newer
Older
cortex-hub / docs / reviews / feature_review_grpc_mesh_control.md

Code Review Report: Feature 12 — gRPC Orchestration & Mesh Control

This report performs a deep-dive audit of the Hub's gRPC server and mesh coordination layer within grpc_server.py, focusing on 12-Factor App Methodology, Concurrency Safety, and Distributed Resource Management.


🏗️ 12-Factor App Compliance Audit

Factor Status Observation
VI. Processes 🔴 Major Issue Memory Leak in IO Locks: self.io_locks (Line 31) stores a mutex lock for every active file sync. These entries are only purged when a final chunk (is_final=True) is processed (Line 380). If a mesh node disconnects or crashes mid-sync (very common on poor Wi-Fi), the Hub leaks a threading.Lock object in-memory permanently. Over months of operation, this will lead to slow performance degradation.
XII. Admin Tasks 🟡 Warning Embedded Management Loops: The MeshMonitor and MirrorCleanup tasks are embedded directly as threads inside the AgentOrchestrator servicer (Lines 37-40). These are effectively "management tasks" (Factor XII) that should be decoupled from the core gRPC protocol handler for better testability and crash isolation.
IX. Disposability Success Journal Failure Recovery: The Hub correctly calls self.journal.fail_node_tasks(node_id) (Line 292) when a gRPC stream terminates, ensuring that any pending AI sub-tasks are immediately failed with a diagnostic message rather than hanging indefinitely.

🔍 File-by-File Diagnostic

1. app/core/grpc/services/grpc_server.py

The bidirectional highway for AI-to-Node communication.

[!CAUTION] Distributed Handshake Deadlock (LFI Risk) Line 176: SyncConfiguration performs a direct database lookup to validate invite tokens. While the switch from HTTP to direct DB calls (M6) prevented a specific type of deadlock, performing complex, potentially slow SQL queries during the initial gRPC handshake can still block the gRPC thread pool, reducing the Hub's ability to accept new connections during DB latency spikes (e.g., during NFS backups).

Identified Problems:

  • Synchronous File Logging: _monitor_mesh (Line 63) performs a blocking with open(...) write every 30 seconds. This disk I/O occurs on the main Hub thread pool, which should be reserved for low-latency network packet handling.
  • Permissive Sandbox Defaults: The _build_sandbox_policy (Line 134) defaults unconfigured nodes to PERMISSIVE mode. While user-friendly, this violates the "Secure by Default" principle.
  • Implicit TTL Enforcement: The lock cleanup (Line 93) in the cleanup loop is a good safety net, but it doesn't solve the memory leak in self.io_locks directly.

🛠️ Summary Recommendations

  1. Harden IO Lock Registry: Implement a WeakValueDictionary or an explicit TTL-based cleanup for self.io_locks to prevent memory leaks from abandoned sync operations.
  2. Decouple Management Tasks: Move the mesh monitoring and mirror cleanup threads to the centralized AgentScheduler or a dedicated background worker class.
  3. Strict Sandbox Defaults: Change the default sandbox mode to STRICT for unconfigured nodes, requiring an explicit admin toggle to unlock "PERMISSIVE" bash execution on a new node.

This concludes Feature 12. I have persisted this report to /app/docs/reviews/feature_review_grpc_mesh_control.md. Which of these gRPC hardening tasks should we investigate further?