This report performs a deep-dive audit of the Hub's gRPC server and mesh coordination layer within grpc_server.py, focusing on 12-Factor App Methodology, Concurrency Safety, and Distributed Resource Management.
| Factor | Status | Observation |
|---|---|---|
| VI. Processes | 🔴 Major Issue | Memory Leak in IO Locks: self.io_locks (Line 31) stores a mutex lock for every active file sync. These entries are only purged when a final chunk (is_final=True) is processed (Line 380). If a mesh node disconnects or crashes mid-sync (very common on poor Wi-Fi), the Hub leaks a threading.Lock object in-memory permanently. Over months of operation, this will lead to slow performance degradation. |
| XII. Admin Tasks | 🟡 Warning | Embedded Management Loops: The MeshMonitor and MirrorCleanup tasks are embedded directly as threads inside the AgentOrchestrator servicer (Lines 37-40). These are effectively "management tasks" (Factor XII) that should be decoupled from the core gRPC protocol handler for better testability and crash isolation. |
| IX. Disposability | ✅ Success | Journal Failure Recovery: The Hub correctly calls self.journal.fail_node_tasks(node_id) (Line 292) when a gRPC stream terminates, ensuring that any pending AI sub-tasks are immediately failed with a diagnostic message rather than hanging indefinitely. |
app/core/grpc/services/grpc_server.pyThe bidirectional highway for AI-to-Node communication.
[!CAUTION] Distributed Handshake Deadlock (LFI Risk) Line 176:
SyncConfigurationperforms a direct database lookup to validate invite tokens. While the switch from HTTP to direct DB calls (M6) prevented a specific type of deadlock, performing complex, potentially slow SQL queries during the initial gRPC handshake can still block the gRPC thread pool, reducing the Hub's ability to accept new connections during DB latency spikes (e.g., during NFS backups).
Identified Problems:
_monitor_mesh (Line 63) performs a blocking with open(...) write every 30 seconds. This disk I/O occurs on the main Hub thread pool, which should be reserved for low-latency network packet handling._build_sandbox_policy (Line 134) defaults unconfigured nodes to PERMISSIVE mode. While user-friendly, this violates the "Secure by Default" principle.self.io_locks directly.WeakValueDictionary or an explicit TTL-based cleanup for self.io_locks to prevent memory leaks from abandoned sync operations.AgentScheduler or a dedicated background worker class.STRICT for unconfigured nodes, requiring an explicit admin toggle to unlock "PERMISSIVE" bash execution on a new node.This concludes Feature 12. I have persisted this report to /app/docs/reviews/feature_review_grpc_mesh_control.md. Which of these gRPC hardening tasks should we investigate further?