# Code Review Report: Feature 12 — gRPC Orchestration & Mesh Control

This report performs a deep-dive audit of the Hub's gRPC server and mesh coordination layer within `grpc_server.py`, focusing on **12-Factor App Methodology**, **Concurrency Safety**, and **Distributed Resource Management**.

---

## 🏗️ 12-Factor App Compliance Audit

| Factor | Status | Observation |
| :--- | :--- | :--- |
| **VI. Processes** | 🔴 **Major Issue** | **Memory Leak in IO Locks**: `self.io_locks` (Line 31) stores a mutex lock for every active file sync. These entries are only purged when a final chunk (`is_final=True`) is processed (Line 380). If a mesh node disconnects or crashes mid-sync (very common on poor Wi-Fi), the Hub leaks a `threading.Lock` object in-memory permanently. Over months of operation, this will lead to slow performance degradation. |
| **XII. Admin Tasks** | 🟡 **Warning** | **Embedded Management Loops**: The `MeshMonitor` and `MirrorCleanup` tasks are embedded directly as threads inside the `AgentOrchestrator` servicer (Lines 37-40). These are effectively "management tasks" (Factor XII) that should be decoupled from the core gRPC protocol handler for better testability and crash isolation. |
| **IX. Disposability** | ✅ **Success** | **Journal Failure Recovery**: The Hub correctly calls `self.journal.fail_node_tasks(node_id)` (Line 292) when a gRPC stream terminates, ensuring that any pending AI sub-tasks are immediately failed with a diagnostic message rather than hanging indefinitely. |

---

## 🔍 File-by-File Diagnostic

### 1. `app/core/grpc/services/grpc_server.py`
The bidirectional highway for AI-to-Node communication.

> [!CAUTION]
> **Distributed Handshake Deadlock (LFI Risk)**
> Line 176: `SyncConfiguration` performs a direct database lookup to validate invite tokens. While the switch from HTTP to direct DB calls (M6) prevented a specific type of deadlock, performing complex, potentially slow SQL queries during the initial gRPC handshake can still block the gRPC thread pool, reducing the Hub's ability to accept new connections during DB latency spikes (e.g., during NFS backups).

**Identified Problems**:
*   **Synchronous File Logging**: `_monitor_mesh` (Line 63) performs a blocking `with open(...)` write every 30 seconds. This disk I/O occurs on the main Hub thread pool, which should be reserved for low-latency network packet handling.
*   **Permissive Sandbox Defaults**: The `_build_sandbox_policy` (Line 134) defaults unconfigured nodes to `PERMISSIVE` mode. While user-friendly, this violates the "Secure by Default" principle.
*   **Implicit TTL Enforcement**: The lock cleanup (Line 93) in the cleanup loop is a good safety net, but it doesn't solve the memory leak in `self.io_locks` directly.

---

## 🛠️ Summary Recommendations

1.  **Harden IO Lock Registry**: Implement a `WeakValueDictionary` or an explicit TTL-based cleanup for `self.io_locks` to prevent memory leaks from abandoned sync operations.
2.  **Decouple Management Tasks**: Move the mesh monitoring and mirror cleanup threads to the centralized `AgentScheduler` or a dedicated background worker class.
3.  **Strict Sandbox Defaults**: Change the default sandbox mode to `STRICT` for unconfigured nodes, requiring an explicit admin toggle to unlock "PERMISSIVE" bash execution on a new node.

---

**This concludes Feature 12. I have persisted this report to `/app/docs/reviews/feature_review_grpc_mesh_control.md`. Which of these gRPC hardening tasks should we investigate further?**
