This report performs a deep-dive audit of the mesh distribution layer, focusing on node_registry.py and mesh.py through the lens of 12-Factor App Methodology, Pythonic Code Style, and Architectural Stability.
| Factor | Status | Observation |
|---|---|---|
| VI. Processes | 🔴 Major Issue | Terminal History Volatility: LiveNodeRecord stores recent terminal output (terminal_history) in-memory. If the Hub process restarts, the context needed for AI "short-term memory" of recent shell activity is permanently lost. This state should be backed by a fast storage layer like Redis. |
| VII. Port Binding | ✅ Success | The Hub correctly splits concerns between gRPC port binding for node command-and-control and HTTP for the UI/API. |
| XI. Logs | 🟡 Warning | Event Bus Leakage: The emit method (Line 377) mixes application logic (terminal history pruning) with logging. Furthermore, several catch blocks still use print() instead of the configured logger.error(), complicating production audit trails. |
app/core/services/node_registry.pyThe mission-critical layer managing all connected agents and their gRPC command queues.
[!CAUTION] Deadlock Risk in Bounded Executor Line 169:
BoundedThreadPoolExecutoruses a bounded queue (max_queue_size=200). Because_db_upsert_nodeis submitted during registration, if the DB becomes slow or worker threads are saturated, thesubmit()call will block the main registry thread. This would freeze all node registrations and heartbeats Hub-wide. Fix: Useloop.run_in_executorwith a non-blocking queue strategy or an overflow-to-async-task model.
Identified Problems:
subscribe_node and subscribe_user append queues to a dictionary. If a WebSocket client crashes and fails to call unsubscribe, the list of listeners grows indefinitely. The emit loop (Line 445) performs q.put_nowait() but does not catch and remove stagnant queues.ANSI_ESCAPE.sub('', data) (Line 418) is executed for every task_stdout chunk. For high-volume shell output, this will lead to significant CPU spikes on the Hub process.id(q) set (Line 440) is a clever manual deduplication for nodes that belong to a specific user, but it implies the underlying list structures are inefficiently managed.app/core/services/mesh.pyThe integration layer between the raw DB records and the Hub's "Live View".
Identified Problems:
generate_provisioning_script performs a silent try...except and returns error strings. This can lead to the frontend displaying an "Error: ..." string inside a code block, which is difficult for users to debug compared to a 500 status code with structured detail.node_to_user_view is called for every node in a list view. While registry.get_node is fast (Dict lookup), the logic should be optimized to batch-resolve status where possible.emit loop to detect Full or BrokenPipe queue errors and automatically prune stagnant _node_listeners and _user_listeners.terminal_history out of LiveNodeRecord and into an ephemeral store (like Redis or a dedicated cache table) to survive process restarts (Factor VI).logger.exception() with unique error IDs and proper stack traces.Please review this second feature audit. If approved, I will proceed to Feature 3: RAG & Tool Services (The Brain & Hands).