Newer
Older
cortex-hub / docs / reviews / feature_review_node_registry.md

Code Review Report: Feature 2 — Node Registry & Mesh Monitoring

This report performs a deep-dive audit of the mesh distribution layer, focusing on node_registry.py and mesh.py through the lens of 12-Factor App Methodology, Pythonic Code Style, and Architectural Stability.


🏗️ 12-Factor App Compliance Audit

Factor Status Observation
VI. Processes 🔴 Major Issue Terminal History Volatility: LiveNodeRecord stores recent terminal output (terminal_history) in-memory. If the Hub process restarts, the context needed for AI "short-term memory" of recent shell activity is permanently lost. This state should be backed by a fast storage layer like Redis.
VII. Port Binding Success The Hub correctly splits concerns between gRPC port binding for node command-and-control and HTTP for the UI/API.
XI. Logs 🟡 Warning Event Bus Leakage: The emit method (Line 377) mixes application logic (terminal history pruning) with logging. Furthermore, several catch blocks still use print() instead of the configured logger.error(), complicating production audit trails.

🔍 File-by-File Diagnostic

1. app/core/services/node_registry.py

The mission-critical layer managing all connected agents and their gRPC command queues.

[!CAUTION] Deadlock Risk in Bounded Executor Line 169: BoundedThreadPoolExecutor uses a bounded queue (max_queue_size=200). Because _db_upsert_node is submitted during registration, if the DB becomes slow or worker threads are saturated, the submit() call will block the main registry thread. This would freeze all node registrations and heartbeats Hub-wide. Fix: Use loop.run_in_executor with a non-blocking queue strategy or an overflow-to-async-task model.

Identified Problems:

  • Listener Memory Leak: subscribe_node and subscribe_user append queues to a dictionary. If a WebSocket client crashes and fails to call unsubscribe, the list of listeners grows indefinitely. The emit loop (Line 445) performs q.put_nowait() but does not catch and remove stagnant queues.
  • Heavy String Manipulation: ANSI_ESCAPE.sub('', data) (Line 418) is executed for every task_stdout chunk. For high-volume shell output, this will lead to significant CPU spikes on the Hub process.
  • Duplicate Event Processing: The id(q) set (Line 440) is a clever manual deduplication for nodes that belong to a specific user, but it implies the underlying list structures are inefficiently managed.

2. app/core/services/mesh.py

The integration layer between the raw DB records and the Hub's "Live View".

Identified Problems:

  • Template Fragility: generate_provisioning_script performs a silent try...except and returns error strings. This can lead to the frontend displaying an "Error: ..." string inside a code block, which is difficult for users to debug compared to a 500 status code with structured detail.
  • N+1 Query Potential: node_to_user_view is called for every node in a list view. While registry.get_node is fast (Dict lookup), the logic should be optimized to batch-resolve status where possible.

🛠️ Summary Recommendations

  1. Harden the Event Bus: Update the emit loop to detect Full or BrokenPipe queue errors and automatically prune stagnant _node_listeners and _user_listeners.
  2. Externalize Short-Term Memory: Move terminal_history out of LiveNodeRecord and into an ephemeral store (like Redis or a dedicated cache table) to survive process restarts (Factor VI).
  3. Audit Exception Logging: Standardize all error reporting to use logger.exception() with unique error IDs and proper stack traces.

Please review this second feature audit. If approved, I will proceed to Feature 3: RAG & Tool Services (The Brain & Hands).