Newer
Older
cortex-hub / docs / reviews / feature_review_node_registry_bus.md

Code Review Report: Feature 11 — Node Mesh Registry & Event Bus

This report performs a deep-dive audit of the node registration and event bus logic within node_registry.py, focusing on 12-Factor App Methodology, Memory Management, and gRPC Concurency Safety.


🏗️ 12-Factor App Compliance Audit

Factor Status Observation
VI. Processes 🔴 Major Issue Ephemeral History Volatility: Each LiveNodeRecord stores a terminal_history list (Lines 79, 437) of the last 150 terminal interaction chunks. When the Hub restarts, this history is purged. This causes a "Context Amnesia" effect where an AI agent cannot "remember" what was displayed on a node's terminal in a previous session, even if the node itself stayed online. This state should be persisted to an ephemeral store (Redis/SQLite).
IX. Disposability Success Bounded Backpressure: The Hub implemented a custom BoundedThreadPoolExecutor (Line 24) to provide natural backpressure when gRPC queues are full. This prevents the Hub from crash-looping due to Out-Of-Memory (OOM) errors during massive file-sync waves.
XI. Logs 🟡 Warning Debug Noise: Every heartbeat logs a debug message (Line 367). In a mesh of 100+ nodes, this generates thousands of lines per minute. While technically correct as "debug", it makes searching for actual issues in the debug logs nearly impossible.

🔍 File-by-File Diagnostic

1. app/core/services/node_registry.py

The Hub's central nervous system for mesh coordination.

[!CAUTION] Memory Fragmentation Hazard (Rolling History) Line 437: node.terminal_history = node.terminal_history[-150:] The terminal history is stored as a list of strings. Every time a new chunk arrives, the list is appended to and then spliced. For large shell outputs (build logs, cating large files), this can lead to significant memory fragmentation and garbage collection overhead in the Hub process. Fix: Use a collections.deque(maxlen=150) for the terminal_history buffer to ensure O(1) rotation and efficient memory management.

Identified Problems:

  • Zombie Thread Fallback: Line 107 spawns a raw threading.Thread(daemon=True) for message delivery if the executor is not ready. In high-load startup scenarios, this could lead to an explosion of unmanaged threads (Thread Leaks).
  • Stale Status Window: Line 133 considers a node "stale" after only 30 seconds. In edge computing environments with latent VPNs (e.g. Wireguard over 5G), this threshold might be too aggressive, causing UI "flashing" between online and stale.

🛠️ Summary Recommendations

  1. Switch to Deque: Implementation of collections.deque for terminal_history to prevent list-splicing performance degradation.
  2. Externalize History: Move the terminal buffer to a Redis LPUSH/LTRIM structure if persistent context across Hub restarts is required.
  3. Configurable Thresholds: Move the "Stale" timeout (30s) and "Flapping" window (60s) to app/config.py to allow tuning for different network environments.

This concludes Feature 11. I have persisted this report to /app/docs/reviews/feature_review_node_registry_bus.md. All major backend components have now been audited. Which of these infrastructure improvements should I implement first?