Newer
Older
cortex-hub / docs / reviews / feature_review_infrastructure.md

Code Review Report: Feature 5 — Mesh Infrastructure & Persistence

This report performs a deep-dive audit of the file-sync and data persistence layer, focusing on mirror.py and agent.py through the lens of 12-Factor App Methodology, Pythonic Code Style, and Mesh Scalability.


🏗️ 12-Factor App Compliance Audit

Factor Status Observation
VI. Processes 🔴 Major Issue Hash Cache Volatility: GhostMirrorManager stores file hash caches exclusively in-memory (self.hash_cache). On every Hub restart, all reconciliation logic for every node is reverted to 0, forcing a full scan and re-hashing of every multi-gigabyte workspace. In dev/prod environments with NFS mounts, this will lead to catastrophic Hub cold-start times.
XI. Logs 🔴 Style Problem Print-Debug Overload: mirror.py uses print() for critical infrastructure events (File Purge, Hash Verification, Sync Complete). These bypass the application's logging configuration, preventing structured monitoring and alerting via log aggregators.
IX. Disposability Success Ghost Mirror Purge: The Hub implemented an active_ids based purge mechanism (Factor IX) to clean up orphaned directories from the filesystem, keeping the disk usage tight.

🔍 File-by-File Diagnostic

1. app/core/grpc/core/mirror.py

The "Ghost Mirror" logic that handles bi-directional file sync between the Hub and Mesh nodes.

[!CAUTION] Performance Hotspot: Redundant Hash Verification write_file_chunk (Line 78) attempts a hash verification on every single-chunk file write. For small files (e.g., project config updates), this is acceptable. However, for continuous streaming of many small files, the Hub CPU will be saturated by redundant SHA256 calculations. Fix: Move hash verification to an asynchronous background task or trust the gRPC transport's built-in integrity and only verify periodically.

Identified Problems:

  • System Immutability Fragility: The "Immutability Lock" (Line 57) relies on reading a .metadata.json via file lookups per chunk. This adds IO overhead for every single file packet received.
  • Atomic Swap Hazards: os.replace (Line 163) is atomic, which is excellent. However, the ownership transfer (os.chown) occurs before the move, which means the .cortex_tmp file might be exposed with the wrong permissions for a split second if the process is killed between the operations. (Minor Race Condition).

2. app/db/models/agent.py

The SQL schema for autonomous agent instances and templates.

Identified Problems:

  • SQL Indexing Gaps: AgentInstance.mesh_node_id and AgentInstance.session_id lack database-level indices. As the agent_instances table grows to thousands of records (from cron/webhooks), agent lookups by node or session will degrade to full-table scans.
  • Mutable Default Hazard: Column(JSON, default={}) (Line 41) can occasionally lead to shared mutable dictionary behavior in some SQLAlchemy versions.
  • Fix: Use default=dict (the callable) to ensure a fresh dictionary is created for every row.

🛠️ Summary Recommendations

  1. Persistent Hash Cache: Move self.hash_cache to a local SQLite database or Redis to ensure reconciliation is instant after a Hub reboot.
  2. Migrate to logger: Immediately replace all print() statements in mirror.py with logger.info() or logger.status() style calls.
  3. Database Indexing: Add indices to mesh_node_id, session_id, and template_id in the agent_instances table to maintain performance at scale.

This concludes Feature 5. All reports have been copied to /app/docs/reviews/. I am ready for your final review or to begin remediating Feature 5's findings.