Newer
Older
cortex-hub / docs / reviews / feature_review_task_journal.md

Code Review Report: Feature 15 — Task Journalism & Memory Sandboxing

This report performs a deep-dive audit of the task-tracking state machine and memory protection layer within journal.py, focusing on 12-Factor App Methodology, Memory Management, and Agent Reliability.


🏗️ 12-Factor App Compliance Audit

Factor Status Observation
VI. Processes 🔴 Major Issue Volatile Journal State: The TaskJournal stores all active task metadata and stream buffers in-memory (self.tasks, Line 20). If the Hub process restarts (deployment/crash), all currently running agent sub-tasks lose their "result hook." The AI Agent will continue waiting indefinitely for a result that disappeared from the Hub's memory. This state must be synchronized to a persistent store (SQLite/Redis).
IX. Disposability Success Robust Memory Sandboxing: The Hub's "Head + Tail" buffer strategy (_trim_stream, Line 41) is a best-in-class implementation for agentic systems. It prevents the Hub from OOM-crashing during accidental massive stdout bursts while preserving the critical initial context and final status needed by the AI.

🔍 File-by-File Diagnostic

1. app/core/grpc/core/journal.py

The Hub's short-term memory for tracking asynchronous node execution.

[!TIP] Performance: Thread Safety vs. Throughput Line 19: self.lock = threading.Lock() The journal uses a single global lock for all task updates (thought logs, stdout chunks, result fulfillment). For a mesh of 100+ nodes streaming build logs, this lock will become a significant point of contention. Fix: Shard the task registry (e.g., 16 separate dictionaries with their own locks) based on the task_id hash to improve concurrent update performance.

Identified Problems:

  • Result Polling Latency: The cleanup task (Line 216) removes completed results after only 120 seconds. If a calling service (like the UI or a background RAG aggregator) fails to poll exactly in that window due to network latency, the result is lost.
  • Lack of Disk Spilling: The journal is purely RAM-based. While the head+tail buffer limits individual tasks to ~40KB, 1,000 concurrent tasks still consume ~40MB. For high-volume agent clusters, a "Spill-to-Disk" strategy for inactive task buffers would be safer.

🛠️ Summary Recommendations

  1. Persistent Task Index: Record the task_id and assigned node_id in the backend Database upon registration to enable "Re-attachment" logic after a Hub reboot.
  2. Sharded Locking: Move from a single global lock to a sharded lock architecture to support high-frequency token streaming from massive agent clusters.
  3. Configurable Stream Limits: Move the 40KB hardcoded stream limits to app/config.py to allow tuning for specific AI model context windows.

This concludes Feature 15. I have persisted this report to /app/docs/reviews/feature_review_task_journal.md. All major gRPC core components have now been audited. Shall I proceed to the final review of the mesh "Assistant" and STT/STT providers?