Newer
Older
cortex-hub / docs / reviews / feature_review_work_pool.md

Code Review Report: Feature 14 — Global Work Pool & Task Discovery

This report performs a deep-dive audit of the Hub's "claim-based" task distribution engine within pool.py, focusing on 12-Factor App Methodology, Distributed Reliability, and Task Lifecycle Safety.


🏗️ 12-Factor App Compliance Audit

Factor Status Observation
VI. Processes 🔴 Major Issue Volatile Task Memory: The GlobalWorkPool stores unclaimed tasks in-memory only (self.available, Line 8). If the Hub process restarts (deployment/update/crash), all pending global work is lost. This makes the system unreliable for scheduled or background "autonomous" tasks.
IX. Disposability Success Thread Safety: Access to the work pool is correctly gated by a threading.Lock() (Line 7), ensuring that no two nodes can claim the same task simultaneously during intensive broadcast bursts.

🔍 File-by-File Diagnostic

1. app/core/grpc/core/pool.py

The Hub's "Job Board" where tasks are listed for Mesh nodes to claim.

[!CAUTION] Task Loss Hazard (No Visibility Timeout) Line 36: return True, self.available.pop(task_id)["payload"] The claim method is globally destructive. Once a node claims a task, it is immediately removed from the pool's memory.

The Problem: If the node claims the task but crashes before completing it (power failure, network loss), the task is lost forever. The Hub provides no mechanism to re-assign or time-out "active" tasks that never report back.

Fix: Replace the simple pop with a status-based registry (available vs in_progress). Implement a "Visibility Timeout" (e.g., 5 minutes) where tasks move back to available if the claiming node's gRPC stream terminates without success.

Identified Problems:

  • Hardcoded Cleanup TTL: The 3600-second (1 hour) expiration for tasks (Line 18) is hardcoded and might be too aggressive for low-priority agent tasks that require specific node availability that might be offline for maintenance.
  • Lack of Priority Support: The pool is a flat dictionary. In a large mesh, "Admin" tasks or "Security Patches" should be prioritizable over standard background "RAG ingestion" tasks.

🛠️ Summary Recommendations

  1. Implements Persistence: Migrate the available tasks map to a Redis HASH or SQLite tasks table to ensure job persistence across Hub reboots.
  2. Add Visibility Timeouts: Track which node_id is currently processing a task and implement a reaper task to re-enque any tasks that exceed a specific processing TTL.
  3. Task Priority: Update available to be a PriorityQueue or implement a numeric priority field to ensure critical mesh commands are processed first.

This concludes Feature 14. I have persisted this report to /app/docs/reviews/feature_review_work_pool.md. Shall I implement a basic task-reaper to prevent task loss during node crashes?