This report performs a deep-dive audit of the Hub's "claim-based" task distribution engine within pool.py, focusing on 12-Factor App Methodology, Distributed Reliability, and Task Lifecycle Safety.
| Factor | Status | Observation |
|---|---|---|
| VI. Processes | 🔴 Major Issue | Volatile Task Memory: The GlobalWorkPool stores unclaimed tasks in-memory only (self.available, Line 8). If the Hub process restarts (deployment/update/crash), all pending global work is lost. This makes the system unreliable for scheduled or background "autonomous" tasks. |
| IX. Disposability | ✅ Success | Thread Safety: Access to the work pool is correctly gated by a threading.Lock() (Line 7), ensuring that no two nodes can claim the same task simultaneously during intensive broadcast bursts. |
app/core/grpc/core/pool.pyThe Hub's "Job Board" where tasks are listed for Mesh nodes to claim.
[!CAUTION] Task Loss Hazard (No Visibility Timeout) Line 36:
return True, self.available.pop(task_id)["payload"]Theclaimmethod is globally destructive. Once a node claims a task, it is immediately removed from the pool's memory.The Problem: If the node claims the task but crashes before completing it (power failure, network loss), the task is lost forever. The Hub provides no mechanism to re-assign or time-out "active" tasks that never report back.
Fix: Replace the simple
popwith a status-based registry (availablevsin_progress). Implement a "Visibility Timeout" (e.g., 5 minutes) where tasks move back toavailableif the claiming node's gRPC stream terminates without success.
Identified Problems:
available tasks map to a Redis HASH or SQLite tasks table to ensure job persistence across Hub reboots.node_id is currently processing a task and implement a reaper task to re-enque any tasks that exceed a specific processing TTL.available to be a PriorityQueue or implement a numeric priority field to ensure critical mesh commands are processed first.This concludes Feature 14. I have persisted this report to /app/docs/reviews/feature_review_work_pool.md. Shall I implement a basic task-reaper to prevent task loss during node crashes?