Newer
Older
cortex-hub / docs / reviews / feature_review_scheduler.md

Code Review Report: Feature 6 — Agent Triggers & Schedulers

This report performs a deep-dive audit of the background execution engine, focusing on scheduler.py through the lens of 12-Factor App Methodology, Pythonic Code Style, and Background Task Reliability.


🏗️ 12-Factor App Compliance Audit

Factor Status Observation
VI. Processes 🔴 Major Issue Distributed Scheduling Hazard: AgentScheduler stores trigger execution state (_last_run_map) in-memory. In a distributed Hub deployment (multiple replicas), each replica will maintain its own map. This will result in duplicate agent triggers (e.g., a 1-hour CRON task firing N times across N replicas), leading to data corruption and redundant token costs. This state MUST be handled by a global lock or a persistent "LastRun" timestamp in the DB.
XI. Logs Success The scheduler uses structured logger levels correctly for zombie detection and trigger firing events.
IX. Disposability 🟡 Warning Task Orphanage on Shutdown: When the Hub shuts down, the scheduler loop stops, but asyncio.create_task (Line 96) invocations continue until the process is hard-killed. There is no graceful "drain" of active agent runs during process termination.

🔍 File-by-File Diagnostic

1. app/core/orchestration/scheduler.py

The background engine for Zombie Sweeping and CRON/Interval triggers.

[!CAUTION] Database Connection Leak Risk Lines 59-123: The _cron_trigger_loop manually instantiates SessionLocal() and calls db.close() at the end of the loop body. If an exception occurs during the CRON processing (e.g., in croniter or AgentExecutor.run), the db.close() call is bypassed, leading to a orphaned database connections. Fix: Use the existing with get_db_session() as db: context manager to ensure safe cleanup.

Identified Problems:

  • N+1 Query Pattern: Lines 70 and 107 perform individual AgentInstance queries for every trigger in the list. For a Hub with 100+ agents, this will inundate the database with redundant metadata checks every 30 seconds.
  • Race Condition in Status Check: The scheduler checks instance.status != 'idle' (Line 71). If two replicas both see 'idle' simultaneously before one can update it to 'active', the task will be double-dispatched.
  • Zombie Detection Rigidity: The 3-minute timeout (Line 38) is hardcoded. Short-lived gRPC network blips could cause healthy but slow agents to be prematurely reset to 'idle'.

🛠️ Summary Recommendations

  1. Global Execution Lock: Implement a distributed locking mechanism (e.g., SELECT ... FOR UPDATE or a Redis REDLOCK) to ensure only one replica handles scheduler duties at a time.
  2. Harden DB Sessions: Refactor both loops to use the with get_db_session() as db: context manager.
  3. Batch Lookups: Query AgentInstance and AgentTrigger together using a JOIN to eliminate the N+1 overhead in the polling loop.

This concludes Feature 6. I have persisted this report to /app/docs/reviews/feature_review_scheduler.md. I am ready for your final instructions or to begin addressing these findings.