This report performs a deep-dive audit of the background execution engine, focusing on scheduler.py through the lens of 12-Factor App Methodology, Pythonic Code Style, and Background Task Reliability.
| Factor | Status | Observation |
|---|---|---|
| VI. Processes | 🔴 Major Issue | Distributed Scheduling Hazard: AgentScheduler stores trigger execution state (_last_run_map) in-memory. In a distributed Hub deployment (multiple replicas), each replica will maintain its own map. This will result in duplicate agent triggers (e.g., a 1-hour CRON task firing N times across N replicas), leading to data corruption and redundant token costs. This state MUST be handled by a global lock or a persistent "LastRun" timestamp in the DB. |
| XI. Logs | ✅ Success | The scheduler uses structured logger levels correctly for zombie detection and trigger firing events. |
| IX. Disposability | 🟡 Warning | Task Orphanage on Shutdown: When the Hub shuts down, the scheduler loop stops, but asyncio.create_task (Line 96) invocations continue until the process is hard-killed. There is no graceful "drain" of active agent runs during process termination. |
app/core/orchestration/scheduler.pyThe background engine for Zombie Sweeping and CRON/Interval triggers.
[!CAUTION] Database Connection Leak Risk Lines 59-123: The
_cron_trigger_loopmanually instantiatesSessionLocal()and callsdb.close()at the end of the loop body. If an exception occurs during the CRON processing (e.g., incroniterorAgentExecutor.run), thedb.close()call is bypassed, leading to a orphaned database connections. Fix: Use the existingwith get_db_session() as db:context manager to ensure safe cleanup.
Identified Problems:
AgentInstance queries for every trigger in the list. For a Hub with 100+ agents, this will inundate the database with redundant metadata checks every 30 seconds.instance.status != 'idle' (Line 71). If two replicas both see 'idle' simultaneously before one can update it to 'active', the task will be double-dispatched.SELECT ... FOR UPDATE or a Redis REDLOCK) to ensure only one replica handles scheduler duties at a time.with get_db_session() as db: context manager.AgentInstance and AgentTrigger together using a JOIN to eliminate the N+1 overhead in the polling loop.This concludes Feature 6. I have persisted this report to /app/docs/reviews/feature_review_scheduler.md. I am ready for your final instructions or to begin addressing these findings.