# Code Review Report: Feature 6 — Agent Triggers & Schedulers

This report performs a deep-dive audit of the background execution engine, focusing on `scheduler.py` through the lens of **12-Factor App Methodology**, **Pythonic Code Style**, and **Background Task Reliability**.

---

## 🏗️ 12-Factor App Compliance Audit

| Factor | Status | Observation |
| :--- | :--- | :--- |
| **VI. Processes** | 🔴 **Major Issue** | **Distributed Scheduling Hazard**: `AgentScheduler` stores trigger execution state (`_last_run_map`) in-memory. In a distributed Hub deployment (multiple replicas), each replica will maintain its own map. This will result in **duplicate agent triggers** (e.g., a 1-hour CRON task firing N times across N replicas), leading to data corruption and redundant token costs. This state MUST be handled by a global lock or a persistent "LastRun" timestamp in the DB. |
| **XI. Logs** | ✅ **Success** | The scheduler uses structured `logger` levels correctly for zombie detection and trigger firing events. |
| **IX. Disposability** | 🟡 **Warning** | **Task Orphanage on Shutdown**: When the Hub shuts down, the scheduler loop stops, but `asyncio.create_task` (Line 96) invocations continue until the process is hard-killed. There is no graceful "drain" of active agent runs during process termination. |

---

## 🔍 File-by-File Diagnostic

### 1. `app/core/orchestration/scheduler.py`
The background engine for Zombie Sweeping and CRON/Interval triggers.

> [!CAUTION]
> **Database Connection Leak Risk**
> Lines 59-123: The `_cron_trigger_loop` manually instantiates `SessionLocal()` and calls `db.close()` at the end of the loop body. If an exception occurs during the CRON processing (e.g., in `croniter` or `AgentExecutor.run`), the `db.close()` call is bypassed, leading to a orphaned database connections.
> **Fix**: Use the existing `with get_db_session() as db:` context manager to ensure safe cleanup.

**Identified Problems**:
*   **N+1 Query Pattern**: Lines 70 and 107 perform individual `AgentInstance` queries for every trigger in the list. For a Hub with 100+ agents, this will inundate the database with redundant metadata checks every 30 seconds.
*   **Race Condition in Status Check**: The scheduler checks `instance.status != 'idle'` (Line 71). If two replicas both see 'idle' simultaneously before one can update it to 'active', the task will be double-dispatched.
*   **Zombie Detection Rigidity**: The 3-minute timeout (Line 38) is hardcoded. Short-lived gRPC network blips could cause healthy but slow agents to be prematurely reset to 'idle'.

---

## 🛠️ Summary Recommendations

1.  **Global Execution Lock**: Implement a distributed locking mechanism (e.g., `SELECT ... FOR UPDATE` or a Redis `REDLOCK`) to ensure only one replica handles scheduler duties at a time.
2.  **Harden DB Sessions**: Refactor both loops to use the `with get_db_session() as db:` context manager.
3.  **Batch Lookups**: Query `AgentInstance` and `AgentTrigger` together using a JOIN to eliminate the N+1 overhead in the polling loop.

---

**This concludes Feature 6. I have persisted this report to `/app/docs/reviews/feature_review_scheduler.md`. I am ready for your final instructions or to begin addressing these findings.**
