diff --git a/docs/features/harness_engineering/co_pilot_agent_design.md b/docs/features/harness_engineering/co_pilot_agent_design.md index 18521ef..d54d8d5 100644 --- a/docs/features/harness_engineering/co_pilot_agent_design.md +++ b/docs/features/harness_engineering/co_pilot_agent_design.md @@ -127,8 +127,22 @@ --- ## 7. Implementation Checklist -- [ ] Add `co_worker_enabled` toggle to `AgentInstance` DB model. -- [ ] Implement `pre_run` and `post_run` hooks in `AgentExecutor`. -- [ ] Add UI slider for `rework_threshold`. -- [ ] Develop the logic to purge/write/read the `evaluation.md` in the workspace. -- [ ] Create a "Self-Improvement Loop" that re-dispatches the Main Agent with Co-Worker feedback. + +### 🟢 Stage 1: Data & Models (Foundation) +- [ ] **DB Model Update**: Modify `AgentInstance` DB model to include `co_worker_enabled`, `rework_threshold` (0-100), and `max_reworks` (int). +- [ ] **Workspace Mirroring**: Implement `.cortex/` path creation and JSON `history.log` initialization in the Agent's filesystem jail. + +### 🟢 Stage 2: Orchestration Logic (The Engine) +- [ ] **Pre-Run Analytic Turn**: implement a hook in `agent_loop.py` to prompt the Co-Pilot to generate a request-specific `rubric.md`. +- [ ] **Stage 2A (Stateless Evaluation)**: Call Co-Pilot with stripped context to generate an objective numerical score. +- [ ] **Stage 2B (Delta Discovery)**: Call Co-Pilot with score-anonymized history to generate gap feedback for the Main Agent. +- [ ] **Rework Loop**: Modify `AgentExecutor` to handle recursive re-triggering based on the score vs. threshold comparison. + +### 🟢 Stage 3: UI & Monitoring (Dashboard) +- [ ] **Config UI**: Update `DeployAgentModal.tsx` and `AgentDrillDown` settings with HSL threshold sliders. +- [ ] **Evaluation Tab**: Build a dedicated tab in `AgentDrillDown` to stream `feedback.md` and `history.log`. +- [ ] **Live Status Badges**: Add "Evaluating..." and "Last Quality Score" indicators to the agent management dashboard. + +### 🟢 Stage 4: Testing & Safety +- [ ] **Blind Context Audit**: Verify that the Co-Pilot in Stage 2A receives zero knowledge of previous rounds. +- [ ] **Loop Breaker Test**: Ensure `max_reworks` correctly stops an infinite implementation loop. diff --git a/docs/features/harness_engineering/co_pilot_task_list.md b/docs/features/harness_engineering/co_pilot_task_list.md new file mode 100644 index 0000000..8630592 --- /dev/null +++ b/docs/features/harness_engineering/co_pilot_task_list.md @@ -0,0 +1,40 @@ +# Task List: Co-Pilot Agent Harness Implementation + +This document tracks the progress of the autonomous evaluation and self-improvement loop for the Cortex Hub agents. + +## 🟢 Stage 1: Data & Models (Foundation) +- [ ] **DB Model Update**: Modify the backend `AgentInstance` model (PostgreSQL/MongoDB as applicable) to include: + - `co_worker_enabled`: (Boolean) Default: `False`. + - `rework_threshold`: (Integer) Range 0-100. Default: `80`. + - `max_rework_count`: (Integer) Default: `3`. +- [ ] **Workspace Mirroring**: + - [ ] Create `.cortex/` directory in the agent's unique jail during initialization. + - [ ] Implement `history.log` append logic (JSON format). + +## 🟢 Stage 2: Orchestration Logic (The Engine) +- [ ] **Request-Specific Rubric Generator**: + - [ ] Implement a pre-execution hook in `agent_loop.py`. + - [ ] Prompt the Co-Pilot to generate a task-specific `rubric.md`. +- [ ] **Dual-Stage Post-Run Hook**: + - [ ] **Stage 2A (Blind Rating)**: Implement gRPC/Executor logic to call the Co-Pilot with a stripped context. + - [ ] **Stage 2B (Delta Analysis)**: Implement context-aware gap discovery (Score-Anonymized). +- [ ] **Recursive Execution Logic**: + - [ ] Logic in `AgentExecutor` to recursively re-trigger if `Score < Threshold` and `Reworks < Max`. + +## 🟢 Stage 3: User Interface (Dashboard) +- [ ] **Agent Config Tab**: + - [ ] Add the "Co-Worker Settings" section to `DeployAgentModal.tsx`. + - [ ] Implement HSL-styled sliders for threshold and count. +- [ ] **Evaluation Tab (`AgentDrillDown`)**: + - [ ] Create a real-time markdown renderer for `.cortex/feedback.md`. + - [ ] Build a "Rework History" component that visualizes `history.log` JSON data. +- [ ] **Status Badges**: + - [ ] Display "Evaluating..." state on the agent card during post-run turns. + - [ ] Show a permanent "Quality Score" badge (Green/Yellow/Red) derived from the last log entry. + +## 🟢 Stage 4: Reliability & Testing +- [ ] **Integration Tests**: + - [ ] Test: A task that fails on attempt 1, reworks, and passes on attempt 2. + - [ ] Test: A task that reaches `max_reworks` and stops even if score is still low. +- [ ] **Bias Validation**: + - [ ] Audit logs to ensure Stage 2A truly receives zero context of previous rounds.