diff --git a/.gitignore b/.gitignore index 7f24cdf..24e5896 100644 --- a/.gitignore +++ b/.gitignore @@ -16,4 +16,5 @@ data/* .env.gitbucket .env.ai -**/.env* \ No newline at end of file +**/.env* +app/CaudeCodeSourceCode/* \ No newline at end of file diff --git a/docs/features/harness_engineering/co_pilot_agent_design.md b/docs/features/harness_engineering/co_pilot_agent_design.md index d54d8d5..2023cad 100644 --- a/docs/features/harness_engineering/co_pilot_agent_design.md +++ b/docs/features/harness_engineering/co_pilot_agent_design.md @@ -146,3 +146,52 @@ ### 🟢 Stage 4: Testing & Safety - [ ] **Blind Context Audit**: Verify that the Co-Pilot in Stage 2A receives zero knowledge of previous rounds. - [ ] **Loop Breaker Test**: Ensure `max_reworks` correctly stops an infinite implementation loop. + +--- + +## 8. Lessons from Claude Code (CC) Architecture + +After a deep dive into the Claude Code (recovered) source, we should adopt the following "premium" patterns to harden the Co-Worker system. + +### A. Memory Mechanics (The Index Pattern) +Claude Code uses a two-tier memory system (`MEMORY.md` as an index + topic files). We should adopt this for `.cortex/evaluation.md`. +- **Implementation**: `evaluation.md` should serve as a **Table of Contents**. Detailed rationales and rework logs should be split into `.cortex/logs/round_N.md`. +- **Benefit**: Keeps the main evaluation context "concise and cache-friendly" while allowing the agent to "Deep Dive" into previous failures only when necessary. +- **Reference**: `src/memdir/memdir.ts` + +### B. System Prompt Boundaries & "No Gold-Plating" +Claude Code uses a `SYSTEM_PROMPT_DYNAMIC_BOUNDARY` to optimize caching and has strict "Doing Tasks" edicts. +- **Edicts to Adopt** (Directly from CC's `# Doing tasks` Section): + - *"Don't add features, refactor code, or make 'improvements' beyond what was asked. A bug fix doesn't need surrounding code cleaned up."* + - *"Don't add docstrings, comments, or type annotations to code you didn't change."* + - *"Don't create helpers, utilities, or abstractions for one-time operations. Three similar lines of code is better than a premature abstraction."* + - *"Before reporting a task complete, verify it actually works: run the test, execute the script, check the output."* +- **The Boundary Pattern**: + 1. Insert a marker like `__DYNAMIC_BOUNDARY__` after the static system instructions. + 2. Everything before this marker is cached by the LLM provider (e.g., Anthropic's Prompt Caching). + 3. Per-session state (CWD, Tool List, Memory) is appended after this marker. +- **Application**: The Co-Worker will use these edicts as its **Evaluation Criteria**. If the Main Agent adds a comment you didn't ask for, the Co-Worker will flag it as "Gold-Plating" and lower the Quality score. +- **Reference**: `src/constants/prompts.ts` + +### C. Context Compaction Awareness +Long rework loops will eventually hit token limits. +- **CC Pattern**: "Microcompact" and "Autocompact" strategy. +- **Application**: If `Attempts > 2`, the Co-Worker should be instructed to **Summarize the Rework History** instead of providing the full text of previous rounds. This prevents the Main Agent's context from becoming bloated with "criticism noise." +- **Reference**: `src/query.ts` (`queryLoop` state management). + +### D. Visual "Buddy" Status +Claude Code uses an animated sprite to show state. +- **Application**: The `AgentCard` UI should display a unique **Co-Worker Avatar** whose expression changes based on the `Blind Score`. + - **90+**: Smiling/Approving. + - **70-89**: Thinking/Skeptical. + - **<70**: Warning/Frustrated signal. +- **Benefit**: Immediate visual feedback for the user on perceived quality without reading logs. +- **Reference**: `src/buddy/CompanionSprite.tsx` + +### E. The "Directive" Fork Pattern +When the Co-Worker triggers a rework, the prompt shouldn't just be "Fix this." +- **CC Pattern**: Sub-agents receive a **Directive** ("Brief the agent like a smart colleague who just walked into the room"). +- **Application**: The Phase 2B feedback should be formatted as a **Direct Command Set**, not a conversational critique. + - **Bad**: "I think the code is a bit messy, maybe fix it?" + - **Good**: "Directive: Refactor `auth.py:L24` to use the `.env` variable instead of the hardcoded string." +- **Reference**: `src/tools/AgentTool/prompt.ts` diff --git a/docs/features/harness_engineering/co_pilot_task_list.md b/docs/features/harness_engineering/co_pilot_task_list.md index 8630592..5979c6a 100644 --- a/docs/features/harness_engineering/co_pilot_task_list.md +++ b/docs/features/harness_engineering/co_pilot_task_list.md @@ -1,40 +1,35 @@ -# Task List: Co-Pilot Agent Harness Implementation +# Master Index: Co-Pilot Agent Harness Implementation -This document tracks the progress of the autonomous evaluation and self-improvement loop for the Cortex Hub agents. +This is the central index for tracking the autonomous evaluation system progress. Detailed tasks are split into specific topic files to maintain a lightweight context window during orchestration. -## 🟢 Stage 1: Data & Models (Foundation) -- [ ] **DB Model Update**: Modify the backend `AgentInstance` model (PostgreSQL/MongoDB as applicable) to include: - - `co_worker_enabled`: (Boolean) Default: `False`. - - `rework_threshold`: (Integer) Range 0-100. Default: `80`. - - `max_rework_count`: (Integer) Default: `3`. -- [ ] **Workspace Mirroring**: - - [ ] Create `.cortex/` directory in the agent's unique jail during initialization. - - [ ] Implement `history.log` append logic (JSON format). +--- -## 🟢 Stage 2: Orchestration Logic (The Engine) -- [ ] **Request-Specific Rubric Generator**: - - [ ] Implement a pre-execution hook in `agent_loop.py`. - - [ ] Prompt the Co-Pilot to generate a task-specific `rubric.md`. -- [ ] **Dual-Stage Post-Run Hook**: - - [ ] **Stage 2A (Blind Rating)**: Implement gRPC/Executor logic to call the Co-Pilot with a stripped context. - - [ ] **Stage 2B (Delta Analysis)**: Implement context-aware gap discovery (Score-Anonymized). -- [ ] **Recursive Execution Logic**: - - [ ] Logic in `AgentExecutor` to recursively re-trigger if `Score < Threshold` and `Reworks < Max`. +## 📈 Overall Status: [🟡 INITIALIZING] -## 🟢 Stage 3: User Interface (Dashboard) -- [ ] **Agent Config Tab**: - - [ ] Add the "Co-Worker Settings" section to `DeployAgentModal.tsx`. - - [ ] Implement HSL-styled sliders for threshold and count. -- [ ] **Evaluation Tab (`AgentDrillDown`)**: - - [ ] Create a real-time markdown renderer for `.cortex/feedback.md`. - - [ ] Build a "Rework History" component that visualizes `history.log` JSON data. -- [ ] **Status Badges**: - - [ ] Display "Evaluating..." state on the agent card during post-run turns. - - [ ] Show a permanent "Quality Score" badge (Green/Yellow/Red) derived from the last log entry. +### 1. [Stage 1: Foundation (Data & Models)](./harness_tasks/foundation.md) + - **Focus**: DB updates and Mirror System filesystem setup. + - **Status**: [🟢 IN PROGRESS] + - **Key File**: `.cortex/evaluation.md` -## 🟢 Stage 4: Reliability & Testing -- [ ] **Integration Tests**: - - [ ] Test: A task that fails on attempt 1, reworks, and passes on attempt 2. - - [ ] Test: A task that reaches `max_reworks` and stops even if score is still low. -- [ ] **Bias Validation**: - - [ ] Audit logs to ensure Stage 2A truly receives zero context of previous rounds. +### 2. [Stage 2: Engine (Orchestration Logic)](./harness_tasks/orchestration.md) + - **Focus**: Dual-Pass evaluation loop and recursive re-triggering. + - **Status**: [⚪ PLANNED] + - **Key File**: `agent_loop.py` hooks. + +### 3. [Stage 3: Dashboard (User Interface)](./harness_tasks/ui_dashboard.md) + - **Focus**: Controls, markdown streaming, and quality badges. + - **Status**: [⚪ PLANNED] + - **Key File**: `AgentDrillDown.tsx` + +### 4. [Stage 4: Quality (Reliability & Testing)](./harness_tasks/reliability.md) + - **Focus**: Bias validation and loop breaker stability. + - **Status**: [⚪ PLANNED] + +--- + +## 🛠 Lessons from Claude Code (Memory Mechanics Adherence) +*Pattern: `MEMORY.md` Index + Topic Files* + +1. **Lightweight Index**: This file (the index) remains small so it can be loaded into any agent turn without busting the token budget. +2. **Topic Segregation**: Details for Foundation, Engine, and UI are stored in `/harness_tasks/`. The agent only "reads" the relevant topic file when working on that specific stage. +3. **Consistency**: Changes to tasks should be made in the topic files; the index only tracks high-level "Status" bubbles. diff --git a/docs/features/harness_engineering/harness_tasks/foundation.md b/docs/features/harness_engineering/harness_tasks/foundation.md new file mode 100644 index 0000000..4a4f400 --- /dev/null +++ b/docs/features/harness_engineering/harness_tasks/foundation.md @@ -0,0 +1,22 @@ +--- +title: Stage 1 - Data & Models (Foundation) +status: IN_PROGRESS +priority: HIGH +--- + +## Core Objectives +Establish the underlying database structure and filesystem mirroring required for the Co-Worker agent's state management. + +## Task Breakdown +- [ ] **DB Model Update**: Modify the backend `AgentInstance` model (PostgreSQL/MongoDB as applicable) to include: + - `co_worker_enabled`: (Boolean) Default: `False`. + - `rework_threshold`: (Integer) Range 0-100. Default: `80`. + - `max_rework_count`: (Integer) Default: `3`. +- [ ] **Workspace Mirroring**: + - [ ] Create `.cortex/` directory in the agent's unique jail during initialization. + - [ ] Implement `history.log` append logic (JSON format). + +## Claude Code Inspiration: Memory Context +*Reference: `src/memdir/memdir.ts`* +- Ensure the `.cortex/` directory exists immediately on agent startup (idempotent initialization). +- Use a single line append-only JSON format for `history.log` to prevent partial write corruption. diff --git a/docs/features/harness_engineering/harness_tasks/orchestration.md b/docs/features/harness_engineering/harness_tasks/orchestration.md new file mode 100644 index 0000000..a22c898 --- /dev/null +++ b/docs/features/harness_engineering/harness_tasks/orchestration.md @@ -0,0 +1,28 @@ +--- +title: Stage 2 - Orchestration Logic (The Engine) +status: PLANNED +priority: CRITICAL +--- + +## Core Objectives +Implement the logic that triggers the Co-Worker agent at pre-run and post-run phases, managing the dual-stage evaluation. + +## Task Breakdown +- [ ] **Request-Specific Rubric Generator**: + - [ ] Implement a pre-execution hook in `agent_loop.py`. + - [ ] Prompt the Co-Pilot to generate a task-specific `rubric.md`. +- [ ] **Dual-Stage Post-Run Hook**: + - [ ] **Stage 2A (Blind Rating)**: Implement gRPC/Executor logic to call the Co-Pilot with a stripped context. + - [ ] **Stage 2B (Delta Analysis)**: Implement context-aware gap discovery (Score-Anonymized). +- [ ] **Directive-Based Rework Injection**: + - [ ] Update the `agent_loop.py` rework trigger logic. + - [ ] Instead of passing raw feedback, format the Co-Worker's gaps into a **Directive block** (e.g., *"Actionable Command: Refactor X to resolve Y"*). +- [ ] **Context Compaction Gate**: + - [ ] Implement a logic to detect token usage/turn count in the rework loop. + - [ ] If `Attempts > 2`, trigger the Co-Pilot to summarize the `.cortex/history.log` and replace the full rework history with a **Compacted Delta** for the Main Agent. + +## Claude Code Inspiration: Loop Orchestration +*Reference: `src/query.ts`* +- Adopt the `QueryLoop` state object to track `maxOutputTokensRecoveryCount` (or in our case, `reworkCount`) across iterations to avoid losing terminal state. +- Use the **"Directive Fork"** pattern: In Phase 2B, provide a strict directive rather than just commentary to improve fix accuracy. +- **Context Management**: Adopt the `Microcompact` and `Autocompact` principles—summarize previous attempts in long sessions to save tokens and focus the agent's attention on the latest delta. diff --git a/docs/features/harness_engineering/harness_tasks/reliability.md b/docs/features/harness_engineering/harness_tasks/reliability.md new file mode 100644 index 0000000..652a0a5 --- /dev/null +++ b/docs/features/harness_engineering/harness_tasks/reliability.md @@ -0,0 +1,19 @@ +--- +title: Stage 4 - Reliability & Testing +status: PLANNED +priority: HIGH +--- + +## Core Objectives +Validate the rework loop's stability and ensures objectivity in the evaluation process. + +## Task Breakdown +- [ ] **Integration Tests**: + - [ ] Test: A task that fails on attempt 1, reworks, and passes on attempt 2. + - [ ] Test: A task that reaches `max_reworks` and stops even if score is still low. +- [ ] **Bias Validation**: + - [ ] Audit logs to ensure Stage 2A truly receives zero context of previous rounds. + +## Claude Code Inspiration: Recovery Circuit Breakers +*Reference: `src/query.ts`* +- Ensure the `max_reworks` logic is a hard circuit breaker (similar to `MAX_OUTPUT_TOKENS_RECOVERY_LIMIT`) to avoid infinite loops and runaway costs. diff --git a/docs/features/harness_engineering/harness_tasks/ui_dashboard.md b/docs/features/harness_engineering/harness_tasks/ui_dashboard.md new file mode 100644 index 0000000..0483aa8 --- /dev/null +++ b/docs/features/harness_engineering/harness_tasks/ui_dashboard.md @@ -0,0 +1,27 @@ +--- +title: Stage 3 - User Interface (Dashboard) +status: PLANNED +priority: MEDIUM +--- + +## Core Objectives +Build the user-facing controls and monitoring tabs for the evaluation loop. + +## Task Breakdown +- [ ] **Agent Config Tab**: + - [ ] Add the "Co-Worker Settings" section to `DeployAgentModal.tsx`. + - [ ] Implement HSL-styled sliders for threshold and count. +- [ ] **Evaluation Tab (`AgentDrillDown`)**: + - [ ] Create a real-time markdown renderer for `.cortex/feedback.md`. + - [ ] Build a "Rework History" component that visualizes `history.log` JSON data. +- [ ] **Mood-Based Co-Worker Avatar**: + - [ ] Create a `CoWorkerAvatar` component to be displayed in the `AgentDrillDown` and `AgentCard`. + - [ ] Implement logic to map the numerical `Quality Score` to an avatar mood: + - `90-100`: High Approval (Happy/Smiling). + - `75-89`: Skeptical (Thinking). + - `<75`: Critical (Warn/Stern). + +## Claude Code Inspiration: Visual Feedback +*Reference: `src/buddy/CompanionSprite.tsx`* +- **Deterministic Avatars**: CC uses a seeded calculation based on user IDs (`userId + SALT`) to determine the "Buddy." While we want a single Co-Worker persona, the **Mood State Tree** (e.g., `HAPPY`, `THINKING`, `WARN`) is directly applicable to our Quality Score mapping. +- **Personality through Animation**: Consider adding micro-animations to the avatar (e.g., a "Thinking" spin during the Co-Pilot evaluation phase) to match CC's high-polish terminal experience.