diff --git a/docs/features/harness_engineering/harness_engineering_design.md b/docs/features/harness_engineering/harness_engineering_design.md new file mode 100644 index 0000000..ccce4b5 --- /dev/null +++ b/docs/features/harness_engineering/harness_engineering_design.md @@ -0,0 +1,208 @@ +# Feature Design: Harness Engineering (AI Orchestrator) + +## 1. Executive Summary & Core Concept +**Harness Engineering (AI Orchestrator)** is a transformative new layer within the platform that evolves our one-on-one AI interactions into a collaborative, automated, multi-agent ecosystem. + +Currently, **Swarm Control** provides a powerful but manual one-to-one developer interface: a user initiates a session, configures nodes, issues prompts, and watches the terminal execution. +**Harness Engineering** takes the Swarm Control concept and fully encapsulates it. Each "session" of Swarm Control is wrapped into an autonomous entity called an **Agent**. Multiple Agents can run concurrently, wait in the background for events, wake up, execute tasks utilizing their dedicated Swarm features, communicate with each other, and go back to sleepβ€”enabling infinite, collaborative execution to achieve complex user requests. + +--- + +## 2. The "Agent" Architecture & Customization + +Every Agent is fundamentally an extended instance of a Swarm Control session, but with persistence, defined roles, and event-driven automation. + +### A. Persona Definition via Markdown +- **System Prompts as Files:** Each Agent's role, constraints, and instructions are defined in an associated `.md` file. By simply editing this Markdown file, developers can customize the exact behavior of the Agent (e.g., "QA Automator", "Code Reviewer", "Database Migrator"). +- **Dynamic Configuration:** When an Agent wakes up, the session engine injects this MD system prompt to initialize the LLM's context. + +### B. Swarm Feature Inheritance +Each Agent inherits the full power of the existing Swarm Control configuration: +- Dedicated Chat Window context. +- Dedicated Node Attachments (Execution VMs). +- Multi-node visibility (Live hardware logs, File sync, Terminal execution). +- Specific LLM Session Engine settings (Provider, Model). + +### C. Execution Modes: Hooks & Loop Mode +- **Loop / Autonomous Mode:** An Agent can be configured to run continuously in a loop, pausing for specific outputs and executing follow-ups without user intervention. +- **Webhooks & CRON:** Agents can be put into an `Idle` state where they consume minimal resources, listening for a trigger. + - **Hooks:** Git pushes, Jira tickets, Slack messages, or simple API calls can wake the Agent up, passing the payload as its initial prompt. + - **Periodic:** CRON-like scheduling allows Agents to wake up, scan logs, and report daily. + +### D. Architectural Inspirations (Open Source) +To ensure the Orchestrator design is robust, it adopts key patterns from popular open-source multi-agent frameworks: +- **Token-Efficient Handoffs via Manifests (inspired by OpenAI Swarm):** Agents explicitly route control to another specialized agent using a strict "Handoff Schema". Crucially, **Agents do not pass their entire chat history**. Passing 100k tokens of debugging history to a QA Agent is too expensive and confusing. To keep the context lean, the handoff is not the "story" of the work, but the **Contract of the result**. The originating Agent generates a dense JSON Manifest: + ```json + { + "handoff_id": "task_refactor_v1", + "source_agent": "Lead_Engineer_Agent", + "target_agent": "QA_Tester_Agent", + "status": "SUCCESS", + "artifacts": { + "working_dir": "/tmp/cortex/shared/refactor_delta_01", + "files_changed": ["src/auth.py", "tests/test_auth.py"], + "cli_entrypoint": "pytest tests/test_auth.py" + }, + "summary_for_target": "Refactored JWT logic. Please verify the 401 Unauthorized edge cases." + } + ``` + This tiny JSON object becomes the *only* initial context injected into the target Agent's empty session, saving massive tokens and ensuring a clean start. +- **Hierarchical Role-based Tasks (inspired by CrewAI):** Utilizing our Markdown `.md` templates, agents are assigned strict roles (e.g., Lead Engineer, QA Tester) and collaborate through shared state to achieve multi-step goals. +- **Graph-Based Routing (inspired by LangGraph):** Future iterations can enforce a stateful, graph-based workflow where nodes represent agents and edges define permissible handoff flows, enabling highly deterministic pipelines. +- **Conversational Reflection Loops (inspired by AutoGen):** Designing loops where agents critique and refine each other's outputs adaptively (e.g., a Coder agent writes code, a Reviewer agent critiques it and sends it back for revision) before fulfilling the user's initial request. + +--- + +## 3. User Interface Design (Agent Dashboard) + +The primary UI for Harness Engineering will pivot from a traditional chat interface to an orchestration dashboard. + +### A. Agent Cards Layout +- A grid of interactive **Agent Cards** serving as high-level monitoring for DevOps teams. +- **Card Details:** + - Agent Avatar / Name / Role. + - Current Status Indicator (`🟒 Active`, `🟑 Idle`, `πŸ”΅ Listening`, `πŸ”΄ Error`). + - Active Node Count / Trigger Configuration (`Webhook`). +- **Telemetry Sparklines:** A mini-graph dynamically showing the isolated CPU/Memory usage of the Agent's specific Namespace Jail, alongside a "Token Burn Rate" to visually spot runaway background loops. +- **Interceptor Actions:** Beyond a simple Play/Pause, the card includes a "Global Kill-Switch" and a "Pause on Next Tool Call" button. This allows a user to freeze an agent exactly where it is (mid-thought) to inspect its Jail before it executes a potentially destructive bash command. + +### B. Session Drill-Down (Dual-Track View) +Clicking on an Agent Card opens up the "Drill-Down" UI. Because Chat History is only 30% of an autonomous agent's story, this UI uses a **Dual-Track Layout**: +- **Left Pane (The Thought Process):** An observation pane displaying the Agent's conversational loop (Thoughts, Prompts, Terminal Output). Users can type directly into the input box to intervene mid-loop. +- **Right Pane (Live File State):** A live-updating File Tree that directly reuses the existing **Mirror File System (File Sync Engine)**. We *do not* build a new file-state tracker. Instead, the UI simply mounts the existing `FileSystemNavigator.js` React component and passes it a `rootPath` prop locking its view strictly to the Agent's Workspace Jail (e.g., `/tmp/cortex/agent_A/`). This gives instant physical inspection of the files the agent is modifying with zero new backend infrastructure. + +#### Advanced CLI Tools +- **Context-Aware Terminal:** The docked terminal dynamically reflects the structural permissions of the Agent. If the Agent has a **Global Node Lock** (e.g., a Baremetal Orchestrator managing Docker or `nginx` system-wide), the terminal displays the native Mesh root prompt (`[root@ubuntu-server-1 /]#`). If the Agent is a concurrently **Jailed Worker** (e.g., just running `pytest` in isolation), it color-codes as a jail (`[Agent_A@Jail-123 /src]$`). This prevents the human user from accidentally typing a system-wide command when they are actually inside a jailed worker session. +- **Time-Travel Log:** Since agents run autonomously for 4 hours while humans sleep, the terminal includes a "Playback Slider." Instead of just seeing the final successful result, users can scrub backward through the execution logs to pinpoint exactly where an obscure `pip install` loop failed before the agent eventually mitigated it. + +### C. Trigger Configuration & Mechanics +Agents operate autonomously based on conditions defined by the user in the UI. + +1. **Manual Triggers (The Play Button):** + - *UI:* A prominent "Start/Pause" toggle on the Agent Card. + - *Mechanics:* Kicks off the Agent loop exactly once with no external context, relying entirely on its `.md` prompt instructions (e.g., "Run a full system check"). +2. **Scheduled Triggers (CRON):** + - *UI:* the user selects a timetable or types a raw cron expression (e.g., `0 * * * *` for hourly). + - *Mechanics:* The Hub backend uses a lightweight scheduler (like `apscheduler`). Every hour, the Hub grabs the `AgentInstance` and pushes an empty message into its chat queue (e.g., "SYSTEM: CRON WAKEUP"). The Worker picks it up, runs the AI loop, completes the task, and returns the Agent to `🟑 Idle`. +3. **Event Webhooks (Push Data & Acknowledge-First Architecture):** + - *UI:* Clicking "Generate Webhook" produces a secure URL and secret token (e.g., `https://ai.jerxie.com/webhooks/agents/123/hit?token=abc`). You paste this into external systems like GitHub or Jira. + - *Mechanics:* Long-running agent workflows (like compiling code) guarantee a standard synchronous webhook will timeout (e.g., GitHub drops the connection after 30s) and retry rapidly, creating destructive duplicate loops. To solve this, the API strictly enforces an **Acknowledge-First** flow: + 1. The `ai-hub` API receives the raw JSON webhook. + 2. The Hub instantly maps the JSON to a User Message, drops the task into the background DB queue, and returns an immediate `HTTP 202 Accepted` back to GitHub, closing the connection. + 3. The background Agent worker wakes up (`πŸ”΅ Listening` --> `🟒 Active`). + 4. *Crucial:* The Agent reads its explicit "Hippocampus" (the Persistent Scratchpad `.txt` on the Node) to determine if this new payload is a continuation of a previously crashed/interrupted task, or a brand new one, before it starts working idempotently. + +### D. Dependency Graph (The "Orchestrator" View) +As agents begin to natively Handoff tasks (passing JSON Manifests), they form a pipeline (e.g., *Frontend Dev* -> *Backend Dev* -> *QA Reviewer*). The UI provides a "Link View" visualizing these connections as edges between nodes. Real-time token flow and "Awaiting Dependencies" states are visualized here to help lead engineers spot pipeline bottlenecks instantly. + +--- + +## 4. Critical User Journeys (CUJs) + +### CUJ 1: Creating an Event-Driven PR Reviewer (Webhook) +**Goal:** The user wants an Agent to automatically review code whenever a Pull Request is opened in their repository. +1. **Creation:** The user navigates to the Agent Dashboard and clicks `Deploy New Agent`. They upload their customized `github_reviewer.md` persona and attach it to `prod-mesh-node-1`. +2. **Setup:** The user tabs to the "Trigger Settings" and selects **Webhook**. The UI instantly generates a secret URL (`https://ai.jerxie.com/webhooks/agents/123?token=abc`). +3. **Context Mapping:** The user defines an incoming JSON mapping instructing the Hub how to read the external webhook: *"A PR was opened! Title: `{{payload.pull_request.title}}`"*. +4. **Activation:** The user clicks deploy. The Agent card drops into the dashboard with a `πŸ”΅ Listening` status. The user pastes the URL into GitHub, and the journey is complete. The Agent will now wake up automatically when GitHub pushes traffic. + +### CUJ 2: Manual Intervention (The Dashboard Quick-Play vs Drill-Down) +**Goal:** The user wants to manually command an Agent that usually runs on a schedule. +1. **The Quick-Play:** The user sees a `Log_Archiver` Agent on the dashboard. They want to archive logs right now instead of waiting for the cron job. They hit the **Play Button** on the Agent Card. The Hub secretly sends an empty `` ping, forcing the Agent to run its defined `.md` loop immediately. +2. **The Drill-Down:** The user wants the `Log_Archiver` to ignore `syslog` today and focus on `nginx.log`. They **click the Agent Card**, opening the Drill-Down UI (which looks identically to the Swarm Control chat interface). The user types into the chat box: *"Ignore syslog today, only archive nginx.log."* and hits Enter. This custom user message wakes the Agent from `🟑 Idle` to `🟒 Active`, completely steering its next loop iteration. + +--- + +## 5. Implementation & Modularization Strategy + +Since the existing application has a clean Backend API separation (likely leveraging FastAPI/Django for `ai-hub` and React for `frontend`), we can implement this robustly while maintaining flexibility. + +### A. Reusing the Backend API +The "Lightweight AI Flow" principle ensures we don't rewrite the wheel. To start, an Agent is simply a database record that references an existing Session ID. +- The Agent background runner (Celery or Async task) will literally pretend to be a User, calling the existing backend APIs (`POST /api/v1/sessions/:id/messages`) to trigger the underlying Swarm execution. +- We expose a wrapper: `POST /api/v1/agents/{id}/trigger` which takes a web payload and translates it into a message for that Agent's underlying Session. + +### B. Data Model Adjustments +New tables/collections needed: +- `AgentTemplate`: Path to the `.md` persona, default Swarm configs, default node connections. +- `AgentInstance`: A running version of a template, mapped 1:1 with a `Session ID`, tracking connection states and loop configuration. +- `AgentTrigger`: Configurations for hooks (`url`, `secret`) or cron schedules. + +### C. Phased Evolutionary Implementation Plan +To build this smoothly, we will prioritize a "Make it Work" MVP using our existing architecture, ensuring the API contract is solid. Once validated, we seamlessly swap the engine underneath to reach our ultimate scale. + +**Phase 1: The Monolithic MVP ("Make it Work")** +- **Action:** Build the Orchestrator loop directly inside FastAPI using `BackgroundTasks` (zero infrastructure sprawl). +- **Setup:** Create the `AgentInstance` DB records. Route the background task to simply call existing endpoints: `GET /nodes/{id}/terminal` and `POST /nodes/{id}/dispatch`. +- **Context Limits:** Use a crude "sliding window" (only send the last 15 messages) to prevent token saturation. +- **Node Clashing:** Rely on system prompts instructing Agents to use unique `/tmp/{agent_name}/` directoris. + +**Phase 2: The UI Dashboard & Persona Engine** +- **Action:** Build the `AgentHarnessPage.js` Card UI. +- **Setup:** Introduce the ability to mount a dynamic `.md` file as the system prompt for a Session. Allow users to click from the Dashboard directly into the pre-configured `SwarmControlPage` to visually spectate the background Agent working. + +**Phase 3: The Engine Migration ("Scale it Up")** +- **Action:** Lift the background loop out of FastAPI and drop it into a Celery/Redis worker fleet (Path A). +- **Setup:** Because our MVP was designed to use the Hub's public API to observe/dispatch commands, the Celery workers can be hosted *anywhere* and still interact perfectly with the Hub using simple REST. We achieve horizontal scale without rewriting the core execution logic. + +**Phase 4: Agent Collaboration & Advanced Resilience** +- **Action:** Build out explicit Handoff tools (`handoff_to_agent(target_agent_id, json_manifest)`). By enforcing a JSON Manifest schema, we guarantee token efficiency and prevent "context poisoning" between disparate agents (e.g., Coder -> QA). +- **Setup:** Introduce Rolling Summarization (replacing the sliding window), Financial Circuit Breakers, and transition Redis TTL locks to handle zombie agent recovery intelligently. +--- + +## 6. Conclusion & Future Flexibility +By abstracting "Swarm Control" into "Agents", we modularize intelligence. The backend AI doesn't need to know if it's chatting with a human or triggered by GitHub; it just receives prompts and executes on the Mesh Nodes. This keeps the codebase incredibly DRY (Don't Repeat Yourself) while exponentially increasing the capabilities of the platform. We can iteratively refine the `.md` files without touching backend Python code, providing maximum flexibility for the future. + +--- + +## 7. Background Resilience & Self-Recovery Mechanics + +Unlike an interactive chat where a human can instantly see and correct an AI error, background Orchestrator loops require extreme self-healing and failure mechanisms to prevent runaway infrastructure or billing disasters. We engineer resilience at five layers: + +### A. Circuit Breakers (Cost & API Failure Limits) +Autonomous loops can easily get stuck retrying broken code, rapidly burning provider tokens (`HTTP 429`). +- **Mechanism:** Every Agent Template is assigned a hard execution cap (e.g., `Max_Iterations: 20`). If an Agent fails to complete its objective within 20 continuous tool calls, the Hub triggers a Circuit Breaker. The Agent halts, flips to `πŸ”΄ Error (Suspended)`, and instantly alerts the Dashboard for manual human intervention via the conversational drill-down. Network drops or OpenAI 502s are gracefully handled via strict exponential backoff (`Tenacity`). + +### B. The Zombie Sweeper (State Recovery) +If the actual `ai-hub` Docker container restarts, or the Python worker runs out of RAM midway through a task, the DB will still say the agent is `🟒 Active`, but no background thread is actually running it. +- **Mechanism (TTL Leases):** When an async worker takes a job, it acquires a "Lease" on that Agent in the DB with a 2-minute Time-To-Live (TTL). While processing, the worker pings the DB every 60 seconds to extend the lease. If the worker crashes, the lease expires. A lightweight background sweeper checks the DB every 5 minutes and immediately resets any "Zombie" agents to `🟑 Idle`. Because our Agents are stateless, the next worker simply reads the chat history and seamlessly resumes the loop exactly where it left off. + +### C. Context Saturation (Rolling Memory Summary) +An Agent in an infinite loop will rapidly exceed the LLM's 100k+ token window if it appends every thought and terminal output linearly. +- **Mechanism:** A background Context Manager constantly measures the Agent's message byte-size. As the Agent approaches the limit, the Hub spins up a fast, cheap LLM model (e.g., Llama3/Claude Haiku) to compress the oldest 50 messages into a dense "Scratchpad Summary." The Agent is then fed a constant-size prompt: `[System Persona] + [Condensed Scratchpad] + [Last 10 Actions]`, ensuring it never crashes from token bloat. + +### D. Node State Concurrency (Clashing Environments) +If multiple autonomous Agents are attached to the *same* physical node simultaneously, their shell commands might collide (e.g., Agent A deletes `/tmp/data` while Agent B is trying to zip it). +- **Mechanism (Global Locks & Jails):** Attempting to parse raw bash strings for "path-based semantics" is incredibly brittle (e.g., an Agent using `cd ..` could escape a path lock). Instead, the MVP exclusively uses **Global Node Locks** (only one Agent can orchestrate a specific Mesh Node at a time). To achieve true concurrency later, we will use **Workspace Jails**. An Agent will be strictly confined by the Node's Sandbox policy to only write to its assigned runtime directory (e.g., `/tmp/cortex/agent_A/`). If an Agent needs to modify a global system config (like `/etc/nginx/`), it must explicitly escalate and halt all other Agents via a Global Node Lock. + +### E. Persistent Headless Logging +The current Swarm UI relies on in-memory WebSockets to display live terminal output. A background Agent might poll the node, but if the logs stream violently fast, crucial output could be dropped from RAM before the polling cycle hits. +- **Mechanism:** The Agent Node strictly streams long-running background task outputs into persistent, locked log files on the host disk (e.g., `~/.cortex/logs/{session_id}.log`). The orchestrator natively instructs the Agent to read these concrete files for analysis rather than sniffing the live websocket buffer, guaranteeing zero data loss. + +### F. Idempotency & Crash Artifacts (State Collision) +If an Agent crashes halfway through downloading a dataset or creating a database table, the "Zombie Sweeper" will reset it and the Agent will retry. Without precautions, the Agent will immediately crash again because the `git clone` or `CREATE TABLE` command will throw a "Resource Already Exists" error. +- **Mechanism:** Agent Prompts will be engineered with rigorous Idempotency rules. Agents will be explicitly instructed to *always verify the current state of the filesystem or environment* before executing write-commands upon waking up. If temporary crash artifacts are detected (e.g., partial downloads), the Agent must clean its isolated directory namespace before restarting the task. + +### G. Summary Degradation ("Agent Dementia") +While Rolling Memory Summary (Mechanism C) keeps the Agent below token limits, summarization inherently destroys precise detail. If an agent spent 5 turns fixing a massive regex on line 124 of a 5,000-line file, the summarizer might condense it to: *"Agent reviewed file and fixed regex."* Because the exact path and line number are lost from context, the Agent will suffer from "dementia" and have to re-discover its own work repeatedly in long loops. +- **Mechanism (The Persistent Scratchpad):** We will provide Agents with an explicit **Scratchpad Node Skill**. The Agent will be instructed to treat a physical `.txt` file on the node as its own hippocampus. It will actively write exact variables, paths, and immediate next steps to this physical file so that even if the Hub summarizes its chat history, its literal working memory is safely preserved and readable natively by the Agent on every loop tick. + +--- + +## 8. Architecture Scalability & Decentralization + +Running the orchestrator loop (compiling prompts, calling LLMs, interpreting outputs) for 10+ sub-agents directly inside the main `ai-hub` API server is **fundamentally unscalable**. A single Python FastAPI backend will quickly become I/O and CPU bound. + +To achieve infinite, horizontal scale for the Orchestrator as usage grows, we strictly adhere to a decoupled **Worker Fleet Architecture** (Brains in the Cloud, Smart Hands on the Edge). + +### The Worker Fleet Model +Running the orchestrator loop directly inside the main `ai-hub` API server will eventually become CPU/IO bound. Instead, we split the architecture: +1. **The Hub (Stateless API):** Remains strictly a router and execution proxy, caching current state. +2. **The Worker Pool:** We implement an asynchronous worker fleet (scaling from FastAPI `BackgroundTasks` in the MVP, up to `Celery` workers for Enterprise instances). These separate containers exclusively run the long-lived `while` loops, make heavy LLM API calls, and parse prompts. +3. **The Edge Nodes:** The `agent-node` clients remain drastically lightweight. They run no LLM logic locally. + +### Addressing Structural Limitations +To ensure the Worker Fleet remains furiously fast and doesn't buckle under network traffic: +- **Batching Skills:** The Agent prompt is instructed to aggregate commands into bash scripts rather than rapid-firing single-line commands, heavily reducing the gRPC Round Trip Time (RTT). +- **Data-Reduction at Edge:** If the Worker Agent needs to ingest a 100MB repository or parse huge live log files, streaming that data from the Node to the Hub just to feed it into the LLM context will choke the network. We mitigate this by building **Data-Reduction Skills** (e.g., `remote_semantic_grep`). The Worker instructs the Node to run the heavy file-parsing locally on its own CPU, and exactly 3 lines of dense, matched text are returned over the wire to the Worker. +- **Dependency Minimalism:** We explicitly avoid integrating complex message brokers like Kafka or heavy workflow engines like Temporal to ensure the platform remains remarkably easy to self-host and maintain. diff --git a/docs/features/harness_engineering/harness_engineering_execution_plan.md b/docs/features/harness_engineering/harness_engineering_execution_plan.md new file mode 100644 index 0000000..30ffd34 --- /dev/null +++ b/docs/features/harness_engineering/harness_engineering_execution_plan.md @@ -0,0 +1,127 @@ +# Harness Engineering: Execution Plan + +Based on the [Harness Engineering Design Document](./harness_engineering_design.md), this execution plan translates the theoretical architecture into concrete, step-by-step implementation tasks for the Cortex codebase. + +We are adopting an evolutionary approach, starting with a Monolithic MVP (Phase 1) using FastAPI `BackgroundTasks` to validate the core loop before scaling to a Celery Worker Fleet. + +--- + +## Area 1: Core Database & Context Scaffolding +*The foundational building blocks required to store and track Agents in the `ai-hub` system.* + +### Task 1.1: Define SQLAlchemy Models +We need to extend the backend relational database to track defining constraints, runtime loops, and external hooks. +- **Action:** Create models inside `/app/ai-hub/app/models/agent.py`. +- **`AgentTemplate`:** The blueprint. + - `id` (UUID) + - `name` (String) + - `description` (String) + - `system_prompt_path` (String) - Path to the `.md` persona file. + - `max_loop_iterations` (Integer) - Circuit breaker cap (default: 20). +- **`AgentInstance`:** The living, breathing run-state. + - `id` (UUID) + - `template_id` (ForeignKey) + - `session_id` (ForeignKey) - Links to the existing Swarm Control `Session` table (where chat history is stored). + - `mesh_node_id` (ForeignKey) - The physical server this agent is locked to. + - `status` (Enum: `active`, `idle`, `listening`, `error_suspended`) + - `current_workspace_jail` (String) - E.g., `/tmp/cortex/agent_abc/`. + - `last_heartbeat` (Timestamp) - Used by the Zombie Sweeper algorithm. +- **`AgentTrigger`:** The Wakeup hooks. + - `id` (UUID) + - `instance_id` (ForeignKey) + - `trigger_type` (Enum: `webhook`, `cron`, `manual`) + - `cron_expression` (String, nullable) + - `webhook_secret` (String, nullable) + - `webhook_mapping_schema` (JSON, nullable) - How to map the incoming JSON to a string prompt. + +### Task 1.2: Define Pydantic Schemas & CRUD Endpoints +- **Action:** Create `/app/ai-hub/app/schemas/agent.py` mirroring the SQLAlchemy models. +- **Action:** Create routers in `/app/ai-hub/app/api/v1/endpoints/agents.py` to allow the Frontend dashboard to Query, Create, and Manage Agents. + - `GET /api/v1/agents` + - `POST /api/v1/agents` + - `PATCH /api/v1/agents/{id}/status` + +### Task 1.3: The Acknowledge-First API Controller +- **Action:** Build the Webhook receiver endpoint. + - `POST /api/v1/agents/{id}/webhook` + - Validate the `?token=` parameter. + - Parse the JSON payload according to `webhook_mapping_schema`. + - Push the mapped string as a `User` message into the associated `Session` database table. + - Dispatch a `FastAPI.BackgroundTask` to wake up the Async Agent Loop. + - Immediately return `HTTP 202 Accepted`. + +--- + +## Area 2: The Agent Execution Loop & Circuit Breakers +*The core autonomous engine that actually "thinks" and "does" inside the backend, heavily fortified with error-handling.* + +### Task 2.1: The Lifecycle Manager & Leases +When the Hub throws an agent into the `BackgroundTasks` queue, the `AgentExecutor` python class takes over. +- **Action:** Build `AgentExecutor.run(agent_id)` in `/app/ai-hub/app/core/orchestration/agent_loop.py`. +- **Acquire Lease:** Immediately update the DB `AgentInstance.last_heartbeat = NOW()` and `status = active`. Spawn an `asyncio` background thread to update this `last_heartbeat` every 60 seconds while the loop runs. +- **Node Lock:** Attempt to claim the `Global Node Lock` (if a deployment agent) or verify access to the `/tmp/cortex/agent_x/` Workspace Jail (if a concurrent worker). +- **Idempotency Check:** Instruct the initial LLM prompt to explicitly read its `.txt` local Scratchpad to verify if it is recovering from a previous crash state. + +### Task 2.2: The `while True` Orchestration Loop +- **Action:** Implement the core AI loop integrating with the existing `profile.py` logic. +- **Circuit Breaker:** Initialize `iteration_count = 0`. At the top of the loop, check `if iteration_count >= template.max_loop_iterations`. If true, break the loop, update DB status to `error_suspended` and exit. +- **Exponential Backoff:** Wrap the OpenAI/LLM network call with the `@retry(wait=wait_exponential(multiplier=1, min=2, max=10))` decorator from the `tenacity` library to gracefully absorb `HTTP 502` or `HTTP 429` errors without crashing. + +### Task 2.3: Context Saturation (Rolling memory) +- **Action:** Inside the loop, measure `len(chat_history)`. +- If the token payload exceeds 80k tokens, call a secondary, fast LLM model (`gpt-4o-mini` or `haiku`) with a specific prompt: *"Summarize these past 50 interactions into a single condensed scratchpad paragraph."* +- Replace the raw 50 messages in the DB with this single Summary message. + +### Task 2.4: The JSON Handoff Protocol +- **Action:** Define a native `Tool/Skill` called `handoff_to_agent()`. +- Force the LLM to output parameter arguments matching the strict JSON Manifest Schema: + - `target_agent_id` + - `working_dir` + - `files_changed` + - `summary_for_target` +- **Execution:** When this tool is triggered, the `AgentExecutor` explicitly terminates the current Agent's loop, spins up the Target Agent's DB instance, injects the JSON Manifest as its *only* initial message, and pushes the new Agent into the `BackgroundTasks` queue. +--- + +## Area 3: Dashboard React UI (Dual-Track Views) +*The frontend evolution transforming Cortex from an interactive chatbot UI into a high-level "System Orchestrator" console.* + +### Task 3.1: The Agent Harness Dashboard (Grid & Cards) +- **Action:** Create `AgentHarnessPage.jsx` at the router level. Map over `AgentInstance` records to render an `AgentCard`. +- **Telemetry Sparklines:** Integrate `recharts` to render a mini-graph on the card. Query `GET /api/v1/agents/{id}/telemetry` (pulling CPU/Memory metrics isolated by the Agent's specific Linux Sandbox/cgroup). +- **The Interceptor Mode:** Add "Global Kill-Switch" and "Pause on Tool-Call" buttons directly to the card. Clicking Pause triggers a `PATCH` request setting `status = paused_mid_loop`, telling the background `AgentExecutor` to halt execution *before* the next LLM tool execution fires, giving the human time to inspect the Jail state. + +### Task 3.2: The Dual-Track Session Drill-Down +- **Action:** Build the `AgentDrillDown.jsx` view utilizing CSS Grid to split the viewport 50/50. +- **Left Pane (Chat Tracker):** Mount the existing `ChatWindow` component. This streams the live AI "thought process" and allows the human to inject custom prompt overrides to steer the loop. +- **Right Pane (Live Jail Filesystem):** Mount the existing `FileSystemNavigator.jsx` component. + - *Crucial UI Filter:* Do not build a new generic filesystem tracker. Pass `rootPath={AgentInstance.current_workspace_jail}` (e.g., `/tmp/cortex/agent_abc/`) as a prop directly to `FileSystemNavigator`. + - The Mirror sync system continues to operate natively, but the UI component mathematically locks the human's visual file-tree into the Agent's sandbox, providing immediate "proof of work" verification. + +### Task 3.3: Context-Aware Terminal & Time-Travel Logs +- **Action:** Update the docked `MultiNodeConsole.jsx`. +- **UI Prompt Regex:** Based on the `AgentInstance.Global_Node_Lock` boolean, dynamically rewrite the PS1 prompt text natively in xterm.js. (e.g., green `[root@app-server] $` vs purple `[Agent_X@Jail] $`). +- **Scrubbing Slider:** Implement a `range` slider input. The terminal reads from `GET /nodes/{id}/persistent_agent_log` rather than ephemeral WebSockets, allowing React to "seek" back through thousands of output lines to pinpoint exactly where an obscure error happened 4 hours prior. + +### Task 3.4: The Dependency Graph (Link View) +- **Action:** Integrate `react-flow-renderer` (or similar node-graph library). +- Query the database to find links where `target_agent_id` maps inside the JSON Manifest of a completed Agent. Visually draw edges connecting "Planner Agent" > "Coder Agent" > "QA Agent" with live token flow rates. +--- + +## Area 4: The Zombie Sweeper & Edge Artifacts +*The asynchronous safety nets that guarantee 100% uptime and completely prevent Agent Dementia during long loops.* + +### Task 4.1: The Zombie Sweeper Service +- **Action:** Create `zombie_sweeper.py` in the Hub workers directory. +- **Scheduler:** Use `apscheduler` (or Celery Beat in Phase 3) to execute this job strictly every 5 minutes. +- **Logic:** Execute `UPDATE agent_instances SET status='idle' WHERE status='active' AND last_heartbeat < (NOW() - INTERVAL '3 minutes')`. +- **Requeue:** For every row updated, automatically dispatch a new `BackgroundTasks` execute call, forcing the Hub to instantly pick back up the crashed Agents exactly where they left off. + +### Task 4.2: The "Hippocampus" Persistent Scratchpad +- **Action:** Add a native, mandatory `manage_scratchpad(text)` tool/skill to the Agent's baseline Sandbox. +- **The File Constraint:** Force this tool to strictly parse and append text directly to `{AgentInstance.workspace_jail}/.cortex_memory_scratchpad.txt`. +- **The System Prompt Override:** Inject a hardcoded string at the bottom of the user's `personas.md` file: *"CRITICAL: As your history is summarized, you will lose exact variables. You must continuously write critical facts to your `.cortex_memory_scratchpad.txt` file. Every time you wake up from a sleep or webhook, you MUST `cat` this file first."* + +### Task 4.3: Stateful Headless Logging +- **Action:** Modify the gRPC `agent-node` shell execution handler. +- **File Sink:** Instead of merely streaming the `stdout/stderr` buffer into the live WebSocket, bind a `FileOutputStream` to dynamically write every line of PTY output to `~/.cortex/agent_logs/{session_id}.log` on the physical machine. +- **API Endpoint:** Create the HTTP endpoint `GET /api/v1/nodes/{id}/persistent_agent_log` so the Frontend's Time-Travel UI slider can request specific byte-ranges of this log file indefinitely, guaranteeing zero dropped terminal lines even if the frontend disconnects for hours. diff --git a/docs/features/harness_engineering/harness_engineering_test_plan.md b/docs/features/harness_engineering/harness_engineering_test_plan.md new file mode 100644 index 0000000..0cd58a6 --- /dev/null +++ b/docs/features/harness_engineering/harness_engineering_test_plan.md @@ -0,0 +1,67 @@ +# Harness Engineering: Formal Test Plan + +This document outlines the Quality Assurance (QA) test plan for the Harness Engineering (Autonomous Agents) feature. These tests map directly to the Critical User Journeys (CUJs) and edge-case Safefails defined in the architecture. + +--- + +## Stage 1: The Critical User Journeys + +### Test 1: The Event-Driven Webhook (CUJ 1) +**Objective:** Verify that the "Acknowledge-First" webhook architecture operates flawlessly under load without timing out external providers. +**Steps:** +1. Create a new `AgentTemplate` with a simple bash task (e.g., `sleep 45 && echo "DONE"`). +2. Configure a Webhook Trigger and copy the generated URL. +3. Use Postman or curl to `POST` a simulated GitHub PR JSON payload to the URL. +**Expected Results:** +- [ ] The API must return `HTTP 202 Accepted` in under 500ms. +- [ ] The Agent status in the UI flips from `πŸ”΅ Listening` to `🟒 Active`. +- [ ] The Agent reads the mapped GitHub payload, executes the 45-second sleep hook, prints "DONE", and returns to `πŸ”΅ Listening`. +- [ ] The external caller (Postman) is not blocked waiting for the 45-second sleep to finish. + +### Test 2: Interceptor & Manual Override (CUJ 2) +**Objective:** Verify that a human can instantly freeze and override an autonomous agent mid-iteration. +**Steps:** +1. Trigger a background Agent tasked with a complex loop (e.g., iteratively grepping 1000 files). +2. While the agent is `🟒 Active`, click the **"Pause on Next Tool Call"** Interceptor button on the dashboard card. +3. Open the Dual-Track Drill-Down UI and view the terminal/chat path. +4. Type an explicit override into the chat: *"STOP grepping, just print 'HELLO' and exit."* +**Expected Results:** +- [ ] Clicking the Interceptor immediately pauses the backend Executor Python loop *before* it fires the next ChatGPT function. Status flips to `🟑 Paused`. +- [ ] Typing the manual message injects a new high-priority `User` message into the Session DB. +- [ ] The agent wakes back up to `🟒 Active`, acknowledges the human override, prints "HELLO", and terminates. + +--- + +## Stage 2: Background Resilience & Safefails + +### Test 3: The Zombie Sweeper Recovery +**Objective:** Prove that a Hard Crash (OOM or Server Power Loss) does not permanently deadlock an Agent task. +**Steps:** +1. Start an Agent on a long-running download process using the Playground. +2. Manually kill the FastAPI `BackgroundTasks` thread or restart the `ai-hub` Docker container entirely mid-way through execution. +3. The DB will inaccurately show the Agent as `🟒 Active` despite the worker being dead. Wait 5 minutes. +**Expected Results:** +- [ ] The `zombie_sweeper` cron job detects `last_heartbeat < (NOW() - 3m)`. +- [ ] The sweeper flips the Agent to `🟑 Idle` and automatically requeues it. +- [ ] The new worker reads the `.cortex_memory_scratchpad.txt` (Hippocampus) to identify idempotency, cleans the directory, and re-starts the task automatically without human intervention. + +### Test 4: The Circuit Breaker (Max Iterations) +**Objective:** Prevent runaway billing loops when the LLM gets confused or the code is structurally unfixable. +**Steps:** +1. Assign an Agent a task that is mathematically impossible (e.g., *"Find the string 'SECRET' in a directory that has no files, and keep searching until you find it."*). +2. Set `max_loop_iterations = 5` in the DB template. +3. Start the Agent. +**Expected Results:** +- [ ] The Agent will endlessly loop, calling `mesh_terminal_control` in vain. +- [ ] On the 5th loop, the `AgentExecutor` forcibly terminates the loop. +- [ ] The Agent status flips to `πŸ”΄ Error (Suspended)` and issues an explicit visible warning to the Dashboard Card requiring user acknowledgment. + +### Test 5: The Namespace Jail Collision +**Objective:** Ensure concurrent workers cannot modify global system state or each other's Jails. +**Steps:** +1. Create Agent A (assigned to `/tmp/cortex/agent_A/`). +2. Create Agent B (assigned to `/tmp/cortex/agent_B/`). +3. Instruct Agent A to execute `rm -rf /tmp/cortex/agent_B/*` or `cat /etc/passwd`. +**Expected Results:** +- [ ] The internal Sandbox Policy strictly rejects the command, returning `SANDBOX_VIOLATION` to Agent A's chat history. +- [ ] Agent A is unable to physically interact with Agent B's workspace or the global root.