Newer
Older
cortex-hub / docs / features / harness_engineering / harness_engineering_design.md

Feature Design: Harness Engineering (AI Orchestrator)

1. Executive Summary & Core Concept

Harness Engineering (AI Orchestrator) is a transformative new layer within the platform that evolves our one-on-one AI interactions into a collaborative, automated, multi-agent ecosystem.

Currently, Swarm Control provides a powerful but manual one-to-one developer interface: a user initiates a session, configures nodes, issues prompts, and watches the terminal execution.
Harness Engineering takes the Swarm Control concept and fully encapsulates it. Each "session" of Swarm Control is wrapped into an autonomous entity called an Agent. Multiple Agents can run concurrently, wait in the background for events, wake up, execute tasks utilizing their dedicated Swarm features, communicate with each other, and go back to sleepβ€”enabling infinite, collaborative execution to achieve complex user requests.


2. The "Agent" Architecture & Customization

Every Agent is fundamentally an extended instance of a Swarm Control session, but with persistence, defined roles, and event-driven automation.

A. Persona Definition via Markdown

  • System Prompts as Files: Each Agent's role, constraints, and instructions are defined in an associated .md file. By simply editing this Markdown file, developers can customize the exact behavior of the Agent (e.g., "QA Automator", "Code Reviewer", "Database Migrator").
  • Dynamic Configuration: When an Agent wakes up, the session engine injects this MD system prompt to initialize the LLM's context.

B. Swarm Feature Inheritance

Each Agent inherits the full power of the existing Swarm Control configuration:

  • Dedicated Chat Window context.
  • Dedicated Node Attachments (Execution VMs).
  • Multi-node visibility (Live hardware logs, File sync, Terminal execution).
  • Specific LLM Session Engine settings (Provider, Model).

C. Execution Modes: Hooks & Loop Mode

  • Loop / Autonomous Mode: An Agent can be configured to run continuously in a loop, pausing for specific outputs and executing follow-ups without user intervention.
  • Webhooks & CRON: Agents can be put into an Idle state where they consume minimal resources, listening for a trigger.
    • Hooks: Git pushes, Jira tickets, Slack messages, or simple API calls can wake the Agent up, passing the payload as its initial prompt.
    • Periodic: CRON-like scheduling allows Agents to wake up, scan logs, and report daily.

D. Architectural Inspirations (Open Source)

To ensure the Orchestrator design is robust, it adopts key patterns from popular open-source multi-agent frameworks:

  • Token-Efficient Handoffs via Manifests (inspired by OpenAI Swarm): Agents explicitly route control to another specialized agent using a strict "Handoff Schema". Crucially, Agents do not pass their entire chat history. Passing 100k tokens of debugging history to a QA Agent is too expensive and confusing. To keep the context lean, the handoff is not the "story" of the work, but the Contract of the result. The originating Agent generates a dense JSON Manifest:
    {
      "handoff_id": "task_refactor_v1",
      "source_agent": "Lead_Engineer_Agent",
      "target_agent": "QA_Tester_Agent",
      "status": "SUCCESS",
      "artifacts": {
        "working_dir": "/tmp/cortex/shared/refactor_delta_01",
        "files_changed": ["src/auth.py", "tests/test_auth.py"],
        "cli_entrypoint": "pytest tests/test_auth.py"
      },
      "summary_for_target": "Refactored JWT logic. Please verify the 401 Unauthorized edge cases."
    }
    This tiny JSON object becomes the only initial context injected into the target Agent's empty session, saving massive tokens and ensuring a clean start.
  • Hierarchical Role-based Tasks (inspired by CrewAI): Utilizing our Markdown .md templates, agents are assigned strict roles (e.g., Lead Engineer, QA Tester) and collaborate through shared state to achieve multi-step goals.
  • Graph-Based Routing (inspired by LangGraph): Future iterations can enforce a stateful, graph-based workflow where nodes represent agents and edges define permissible handoff flows, enabling highly deterministic pipelines.
  • Conversational Reflection Loops (inspired by AutoGen): Designing loops where agents critique and refine each other's outputs adaptively (e.g., a Coder agent writes code, a Reviewer agent critiques it and sends it back for revision) before fulfilling the user's initial request.

3. User Interface Design (Agent Dashboard)

The primary UI for Harness Engineering will pivot from a traditional chat interface to an orchestration dashboard.

A. Agent Cards Layout

  • A grid of interactive Agent Cards serving as high-level monitoring for DevOps teams.
  • Card Details:
    • Agent Avatar / Name / Role.
    • Current Status Indicator (🟒 Active, 🟑 Idle, πŸ”΅ Listening, πŸ”΄ Error).
    • Active Node Count / Trigger Configuration (Webhook).
  • Telemetry Sparklines: A mini-graph dynamically showing the isolated CPU/Memory usage of the Agent's specific Namespace Jail, alongside a "Token Burn Rate" to visually spot runaway background loops.
  • Interceptor Actions: Beyond a simple Play/Pause, the card includes a "Global Kill-Switch" and a "Pause on Next Tool Call" button. This allows a user to freeze an agent exactly where it is (mid-thought) to inspect its Jail before it executes a potentially destructive bash command.

B. Session Drill-Down (Dual-Track View)

Clicking on an Agent Card opens up the "Drill-Down" UI. Because Chat History is only 30% of an autonomous agent's story, this UI uses a Dual-Track Layout:

  • Left Pane (The Thought Process): An observation pane displaying the Agent's conversational loop (Thoughts, Prompts, Terminal Output). Users can type directly into the input box to intervene mid-loop.
  • Right Pane (Live File State): A live-updating File Tree that directly reuses the existing Mirror File System (File Sync Engine). We do not build a new file-state tracker. Instead, the UI simply mounts the existing FileSystemNavigator.js React component and passes it a rootPath prop locking its view strictly to the Agent's Workspace Jail (e.g., /tmp/cortex/agent_A/). This gives instant physical inspection of the files the agent is modifying with zero new backend infrastructure.

Advanced CLI Tools

  • Context-Aware Terminal: The docked terminal dynamically reflects the structural permissions of the Agent. If the Agent has a Global Node Lock (e.g., a Baremetal Orchestrator managing Docker or nginx system-wide), the terminal displays the native Mesh root prompt ([root@ubuntu-server-1 /]#). If the Agent is a concurrently Jailed Worker (e.g., just running pytest in isolation), it color-codes as a jail ([Agent_A@Jail-123 /src]$). This prevents the human user from accidentally typing a system-wide command when they are actually inside a jailed worker session.
  • Time-Travel Log: Since agents run autonomously for 4 hours while humans sleep, the terminal includes a "Playback Slider." Instead of just seeing the final successful result, users can scrub backward through the execution logs to pinpoint exactly where an obscure pip install loop failed before the agent eventually mitigated it.

C. Trigger Configuration & Mechanics

Agents operate autonomously based on conditions defined by the user in the UI.

  1. Manual Triggers (The Play Button):
    • UI: A prominent "Start/Pause" toggle on the Agent Card.
    • Mechanics: Kicks off the Agent loop exactly once with no external context, relying entirely on its .md prompt instructions (e.g., "Run a full system check").
  2. Scheduled Triggers (CRON):
    • UI: the user selects a timetable or types a raw cron expression (e.g., 0 * * * * for hourly).
    • Mechanics: The Hub backend uses a lightweight scheduler (like apscheduler). Every hour, the Hub grabs the AgentInstance and pushes an empty message into its chat queue (e.g., "SYSTEM: CRON WAKEUP"). The Worker picks it up, runs the AI loop, completes the task, and returns the Agent to 🟑 Idle.
  3. Event Webhooks (Push Data & Acknowledge-First Architecture):
    • UI: Clicking "Generate Webhook" produces a secure URL and secret token (e.g., https://ai.jerxie.com/webhooks/agents/123/hit?token=abc). You paste this into external systems like GitHub or Jira.
    • Mechanics: Long-running agent workflows (like compiling code) guarantee a standard synchronous webhook will timeout (e.g., GitHub drops the connection after 30s) and retry rapidly, creating destructive duplicate loops. To solve this, the API strictly enforces an Acknowledge-First flow:
      1. The ai-hub API receives the raw JSON webhook.
      2. The Hub instantly maps the JSON to a User Message, drops the task into the background DB queue, and returns an immediate HTTP 202 Accepted back to GitHub, closing the connection.
      3. The background Agent worker wakes up (πŸ”΅ Listening --> 🟒 Active).
      4. Crucial: The Agent reads its explicit "Hippocampus" (the Persistent Scratchpad .txt on the Node) to determine if this new payload is a continuation of a previously crashed/interrupted task, or a brand new one, before it starts working idempotently.

D. Dependency Graph (The "Orchestrator" View)

As agents begin to natively Handoff tasks (passing JSON Manifests), they form a pipeline (e.g., Frontend Dev -> Backend Dev -> QA Reviewer). The UI provides a "Link View" visualizing these connections as edges between nodes. Real-time token flow and "Awaiting Dependencies" states are visualized here to help lead engineers spot pipeline bottlenecks instantly.


4. Critical User Journeys (CUJs)

CUJ 1: Creating an Event-Driven PR Reviewer (Webhook)

Goal: The user wants an Agent to automatically review code whenever a Pull Request is opened in their repository.

  1. Creation: The user navigates to the Agent Dashboard and clicks Deploy New Agent. They upload their customized github_reviewer.md persona and attach it to prod-mesh-node-1.
  2. Setup: The user tabs to the "Trigger Settings" and selects Webhook. The UI instantly generates a secret URL (https://ai.jerxie.com/webhooks/agents/123?token=abc).
  3. Context Mapping: The user defines an incoming JSON mapping instructing the Hub how to read the external webhook: "A PR was opened! Title: {{payload.pull_request.title}}".
  4. Activation: The user clicks deploy. The Agent card drops into the dashboard with a πŸ”΅ Listening status. The user pastes the URL into GitHub, and the journey is complete. The Agent will now wake up automatically when GitHub pushes traffic.

CUJ 2: Manual Intervention (The Dashboard Quick-Play vs Drill-Down)

Goal: The user wants to manually command an Agent that usually runs on a schedule.

  1. The Quick-Play: The user sees a Log_Archiver Agent on the dashboard. They want to archive logs right now instead of waiting for the cron job. They hit the Play Button on the Agent Card. The Hub secretly sends an empty <WAKE_UP> ping, forcing the Agent to run its defined .md loop immediately.
  2. The Drill-Down: The user wants the Log_Archiver to ignore syslog today and focus on nginx.log. They click the Agent Card, opening the Drill-Down UI (which looks identically to the Swarm Control chat interface). The user types into the chat box: "Ignore syslog today, only archive nginx.log." and hits Enter. This custom user message wakes the Agent from 🟑 Idle to 🟒 Active, completely steering its next loop iteration.

5. Implementation & Modularization Strategy

Since the existing application has a clean Backend API separation (likely leveraging FastAPI/Django for ai-hub and React for frontend), we can implement this robustly while maintaining flexibility.

A. Reusing the Backend API

The "Lightweight AI Flow" principle ensures we don't rewrite the wheel. To start, an Agent is simply a database record that references an existing Session ID.

  • The Agent background runner (Celery or Async task) will literally pretend to be a User, calling the existing backend APIs (POST /api/v1/sessions/:id/messages) to trigger the underlying Swarm execution.
  • We expose a wrapper: POST /api/v1/agents/{id}/trigger which takes a web payload and translates it into a message for that Agent's underlying Session.

B. Data Model Adjustments

New tables/collections needed:

  • AgentTemplate: Path to the .md persona, default Swarm configs, default node connections.
  • AgentInstance: A running version of a template, mapped 1:1 with a Session ID, tracking connection states and loop configuration.
  • AgentTrigger: Configurations for hooks (url, secret) or cron schedules.

C. Phased Evolutionary Implementation Plan

To build this smoothly, we will prioritize a "Make it Work" MVP using our existing architecture, ensuring the API contract is solid. Once validated, we seamlessly swap the engine underneath to reach our ultimate scale.

Phase 1: The Monolithic MVP ("Make it Work")

  • Action: Build the Orchestrator loop directly inside FastAPI using BackgroundTasks (zero infrastructure sprawl).
  • Setup: Create the AgentInstance DB records. Route the background task to simply call existing endpoints: GET /nodes/{id}/terminal and POST /nodes/{id}/dispatch.
  • Context Limits: Use a crude "sliding window" (only send the last 15 messages) to prevent token saturation.
  • Node Clashing: Rely on system prompts instructing Agents to use unique /tmp/{agent_name}/ directoris.

Phase 2: The UI Dashboard & Persona Engine

  • Action: Build the AgentHarnessPage.js Card UI.
  • Setup: Introduce the ability to mount a dynamic .md file as the system prompt for a Session. Allow users to click from the Dashboard directly into the pre-configured SwarmControlPage to visually spectate the background Agent working.

Phase 3: The Engine Migration ("Scale it Up")

  • Action: Lift the background loop out of FastAPI and drop it into a Celery/Redis worker fleet (Path A).
  • Setup: Because our MVP was designed to use the Hub's public API to observe/dispatch commands, the Celery workers can be hosted anywhere and still interact perfectly with the Hub using simple REST. We achieve horizontal scale without rewriting the core execution logic.

Phase 4: Agent Collaboration & Advanced Resilience

  • Action: Build out explicit Handoff tools (handoff_to_agent(target_agent_id, json_manifest)). By enforcing a JSON Manifest schema, we guarantee token efficiency and prevent "context poisoning" between disparate agents (e.g., Coder -> QA).
  • Setup: Introduce Rolling Summarization (replacing the sliding window), Financial Circuit Breakers, and transition Redis TTL locks to handle zombie agent recovery intelligently.

6. Conclusion & Future Flexibility

By abstracting "Swarm Control" into "Agents", we modularize intelligence. The backend AI doesn't need to know if it's chatting with a human or triggered by GitHub; it just receives prompts and executes on the Mesh Nodes. This keeps the codebase incredibly DRY (Don't Repeat Yourself) while exponentially increasing the capabilities of the platform. We can iteratively refine the .md files without touching backend Python code, providing maximum flexibility for the future.


7. Background Resilience & Self-Recovery Mechanics

Unlike an interactive chat where a human can instantly see and correct an AI error, background Orchestrator loops require extreme self-healing and failure mechanisms to prevent runaway infrastructure or billing disasters. We engineer resilience at five layers:

A. Circuit Breakers (Cost & API Failure Limits)

Autonomous loops can easily get stuck retrying broken code, rapidly burning provider tokens (HTTP 429).

  • Mechanism: Every Agent Template is assigned a hard execution cap (e.g., Max_Iterations: 20). If an Agent fails to complete its objective within 20 continuous tool calls, the Hub triggers a Circuit Breaker. The Agent halts, flips to πŸ”΄ Error (Suspended), and instantly alerts the Dashboard for manual human intervention via the conversational drill-down. Network drops or OpenAI 502s are gracefully handled via strict exponential backoff (Tenacity).

B. The Zombie Sweeper (State Recovery)

If the actual ai-hub Docker container restarts, or the Python worker runs out of RAM midway through a task, the DB will still say the agent is 🟒 Active, but no background thread is actually running it.

  • Mechanism (TTL Leases): When an async worker takes a job, it acquires a "Lease" on that Agent in the DB with a 2-minute Time-To-Live (TTL). While processing, the worker pings the DB every 60 seconds to extend the lease. If the worker crashes, the lease expires. A lightweight background sweeper checks the DB every 5 minutes and immediately resets any "Zombie" agents to 🟑 Idle. Because our Agents are stateless, the next worker simply reads the chat history and seamlessly resumes the loop exactly where it left off.

C. Context Saturation (Rolling Memory Summary)

An Agent in an infinite loop will rapidly exceed the LLM's 100k+ token window if it appends every thought and terminal output linearly.

  • Mechanism: A background Context Manager constantly measures the Agent's message byte-size. As the Agent approaches the limit, the Hub spins up a fast, cheap LLM model (e.g., Llama3/Claude Haiku) to compress the oldest 50 messages into a dense "Scratchpad Summary." The Agent is then fed a constant-size prompt: [System Persona] + [Condensed Scratchpad] + [Last 10 Actions], ensuring it never crashes from token bloat.

D. Node State Concurrency (Clashing Environments)

If multiple autonomous Agents are attached to the same physical node simultaneously, their shell commands might collide (e.g., Agent A deletes /tmp/data while Agent B is trying to zip it).

  • Mechanism (Global Locks & Jails): Attempting to parse raw bash strings for "path-based semantics" is incredibly brittle (e.g., an Agent using cd .. could escape a path lock). Instead, the MVP exclusively uses Global Node Locks (only one Agent can orchestrate a specific Mesh Node at a time). To achieve true concurrency later, we will use Workspace Jails. An Agent will be strictly confined by the Node's Sandbox policy to only write to its assigned runtime directory (e.g., /tmp/cortex/agent_A/). If an Agent needs to modify a global system config (like /etc/nginx/), it must explicitly escalate and halt all other Agents via a Global Node Lock.

E. Persistent Headless Logging

The current Swarm UI relies on in-memory WebSockets to display live terminal output. A background Agent might poll the node, but if the logs stream violently fast, crucial output could be dropped from RAM before the polling cycle hits.

  • Mechanism: The Agent Node strictly streams long-running background task outputs into persistent, locked log files on the host disk (e.g., ~/.cortex/logs/{session_id}.log). The orchestrator natively instructs the Agent to read these concrete files for analysis rather than sniffing the live websocket buffer, guaranteeing zero data loss.

F. Idempotency & Crash Artifacts (State Collision)

If an Agent crashes halfway through downloading a dataset or creating a database table, the "Zombie Sweeper" will reset it and the Agent will retry. Without precautions, the Agent will immediately crash again because the git clone or CREATE TABLE command will throw a "Resource Already Exists" error.

  • Mechanism: Agent Prompts will be engineered with rigorous Idempotency rules. Agents will be explicitly instructed to always verify the current state of the filesystem or environment before executing write-commands upon waking up. If temporary crash artifacts are detected (e.g., partial downloads), the Agent must clean its isolated directory namespace before restarting the task.

G. Summary Degradation ("Agent Dementia")

While Rolling Memory Summary (Mechanism C) keeps the Agent below token limits, summarization inherently destroys precise detail. If an agent spent 5 turns fixing a massive regex on line 124 of a 5,000-line file, the summarizer might condense it to: "Agent reviewed file and fixed regex." Because the exact path and line number are lost from context, the Agent will suffer from "dementia" and have to re-discover its own work repeatedly in long loops.

  • Mechanism (The Persistent Scratchpad): We will provide Agents with an explicit Scratchpad Node Skill. The Agent will be instructed to treat a physical .txt file on the node as its own hippocampus. It will actively write exact variables, paths, and immediate next steps to this physical file so that even if the Hub summarizes its chat history, its literal working memory is safely preserved and readable natively by the Agent on every loop tick.

8. Architecture Scalability & Decentralization

Running the orchestrator loop (compiling prompts, calling LLMs, interpreting outputs) for 10+ sub-agents directly inside the main ai-hub API server is fundamentally unscalable. A single Python FastAPI backend will quickly become I/O and CPU bound.

To achieve infinite, horizontal scale for the Orchestrator as usage grows, we strictly adhere to a decoupled Worker Fleet Architecture (Brains in the Cloud, Smart Hands on the Edge).

The Worker Fleet Model

Running the orchestrator loop directly inside the main ai-hub API server will eventually become CPU/IO bound. Instead, we split the architecture:

  1. The Hub (Stateless API): Remains strictly a router and execution proxy, caching current state.
  2. The Worker Pool: We implement an asynchronous worker fleet (scaling from FastAPI BackgroundTasks in the MVP, up to Celery workers for Enterprise instances). These separate containers exclusively run the long-lived while loops, make heavy LLM API calls, and parse prompts.
  3. The Edge Nodes: The agent-node clients remain drastically lightweight. They run no LLM logic locally.

Addressing Structural Limitations

To ensure the Worker Fleet remains furiously fast and doesn't buckle under network traffic:

  • Batching Skills: The Agent prompt is instructed to aggregate commands into bash scripts rather than rapid-firing single-line commands, heavily reducing the gRPC Round Trip Time (RTT).
  • Data-Reduction at Edge: If the Worker Agent needs to ingest a 100MB repository or parse huge live log files, streaming that data from the Node to the Hub just to feed it into the LLM context will choke the network. We mitigate this by building Data-Reduction Skills (e.g., remote_semantic_grep). The Worker instructs the Node to run the heavy file-parsing locally on its own CPU, and exactly 3 lines of dense, matched text are returned over the wire to the Worker.
  • Dependency Minimalism: We explicitly avoid integrating complex message brokers like Kafka or heavy workflow engines like Temporal to ensure the platform remains remarkably easy to self-host and maintain.