Newer
Older
cortex-hub / docs / architecture / cortex_agent_node_plan.md

Cortex Agent Node: Architecture & Implementation Plan

This document outlines the transition from the current WebSockets (wss) code syncing approach to a fully distributed, secure, multi-agent architecture where the Cortex Server orchestrates powerful local "Agent Nodes."

🏗️ High-Level Architecture

1. The Cortex Server (Orchestrator)

  • Role: The central brain. Handles AI inference, task planning, and user interface.
  • Communication Hub: Exposes a bidirectional streaming endpoint via gRPC over HTTP/2 to securely manage connections from multiple remote Agent Nodes.
  • Node Registry: Keeps track of connected nodes, their identities, health status, and most importantly, Capability Discovery (e.g., knows if a node has Docker, Python, or Chrome installed before sending a task).

2. The Agent Node (Client Software)

  • Role: A lightweight, standalone daemon running on the user's local machine (or specific dev containers).
  • Execution Engine: Receives tasks from the server, executes them locally (using host resources), and streams results back.
  • Capabilities:
    • System Ops: Run bash commands, edit files, list directories.
    • Browser Automation: Control local browsers via CDP (Chrome DevTools Protocol) for UI testing and visual feedback.
    • Auditing: Maintains a strict, immutable local log of every command executed by the AI, ensuring the user has a transparent trail of data access.

3. Tunneling & Security

  • The "Phone Home" Pattern: To bypass NAT and firewalls (e.g., home routers, corporate networks), the Agent Node initiates an outbound HTTPS/HTTP2 connection to the server. The server then pushes tasks down this persistent bidirectional stream.
  • JWT Identity & Authz:
    • Each Agent Node is bootstrapped with a unique identity (Service Account or User-bound token).
    • The node presents a short-lived JWT upon tunnel connection. The server validates the claims to ensure the node is authorized.
  • mTLS: For enterprise-grade security and strict node identity validation, Mutual TLS should be established between the Server and Agent Node.

🛠️ Execution Plan

We will execute this transformation in 6 phased milestones.

Phase 1: Protocol & Tunnel Proof of Concept (POC)

  • Goal: Establish a reliable, bidirectional gRPC connection that supports retries and backpressure.
  • Tasks:
    • Define the Protobuf schema (agent.proto) with structured messages: TaskRequest (needs task_id, idempotency_key, capability_required), TaskResponse, and Heartbeat.
    • Build a Python gRPC server and client to validate connection multiplexing.
    • Implement gRPC keep-alives and exponential backoff retry logic.
  • Outcome: Server can dispatch an idempotent "Echo" task down the gRPC stream.

Phase 2: Security & Identity Implementation

  • Goal: Lock down the tunnel.
  • Tasks:
    • Implement JWT minting for Agent Nodes on the Cortex Server.
    • Require the Agent Client to authenticate during the initial handshake.
    • Associate connected sessions with a specific User/Workspace identity to enforce authorization boundaries.
  • Outcome: Only authenticated nodes can connect; connections are mapped to user sessions.

Phase 3: Core Capabilities & Secure Engine (The Local Sandbox)

  • Goal: Give the Agent Node hands and eyes, safely.
  • Tasks:
    • Capability Negotiation: Agent sends a manifest (node_id, capabilities: {shell: true, fs: true}, platform) on connection.
    • Execution Sandbox: Enforce a strict "Command Sandbox Policy" (whitelist allowed commands, restrict network).
    • Consent-based Execution: Add a "Strict Mode" where the Agent prompts the local user (Y/N) in the terminal before destructive actions.
    • Audit Interceptor: Every command requested by the server is logged locally (append-only) before execution.
  • Outcome: The Server can safely ask the Client to read /etc/os-release.

Phase 4: Browser Automation (The "Antigravity" Feature)

  • Goal: Allow the Agent Node to interact with local web apps.
  • Tasks:
    • Implement a lightweight CDP (Chrome DevTools Protocol) integration to attach to an already running browser instance (avoids heavy Playwright dependencies).
    • Create standardized commands like Navigate, Click, CaptureScreenshot.
    • Stream screenshots back over the gRPC tunnel natively using chunked binary frames.
  • Outcome: The Server can instruct the client's local browser to open localhost:8080 and stream a screenshot.

Phase 5: Concurrency & Task Isolation

  • Goal: Handle multiple simultaneous requests safely without corruption.
  • Tasks:
    • Define a strict Task Isolation Model: File writes use advisory locks; browser actions run in isolated contexts.
    • Implement asynchronous task workers on the Agent Node.
    • Introduce Resource Quotas (limit Agent Node to max % CPU/Memory).
  • Outcome: Server can issue 5 simultaneous operations and they complete concurrently without blocking the tunnel or corrupting state.

Phase 6: Frontend UI Integration & Refactoring

  • Goal: Replace the old UI approach with the new system.
  • Tasks:
    • Update the CodingAssistantPage to recognize connected Agent Nodes instead of relying on the old WSS sync logic.
    • Display connected nodes in the UI.
    • Give users a dashboard to view the remote audit logs from the UI.
  • Outcome: A seamless user experience powered by the new architecture.

Before mapping this into JIRA/GitBucket issues, we should build the gRPC Protobuf Schema (agent.proto) and establish the Phase 1 Dummy Python Server/Client.

Shall I proceed with writing the initial Protobuf definition to solidify the API contract?