Newer
Older
cortex-hub / docs / swarm_architecture_analysis.md

Swarm Control Architecture Analysis

This document provides a deep dive into the internal mechanics of the Cortex Swarm Control system, tracing a command from the user interface to the remote agent node execution and back.

1. System Components

  • Frontend (MultiNodeConsole/ChatWindow): The user interface for interaction.
  • AI Hub (Orchestrator): The central brain managing sessions, RAG, and node communication.
  • Task Assistant/Journal: Manages task registration, signing, and state tracking.
  • Agent Node (Client): Remote worker executing commands via gRPC.
  • AI Sub-Agent: Specialized autonomous loop that monitors PTY progress.

2. Command Flow Trace

Phase 1: Request & Tool Dispatch

  1. User Input: A user types a command or request into the chat.
  2. Main LLM Loop: RagPipeline calls the LLM with the current mesh context. The LLM decides to use mesh_terminal_control.
  3. Tool Dispatching: ToolService intercepts the request. It initializes a Sub-Agent to manage the lifecycle of this specific terminal task.
  4. Task Registration: The AssistantService generates a unique task_id and registers it in the TaskJournal. It signs the command payload using an RSA-PSS signature to ensure integrity.

Phase 2: Hub-to-Node Transmission (gRPC)

  1. Queueing: The signed task is put into the target node's outbound queue (node.queue).
  2. Streaming: The AgentOrchestrator service (running a bidirectional gRPC stream) pops the message and sends a ServerTaskMessage (Protobuf) to the connected Agent Node.

Phase 3: Remote Execution (Agent Node)

  1. Node Reception: The Agent Node's gRPC client receives the message and validates the signature.
  2. Shell Execution: The ShellSkill initializes a pseudo-terminal (PTY) for the command. This provides a stateful session where environment variables and working directories persist.
  3. Real-time Feedback: Standard output (stdout) and error (stderr) are immediately streamed back to the Hub via TaskResult messages.

Phase 4: Monitoring & Decision (The Sub-Agent)

  1. State Tracking: grpc_server.py on the Hub receives the output chunks and updates the TaskJournal. It also broadcasts these via WebSockets for the UI's real-time terminal display.
  2. AI Analysis: The Sub-Agent on the Hub runs in parallel. Every few seconds, it evaluates the thought_history and current stdout of the task.
  3. Branching Agency: The Sub-Agent determines if the task is FINISH, WAIT, ABORT, or EXECUTE. If EXECUTE is chosen, the Sub-Agent can launch new commands on any node in the swarm (e.g., pivot to Node-B after Node-A is ready).
  4. Autonomy: It returns the final result once the entire coordinated chain of completion is detected.

Phase 5: Result Aggregation

  1. Main Response: RagPipeline receives the Sub-Agent's final report and feeds it back to the main LLM to generate the final user-facing answer.

3. Architecture Diagram

sequenceDiagram
    participant User as 💻 User (UI)
    participant Hub as 🧠 AI Hub (Orchestrator)
    participant Sub as 🤖 Sub-Agent (Monitor)
    participant Node as 📡 Agent Node (PTY)

    User->>Hub: "List files on Node-A"
    Hub->>Hub: LLM: Use mesh_terminal_control
    Hub->>Sub: Instantiate Sub-Agent
    Sub->>Hub: dispatch_single(cmd, node_a)
    Hub->>Hub: TaskJournal.register(tid)
    Hub->>Node: gRPC: TaskRequest (Signed)
    
    loop Monitoring Loop
        Node-->>Hub: gRPC: TaskResult (stdout)
        Hub-->>User: WS: task_stdout (Live Terminal)
        Sub->>Hub: peek_journal(tid)
        Sub->>Sub: LLM Decision: WAIT/FINISH/EXECUTE
        opt Branching Action
            Sub->>Hub: dispatch_single(new_cmd, node_b)
            Hub->>Node: gRPC: New Task (Signed)
        end
    end

    Sub->>Hub: Task Done
    Hub->>User: "Here is the list of files..."

4. Performance & Bottlenecks Analysis

Current Bottlenecks

  1. Hub Network Serialization: For massive swarms (100+ nodes), the ThreadPoolExecutor in dispatch_swarm may encounter GIL contention or network I/O limits on the Hub.
  2. LLM Latency: The Sub-Agent's "Observation Loop" relies on repeated LLM calls to detect task completion. This adds cost and latency (1-3 seconds per "intelligence" tick).
  3. PTY Memory: Large terminal buffers (scrollback) are stored in memory on both the Hub and the Node. Extremely long-running commands with massive output can lead to memory pressure.

Limitations

  1. Synchronous Tool Wait: The RagPipeline currently blocks while waiting for the ToolService to return. This prevents the user from asking follow-up questions while a background task is running, unless no_abort=True is used for asynchronous polling.
  2. Context Window Limits: Passing full terminal history back to the LLM for every monitoring tick can quickly saturate the context window of smaller models.

Potential Issues

  1. Zombie Tasks: If a node disconnects abruptly, the PTY process on the node might continue to run (dangling process) if the TaskCancel message fails to reach it.
  2. Race Conditions: In dispatch_swarm, results are returned in a dictionary keyed by node_id. If multiple tasks are dispatched to the same node in very rapid succession, session_id conflicts must be avoided (mitigated by using unique PTY sessions).

5. Future Recommendations

  • Edge Intelligence: Move the terminal monitoring logic (Sub-Agent) to the local Agent Node to reduce Hub-to-Node traffic and Hub-side LLM costs.
  • WebSocket Binary Compression: Compress terminal output before sending it to the browser to improve performance on high-latency connections.
  • Persistence Layer: Move TaskJournal state to Redis or a database to allow Hub restarts without losing task monitoring state.