Newer
Older
cortex-hub / docs / architecture / agent_node_integration_readiness.md

Agent Node — AI Hub Integration Readiness Review

Prepared: 2026-03-04
Scope: poc-grpc-agent/ (commit 785b387)
Goal: Evaluate what is built, what works, and what must be done before migrating server logic into the AI Hub to support the target user flow:

"The AI Hub acts as the server. It provides a page to download the client-side Agent Node software (with a YAML config). The user deploys the node locally, the Hub detects the live node, and the user can attach that node to a conversation session in the UI."


1. What Is Built — Component Inventory

1.1 gRPC Protocol (protos/agent.proto)

Message / RPC Purpose Status
SyncConfiguration (unary) Node registration + sandbox policy handshake ✅ Done
TaskStream (bidir stream) Persistent command & control channel ✅ Done
ReportHealth (bidir stream) Heartbeat / utilization reporting ✅ Done
RegistrationRequest Node ID, version, auth JWT, capabilities map ✅ Done
SandboxPolicy STRICT / PERMISSIVE mode, allowed/denied commands ✅ Done
TaskRequest Shell payload or BrowserAction, with session_id ✅ Done
TaskResponse stdout/stderr + structured BrowserResponse ✅ Done
FileSyncMessage Bidirectional file sync envelope ✅ Done
SyncControl START/STOP_WATCHING, LOCK, UNLOCK, REFRESH_MANIFEST, RESYNC ✅ Done
DirectoryManifest / FileInfo SHA-256 manifest for drift detection ✅ Done
FilePayload Chunked file transfer with hash verification ✅ Done
SyncStatus OK / ERROR / RECONCILE_REQUIRED + reconcile_paths ✅ Done
BrowserEvent Live console/network event tunneling ✅ Done
WorkPoolUpdate / TaskClaimRequest Work-stealing pool protocol ✅ Done

1.2 Server Side — Orchestrator Components

Component File What It Does Status
AgentOrchestrator services/grpc_server.py gRPC servicer, routes all three RPCs ✅ Done
MemoryNodeRegistry core/registry.py Tracks live nodes by ID in memory ✅ Done (in-memory only)
TaskJournal core/journal.py Async task state tracking (Event-based) ✅ Done
GlobalWorkPool core/pool.py Thread-safe work-stealing task pool ✅ Done
GhostMirrorManager core/mirror.py Server-side file mirror with SHA-256 ✅ Done
TaskAssistant services/assistant.py High-level AI API: dispatch, push, sync ✅ Done
CortexIgnore shared_core/ignore.py .cortexignore / .gitignore filtering ✅ Done
sign_payload / sign_browser_action utils/crypto.py HMAC-SHA256 task signing ✅ Done
Mesh Dashboard grpc_server.py:_monitor_mesh Console health printout every 10s ✅ Done (console only)

Key API Methods on TaskAssistant

Method Description
push_workspace(node_id, session_id) Initial push of all files in Ghost Mirror to a node
push_file(node_id, session_id, rel_path) Targeted single-file push (drift recovery)
broadcast_file_chunk(session_id, sender, chunk) Propagate a node's change to all other nodes
control_sync(node_id, session_id, action) Send any SyncControl action
lock_workspace / unlock_workspace Toggle node-side write lock
request_manifest / reconcile_node Phase 5 drift detection + auto-recovery
dispatch_single(node_id, cmd, session_id) Dispatch shell task (CWD-aware)
dispatch_browser(node_id, action, session_id) Dispatch browser task (Download-to-Sync)

1.3 Client Side — Agent Node Components

Component File What It Does Status
AgentNode node.py Core node orchestrator ✅ Done
SkillManager skills/manager.py ThreadPool routing to shell / browser ✅ Done
ShellSkill skills/shell.py Shell execution, CWD-aware, cancellable ✅ Done
BrowserSkill skills/browser.py Playwright actor, Download-to-Sync ✅ Done
SandboxEngine core/sandbox.py Policy enforcement (STRICT/PERMISSIVE) ✅ Done
NodeSyncManager core/sync.py Local file writes + drift detection ✅ Done
WorkspaceWatcher core/watcher.py watchdog-based file change streaming ✅ Done
CortexIgnore (shared) shared_core/ignore.py Same ignore logic as server ✅ Done
create_auth_token utils/auth.py 10-minute JWT for registration ✅ Done
verify_task_signature utils/auth.py HMAC verification for incoming tasks ✅ Done
get_secure_stub utils/network.py mTLS gRPC channel setup ✅ Done
Config (config.py) agent_node/config.py 12-Factor env-var config ✅ Done
Entry Point (main.py) agent_node/main.py SIGINT/SIGTERM graceful shutdown ✅ Done

Node Capabilities Advertised on Registration

shell: "v1"
browser: "playwright-sync-bridge"

1.4 Security

Mechanism Implementation Status
mTLS Server + Client certs (certs/), CA-signed ✅ Done (self-signed for dev)
JWT Registration 10-min expiry, HS256, shared secret ✅ Done
HMAC Task Signing Each task payload is HMAC-SHA256 signed ✅ Done
Sandbox Policy Server sends allow/deny lists at registration ✅ Done
Path Traversal Guard normpath + prefix check in sync writes ✅ Done
.cortexignore Prevents sensitive files from syncing ✅ Done
Workspace Locking Node ignores user edits during AI writes ✅ Done

1.5 Synchronization — Ghost Mirror System

Feature Status
Server-Primary push (server → node) ✅ Done
Node-Primary delta streaming (node → server) ✅ Done
Multi-node broadcast propagation ✅ Done
SHA-256 hash verification on file receive ✅ Done
Manifest-based drift detection ✅ Done
Automatic drift recovery on reconnect ✅ Done
Chunked file transfer (64KB chunks) ✅ Done
.cortexignore / .gitignore filtering ✅ Done
Dynamic ignore rule reloading ✅ Done
Download-to-Sync (browser → workspace) ✅ Done

2. What Is Missing — Gaps Before AI Hub Integration

🔴 Must-Fix (Blockers for the Target User Flow)

# Gap Why It's a Blocker Recommendation
M1 No REST/WebSocket API surface The Orchestrator only exposes a gRPC port. The AI Hub UI has no way to query node status, trigger syncs, or read task results. Add a thin HTTP layer (FastAPI or Django view) alongside the gRPC server that exposes: GET /nodes, GET /nodes/{id}/status, POST /sessions/{id}/dispatch.
M2 Node Registry is in-memory If the Hub process restarts, all node registrations are lost. Live nodes appear "offline" even if still connected. Back MemoryNodeRegistry with a persistent store (Redis or Postgres) so registrations survive restarts.
M3 No persistent session model session_id is a bare string — there is no concept of which user owns a session, which nodes are attached, or its lifecycle. Add a Session DB model in the Hub: id, user_id, attached_node_ids, created_at, status.
M4 No user → node ownership / authorization Any registered node is visible to all. There is no "this is my node" concept per user account. Add node_owner_id (user_id) at registration time. The Hub must issue a pre-signed invite token via POST /api/nodes/invite before the node connects.
M5 Node download page doesn't exist There is no endpoint or UI page to download the Agent Node software with a pre-configured YAML. Build a "Download Your Node" page in the Hub that generates a config YAML with SERVER_HOST, GRPC_PORT, AGENT_NODE_ID, and the invite token. Bundle the Python package for download.
M6 No live node status in the UI The mesh dashboard only prints to the server console. The Hub UI needs real-time status to show "node is alive". Expose node status via WebSocket or Server-Sent Events so the UI can show 🟢/⚫ for each node in real time.
M7 mTLS certs are developer-only self-signed The certs/ folder contains hardcoded dev certs. Nodes on external networks will fail TLS verification. Either: (a) switch to token-only auth (no mTLS), which is simpler since the Hub already handles HTTPS; or (b) implement a cert-issuance API (/api/nodes/cert-request) backed by an internal CA.
M8 Shared HMAC secret is hardcoded ORCHESTRATOR_SECRET_KEY and AGENT_SECRET_KEY both default to "cortex-secret-shared-key". Any node with this secret can forge tasks. Replace with per-node rotating keys derived from the invite token, or use asymmetric signing (Ed25519).

🟡 Should-Fix (Important for Production Quality)

# Gap Detail Recommendation
S1 No node deregistration When a node's stream closes, its entry stays in registry.nodes forever. list_nodes() returns stale dead entries. Add deregister(node_id) to the TaskStream finally block + a TTL-based cleanup routine.
S2 JWT has 10-minute expiry but no refresh After 10 minutes the JWT is expired but SyncConfiguration runs only once at startup. Either extend TTL to match session length, or implement a token-refresh RPC.
S3 GhostMirrorManager storage root is hardcoded storage_root="/app/data/mirrors" is hardcoded. In the Hub this should be per-user and configurable. Make it configurable via env var; use path /data/mirrors/{user_id}/{session_id}.
S4 No browser task cancellation BrowserSkill.cancel() always returns False. A running browser session cannot be interrupted. Implement cancellation by pushing a sentinel + task_id into the actor queue.
S5 CPU usage reported as hardcoded 1.0 Heartbeat.cpu_usage_percent=1.0 is fake. Load balancing decisions are unreliable. Use psutil.cpu_percent() and psutil.virtual_memory().percent.
S6 Work-stealing jitter is random time.sleep(random.uniform(0.1, 0.5)) for claim jitter is functional but non-deterministic. Use a hash of node_id + task_id for deterministic, replayable distribution.
S7 No reconnection loop on the client If the server is temporarily unavailable, main.py calls sys.exit(1) and dies. Implement an exponential-backoff retry loop (max_retries=10) in run_task_stream() before giving up.
S8 import hashlib missing in node.py _push_full_manifest calls hashlib.sha256 but import hashlib is missing. Will crash with NameError at runtime. Add import hashlib to the top of node.py. Fix immediately.

🟢 Nice-to-Have (Phase 6 / Future)

# Gap Recommendation
N1 No Optimistic Concurrency on file writes Add parent_hash field to FilePayload; reject edits where hash doesn't match server's current version.
N2 Browser events not persisted Console/network events are only printed to console. Store them per-session for the UI to replay.
N3 No streaming task output Shell output is returned only on completion. Add a ProgressEvent to stream stdout lines in real-time.
N4 No structured capability discovery Capabilities are a bare map<string,string>. Structured metadata (OS, Python version, GPU, disk space) would enable smarter task routing.

3. Architecture — Current vs. Target

Current (POC)

[ Orchestrator CLI ] ←gRPC:50051 (mTLS)→ [ Agent Node A ]
       |                                 → [ Agent Node B ]
  /app/data/mirrors (local FS)
  in-memory registry
  console dashboard only

Target (AI Hub Integrated)

[ User Browser ]
     ↓ HTTPS
[ AI Hub (Django/FastAPI) ]
   ├── REST/WS API  →  GET /api/nodes
   │                   POST /api/nodes/invite
   │                   POST /api/sessions/{id}/dispatch
   ├── gRPC Server (port 50051)
   │       ↑ SyncConfiguration / TaskStream / ReportHealth
   │   [ Agent Node A ] ← downloaded + configured by user
   │   [ Agent Node B ]
   ├── DB (Postgres)  → sessions, nodes, users
   └── File Storage   → /data/mirrors/{user_id}/{session_id}/

This is the sequence to complete before declaring "ready to migrate":

  1. [S8] Fix import hashlib in node.py — immediate silent crash risk.
  2. [M8] Fix secret management — per-node invite-based keys. Security baseline.
  3. [M7] Decide TLS strategy — token-only auth removes the cert burden from end users.
  4. [M3] + [M4] Session & Node ownership DB models — data foundation for everything else.
  5. [M1] HTTP/WS API layer — thin FastAPI app alongside gRPC to expose state to the Hub UI.
  6. [M2] Persistent registry — wire MemoryNodeRegistry to Redis/Postgres.
  7. [S1] Node deregistration + TTL — makes the "online nodes" list accurate.
  8. [S7] Client reconnection loop — nodes must survive transient server restarts.
  9. [M5] "Download Your Node" page — the final user-facing feature closing the loop.
  10. [M6] Live node status in UI — WebSocket push so the UI shows 🟢/⚫ in real-time.

[!NOTE] Items M1 through M6 are all about the Hub's integration layer, not the Agent Node client code. The gRPC protocol (.proto) is stable and does not need to change for the initial integration.

[!CAUTION] S8 (import hashlib missing in node.py) will cause a NameError crash in the _push_full_manifest code path. Fix this before any production deployment.