Agent Node — AI Hub Integration Readiness Review

Prepared: 2026-03-04
Scope: poc-grpc-agent/ (commit 785b387)
Goal: Evaluate what is built, what works, and what must be done before migrating server logic into the AI Hub to support the target user flow:

"The AI Hub acts as the server. It provides a page to download the client-side Agent Node software (with a YAML config). The user deploys the node locally, the Hub detects the live node, and the user can attach that node to a conversation session in the UI."

1. What Is Built — Component Inventory

1.1 gRPC Protocol (`protos/agent.proto`)

Message / RPC	Purpose	Status
`SyncConfiguration` (unary)	Node registration + sandbox policy handshake	✅ Done
`TaskStream` (bidir stream)	Persistent command & control channel	✅ Done
`ReportHealth` (bidir stream)	Heartbeat / utilization reporting	✅ Done
`RegistrationRequest`	Node ID, version, auth JWT, capabilities map	✅ Done
`SandboxPolicy`	STRICT / PERMISSIVE mode, allowed/denied commands	✅ Done
`TaskRequest`	Shell payload or BrowserAction, with `session_id`	✅ Done
`TaskResponse`	stdout/stderr + structured `BrowserResponse`	✅ Done
`FileSyncMessage`	Bidirectional file sync envelope	✅ Done
`SyncControl`	START/STOP_WATCHING, LOCK, UNLOCK, REFRESH_MANIFEST, RESYNC	✅ Done
`DirectoryManifest` / `FileInfo`	SHA-256 manifest for drift detection	✅ Done
`FilePayload`	Chunked file transfer with hash verification	✅ Done
`SyncStatus`	OK / ERROR / RECONCILE_REQUIRED + `reconcile_paths`	✅ Done
`BrowserEvent`	Live console/network event tunneling	✅ Done
`WorkPoolUpdate` / `TaskClaimRequest`	Work-stealing pool protocol	✅ Done

1.2 Server Side — Orchestrator Components

Component	File	What It Does	Status
`AgentOrchestrator`	`services/grpc_server.py`	gRPC servicer, routes all three RPCs	✅ Done
`MemoryNodeRegistry`	`core/registry.py`	Tracks live nodes by ID in memory	✅ Done (in-memory only)
`TaskJournal`	`core/journal.py`	Async task state tracking (Event-based)	✅ Done
`GlobalWorkPool`	`core/pool.py`	Thread-safe work-stealing task pool	✅ Done
`GhostMirrorManager`	`core/mirror.py`	Server-side file mirror with SHA-256	✅ Done
`TaskAssistant`	`services/assistant.py`	High-level AI API: dispatch, push, sync	✅ Done
`CortexIgnore`	`shared_core/ignore.py`	`.cortexignore` / `.gitignore` filtering	✅ Done
`sign_payload` / `sign_browser_action`	`utils/crypto.py`	HMAC-SHA256 task signing	✅ Done
Mesh Dashboard	`grpc_server.py:_monitor_mesh`	Console health printout every 10s	✅ Done (console only)

Key API Methods on `TaskAssistant`

Method	Description
`push_workspace(node_id, session_id)`	Initial push of all files in Ghost Mirror to a node
`push_file(node_id, session_id, rel_path)`	Targeted single-file push (drift recovery)
`broadcast_file_chunk(session_id, sender, chunk)`	Propagate a node's change to all other nodes
`control_sync(node_id, session_id, action)`	Send any SyncControl action
`lock_workspace / unlock_workspace`	Toggle node-side write lock
`request_manifest / reconcile_node`	Phase 5 drift detection + auto-recovery
`dispatch_single(node_id, cmd, session_id)`	Dispatch shell task (CWD-aware)
`dispatch_browser(node_id, action, session_id)`	Dispatch browser task (Download-to-Sync)

1.3 Client Side — Agent Node Components

Component	File	What It Does	Status
`AgentNode`	`node.py`	Core node orchestrator	✅ Done
`SkillManager`	`skills/manager.py`	ThreadPool routing to shell / browser	✅ Done
`ShellSkill`	`skills/shell.py`	Shell execution, CWD-aware, cancellable	✅ Done
`BrowserSkill`	`skills/browser.py`	Playwright actor, Download-to-Sync	✅ Done
`SandboxEngine`	`core/sandbox.py`	Policy enforcement (STRICT/PERMISSIVE)	✅ Done
`NodeSyncManager`	`core/sync.py`	Local file writes + drift detection	✅ Done
`WorkspaceWatcher`	`core/watcher.py`	`watchdog`-based file change streaming	✅ Done
`CortexIgnore` (shared)	`shared_core/ignore.py`	Same ignore logic as server	✅ Done
`create_auth_token`	`utils/auth.py`	10-minute JWT for registration	✅ Done
`verify_task_signature`	`utils/auth.py`	HMAC verification for incoming tasks	✅ Done
`get_secure_stub`	`utils/network.py`	mTLS gRPC channel setup	✅ Done
Config (`config.py`)	`agent_node/config.py`	12-Factor env-var config	✅ Done
Entry Point (`main.py`)	`agent_node/main.py`	SIGINT/SIGTERM graceful shutdown	✅ Done

Node Capabilities Advertised on Registration

shell: "v1"
browser: "playwright-sync-bridge"

1.4 Security

Mechanism	Implementation	Status
mTLS	Server + Client certs (`certs/`), CA-signed	✅ Done (self-signed for dev)
JWT Registration	10-min expiry, HS256, shared secret	✅ Done
HMAC Task Signing	Each task payload is HMAC-SHA256 signed	✅ Done
Sandbox Policy	Server sends allow/deny lists at registration	✅ Done
Path Traversal Guard	`normpath` + prefix check in sync writes	✅ Done
`.cortexignore`	Prevents sensitive files from syncing	✅ Done
Workspace Locking	Node ignores user edits during AI writes	✅ Done

1.5 Synchronization — Ghost Mirror System

Feature	Status
Server-Primary push (server → node)	✅ Done
Node-Primary delta streaming (node → server)	✅ Done
Multi-node broadcast propagation	✅ Done
SHA-256 hash verification on file receive	✅ Done
Manifest-based drift detection	✅ Done
Automatic drift recovery on reconnect	✅ Done
Chunked file transfer (64KB chunks)	✅ Done
`.cortexignore` / `.gitignore` filtering	✅ Done
Dynamic ignore rule reloading	✅ Done
Download-to-Sync (browser → workspace)	✅ Done

2. What Is Missing — Gaps Before AI Hub Integration

🔴 Must-Fix (Blockers for the Target User Flow)

#	Gap	Why It's a Blocker	Recommendation
M1	No REST/WebSocket API surface	The Orchestrator only exposes a gRPC port. The AI Hub UI has no way to query node status, trigger syncs, or read task results.	Add a thin HTTP layer (FastAPI or Django view) alongside the gRPC server that exposes: `GET /nodes`, `GET /nodes/{id}/status`, `POST /sessions/{id}/dispatch`.
M2	Node Registry is in-memory	If the Hub process restarts, all node registrations are lost. Live nodes appear "offline" even if still connected.	Back `MemoryNodeRegistry` with a persistent store (Redis or Postgres) so registrations survive restarts.
M3	No persistent session model	`session_id` is a bare string — there is no concept of which user owns a session, which nodes are attached, or its lifecycle.	Add a `Session` DB model in the Hub: `id`, `user_id`, `attached_node_ids`, `created_at`, `status`.
M4	No user → node ownership / authorization	Any registered node is visible to all. There is no "this is my node" concept per user account.	Add `node_owner_id` (user_id) at registration time. The Hub must issue a pre-signed invite token via `POST /api/nodes/invite` before the node connects.
M5	Node download page doesn't exist	There is no endpoint or UI page to download the Agent Node software with a pre-configured YAML.	Build a "Download Your Node" page in the Hub that generates a config YAML with `SERVER_HOST`, `GRPC_PORT`, `AGENT_NODE_ID`, and the invite token. Bundle the Python package for download.
M6	No live node status in the UI	The mesh dashboard only prints to the server console. The Hub UI needs real-time status to show "node is alive".	Expose node status via WebSocket or Server-Sent Events so the UI can show 🟢/⚫ for each node in real time.
M7	mTLS certs are developer-only self-signed	The `certs/` folder contains hardcoded dev certs. Nodes on external networks will fail TLS verification.	Either: (a) switch to token-only auth (no mTLS), which is simpler since the Hub already handles HTTPS; or (b) implement a cert-issuance API (`/api/nodes/cert-request`) backed by an internal CA.
M8	Shared HMAC secret is hardcoded	`ORCHESTRATOR_SECRET_KEY` and `AGENT_SECRET_KEY` both default to `"cortex-secret-shared-key"`. Any node with this secret can forge tasks.	Replace with per-node rotating keys derived from the invite token, or use asymmetric signing (Ed25519).

🟡 Should-Fix (Important for Production Quality)

#	Gap	Detail	Recommendation
S1	No node deregistration	When a node's stream closes, its entry stays in `registry.nodes` forever. `list_nodes()` returns stale dead entries.	Add `deregister(node_id)` to the `TaskStream` `finally` block + a TTL-based cleanup routine.
S2	JWT has 10-minute expiry but no refresh	After 10 minutes the JWT is expired but `SyncConfiguration` runs only once at startup.	Either extend TTL to match session length, or implement a token-refresh RPC.
S3	`GhostMirrorManager` storage root is hardcoded	`storage_root="/app/data/mirrors"` is hardcoded. In the Hub this should be per-user and configurable.	Make it configurable via env var; use path `/data/mirrors/{user_id}/{session_id}`.
S4	No browser task cancellation	`BrowserSkill.cancel()` always returns `False`. A running browser session cannot be interrupted.	Implement cancellation by pushing a sentinel + `task_id` into the actor queue.
S5	CPU usage reported as hardcoded `1.0`	`Heartbeat.cpu_usage_percent=1.0` is fake. Load balancing decisions are unreliable.	Use `psutil.cpu_percent()` and `psutil.virtual_memory().percent`.
S6	Work-stealing jitter is random	`time.sleep(random.uniform(0.1, 0.5))` for claim jitter is functional but non-deterministic.	Use a hash of `node_id + task_id` for deterministic, replayable distribution.
S7	No reconnection loop on the client	If the server is temporarily unavailable, `main.py` calls `sys.exit(1)` and dies.	Implement an exponential-backoff retry loop (`max_retries=10`) in `run_task_stream()` before giving up.
S8	`import hashlib` missing in `node.py`	`_push_full_manifest` calls `hashlib.sha256` but `import hashlib` is missing. Will crash with `NameError` at runtime.	Add `import hashlib` to the top of `node.py`. Fix immediately.

🟢 Nice-to-Have (Phase 6 / Future)

#	Gap	Recommendation
N1	No Optimistic Concurrency on file writes	Add `parent_hash` field to `FilePayload`; reject edits where hash doesn't match server's current version.
N2	Browser events not persisted	Console/network events are only printed to console. Store them per-session for the UI to replay.
N3	No streaming task output	Shell output is returned only on completion. Add a `ProgressEvent` to stream stdout lines in real-time.
N4	No structured capability discovery	Capabilities are a bare `map<string,string>`. Structured metadata (OS, Python version, GPU, disk space) would enable smarter task routing.

3. Architecture — Current vs. Target

Current (POC)

[ Orchestrator CLI ] ←gRPC:50051 (mTLS)→ [ Agent Node A ]
       |                                 → [ Agent Node B ]
  /app/data/mirrors (local FS)
  in-memory registry
  console dashboard only

Target (AI Hub Integrated)

[ User Browser ]
     ↓ HTTPS
[ AI Hub (Django/FastAPI) ]
   ├── REST/WS API  →  GET /api/nodes
   │                   POST /api/nodes/invite
   │                   POST /api/sessions/{id}/dispatch
   ├── gRPC Server (port 50051)
   │       ↑ SyncConfiguration / TaskStream / ReportHealth
   │   [ Agent Node A ] ← downloaded + configured by user
   │   [ Agent Node B ]
   ├── DB (Postgres)  → sessions, nodes, users
   └── File Storage   → /data/mirrors/{user_id}/{session_id}/

4. Recommended Integration Sequence

This is the sequence to complete before declaring "ready to migrate":

[S8] Fix import hashlib in node.py — immediate silent crash risk.
[M8] Fix secret management — per-node invite-based keys. Security baseline.
[M7] Decide TLS strategy — token-only auth removes the cert burden from end users.
[M3] + [M4] Session & Node ownership DB models — data foundation for everything else.
[M1] HTTP/WS API layer — thin FastAPI app alongside gRPC to expose state to the Hub UI.
[M2] Persistent registry — wire MemoryNodeRegistry to Redis/Postgres.
[S1] Node deregistration + TTL — makes the "online nodes" list accurate.
[S7] Client reconnection loop — nodes must survive transient server restarts.
[M5] "Download Your Node" page — the final user-facing feature closing the loop.
[M6] Live node status in UI — WebSocket push so the UI shows 🟢/⚫ in real-time.

[!NOTE] Items M1 through M6 are all about the Hub's integration layer, not the Agent Node client code. The gRPC protocol (.proto) is stable and does not need to change for the initial integration.

[!CAUTION] S8 (import hashlib missing in node.py) will cause a NameError crash in the _push_full_manifest code path. Fix this before any production deployment.