Agent Node — AI Hub Integration Readiness Review
Prepared: 2026-03-04
Scope: poc-grpc-agent/ (commit 785b387)
Goal: Evaluate what is built, what works, and what must be done before migrating server logic into the AI Hub to support the target user flow:
"The AI Hub acts as the server. It provides a page to download the client-side Agent Node software (with a YAML config). The user deploys the node locally, the Hub detects the live node, and the user can attach that node to a conversation session in the UI."
1. What Is Built — Component Inventory
1.1 gRPC Protocol (protos/agent.proto)
| Message / RPC |
Purpose |
Status |
SyncConfiguration (unary) |
Node registration + sandbox policy handshake |
✅ Done |
TaskStream (bidir stream) |
Persistent command & control channel |
✅ Done |
ReportHealth (bidir stream) |
Heartbeat / utilization reporting |
✅ Done |
RegistrationRequest |
Node ID, version, auth JWT, capabilities map |
✅ Done |
SandboxPolicy |
STRICT / PERMISSIVE mode, allowed/denied commands |
✅ Done |
TaskRequest |
Shell payload or BrowserAction, with session_id |
✅ Done |
TaskResponse |
stdout/stderr + structured BrowserResponse |
✅ Done |
FileSyncMessage |
Bidirectional file sync envelope |
✅ Done |
SyncControl |
START/STOP_WATCHING, LOCK, UNLOCK, REFRESH_MANIFEST, RESYNC |
✅ Done |
DirectoryManifest / FileInfo |
SHA-256 manifest for drift detection |
✅ Done |
FilePayload |
Chunked file transfer with hash verification |
✅ Done |
SyncStatus |
OK / ERROR / RECONCILE_REQUIRED + reconcile_paths |
✅ Done |
BrowserEvent |
Live console/network event tunneling |
✅ Done |
WorkPoolUpdate / TaskClaimRequest |
Work-stealing pool protocol |
✅ Done |
1.2 Server Side — Orchestrator Components
| Component |
File |
What It Does |
Status |
AgentOrchestrator |
services/grpc_server.py |
gRPC servicer, routes all three RPCs |
✅ Done |
MemoryNodeRegistry |
core/registry.py |
Tracks live nodes by ID in memory |
✅ Done (in-memory only) |
TaskJournal |
core/journal.py |
Async task state tracking (Event-based) |
✅ Done |
GlobalWorkPool |
core/pool.py |
Thread-safe work-stealing task pool |
✅ Done |
GhostMirrorManager |
core/mirror.py |
Server-side file mirror with SHA-256 |
✅ Done |
TaskAssistant |
services/assistant.py |
High-level AI API: dispatch, push, sync |
✅ Done |
CortexIgnore |
shared_core/ignore.py |
.cortexignore / .gitignore filtering |
✅ Done |
sign_payload / sign_browser_action |
utils/crypto.py |
HMAC-SHA256 task signing |
✅ Done |
| Mesh Dashboard |
grpc_server.py:_monitor_mesh |
Console health printout every 10s |
✅ Done (console only) |
Key API Methods on TaskAssistant
| Method |
Description |
push_workspace(node_id, session_id) |
Initial push of all files in Ghost Mirror to a node |
push_file(node_id, session_id, rel_path) |
Targeted single-file push (drift recovery) |
broadcast_file_chunk(session_id, sender, chunk) |
Propagate a node's change to all other nodes |
control_sync(node_id, session_id, action) |
Send any SyncControl action |
lock_workspace / unlock_workspace |
Toggle node-side write lock |
request_manifest / reconcile_node |
Phase 5 drift detection + auto-recovery |
dispatch_single(node_id, cmd, session_id) |
Dispatch shell task (CWD-aware) |
dispatch_browser(node_id, action, session_id) |
Dispatch browser task (Download-to-Sync) |
1.3 Client Side — Agent Node Components
| Component |
File |
What It Does |
Status |
AgentNode |
node.py |
Core node orchestrator |
✅ Done |
SkillManager |
skills/manager.py |
ThreadPool routing to shell / browser |
✅ Done |
ShellSkill |
skills/shell.py |
Shell execution, CWD-aware, cancellable |
✅ Done |
BrowserSkill |
skills/browser.py |
Playwright actor, Download-to-Sync |
✅ Done |
SandboxEngine |
core/sandbox.py |
Policy enforcement (STRICT/PERMISSIVE) |
✅ Done |
NodeSyncManager |
core/sync.py |
Local file writes + drift detection |
✅ Done |
WorkspaceWatcher |
core/watcher.py |
watchdog-based file change streaming |
✅ Done |
CortexIgnore (shared) |
shared_core/ignore.py |
Same ignore logic as server |
✅ Done |
create_auth_token |
utils/auth.py |
10-minute JWT for registration |
✅ Done |
verify_task_signature |
utils/auth.py |
HMAC verification for incoming tasks |
✅ Done |
get_secure_stub |
utils/network.py |
mTLS gRPC channel setup |
✅ Done |
Config (config.py) |
agent_node/config.py |
12-Factor env-var config |
✅ Done |
Entry Point (main.py) |
agent_node/main.py |
SIGINT/SIGTERM graceful shutdown |
✅ Done |
Node Capabilities Advertised on Registration
shell: "v1"
browser: "playwright-sync-bridge"
1.4 Security
| Mechanism |
Implementation |
Status |
| mTLS |
Server + Client certs (certs/), CA-signed |
✅ Done (self-signed for dev) |
| JWT Registration |
10-min expiry, HS256, shared secret |
✅ Done |
| HMAC Task Signing |
Each task payload is HMAC-SHA256 signed |
✅ Done |
| Sandbox Policy |
Server sends allow/deny lists at registration |
✅ Done |
| Path Traversal Guard |
normpath + prefix check in sync writes |
✅ Done |
.cortexignore |
Prevents sensitive files from syncing |
✅ Done |
| Workspace Locking |
Node ignores user edits during AI writes |
✅ Done |
1.5 Synchronization — Ghost Mirror System
| Feature |
Status |
| Server-Primary push (server → node) |
✅ Done |
| Node-Primary delta streaming (node → server) |
✅ Done |
| Multi-node broadcast propagation |
✅ Done |
| SHA-256 hash verification on file receive |
✅ Done |
| Manifest-based drift detection |
✅ Done |
| Automatic drift recovery on reconnect |
✅ Done |
| Chunked file transfer (64KB chunks) |
✅ Done |
.cortexignore / .gitignore filtering |
✅ Done |
| Dynamic ignore rule reloading |
✅ Done |
| Download-to-Sync (browser → workspace) |
✅ Done |
2. What Is Missing — Gaps Before AI Hub Integration
🔴 Must-Fix (Blockers for the Target User Flow)
| # |
Gap |
Why It's a Blocker |
Recommendation |
| M1 |
No REST/WebSocket API surface |
The Orchestrator only exposes a gRPC port. The AI Hub UI has no way to query node status, trigger syncs, or read task results. |
Add a thin HTTP layer (FastAPI or Django view) alongside the gRPC server that exposes: GET /nodes, GET /nodes/{id}/status, POST /sessions/{id}/dispatch. |
| M2 |
Node Registry is in-memory |
If the Hub process restarts, all node registrations are lost. Live nodes appear "offline" even if still connected. |
Back MemoryNodeRegistry with a persistent store (Redis or Postgres) so registrations survive restarts. |
| M3 |
No persistent session model |
session_id is a bare string — there is no concept of which user owns a session, which nodes are attached, or its lifecycle. |
Add a Session DB model in the Hub: id, user_id, attached_node_ids, created_at, status. |
| M4 |
No user → node ownership / authorization |
Any registered node is visible to all. There is no "this is my node" concept per user account. |
Add node_owner_id (user_id) at registration time. The Hub must issue a pre-signed invite token via POST /api/nodes/invite before the node connects. |
| M5 |
Node download page doesn't exist |
There is no endpoint or UI page to download the Agent Node software with a pre-configured YAML. |
Build a "Download Your Node" page in the Hub that generates a config YAML with SERVER_HOST, GRPC_PORT, AGENT_NODE_ID, and the invite token. Bundle the Python package for download. |
| M6 |
No live node status in the UI |
The mesh dashboard only prints to the server console. The Hub UI needs real-time status to show "node is alive". |
Expose node status via WebSocket or Server-Sent Events so the UI can show 🟢/⚫ for each node in real time. |
| M7 |
mTLS certs are developer-only self-signed |
The certs/ folder contains hardcoded dev certs. Nodes on external networks will fail TLS verification. |
Either: (a) switch to token-only auth (no mTLS), which is simpler since the Hub already handles HTTPS; or (b) implement a cert-issuance API (/api/nodes/cert-request) backed by an internal CA. |
| M8 |
Shared HMAC secret is hardcoded |
ORCHESTRATOR_SECRET_KEY and AGENT_SECRET_KEY both default to "cortex-secret-shared-key". Any node with this secret can forge tasks. |
Replace with per-node rotating keys derived from the invite token, or use asymmetric signing (Ed25519). |
🟡 Should-Fix (Important for Production Quality)
| # |
Gap |
Detail |
Recommendation |
| S1 |
No node deregistration |
When a node's stream closes, its entry stays in registry.nodes forever. list_nodes() returns stale dead entries. |
Add deregister(node_id) to the TaskStream finally block + a TTL-based cleanup routine. |
| S2 |
JWT has 10-minute expiry but no refresh |
After 10 minutes the JWT is expired but SyncConfiguration runs only once at startup. |
Either extend TTL to match session length, or implement a token-refresh RPC. |
| S3 |
GhostMirrorManager storage root is hardcoded |
storage_root="/app/data/mirrors" is hardcoded. In the Hub this should be per-user and configurable. |
Make it configurable via env var; use path /data/mirrors/{user_id}/{session_id}. |
| S4 |
No browser task cancellation |
BrowserSkill.cancel() always returns False. A running browser session cannot be interrupted. |
Implement cancellation by pushing a sentinel + task_id into the actor queue. |
| S5 |
CPU usage reported as hardcoded 1.0 |
Heartbeat.cpu_usage_percent=1.0 is fake. Load balancing decisions are unreliable. |
Use psutil.cpu_percent() and psutil.virtual_memory().percent. |
| S6 |
Work-stealing jitter is random |
time.sleep(random.uniform(0.1, 0.5)) for claim jitter is functional but non-deterministic. |
Use a hash of node_id + task_id for deterministic, replayable distribution. |
| S7 |
No reconnection loop on the client |
If the server is temporarily unavailable, main.py calls sys.exit(1) and dies. |
Implement an exponential-backoff retry loop (max_retries=10) in run_task_stream() before giving up. |
| S8 |
import hashlib missing in node.py |
_push_full_manifest calls hashlib.sha256 but import hashlib is missing. Will crash with NameError at runtime. |
Add import hashlib to the top of node.py. Fix immediately. |
🟢 Nice-to-Have (Phase 6 / Future)
| # |
Gap |
Recommendation |
| N1 |
No Optimistic Concurrency on file writes |
Add parent_hash field to FilePayload; reject edits where hash doesn't match server's current version. |
| N2 |
Browser events not persisted |
Console/network events are only printed to console. Store them per-session for the UI to replay. |
| N3 |
No streaming task output |
Shell output is returned only on completion. Add a ProgressEvent to stream stdout lines in real-time. |
| N4 |
No structured capability discovery |
Capabilities are a bare map<string,string>. Structured metadata (OS, Python version, GPU, disk space) would enable smarter task routing. |
3. Architecture — Current vs. Target
Current (POC)
[ Orchestrator CLI ] ←gRPC:50051 (mTLS)→ [ Agent Node A ]
| → [ Agent Node B ]
/app/data/mirrors (local FS)
in-memory registry
console dashboard only
Target (AI Hub Integrated)
[ User Browser ]
↓ HTTPS
[ AI Hub (Django/FastAPI) ]
├── REST/WS API → GET /api/nodes
│ POST /api/nodes/invite
│ POST /api/sessions/{id}/dispatch
├── gRPC Server (port 50051)
│ ↑ SyncConfiguration / TaskStream / ReportHealth
│ [ Agent Node A ] ← downloaded + configured by user
│ [ Agent Node B ]
├── DB (Postgres) → sessions, nodes, users
└── File Storage → /data/mirrors/{user_id}/{session_id}/
4. Recommended Integration Sequence
This is the sequence to complete before declaring "ready to migrate":
- [S8] Fix
import hashlib in node.py — immediate silent crash risk.
- [M8] Fix secret management — per-node invite-based keys. Security baseline.
- [M7] Decide TLS strategy — token-only auth removes the cert burden from end users.
- [M3] + [M4] Session & Node ownership DB models — data foundation for everything else.
- [M1] HTTP/WS API layer — thin FastAPI app alongside gRPC to expose state to the Hub UI.
- [M2] Persistent registry — wire
MemoryNodeRegistry to Redis/Postgres.
- [S1] Node deregistration + TTL — makes the "online nodes" list accurate.
- [S7] Client reconnection loop — nodes must survive transient server restarts.
- [M5] "Download Your Node" page — the final user-facing feature closing the loop.
- [M6] Live node status in UI — WebSocket push so the UI shows 🟢/⚫ in real-time.
[!NOTE] Items M1 through M6 are all about the Hub's integration layer, not the Agent Node client code. The gRPC protocol (.proto) is stable and does not need to change for the initial integration.
[!CAUTION] S8 (import hashlib missing in node.py) will cause a NameError crash in the _push_full_manifest code path. Fix this before any production deployment.