diff --git a/docs/architecture/agent_node_integration_readiness.md b/docs/architecture/agent_node_integration_readiness.md new file mode 100644 index 0000000..6e2bbc0 --- /dev/null +++ b/docs/architecture/agent_node_integration_readiness.md @@ -0,0 +1,209 @@ +# Agent Node — AI Hub Integration Readiness Review +**Prepared**: 2026-03-04 +**Scope**: `poc-grpc-agent/` (commit `785b387`) +**Goal**: Evaluate what is built, what works, and what **must be done** before migrating server logic into the AI Hub to support the target user flow: + +> *"The AI Hub acts as the server. It provides a page to download the client-side Agent Node software (with a YAML config). The user deploys the node locally, the Hub detects the live node, and the user can attach that node to a conversation session in the UI."* + +--- + +## 1. What Is Built — Component Inventory + +### 1.1 gRPC Protocol (`protos/agent.proto`) + +| Message / RPC | Purpose | Status | +|---|---|---| +| `SyncConfiguration` (unary) | Node registration + sandbox policy handshake | ✅ Done | +| `TaskStream` (bidir stream) | Persistent command & control channel | ✅ Done | +| `ReportHealth` (bidir stream) | Heartbeat / utilization reporting | ✅ Done | +| `RegistrationRequest` | Node ID, version, auth JWT, capabilities map | ✅ Done | +| `SandboxPolicy` | STRICT / PERMISSIVE mode, allowed/denied commands | ✅ Done | +| `TaskRequest` | Shell payload or BrowserAction, with `session_id` | ✅ Done | +| `TaskResponse` | stdout/stderr + structured `BrowserResponse` | ✅ Done | +| `FileSyncMessage` | Bidirectional file sync envelope | ✅ Done | +| `SyncControl` | START/STOP_WATCHING, LOCK, UNLOCK, REFRESH_MANIFEST, RESYNC | ✅ Done | +| `DirectoryManifest` / `FileInfo` | SHA-256 manifest for drift detection | ✅ Done | +| `FilePayload` | Chunked file transfer with hash verification | ✅ Done | +| `SyncStatus` | OK / ERROR / RECONCILE_REQUIRED + `reconcile_paths` | ✅ Done | +| `BrowserEvent` | Live console/network event tunneling | ✅ Done | +| `WorkPoolUpdate` / `TaskClaimRequest` | Work-stealing pool protocol | ✅ Done | + +--- + +### 1.2 Server Side — Orchestrator Components + +| Component | File | What It Does | Status | +|---|---|---|---| +| `AgentOrchestrator` | `services/grpc_server.py` | gRPC servicer, routes all three RPCs | ✅ Done | +| `MemoryNodeRegistry` | `core/registry.py` | Tracks live nodes by ID in memory | ✅ Done (in-memory only) | +| `TaskJournal` | `core/journal.py` | Async task state tracking (Event-based) | ✅ Done | +| `GlobalWorkPool` | `core/pool.py` | Thread-safe work-stealing task pool | ✅ Done | +| `GhostMirrorManager` | `core/mirror.py` | Server-side file mirror with SHA-256 | ✅ Done | +| `TaskAssistant` | `services/assistant.py` | High-level AI API: dispatch, push, sync | ✅ Done | +| `CortexIgnore` | `shared_core/ignore.py` | `.cortexignore` / `.gitignore` filtering | ✅ Done | +| `sign_payload` / `sign_browser_action` | `utils/crypto.py` | HMAC-SHA256 task signing | ✅ Done | +| Mesh Dashboard | `grpc_server.py:_monitor_mesh` | Console health printout every 10s | ✅ Done (console only) | + +#### Key API Methods on `TaskAssistant` +| Method | Description | +|---|---| +| `push_workspace(node_id, session_id)` | Initial push of all files in Ghost Mirror to a node | +| `push_file(node_id, session_id, rel_path)` | Targeted single-file push (drift recovery) | +| `broadcast_file_chunk(session_id, sender, chunk)` | Propagate a node's change to all other nodes | +| `control_sync(node_id, session_id, action)` | Send any SyncControl action | +| `lock_workspace / unlock_workspace` | Toggle node-side write lock | +| `request_manifest / reconcile_node` | Phase 5 drift detection + auto-recovery | +| `dispatch_single(node_id, cmd, session_id)` | Dispatch shell task (CWD-aware) | +| `dispatch_browser(node_id, action, session_id)` | Dispatch browser task (Download-to-Sync) | + +--- + +### 1.3 Client Side — Agent Node Components + +| Component | File | What It Does | Status | +|---|---|---|---| +| `AgentNode` | `node.py` | Core node orchestrator | ✅ Done | +| `SkillManager` | `skills/manager.py` | ThreadPool routing to shell / browser | ✅ Done | +| `ShellSkill` | `skills/shell.py` | Shell execution, CWD-aware, cancellable | ✅ Done | +| `BrowserSkill` | `skills/browser.py` | Playwright actor, Download-to-Sync | ✅ Done | +| `SandboxEngine` | `core/sandbox.py` | Policy enforcement (STRICT/PERMISSIVE) | ✅ Done | +| `NodeSyncManager` | `core/sync.py` | Local file writes + drift detection | ✅ Done | +| `WorkspaceWatcher` | `core/watcher.py` | `watchdog`-based file change streaming | ✅ Done | +| `CortexIgnore` (shared) | `shared_core/ignore.py` | Same ignore logic as server | ✅ Done | +| `create_auth_token` | `utils/auth.py` | 10-minute JWT for registration | ✅ Done | +| `verify_task_signature` | `utils/auth.py` | HMAC verification for incoming tasks | ✅ Done | +| `get_secure_stub` | `utils/network.py` | mTLS gRPC channel setup | ✅ Done | +| Config (`config.py`) | `agent_node/config.py` | 12-Factor env-var config | ✅ Done | +| Entry Point (`main.py`) | `agent_node/main.py` | SIGINT/SIGTERM graceful shutdown | ✅ Done | + +#### Node Capabilities Advertised on Registration +```yaml +shell: "v1" +browser: "playwright-sync-bridge" +``` + +--- + +### 1.4 Security + +| Mechanism | Implementation | Status | +|---|---|---| +| **mTLS** | Server + Client certs (`certs/`), CA-signed | ✅ Done (self-signed for dev) | +| **JWT Registration** | 10-min expiry, HS256, shared secret | ✅ Done | +| **HMAC Task Signing** | Each task payload is HMAC-SHA256 signed | ✅ Done | +| **Sandbox Policy** | Server sends allow/deny lists at registration | ✅ Done | +| **Path Traversal Guard** | `normpath` + prefix check in sync writes | ✅ Done | +| **`.cortexignore`** | Prevents sensitive files from syncing | ✅ Done | +| **Workspace Locking** | Node ignores user edits during AI writes | ✅ Done | + +--- + +### 1.5 Synchronization — Ghost Mirror System + +| Feature | Status | +|---|---| +| Server-Primary push (server → node) | ✅ Done | +| Node-Primary delta streaming (node → server) | ✅ Done | +| Multi-node broadcast propagation | ✅ Done | +| SHA-256 hash verification on file receive | ✅ Done | +| Manifest-based drift detection | ✅ Done | +| Automatic drift recovery on reconnect | ✅ Done | +| Chunked file transfer (64KB chunks) | ✅ Done | +| `.cortexignore` / `.gitignore` filtering | ✅ Done | +| Dynamic ignore rule reloading | ✅ Done | +| Download-to-Sync (browser → workspace) | ✅ Done | + +--- + +## 2. What Is Missing — Gaps Before AI Hub Integration + +### 🔴 Must-Fix (Blockers for the Target User Flow) + +| # | Gap | Why It's a Blocker | Recommendation | +|---|---|---|---| +| **M1** | **No REST/WebSocket API surface** | The Orchestrator only exposes a gRPC port. The AI Hub UI has no way to query node status, trigger syncs, or read task results. | Add a thin HTTP layer (FastAPI or Django view) alongside the gRPC server that exposes: `GET /nodes`, `GET /nodes/{id}/status`, `POST /sessions/{id}/dispatch`. | +| **M2** | **Node Registry is in-memory** | If the Hub process restarts, all node registrations are lost. Live nodes appear "offline" even if still connected. | Back `MemoryNodeRegistry` with a persistent store (Redis or Postgres) so registrations survive restarts. | +| **M3** | **No persistent session model** | `session_id` is a bare string — there is no concept of which user owns a session, which nodes are attached, or its lifecycle. | Add a `Session` DB model in the Hub: `id`, `user_id`, `attached_node_ids`, `created_at`, `status`. | +| **M4** | **No user → node ownership / authorization** | Any registered node is visible to all. There is no "this is my node" concept per user account. | Add `node_owner_id` (user_id) at registration time. The Hub must issue a pre-signed invite token via `POST /api/nodes/invite` before the node connects. | +| **M5** | **Node download page doesn't exist** | There is no endpoint or UI page to download the Agent Node software with a pre-configured YAML. | Build a "Download Your Node" page in the Hub that generates a config YAML with `SERVER_HOST`, `GRPC_PORT`, `AGENT_NODE_ID`, and the invite token. Bundle the Python package for download. | +| **M6** | **No live node status in the UI** | The mesh dashboard only prints to the server console. The Hub UI needs real-time status to show "node is alive". | Expose node status via WebSocket or Server-Sent Events so the UI can show 🟢/⚫ for each node in real time. | +| **M7** | **mTLS certs are developer-only self-signed** | The `certs/` folder contains hardcoded dev certs. Nodes on external networks will fail TLS verification. | Either: (a) switch to token-only auth (no mTLS), which is simpler since the Hub already handles HTTPS; or (b) implement a cert-issuance API (`/api/nodes/cert-request`) backed by an internal CA. | +| **M8** | **Shared HMAC secret is hardcoded** | `ORCHESTRATOR_SECRET_KEY` and `AGENT_SECRET_KEY` both default to `"cortex-secret-shared-key"`. Any node with this secret can forge tasks. | Replace with per-node rotating keys derived from the invite token, or use asymmetric signing (Ed25519). | + +--- + +### 🟡 Should-Fix (Important for Production Quality) + +| # | Gap | Detail | Recommendation | +|---|---|---|---| +| **S1** | **No node deregistration** | When a node's stream closes, its entry stays in `registry.nodes` forever. `list_nodes()` returns stale dead entries. | Add `deregister(node_id)` to the `TaskStream` `finally` block + a TTL-based cleanup routine. | +| **S2** | **JWT has 10-minute expiry but no refresh** | After 10 minutes the JWT is expired but `SyncConfiguration` runs only once at startup. | Either extend TTL to match session length, or implement a token-refresh RPC. | +| **S3** | **`GhostMirrorManager` storage root is hardcoded** | `storage_root="/app/data/mirrors"` is hardcoded. In the Hub this should be per-user and configurable. | Make it configurable via env var; use path `/data/mirrors/{user_id}/{session_id}`. | +| **S4** | **No browser task cancellation** | `BrowserSkill.cancel()` always returns `False`. A running browser session cannot be interrupted. | Implement cancellation by pushing a sentinel + `task_id` into the actor queue. | +| **S5** | **CPU usage reported as hardcoded `1.0`** | `Heartbeat.cpu_usage_percent=1.0` is fake. Load balancing decisions are unreliable. | Use `psutil.cpu_percent()` and `psutil.virtual_memory().percent`. | +| **S6** | **Work-stealing jitter is random** | `time.sleep(random.uniform(0.1, 0.5))` for claim jitter is functional but non-deterministic. | Use a hash of `node_id + task_id` for deterministic, replayable distribution. | +| **S7** | **No reconnection loop on the client** | If the server is temporarily unavailable, `main.py` calls `sys.exit(1)` and dies. | Implement an exponential-backoff retry loop (`max_retries=10`) in `run_task_stream()` before giving up. | +| **S8** | **`import hashlib` missing in `node.py`** | `_push_full_manifest` calls `hashlib.sha256` but `import hashlib` is missing. Will crash with `NameError` at runtime. | Add `import hashlib` to the top of `node.py`. **Fix immediately.** | + +--- + +### 🟢 Nice-to-Have (Phase 6 / Future) + +| # | Gap | Recommendation | +|---|---|---| +| **N1** | **No Optimistic Concurrency on file writes** | Add `parent_hash` field to `FilePayload`; reject edits where hash doesn't match server's current version. | +| **N2** | **Browser events not persisted** | Console/network events are only printed to console. Store them per-session for the UI to replay. | +| **N3** | **No streaming task output** | Shell output is returned only on completion. Add a `ProgressEvent` to stream stdout lines in real-time. | +| **N4** | **No structured capability discovery** | Capabilities are a bare `map`. Structured metadata (OS, Python version, GPU, disk space) would enable smarter task routing. | + +--- + +## 3. Architecture — Current vs. Target + +### Current (POC) +``` +[ Orchestrator CLI ] ←gRPC:50051 (mTLS)→ [ Agent Node A ] + | → [ Agent Node B ] + /app/data/mirrors (local FS) + in-memory registry + console dashboard only +``` + +### Target (AI Hub Integrated) +``` +[ User Browser ] + ↓ HTTPS +[ AI Hub (Django/FastAPI) ] + ├── REST/WS API → GET /api/nodes + │ POST /api/nodes/invite + │ POST /api/sessions/{id}/dispatch + ├── gRPC Server (port 50051) + │ ↑ SyncConfiguration / TaskStream / ReportHealth + │ [ Agent Node A ] ← downloaded + configured by user + │ [ Agent Node B ] + ├── DB (Postgres) → sessions, nodes, users + └── File Storage → /data/mirrors/{user_id}/{session_id}/ +``` + +--- + +## 4. Recommended Integration Sequence + +This is the sequence to complete before declaring "ready to migrate": + +1. **[S8] Fix `import hashlib`** in `node.py` — immediate silent crash risk. +2. **[M8] Fix secret management** — per-node invite-based keys. Security baseline. +3. **[M7] Decide TLS strategy** — token-only auth removes the cert burden from end users. +4. **[M3] + [M4] Session & Node ownership DB models** — data foundation for everything else. +5. **[M1] HTTP/WS API layer** — thin FastAPI app alongside gRPC to expose state to the Hub UI. +6. **[M2] Persistent registry** — wire `MemoryNodeRegistry` to Redis/Postgres. +7. **[S1] Node deregistration + TTL** — makes the "online nodes" list accurate. +8. **[S7] Client reconnection loop** — nodes must survive transient server restarts. +9. **[M5] "Download Your Node" page** — the final user-facing feature closing the loop. +10. **[M6] Live node status in UI** — WebSocket push so the UI shows 🟢/⚫ in real-time. + +> [!NOTE] +> Items **M1 through M6** are all about the Hub's integration layer, not the Agent Node client code. The gRPC protocol (`.proto`) is **stable** and does not need to change for the initial integration. + +> [!CAUTION] +> **S8** (`import hashlib` missing in `node.py`) will cause a `NameError` crash in the `_push_full_manifest` code path. Fix this before any production deployment.