| 2026-04-30 |
Fix node disconnect: health stream restart, policy dispatch, work_pool routing
...
Three bugs causing nodes to drop offline and stay offline after disconnect:
1. on_policy passed full ServerMessage instead of sub-message — caused
Dispatch Error: mode on every connection (AgentNode._on_policy_update
accesses policy.mode directly)
2. _health_thread_started never reset in close() — health stream could not
restart after reconnect, so hub watchdog eventually timed out the node
3. work_pool_update had no dispatch case in on_message() — _handle_work_pool
was dead code, causing hub to flood nodes with unclaimed task updates
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
| 2026-04-28 |
Fix Windows agent offline issues: Added run_loop.bat wrapper and fixed Sandbox policy handshake
|
Fix thread-unsafe sequence counter causing heapq comparison crash
...
The _send_seq += 1 approach had a race condition: two threads calling
send() simultaneously could load the same counter value, producing two
queue items with identical (priority, seq) tuples. heapq then compares
the third element (protobuf message), which crashes with TypeError.
Replace with itertools.count() whose next() is GIL-atomic in CPython —
each call is a single C-level operation that cannot be interrupted mid-
increment. Also fix the legacy fallback path in LiveNodeRecord.send_message
and remove the duplicate 'import queue' in node_registry.py.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Fix PriorityQueue crash when protobuf messages are compared as tiebreakers
...
Python's heapq compares all tuple elements if earlier ones are equal.
Using timestamp as the second key means two rapidly queued messages at
the same millisecond trigger comparison of the protobuf message objects,
which don't support '<'. Replace timestamp with a monotonic sequence
counter so the message object is never reached in the comparison.
Fixes: Reader thread FATAL exception: '<' not supported between instances
of 'ClientTaskMessage' and 'ClientTaskMessage'
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Add token self-recovery to survive auth failures without SSH access
...
- GrpcMeshTransport: on handshake rejection, calls /api/v1/agent/token-sync
using stable secret_key to fetch fresh invite_token and retries handshake.
Persists recovered token to all known config file locations.
- agent_update.py: new /token-sync endpoint (auth: hub SECRET_KEY header).
- node.py: wire hub_http_url + secret_key into GrpcMeshTransport constructor.
- reinstall_windows_agent.ps1: idempotent all-in-one reinstall script —
kills competing python processes, disables ghost startup bat, syncs config,
optionally updates token, re-registers task with RestartCount=3.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
| 2026-04-25 |
refactor done
yangyang xie
committed
17 days ago
|