| 2026-04-30 |

Self-recovery: dispatch dict, disable updater in Docker, fix watchdog semantics
...
1. Replace if/elif dispatch chain with _DISPATCH dict (engines/node.py)
Adding a proto message kind now requires exactly one entry in _DISPATCH —
impossible to add a callback slot without the routing case. Missing kinds
emit WARNING immediately. Extractor is colocated with the route so the
wrong-object bug (Bug 1) cannot recur.
2. Honour CORTEX_DISABLE_AUTO_UPDATE env var (updater.py)
In Docker, updates are delivered via image rebuilds. The old behaviour
called sys.exit(0) to spawn a bootstrapper, which Docker restart:always
turned into an infinite boot loop. Setting CORTEX_DISABLE_AUTO_UPDATE=1
in the container env prevents this entirely.
3. Watchdog ticks unconditionally in health reporter (node.py)
Previously the watchdog tick was skipped when the hub was unreachable,
causing os._exit(1) after 300s of any disconnect — even during normal
gRPC reconnect backoff. Now the watchdog proves the reporter thread is
alive regardless of hub state; it only fires if the thread itself deadlocks.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Harden mesh dispatch against silent failures
...
Three structural improvements to prevent future silent drops:
1. Add else-warning in on_message() dispatch chain — any unhandled proto
message kind now logs a WARNING in production instead of being silently
swallowed. This would have made the work_pool_update miss immediately visible.
2. Fix on_ready to use getattr(message, kind) — same safe extraction pattern
now applied consistently to all dispatch branches. Prevents Bug 1 class
from recurring if _on_mesh_ready ever starts using the argument.
3. Remove redundant _health_thread_started boolean from send_health() guard —
_start_health_stream() already gates on thread.is_alive(), so the outer
boolean was a stale fast-path that could mislead future readers back into
the original thread-restart bug.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Fix node disconnect: health stream restart, policy dispatch, work_pool routing
...
Three bugs causing nodes to drop offline and stay offline after disconnect:
1. on_policy passed full ServerMessage instead of sub-message — caused
Dispatch Error: mode on every connection (AgentNode._on_policy_update
accesses policy.mode directly)
2. _health_thread_started never reset in close() — health stream could not
restart after reconnect, so hub watchdog eventually timed out the node
3. work_pool_update had no dispatch case in on_message() — _handle_work_pool
was dead code, causing hub to flood nodes with unclaimed task updates
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
| 2026-04-28 |
Fix Windows agent offline issues: Added run_loop.bat wrapper and fixed Sandbox policy handshake
|
Fix thread-unsafe sequence counter causing heapq comparison crash
...
The _send_seq += 1 approach had a race condition: two threads calling
send() simultaneously could load the same counter value, producing two
queue items with identical (priority, seq) tuples. heapq then compares
the third element (protobuf message), which crashes with TypeError.
Replace with itertools.count() whose next() is GIL-atomic in CPython —
each call is a single C-level operation that cannot be interrupted mid-
increment. Also fix the legacy fallback path in LiveNodeRecord.send_message
and remove the duplicate 'import queue' in node_registry.py.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Fix PriorityQueue crash when protobuf messages are compared as tiebreakers
...
Python's heapq compares all tuple elements if earlier ones are equal.
Using timestamp as the second key means two rapidly queued messages at
the same millisecond trigger comparison of the protobuf message objects,
which don't support '<'. Replace timestamp with a monotonic sequence
counter so the message object is never reached in the comparison.
Fixes: Reader thread FATAL exception: '<' not supported between instances
of 'ClientTaskMessage' and 'ClientTaskMessage'
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Add token self-recovery to survive auth failures without SSH access
...
- GrpcMeshTransport: on handshake rejection, calls /api/v1/agent/token-sync
using stable secret_key to fetch fresh invite_token and retries handshake.
Persists recovered token to all known config file locations.
- agent_update.py: new /token-sync endpoint (auth: hub SECRET_KEY header).
- node.py: wire hub_http_url + secret_key into GrpcMeshTransport constructor.
- reinstall_windows_agent.ps1: idempotent all-in-one reinstall script —
kills competing python processes, disables ghost startup bat, syncs config,
optionally updates token, re-registers task with RestartCount=3.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
| 2026-04-26 |
fix(mesh): resolve Windows interactive terminal backspace and reconnect loops, optimize link CSS
yangyang xie
committed
15 days ago
|
| 2026-04-25 |
Fix integration tests, deadlocks, and race conditions in coworker flow
Antigravity AI
committed
17 days ago
|
refactor done
yangyang xie
committed
17 days ago
|
| 2026-04-24 |
half done refactoring
Antigravity AI
committed
17 days ago
|