Store
The API loads a checkpoint, splits tensors evenly into N shards (one per worker), computes a SHA-256 checksum per shard,
and sends each over TCP. Workers verify the checksum and write the shard to disk with a .checksum sidecar.
Failed sends retry with exponential backoff. Each shard is then mirrored to a second worker (replication round), so any
single node failure loses no data.
Gather
The API pulls each shard from its primary worker. If the primary is unreachable, it automatically falls back to the replica
on the next worker. All shards are merged back into one .safetensors file written as merged.safetensors
next to the original checkpoint.
Watcher
Monitors ckpt_root for new .safetensors files. On each trigger: file_sync →
checksum_sync (startup only) → transfer → crosscheck.
Files detected while still being written go to a pending list and are re-evaluated every 10 s.
Discovery
Workers advertise themselves over mDNS (_smoltorrent._tcp.local.) on startup. The coordinator scans the
network and finds all workers without any hardcoded IPs — useful for initial setup or when DHCP reassigns addresses.
On macOS, AirDrop/AWDL discovery runs in parallel for Mac-to-Mac scenarios.