smoltorrent

Shards .safetensors checkpoints across worker nodes over TCP -
with SHA-256 verification, replication, and automatic sync.

Overview

smoltorrent is an educational project for learning distributed systems concepts hands-on. It shards .safetensors ML model checkpoints across a cluster of worker nodes over raw TCP, coordinated from a central machine. Each shard is stored on two workers (replication factor 2), so any single node failure loses no data. Transfers use os.sendfile + mmap end-to-end — tensor bytes move from disk to socket to disk entirely at the kernel level, so the coordinator's RAM usage stays flat regardless of checkpoint size. Workers are discovered automatically via mDNS - no hardcoded IPs. On macOS, peer-to-peer AirDrop / AWDL discovery runs in parallel for router-free Mac-to-Mac networking. A watcher daemon monitors your checkpoint directory and pushes new files automatically - no manual intervention needed.

4× Raspberry Pi 4 in a rack enclosure connected to a TP-Link PoE switch
The worker cluster - 4× Raspberry Pi 4 (4 GB) in a rack enclosure, connected over Ethernet via a TP-Link LS110P PoE switch. Each Pi runs one smoltorrent worker process.

Performance Results

4 workers · RF2 replication · ~100 Mbps Ethernet · 942 MB checkpoint. Wall-clock times from Prometheus (9 runs avg). Sequential baseline measured directly (242 s/shard × 4).

Wall-clock time
Wall-clock time - store, gather, sequential baseline
Aggregate throughput
Aggregate throughput (MB/s)
Per-node bandwidth MB/s
Per-node bandwidth MB/s - live Prometheus
Avg TCP send latency
Avg TCP send latency per direction
Store RF2: ~321 s  ·  Gather: ~95 s - Prometheus, 9 runs Sequential gather: ~968 s (measured)  ·  ~10× speedup Gather agg: 9.9 MB/s  ·  Store agg: 5.9 MB/s (2× data, RF2)

Key Features

safetensors MLX PyTorch FastAPI Prometheus Grafana Loki mDNS / Zeroconf macOS Linux

Watcher & Discovery

Watcher

Monitors ckpt_root for new .safetensors files. On each trigger: file_syncchecksum_sync (startup only) → transfercrosscheck. Files detected while still being written go to a pending list and are re-evaluated every 10 s.

Discovery

Workers advertise themselves over mDNS (_smoltorrent._tcp.local.) on startup. The coordinator scans the network and finds all workers without any hardcoded IPs - useful for initial setup or when DHCP reassigns addresses. On macOS, AirDrop/AWDL discovery runs in parallel for Mac-to-Mac scenarios.

Pipeline Workflow

End-to-end flow for store (left) and gather (right). The watcher daemon auto-triggers store on new checkpoint files; gather is invoked via CLI or API. Dashed arrows indicate optional or fallback paths.

SmolTorrent pipeline workflow - store and gather flows

Watcher Boot Sequence

What happens when the cluster starts. Left lane: Mac coordinator. Right lane: Pi workers. Startup-only path (checksum_sync) shown in amber; loop-back on gaps in red.

Watcher boot sequence diagram

Cluster Architecture

Coordinator (any macOS / Linux machine)
  ├── FastAPI server   backend/api.py          ← /store-shard, /gather-shards, /discover
  ├── Watcher daemon   watcher/watch.py         ← auto-syncs new checkpoints
  ├── Discovery        discovery/               ← mDNS + AirDrop device discovery
  └── Workers × N      algorithms/SyncPS/worker.py  ← TCP listener on each worker node

Replication (factor 2):
  Shard 0  →  Worker 1 (primary)  +  Worker 2 (replica)
  Shard 1  →  Worker 2 (primary)  +  Worker 3 (replica)
  Shard 2  →  Worker 3 (primary)  +  Worker 4 (replica)
  Shard 3  →  Worker 4 (primary)  +  Worker 1 (replica)

Transfer protocol (zero-copy):
  Store: coordinator reads shard range from checkpoint → os.sendfile → Pi writes to shard_N.safetensors via mmap
  Gather: Pi serves raw tensor bytes → coordinator writes directly into pre-allocated merged file via mmap
  Python never holds the checkpoint in RAM on either end — RSS stays flat regardless of file size.

Network: LAN, VPN, or any TCP-reachable topology (~100 Mbps tested over Tailscale)

Usage

grove (recommended) - no SSH config needed. Workers discover the master over mDNS and self-register. grove_launch.sh starts the API + watcher automatically once all workers have joined:

# Coordinator: advertise and wait for N workers to join
grove start -n 4

# Worker (each Pi or Mac on the same network): TUI - select master, Enter to join
grove join

# -- cluster is now up, API server at localhost:8000 --
grove store --ckpt-path /path/to/checkpoints/model.safetensors
grove gather --ckpt-path /path/to/checkpoints/model.safetensors

# Discover live workers (mDNS)
curl http://<master-ip>:8000/discover

Contributors / code changes - fill in configs/dev-config.yaml with your SSH aliases and Tailscale IPs, then:

bash scripts/launch.sh              # rsync code to all workers (skips configs/)
Port conflicts: If you see OSError: [Errno 98] Address already in use, port 8000 (API) or 8001 (watcher metrics) is still bound from a previous run. grove_launch.sh frees both automatically and only kills its own grove_api / grove_watcher tmux sessions.

Tested On

Any macOS or Linux machine can act as coordinator or worker. This is the exact setup we developed and tested on:

Role Hardware Chip OS Python RAM Storage Stack
Coordinator Apple Mac mini M4 Apple M4 (arm64) macOS 26.2 Tahoe 3.13.3 16 GB 256 GB SSD uv 0.9 · safetensors 0.7 · PyTorch 2.11 · MLX 0.31
Workers × 4 Raspberry Pi 4 Model B Rev 1.5 BCM2711 Cortex-A72 (aarch64) Debian 13 Trixie (kernel 6.12) 3.13.5 4 GB 64 GB microSD uv 0.9 · safetensors 0.7 · PyTorch 2.11
Network: ~100 Mbps Ethernet + Tailscale VPN Gather: ~1.5 min · Store: ~5 min · 942 MB (RF2, parallel)

Workers can be any Linux or macOS machine - not Pi-specific. Coordinator requires Apple Silicon for MLX tensor ops; workers are platform-agnostic (PyTorch only).

License

smoltorrent is released under the MIT License.