smoltorrent — Distributed ML Checkpoint Sharding

Overview

smoltorrent is a distributed checkpoint storage system that shards .safetensors ML model checkpoints across a cluster of worker nodes, coordinated from a central coordinator machine. Each shard is stored on two workers (replication factor 2), so any single node failure loses no data. A watcher daemon monitors your checkpoint directory and pushes new files automatically — no manual intervention needed.

The coordinator handles tensor splitting and shard distribution; workers store and serve shards over TCP. The wire format is .safetensors only. The coordinator currently uses MLX (Apple Silicon) for tensor ops and converts to torch before sending — workers only need torch and safetensors, with no platform restriction.

Our setup (as an example): Mac mini M4 as the coordinator, 4× Raspberry Pi 4 as workers — but any Linux/macOS machine can fill either role.

Cluster Architecture

Coordinator (any macOS / Linux machine)
  ├── FastAPI server   backend/api.py          ← /store-shard, /gather-shards, /discover
  ├── Watcher daemon   watcher/watch.py         ← auto-syncs new checkpoints
  ├── Discovery        discovery/               ← mDNS + AirDrop device discovery
  └── Workers × N      algorithms/SyncPS/worker.py  ← TCP listener on each worker node

Replication (factor 2):
  Shard 0  →  Worker 1 (primary)  +  Worker 2 (replica)
  Shard 1  →  Worker 2 (primary)  +  Worker 3 (replica)
  Shard 2  →  Worker 3 (primary)  +  Worker 4 (replica)
  Shard 3  →  Worker 4 (primary)  +  Worker 1 (replica)

Network: LAN, VPN, or any TCP-reachable topology (~100 Mbps tested over Tailscale)

How It Works

Store

The API loads a checkpoint, splits tensors evenly into N shards (one per worker), computes a SHA-256 checksum per shard, and sends each over TCP. Workers verify the checksum and write the shard to disk with a .checksum sidecar. Failed sends retry with exponential backoff. Each shard is then mirrored to a second worker (replication round), so any single node failure loses no data.

Gather

The API pulls each shard from its primary worker. If the primary is unreachable, it automatically falls back to the replica on the next worker. All shards are merged back into one .safetensors file written as merged.safetensors next to the original checkpoint.

Watcher

Monitors ckpt_root for new .safetensors files. On each trigger: file_sync → checksum_sync (startup only) → transfer → crosscheck. Files detected while still being written go to a pending list and are re-evaluated every 10 s.

Discovery

Workers advertise themselves over mDNS (_smoltorrent._tcp.local.) on startup. The coordinator scans the network and finds all workers without any hardcoded IPs — useful for initial setup or when DHCP reassigns addresses. On macOS, AirDrop/AWDL discovery runs in parallel for Mac-to-Mac scenarios.

Key Features

Replication factor 2 — each shard lives on two nodes; single-node failure loses no data, gather falls back to replica automatically.
SHA-256 checksums — every shard has a .checksum sidecar; corruption detected at store and re-verified on gather.
Watcher daemon — inotify/watchdog triggers on new files; pending loop handles files still being written.
mDNS discovery — zero-config worker discovery via Zeroconf on all platforms; AirDrop/AWDL on macOS for Mac-to-Mac.
Auto-start — macOS LaunchDaemons bring the cluster up after coordinator reboot; systemd service restarts workers after node reboot.
Monitoring — Prometheus + Grafana + Loki in Docker on the coordinator; no SSH required. Workers expose metrics on 9200+rank.
Fully config-driven — all topology in configs/config.yaml; no IPs hardcoded anywhere.

safetensors MLX PyTorch FastAPI Prometheus Grafana Loki mDNS / Zeroconf macOS Linux

Usage

# Store a checkpoint across worker nodes
python main.py store --ckpt-path /path/to/checkpoints/model.safetensors

# Reassemble from shards → merged.safetensors written next to original
python main.py gather --ckpt-path /path/to/checkpoints/model.safetensors

# Discover workers on the network (mDNS scan)
curl http://<coordinator-ip>:8000/discover

Full Setup Guide

Our Setup

Any macOS or Linux machine can act as coordinator or worker. Here's what we tested with:

Role	Hardware	OS	Python	RAM
Coordinator	Apple Mac mini M4 (e.g.)	macOS 26.2 Tahoe (arm64)	3.13	16 GB
Workers × 4	Raspberry Pi 4 Model B Rev 1.5 (e.g.)	Debian 13 Trixie (aarch64, kernel 6.12)	3.13.5	4 GB

~100 Mbps Ethernet between nodes (tested over Tailscale VPN). ~2 min per 942 MB checkpoint per worker, parallel across all N workers.

License

smoltorrent is released under the MIT License.

Contributions welcome — visit the GitHub repository.