API Reference
The smoltorrent API runs on the coordinator (default localhost:8000).
All store/gather operations stream log lines as they run so you see progress in real time.
No authentication required — it's designed for local cluster use.
Make sure the API and all workers are running before calling any endpoint (setup guide).
Store
Shard a checkpoint and push each shard to workers with RF2 replication.
Gather
Pull all shards from workers, verify, and merge into a single file.
Discover
Scan the network via mDNS/AWDL and return all live workers.
/store-shard
Loads the checkpoint at ckpt_path, shards the tensor dict into N equal chunks
(one per worker), serializes each chunk to .safetensors bytes, then dispatches
all N × REDUNDANCY sends in parallel via a ThreadPoolExecutor.
With REDUNDANCY=2 (default), shard i goes to
workers[i] (primary) and workers[(i+1) % N] (replica).
Failed sends are retried with exponential backoff (up to 6 attempts).
Response is a streaming plain-text log — one line per event.
Query parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
ckpt_path |
string |
required | Absolute path to the .safetensors checkpoint on the coordinator. Must be under ckpt_root from config — the relative path is used as the storage key on workers. |
curl
curl -N -X POST \ "http://localhost:8000/store-shard?ckpt_path=/home/user/checkpoints/run1/step_100/model.safetensors"
Python (httpx)
import httpx
ckpt_path = "/home/user/checkpoints/run1/step_100/model.safetensors"
with httpx.Client(timeout=None) as client:
with client.stream(
"POST", "http://localhost:8000/store-shard",
params={"ckpt_path": ckpt_path}
) as resp:
resp.raise_for_status()
for line in resp.iter_lines():
print(line)
timeout=None — a 942 MB checkpoint with RF2 takes ~5 min on 100 Mbps Ethernet.
The response streams progress lines continuously so your connection won't idle.
Streaming response
ERROR: at the start of any line. A Done: line at the end means all shards were stored successfully. Missing Done: means at least one shard permanently failed.
/gather-shards
Pulls shards from all workers in parallel, saves each to local disk under
shards/, verifies integrity, then merges them back into a single
merged.safetensors file in the same directory as the original checkpoint.
If a primary worker is down, the gather automatically falls back to the replica.
Pass the same ckpt_path used during store.
Query parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
ckpt_path |
string |
required | Same absolute path used in /store-shard. The relative path from ckpt_root is used to look up shards on workers. |
curl
curl -N -X POST \ "http://localhost:8000/gather-shards?ckpt_path=/home/user/checkpoints/run1/step_100/model.safetensors"
Python (httpx)
import httpx
ckpt_path = "/home/user/checkpoints/run1/step_100/model.safetensors"
with httpx.Client(timeout=None) as client:
with client.stream(
"POST", "http://localhost:8000/gather-shards",
params={"ckpt_path": ckpt_path}
) as resp:
resp.raise_for_status()
body = ""
for line in resp.iter_lines():
print(line)
body += line + "\n"
# merged.safetensors is in the same directory as ckpt_path
# e.g. /home/user/checkpoints/run1/step_100/merged.safetensors
Streaming response
.safetensors file — load it with
safetensors.torch.load_file() or mx.load() directly. No smoltorrent dependency at inference time.
Replica fallback
If a primary worker is unreachable, gather automatically tries its replica (the next worker in the ring). The log shows which worker actually served the shard:
/discover
Scans the local network for smoltorrent worker nodes using mDNS
(_smoltorrent._tcp.local.). On macOS, AirDrop/AWDL discovery runs in
parallel for router-free peer-to-peer Mac-to-Mac networking.
Useful for verifying which workers are alive before a store/gather, or after DHCP reassigns IPs.
Query parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
timeout |
float |
optional | 10.0 |
How long to scan for workers in seconds. Increase for slow networks. |
curl
curl "http://localhost:8000/discover?timeout=5"
Response
{
"workers": [
{"ip": "192.168.1.7", "port": 5001, "rank": 1, "hostname": "pi4-1"},
{"ip": "192.168.1.5", "port": 5002, "rank": 2, "hostname": "pi4-2"},
{"ip": "192.168.1.3", "port": 5003, "rank": 3, "hostname": "pi4-3"},
{"ip": "192.168.1.6", "port": 5004, "rank": 4, "hostname": "pi4-4"}
]
}
/metrics
Prometheus scrape endpoint. Exposes store/gather operation counts, wall-clock histograms, per-rank transfer errors, bytes sent/received, and send duration histograms. Scraped by the Prometheus container in the monitoring stack.
curl http://localhost:8000/metrics/
/metrics/ not /metrics. The Prometheus scrape config in monitoring/prometheus/prometheus.yml already has this configured.
grove CLI
grove is the command-line interface that wraps all API calls. It handles cluster startup, worker discovery, and store/gather without needing to manage curl manually.
Start the cluster
# Start API + watcher + all 4 workers (rsyncs code, installs deps, opens tmux sessions) bash scripts/launch.sh # Start API only (workers must already be running) bash scripts/launch.sh --api-only # Start specific workers only bash scripts/launch.sh --workers 1,3
grove CLI is the recommended way to store and gather — it handles streaming, error propagation, and optional tokenizer fetch after gather. The API server must already be running (started via grove start or bash scripts/launch.sh) before either command works.
Store a checkpoint
grove store --ckpt-path ~/checkpoints/run1/step_100/model.safetensors # Streams the same log as the API directly to your terminal: # Loaded 290 tensors (942.3 MB) — chunking into 4 shards # ✓ rank 1 (pi4-1) [round 0] # ✓ rank 2 (pi4-2) [round 0] # ... # Done: 8/8 sends (2x replicated) → run1/step_100
Gather a checkpoint
grove gather --ckpt-path ~/checkpoints/run1/step_100/model.safetensors # → writes merged.safetensors in the same directory
Discover workers
grove discover # → lists all live workers found via mDNS
Watcher — automatic mode
The watcher daemon watches your checkpoint directory and calls /store-shard
automatically whenever a new .safetensors file appears and stabilises (no
writes for 1 second). No manual intervention needed — just train, and checkpoints
replicate themselves.
Start the watcher
# Watcher is launched automatically by launch.sh # To start manually: uv run python watcher/watch.py # Watch specific file extensions (default: .safetensors) uv run python watcher/watch.py --ext .safetensors,.pth
What the watcher does
1. file_sync — push local checkpoint list to all workers 2. checksum_sync — verify existing shards on startup (once) 3. transfer — call /store-shard for any new/changed file 4. crosscheck — confirm all workers have the shard before moving on
config.yaml reference
Lives at configs/config.yaml. Loaded fresh on every API request — no restart needed after edits.
ckpt_root: ~/smolcluster/checkpoints # base directory — ckpt_path must be under this
n_chunks: 4 # number of shards = number of workers
num_workers: 4
devices_config:
master:
- host: localhost
ip: 127.0.0.1
rank: 0
port: 5000
workers:
- host: pi4-1
ip: 192.168.1.7
rank: 1
port: 5001
- host: pi4-2
ip: 192.168.1.5
rank: 2
port: 5002
- host: pi4-3
ip: 192.168.1.3
rank: 3
port: 5003
- host: pi4-4
ip: 192.168.1.6
rank: 4
port: 5004
ckpt_path passed to the API must be an absolute path that starts with ckpt_root. The relative portion (e.g. run1/step_100) becomes the storage key on every worker.
Streaming responses
Both /store-shard and /gather-shards return
text/plain streaming responses — FastAPI's StreamingResponse
yields one log line per event as it happens. This means:
- Use
-N(no-buffer) with curl to see lines as they arrive - Use
client.stream()+iter_lines()in httpx/requests - Set
timeout=None— the response stays open until the operation finishes - Check for
ERROR:prefix on any line to detect failures - The final
Done:line confirms full success
Check for errors programmatically
import httpx
def store(ckpt_path: str) -> bool:
lines = []
with httpx.Client(timeout=None) as client:
with client.stream("POST", "http://localhost:8000/store-shard",
params={"ckpt_path": ckpt_path}) as resp:
resp.raise_for_status()
for line in resp.iter_lines():
lines.append(line)
body = "\n".join(lines)
if any(l.startswith("ERROR") for l in lines):
raise RuntimeError(f"Store failed:\n{body}")
if not any("Done:" in l for l in lines):
raise RuntimeError(f"Store did not complete:\n{body}")
return True
Errors
| Message | Cause | Fix |
|---|---|---|
ERROR: checkpoint not found |
ckpt_path doesn't exist on the coordinator |
Check the path. Must be absolute and the file must exist. |
ERROR: … is not under ckpt_root |
ckpt_path is outside ckpt_root in config |
Move the checkpoint under ckpt_root or update config. |
↻ rank N failed — queuing retry |
TCP connect failed to a worker | Transient — retried automatically up to 6 times. Check worker is running if it persists. |
✗ rank N permanently failed |
Worker unreachable after 6 retries | Check worker logs: ssh pi4-N 'tmux attach -t syncps_worker_N' |
ERROR: N/M shards failed — skipping merge |
Gather could not retrieve all shards | At least one worker and its replica are both down. Restore a worker or recover from backup. |