smoltorrent — API Docs

API Reference

The smoltorrent API runs on the coordinator (default localhost:8000). All store/gather operations stream log lines as they run so you see progress in real time. No authentication required — it's designed for local cluster use. Make sure the API and all workers are running before calling any endpoint (setup guide).

POST /store-shard

Store

Shard a checkpoint and push each shard to workers with RF2 replication.

POST /gather-shards

Gather

Pull all shards from workers, verify, and merge into a single file.

GET /discover

Discover

Scan the network via mDNS/AWDL and return all live workers.

POST

/store-shard

Loads the checkpoint at ckpt_path, shards the tensor dict into N equal chunks (one per worker), serializes each chunk to .safetensors bytes, then dispatches all N × REDUNDANCY sends in parallel via a ThreadPoolExecutor. With REDUNDANCY=2 (default), shard i goes to workers[i] (primary) and workers[(i+1) % N] (replica). Failed sends are retried with exponential backoff (up to 6 attempts). Response is a streaming plain-text log — one line per event.

Query parameters

Parameter	Type	Required	Description
`ckpt_path`	`string`	required	Absolute path to the `.safetensors` checkpoint on the coordinator. Must be under `ckpt_root` from config — the relative path is used as the storage key on workers.

curl

bash

curl -N -X POST \
  "http://localhost:8000/store-shard?ckpt_path=/home/user/checkpoints/run1/step_100/model.safetensors"

Python (httpx)

python

import httpx

ckpt_path = "/home/user/checkpoints/run1/step_100/model.safetensors"

with httpx.Client(timeout=None) as client:
    with client.stream(
        "POST", "http://localhost:8000/store-shard",
        params={"ckpt_path": ckpt_path}
    ) as resp:
        resp.raise_for_status()
        for line in resp.iter_lines():
            print(line)

ℹ️ Use timeout=None — a 942 MB checkpoint with RF2 takes ~5 min on 100 Mbps Ethernet. The response streams progress lines continuously so your connection won't idle.

Streaming response

Loaded 290 tensors (942.3 MB) from run1/step_100 — chunking into 4 shards ✓ rank 1 (pi4-1) [round 0] ✓ rank 3 (pi4-3) [round 0] ✓ rank 2 (pi4-2) [round 0] ✓ rank 4 (pi4-4) [round 0] ✓ rank 2 (pi4-2) [round 1] ✓ rank 3 (pi4-3) [round 1] ✓ rank 1 (pi4-1) [round 1] ✓ rank 4 (pi4-4) [round 1] Done: 8/8 sends (2x replicated) → run1/step_100

⚠️ Check for ERROR: at the start of any line. A Done: line at the end means all shards were stored successfully. Missing Done: means at least one shard permanently failed.

POST

/gather-shards

Pulls shards from all workers in parallel, saves each to local disk under shards/, verifies integrity, then merges them back into a single merged.safetensors file in the same directory as the original checkpoint. If a primary worker is down, the gather automatically falls back to the replica. Pass the same ckpt_path used during store.

Query parameters

Parameter	Type	Required	Description
`ckpt_path`	`string`	required	Same absolute path used in `/store-shard`. The relative path from `ckpt_root` is used to look up shards on workers.

curl

bash

curl -N -X POST \
  "http://localhost:8000/gather-shards?ckpt_path=/home/user/checkpoints/run1/step_100/model.safetensors"

Python (httpx)

python

import httpx

ckpt_path = "/home/user/checkpoints/run1/step_100/model.safetensors"

with httpx.Client(timeout=None) as client:
    with client.stream(
        "POST", "http://localhost:8000/gather-shards",
        params={"ckpt_path": ckpt_path}
    ) as resp:
        resp.raise_for_status()
        body = ""
        for line in resp.iter_lines():
            print(line)
            body += line + "\n"

# merged.safetensors is in the same directory as ckpt_path
# e.g. /home/user/checkpoints/run1/step_100/merged.safetensors

Streaming response

✓ shard 0 — saved → /path/to/shards/worker_1/run1/step_100/shard.safetensors ✓ shard 1 — saved → /path/to/shards/worker_2/run1/step_100/shard.safetensors ✓ shard 2 — saved → /path/to/shards/worker_3/run1/step_100/shard.safetensors ✓ shard 3 — saved → /path/to/shards/worker_4/run1/step_100/shard.safetensors Merging 4 shards → /home/user/checkpoints/run1/step_100/merged.safetensors Done: saved → /home/user/checkpoints/run1/step_100/merged.safetensors

✅ The merged file is a standard .safetensors file — load it with safetensors.torch.load_file() or mx.load() directly. No smoltorrent dependency at inference time.

Replica fallback

If a primary worker is unreachable, gather automatically tries its replica (the next worker in the ring). The log shows which worker actually served the shard:

trying replica rank 2 for shard 1 (primary rank 1 failed) ✓ shard 1 — saved → .../worker_2/run1/step_100/shard.safetensors

GET

/discover

Scans the local network for smoltorrent worker nodes using mDNS (_smoltorrent._tcp.local.). On macOS, AirDrop/AWDL discovery runs in parallel for router-free peer-to-peer Mac-to-Mac networking. Useful for verifying which workers are alive before a store/gather, or after DHCP reassigns IPs.

Query parameters

Parameter	Type	Required	Default	Description
`timeout`	`float`	optional	`10.0`	How long to scan for workers in seconds. Increase for slow networks.

curl

bash

curl "http://localhost:8000/discover?timeout=5"

Response

json

{
  "workers": [
    {"ip": "192.168.1.7",  "port": 5001, "rank": 1, "hostname": "pi4-1"},
    {"ip": "192.168.1.5",  "port": 5002, "rank": 2, "hostname": "pi4-2"},
    {"ip": "192.168.1.3",  "port": 5003, "rank": 3, "hostname": "pi4-3"},
    {"ip": "192.168.1.6",  "port": 5004, "rank": 4, "hostname": "pi4-4"}
  ]
}

GET

/metrics

Prometheus scrape endpoint. Exposes store/gather operation counts, wall-clock histograms, per-rank transfer errors, bytes sent/received, and send duration histograms. Scraped by the Prometheus container in the monitoring stack.

bash

curl http://localhost:8000/metrics/

ℹ️ Note the trailing slash — /metrics/ not /metrics. The Prometheus scrape config in monitoring/prometheus/prometheus.yml already has this configured.

grove CLI

grove is the command-line interface that wraps all API calls. It handles cluster startup, worker discovery, and store/gather without needing to manage curl manually.

Start the cluster

bash

# Start API + watcher + all 4 workers (rsyncs code, installs deps, opens tmux sessions)
bash scripts/launch.sh

# Start API only (workers must already be running)
bash scripts/launch.sh --api-only

# Start specific workers only
bash scripts/launch.sh --workers 1,3

grove CLI is the recommended way to store and gather — it handles streaming, error propagation, and optional tokenizer fetch after gather. The API server must already be running (started via grove start or bash scripts/launch.sh) before either command works.

Store a checkpoint

bash

grove store --ckpt-path ~/checkpoints/run1/step_100/model.safetensors

# Streams the same log as the API directly to your terminal:
# Loaded 290 tensors (942.3 MB) — chunking into 4 shards
#   ✓ rank 1 (pi4-1) [round 0]
#   ✓ rank 2 (pi4-2) [round 0]
#   ...
# Done: 8/8 sends (2x replicated) → run1/step_100

Gather a checkpoint

bash

grove gather --ckpt-path ~/checkpoints/run1/step_100/model.safetensors
# → writes merged.safetensors in the same directory

Discover workers

bash

grove discover
# → lists all live workers found via mDNS

Watcher — automatic mode

The watcher daemon watches your checkpoint directory and calls /store-shard automatically whenever a new .safetensors file appears and stabilises (no writes for 1 second). No manual intervention needed — just train, and checkpoints replicate themselves.

Start the watcher

bash

# Watcher is launched automatically by launch.sh
# To start manually:
uv run python watcher/watch.py

# Watch specific file extensions (default: .safetensors)
uv run python watcher/watch.py --ext .safetensors,.pth

What the watcher does

4-phase loop

1. file_sync        — push local checkpoint list to all workers
2. checksum_sync    — verify existing shards on startup (once)
3. transfer         — call /store-shard for any new/changed file
4. crosscheck       — confirm all workers have the shard before moving on

✅ The watcher and direct API calls are not mutually exclusive — you can run both at the same time. The watcher handles new files automatically; direct API calls let you trigger store/gather on demand for specific paths.

config.yaml reference

Lives at configs/config.yaml. Loaded fresh on every API request — no restart needed after edits.

yaml

ckpt_root: ~/smolcluster/checkpoints   # base directory — ckpt_path must be under this
n_chunks: 4                            # number of shards = number of workers
num_workers: 4

devices_config:
  master:
    - host: localhost
      ip: 127.0.0.1
      rank: 0
      port: 5000
  workers:
    - host: pi4-1
      ip: 192.168.1.7
      rank: 1
      port: 5001
    - host: pi4-2
      ip: 192.168.1.5
      rank: 2
      port: 5002
    - host: pi4-3
      ip: 192.168.1.3
      rank: 3
      port: 5003
    - host: pi4-4
      ip: 192.168.1.6
      rank: 4
      port: 5004

⚠️ ckpt_path passed to the API must be an absolute path that starts with ckpt_root. The relative portion (e.g. run1/step_100) becomes the storage key on every worker.

Streaming responses

Both /store-shard and /gather-shards return text/plain streaming responses — FastAPI's StreamingResponse yields one log line per event as it happens. This means:

Use -N (no-buffer) with curl to see lines as they arrive
Use client.stream() + iter_lines() in httpx/requests
Set timeout=None — the response stays open until the operation finishes
Check for ERROR: prefix on any line to detect failures
The final Done: line confirms full success

Check for errors programmatically

python

import httpx

def store(ckpt_path: str) -> bool:
    lines = []
    with httpx.Client(timeout=None) as client:
        with client.stream("POST", "http://localhost:8000/store-shard",
                           params={"ckpt_path": ckpt_path}) as resp:
            resp.raise_for_status()
            for line in resp.iter_lines():
                lines.append(line)

    body = "\n".join(lines)
    if any(l.startswith("ERROR") for l in lines):
        raise RuntimeError(f"Store failed:\n{body}")
    if not any("Done:" in l for l in lines):
        raise RuntimeError(f"Store did not complete:\n{body}")
    return True

Errors

Message	Cause	Fix
`ERROR: checkpoint not found`	`ckpt_path` doesn't exist on the coordinator	Check the path. Must be absolute and the file must exist.
`ERROR: … is not under ckpt_root`	`ckpt_path` is outside `ckpt_root` in config	Move the checkpoint under `ckpt_root` or update config.
`↻ rank N failed — queuing retry`	TCP connect failed to a worker	Transient — retried automatically up to 6 times. Check worker is running if it persists.
`✗ rank N permanently failed`	Worker unreachable after 6 retries	Check worker logs: `ssh pi4-N 'tmux attach -t syncps_worker_N'`
`ERROR: N/M shards failed — skipping merge`	Gather could not retrieve all shards	At least one worker and its replica are both down. Restore a worker or recover from backup.