smoltorrent — Setup Guide

1

Server prerequisites (macOS)

Install Homebrew tools on your Mac mini (or any Apple Silicon machine):

# Homebrew (if not installed)
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

# Required tools
brew install yq       # YAML parser used by launch.sh
brew install uv       # Python package manager

💡 uv, tmux, and node_exporter are also installed automatically on each Pi by launch.sh — no need to install them manually on the workers.

2

Pi prerequisites

Follow the Raspberry Pi cluster setup guide to get your Pis networked and SSH-accessible. Then on each Pi:

sudo apt update && sudo apt install -y python3.13 python3.13-venv curl git

⚠️ The Host alias in ~/.ssh/config must exactly match the host field in configs/config.yaml. launch.sh uses those values directly as SSH targets.

Verify SSH access from the server:

ssh pi4-1   # or whatever alias you chose

3

Clone & configure

Clone the repo on the server and install Python dependencies:

git clone https://github.com/YuvrajSingh-mist/smoltorrent
cd smoltorrent
uv sync

Edit configs/config.yaml — set ckpt_root and each worker's host, ip, port, and rank:

ckpt_root: /Users/you/smolcluster/checkpoints num_workers: 4 n_chunks: 4 devices_config: master: - host: localhost ip: <your-server-ip> rank: 0 port: 5000 workers: - host: pi4-1 # must match Host alias in ~/.ssh/config ip: <ip> rank: 1 port: 5001 - host: pi4-2 ip: <ip> rank: 2 port: 5002 # ... one entry per worker

ℹ️ Don't know a Pi's IP yet? Run curl http://localhost:8000/discover after launching — mDNS discovery will find all workers on the network automatically.

4

Launch the cluster

One command rsyncs the codebase to every Pi, installs deps, and starts everything in tmux:

bash scripts/launch.sh

This starts:

syncps_api — FastAPI server on the master (port 8000)
syncps_watcher — Watcher daemon on the master
syncps_worker_N — TCP worker on each Pi (port 5001+)

Useful launch flags

Flag	What it does
`--dry-run`	Print what would happen, no SSH or launches
`--api-only`	Heartbeat-check workers, start API only
`--workers 1,3`	Launch only specific worker ranks
`--ext .safetensors,.pth`	Override file extensions the watcher monitors

Watch logs

# API / watcher
tmux attach -t syncps_api
tmux attach -t syncps_watcher

# Worker (SSH first)
ssh pi4-1 && tmux attach -t syncps_worker_1

# All logs
tail -f logging/cluster-logs/*.log

OPT

Pi worker auto-start

Install a systemd service on each Pi so workers restart automatically after a Pi reboot — without waiting for the server:

# All 4 workers
bash scripts/install_worker_service.sh

# Specific ranks only
bash scripts/install_worker_service.sh --workers 1,3

# Remove from all
bash scripts/install_worker_service.sh --uninstall

ℹ️ launch.sh also kills and re-launches workers via tmux on every run — systemd and tmux are independent. If both are running, systemd's process will fail to bind the port and retry after 5 s (harmless).

OPT

Server auto-start at boot

Register macOS LaunchDaemons so the entire cluster comes up after a server reboot — no manual intervention:

bash scripts/launch.sh --daemons

This registers two system daemons:

com.smoltorrent.startup — waits for network (pings first worker every 5 s), then runs launch.sh
com.node-exporter — keeps node_exporter running for Grafana system stats

Verify & check logs

# Check both are registered
sudo launchctl print system/com.smoltorrent.startup
sudo launchctl print system/com.node-exporter

# Startup log (after reboot)
cat /tmp/smoltorrent-startup.log

ℹ️ launchctl print may show last exit code = 1 and state = not running — this is expected. The daemon runs once at boot, launches everything into tmux, then exits. The cluster keeps running in tmux independently.

Remove / uninstall

sudo launchctl bootout system/com.smoltorrent.startup 2>/dev/null || true
sudo rm -f /Library/LaunchDaemons/com.smoltorrent.startup.plist
sudo rm -f /usr/local/bin/smoltorrent_startup.sh

sudo launchctl bootout system/com.node-exporter 2>/dev/null || true
sudo rm -f /Library/LaunchDaemons/com.node-exporter.plist

OPT

Monitoring (Prometheus + Grafana + Loki)

✅ No SSH required. Prometheus scrapes worker metrics directly over TCP on 9200+rank. Everything runs in Docker on the server only.

1. Install Docker (via colima on macOS)

brew install colima docker docker-compose
colima start

2. Configure & start the stack

# Copy and fill in credentials
cp monitoring/.env.example monitoring/.env
# Edit monitoring/.env — set Gmail app password for alert emails

# Start Prometheus + Grafana + Loki
cd monitoring && docker compose up -d

# Grafana → http://<master-ip>:3000  (admin / smoltorrent)

Metrics endpoints

Source	URL	What
Master API	`<master-ip>:8000/metrics`	FastAPI + transfer metrics
Pi worker N	`<pi-ip>:920N/metrics`	Per-worker shard/transfer metrics
All nodes	`<node-ip>:9100/metrics`	System stats (node_exporter)

ℹ️ Logs are shipped from all nodes via Promtail → Loki. See monitoring/README.md in the repo for the full metrics reference and dashboard panel guide.

—

Requirements summary

Dependency	Where	How
Python ≥ 3.13	All nodes	Manual on Pis; already on macOS
uv	All nodes	Auto by launch.sh
tmux ≥ 3.0	All nodes	Auto by launch.sh
yq	Server only	brew install yq
node_exporter	All nodes	Auto by launch.sh
zeroconf	All nodes	Auto by launch.sh (mDNS discovery)
Network (LAN/VPN)	All nodes	Nodes must reach each other over TCP
SSH key auth	Server → Pis	ssh-copy-id (step 2)
Docker + colima	Server only	Monitoring only — no SSH needed