← Back
1

Server prerequisites (macOS)

Install Homebrew tools on your Mac mini (or any Apple Silicon machine):

# Homebrew (if not installed)
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

# Required tools
brew install yq       # YAML parser used by launch.sh
brew install uv       # Python package manager
💡 uv, tmux, and node_exporter are also installed automatically on each Pi by launch.sh — no need to install them manually on the workers.
2

Pi prerequisites

Follow the Raspberry Pi cluster setup guide to get your Pis networked and SSH-accessible. Then on each Pi:

sudo apt update && sudo apt install -y python3.13 python3.13-venv curl git
⚠️ The Host alias in ~/.ssh/config must exactly match the host field in configs/config.yaml. launch.sh uses those values directly as SSH targets.

Verify SSH access from the server:

ssh pi4-1   # or whatever alias you chose
3

Clone & configure

Clone the repo on the server and install Python dependencies:

git clone https://github.com/YuvrajSingh-mist/smoltorrent
cd smoltorrent
uv sync

Edit configs/config.yaml — set ckpt_root and each worker's host, ip, port, and rank:

ckpt_root: /Users/you/smolcluster/checkpoints num_workers: 4 n_chunks: 4 devices_config: master: - host: localhost ip: <your-server-ip> rank: 0 port: 5000 workers: - host: pi4-1 # must match Host alias in ~/.ssh/config ip: <ip> rank: 1 port: 5001 - host: pi4-2 ip: <ip> rank: 2 port: 5002 # ... one entry per worker
ℹ️ Don't know a Pi's IP yet? Run curl http://localhost:8000/discover after launching — mDNS discovery will find all workers on the network automatically.
4

Launch the cluster

One command rsyncs the codebase to every Pi, installs deps, and starts everything in tmux:

bash scripts/launch.sh

This starts:

  • syncps_api — FastAPI server on the master (port 8000)
  • syncps_watcher — Watcher daemon on the master
  • syncps_worker_N — TCP worker on each Pi (port 5001+)

Useful launch flags

FlagWhat it does
--dry-runPrint what would happen, no SSH or launches
--api-onlyHeartbeat-check workers, start API only
--workers 1,3Launch only specific worker ranks
--ext .safetensors,.pthOverride file extensions the watcher monitors

Watch logs

# API / watcher
tmux attach -t syncps_api
tmux attach -t syncps_watcher

# Worker (SSH first)
ssh pi4-1 && tmux attach -t syncps_worker_1

# All logs
tail -f logging/cluster-logs/*.log
OPT

Pi worker auto-start

Install a systemd service on each Pi so workers restart automatically after a Pi reboot — without waiting for the server:

# All 4 workers
bash scripts/install_worker_service.sh

# Specific ranks only
bash scripts/install_worker_service.sh --workers 1,3

# Remove from all
bash scripts/install_worker_service.sh --uninstall
ℹ️ launch.sh also kills and re-launches workers via tmux on every run — systemd and tmux are independent. If both are running, systemd's process will fail to bind the port and retry after 5 s (harmless).
OPT

Server auto-start at boot

Register macOS LaunchDaemons so the entire cluster comes up after a server reboot — no manual intervention:

bash scripts/launch.sh --daemons

This registers two system daemons:

  • com.smoltorrent.startup — waits for network (pings first worker every 5 s), then runs launch.sh
  • com.node-exporter — keeps node_exporter running for Grafana system stats

Verify & check logs

# Check both are registered
sudo launchctl print system/com.smoltorrent.startup
sudo launchctl print system/com.node-exporter

# Startup log (after reboot)
cat /tmp/smoltorrent-startup.log
ℹ️ launchctl print may show last exit code = 1 and state = not running — this is expected. The daemon runs once at boot, launches everything into tmux, then exits. The cluster keeps running in tmux independently.

Remove / uninstall

sudo launchctl bootout system/com.smoltorrent.startup 2>/dev/null || true
sudo rm -f /Library/LaunchDaemons/com.smoltorrent.startup.plist
sudo rm -f /usr/local/bin/smoltorrent_startup.sh

sudo launchctl bootout system/com.node-exporter 2>/dev/null || true
sudo rm -f /Library/LaunchDaemons/com.node-exporter.plist
OPT

Monitoring (Prometheus + Grafana + Loki)

No SSH required. Prometheus scrapes worker metrics directly over TCP on 9200+rank. Everything runs in Docker on the server only.

1. Install Docker (via colima on macOS)

brew install colima docker docker-compose
colima start

2. Configure & start the stack

# Copy and fill in credentials
cp monitoring/.env.example monitoring/.env
# Edit monitoring/.env — set Gmail app password for alert emails

# Start Prometheus + Grafana + Loki
cd monitoring && docker compose up -d

# Grafana → http://<master-ip>:3000  (admin / smoltorrent)

Metrics endpoints

SourceURLWhat
Master API<master-ip>:8000/metricsFastAPI + transfer metrics
Pi worker N<pi-ip>:920N/metricsPer-worker shard/transfer metrics
All nodes<node-ip>:9100/metricsSystem stats (node_exporter)
ℹ️ Logs are shipped from all nodes via Promtail → Loki. See monitoring/README.md in the repo for the full metrics reference and dashboard panel guide.

Requirements summary

DependencyWhereHow
Python ≥ 3.13All nodesManual on Pis; already on macOS
uvAll nodesAuto by launch.sh
tmux ≥ 3.0All nodesAuto by launch.sh
yqServer onlybrew install yq
node_exporterAll nodesAuto by launch.sh
zeroconfAll nodesAuto by launch.sh (mDNS discovery)
Network (LAN/VPN)All nodesNodes must reach each other over TCP
SSH key authServer → Pisssh-copy-id (step 2)
Docker + colimaServer onlyMonitoring only — no SSH needed