1

Server prerequisites (macOS)

Install Homebrew tools on your Mac mini (or any Apple Silicon machine):

# Homebrew (if not installed)
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

# Required tools
brew install yq       # YAML parser used by bootstrap.sh and launch.sh
brew install uv       # Python package manager (needed before bootstrap)
💡 uv, tmux, node_exporter, and the Python venv are installed on every Pi automatically by bootstrap.sh in step 4 - no need to install them manually on the workers.
2

Pi prerequisites

Follow the Raspberry Pi cluster setup guide to get your Pis networked. The only thing needed on each Pi before bootstrap is git and curl (usually pre-installed on Raspberry Pi OS):

sudo apt update && sudo apt install -y git curl

Set up SSH key auth (server to each Pi)

Generate a key on the server (skip if you already have one) and copy it to each Pi:

# On the server - generate key if needed
ssh-keygen -t ed25519 -C "smoltorrent"

# Copy to each Pi (repeat for each)
ssh-copy-id <pi-user>@<pi-ip>

Add an alias for each Pi in ~/.ssh/config on the server. The alias must exactly match the host field you'll set in configs/config.yaml:

# Append to ~/.ssh/config on the server
Host pi4-1
    HostName <pi-ip>
    User <pi-user>
    IdentityFile ~/.ssh/id_ed25519
    IdentitiesOnly yes

Host pi4-2
    HostName <pi-ip>
    User <pi-user>
    IdentityFile ~/.ssh/id_ed25519
    IdentitiesOnly yes

# ... one block per Pi
⚠️ launch.sh uses the host value from config.yaml as a literal SSH target - it must exactly match the Host alias above. If they don't match, rsync and remote starts will fail.

Verify SSH access from the server (should connect without a password prompt):

ssh pi4-1   # or whatever alias you chose
3

Clone & configure

Clone the repo on the server and install Python dependencies:

git clone https://github.com/YuvrajSingh-mist/smoltorrent
cd smoltorrent
uv sync

Add grove to your PATH so you can call it from anywhere:

echo 'export PATH="$HOME/smoltorrent/.venv/bin:$PATH"' >> ~/.zshrc
source ~/.zshrc
grove --help   # should print grove CLI usage

Create your checkpoint root directory (the watcher monitors this folder):

! Disclaimer — macOS TCC restriction: LaunchDaemons (running as root) cannot access ~/Desktop, ~/Documents, or ~/Downloads — macOS silently blocks them. Keep ckpt_root under ~/smolcluster/ (or any subfolder of your home directory that is NOT TCC-protected). Same for the repo itself — clone to ~/smoltorrent/, not the Desktop.
mkdir -p ~/smolcluster/checkpoints

Edit configs/config.yaml

ℹ️ Find your server's LAN IP: ipconfig getifaddr en0 (Ethernet) or en1 (Wi-Fi). Find each Pi's IP from your router's DHCP table, or run arp -a on the server.

Below is the complete working config for a 4-Pi setup. Replace IPs, paths, and SSH aliases with yours:

# ── Paths ───────────────────────────────────────────────────────────────────── ckpt_root: ~/smolcluster/checkpoints # watcher monitors this dir; must exist (mkdir -p above) data_path: ~/smolcluster/checkpoints/model.safetensors # used by legacy direct-load scripts only; safe to leave as-is save_path: ~/smolcluster/received_model/model.safetensors # gather destination for legacy scripts; safe to leave as-is # ── Cluster topology ─────────────────────────────────────────────────────────── num_workers: 4 # number of worker nodes n_chunks: 4 # MUST equal num_workers - each chunk goes to one worker devices_config: master: - host: localhost ip: 192.168.1.10 # your server's LAN IP (ipconfig getifaddr en0) rank: 0 port: 5000 # internal coord port - not the API port (8000) workers: - host: pi4-1 # MUST exactly match Host alias in ~/.ssh/config ip: 192.168.1.7 rank: 1 # ranks are 1-based; must be unique and sequential port: 5001 # TCP port for shard transfer (NOT Prometheus metrics) - host: pi4-2 ip: 192.168.1.5 rank: 2 port: 5002 - host: pi4-3 ip: 192.168.1.3 rank: 3 port: 5003 - host: pi4-4 ip: 192.168.1.6 rank: 4 port: 5004
⚠️ n_chunks must equal num_workers - each chunk is assigned to exactly one worker. If they differ, sharding silently under/over-assigns and gather will fail.

Port map - what uses what

PortWhereWhat
8000ServerFastAPI - /store-shard, /gather-shards, /discover
8001ServerWatcher Prometheus metrics
5001–5004Each PiWorker TCP listener - raw shard transfer (not HTTP)
9201–9204Each PiWorker Prometheus metrics (rank+9200)
9100All nodesnode_exporter - system stats for Grafana
9101All nodesboot_exporter - last boot timestamp
3000ServerGrafana UI
4

Bootstrap all nodes

One command rsyncs the codebase to every Pi and installs all dependencies - uv, tmux, node_exporter, the Python venv, zeroconf, and the boot_exporter systemd service. Run this once before your first launch, or again after adding a new worker.

bash scripts/bootstrap.sh
Once bootstrap completes, every node is fully ready. You can go straight to grove start / grove join (no-SSH path below), or proceed to step 5 to launch via SSH.

Useful flags

FlagWhat it does
--workers 1,3Bootstrap only the specified worker ranks
--dry-runPrint what would run without executing anything
5

Launch the cluster

Rsyncs the latest code to every Pi (no dep install - bootstrap already handled that), kills stale sessions, and starts everything in tmux:

bash scripts/launch.sh
Port warning: launch.sh and grove start forcibly free ports before starting. On the coordinator: 8000 (API) and 8001 (watcher metrics). On each worker Pi: 9200+rank (Prometheus metrics, e.g. 9201–9204). Any process already holding those ports will be killed.

This starts:

  • syncps_api - FastAPI server on the master (port 8000)
  • syncps_watcher - Watcher daemon on the master
  • syncps_worker_N - TCP worker on each Pi (port 5001+)

Verify the cluster is up

# Should return JSON with all 4 workers listed
curl http://localhost:8000/discover

# Should print "API Process Uptime" metric (confirms API is scraping)
curl -s http://localhost:8000/metrics/ | grep process_start_time

# Check tmux sessions exist
tmux ls   # expect: syncps_api, syncps_watcher, syncps_worker_1..4 on Pis
Success looks like: /discover returns "workers": [...] with all 4 Pis listed. If a worker is missing, SSH into that Pi and check tmux attach -t syncps_worker_N for the error.

Useful launch flags

FlagWhat it does
--dry-runPrint what would happen, no SSH or launches
--api-onlyHeartbeat-check workers, start API only
--workers 1,3Launch only specific worker ranks
--ext .safetensors,.pthOverride file extensions the watcher monitors

Watch logs

# API / watcher (on server)
tmux attach -t syncps_api
tmux attach -t syncps_watcher

# Worker logs (SSH into Pi, then attach)
ssh -t pi4-1 tmux attach -t syncps_worker_1

# All logs tailed locally
tail -f logging/cluster-logs/*.log

No-SSH alternative
No SSH required

grove start / join

Once bootstrap has run on every node (step 4), you can skip launch.sh entirely and use grove instead. No SSH config, no manual config.yaml editing - workers discover the master over mDNS and self-register, and the master writes config.yaml automatically. Good for testing and same-network setups.

⚠️ All nodes must be on the same LAN or VPN for mDNS multicast to reach them. For production runs over complex networks, use the SSH setup above.

Master Install once

brew install uv   # if not already installed
git clone https://github.com/YuvrajSingh-mist/smoltorrent
cd smoltorrent && uv sync

# Add grove to your PATH (one time)
echo 'export PATH="$HOME/smoltorrent/.venv/bin:$PATH"' >> ~/.zshrc
source ~/.zshrc

Each worker Install once (Pi or any Linux/macOS)

# Install uv first if not present
curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env   # or open a new shell

git clone https://github.com/YuvrajSingh-mist/smoltorrent
cd smoltorrent && uv pip install -e .

Master Advertise and wait for workers

# Run from inside the smoltorrent directory
cd smoltorrent
grove start -n 4   # replace 4 with your worker count

The master advertises itself over mDNS and opens a registration server on port 5999. It will wait until all N workers have joined before launching.

Each worker Find the master and join

# Run from inside the smoltorrent directory on the worker
cd smoltorrent
grove join

A TUI lists all smoltorrent masters found on the network. Select yours and press Enter. The worker registers, receives its rank, and starts the TCP listener immediately - no manual config needed.

Once all N workers have joined:
  • Master writes configs/config.yaml with each worker's IP, rank, and port.
  • Master starts the FastAPI server (syncps_api) and watcher daemon (syncps_watcher) in tmux.
  • Cluster is ready - drop a .safetensors file into ckpt_root and the watcher shards it automatically.
ℹ️ Set ckpt_root in configs/config.yaml after the first join to point at your checkpoint directory. Everything else is managed automatically.
OPT

Pi worker auto-start

Install a systemd service on each Pi so workers restart automatically after a Pi reboot - without waiting for the server:

# All 4 workers
bash scripts/install_worker_service.sh

# Specific ranks only
bash scripts/install_worker_service.sh --workers 1,3

# Remove from all
bash scripts/install_worker_service.sh --uninstall
ℹ️ launch.sh also kills and re-launches workers via tmux on every run - systemd and tmux are independent. If both are running, systemd's process will fail to bind the port and retry after 5 s (harmless).
OPT

Server auto-start at boot

Register macOS LaunchDaemons so the entire cluster comes up after a server reboot - no manual intervention. Run from the smoltorrent directory:

sudo bash scripts/launch.sh --daemons

This registers three system daemons:

  • com.smoltorrent.startup - waits for network (pings first worker every 5 s), then runs launch.sh
  • com.node-exporter - keeps node_exporter running for Grafana system stats
  • com.smoltorrent.boot-exporter - exposes boot time metric on port 9101 for Grafana

Verify & check logs

# Check all three are registered
sudo launchctl print system/com.smoltorrent.startup
sudo launchctl print system/com.node-exporter
sudo launchctl print system/com.smoltorrent.boot-exporter

# Startup log (after reboot)
cat /tmp/smoltorrent-startup.log
cat /tmp/smoltorrent-boot-exporter.log
ℹ️ launchctl print may show last exit code = 1 and state = not running - this is expected. The daemon runs once at boot, launches everything into tmux, then exits. The cluster keeps running in tmux independently.

Remove all daemons

# Cluster startup daemon
sudo launchctl bootout system/com.smoltorrent.startup 2>/dev/null || true
sudo rm -f /Library/LaunchDaemons/com.smoltorrent.startup.plist
sudo rm -f /usr/local/bin/smoltorrent_startup.sh

# node_exporter
sudo launchctl bootout system/com.node-exporter 2>/dev/null || true
sudo rm -f /Library/LaunchDaemons/com.node-exporter.plist

# Boot time exporter
sudo launchctl bootout system/com.smoltorrent.boot-exporter 2>/dev/null || true
sudo rm -f /Library/LaunchDaemons/com.smoltorrent.boot-exporter.plist
OPT

Boot time exporter

utils/boot_exporter.py exposes smoltorrent_boot_time_ms on port 9101 - the metric that powers the Server Last Boot panel in Grafana. It reads the OS boot timestamp directly (sysctl kern.boottime on macOS, /proc/stat on Linux) so the value is always accurate.

Server (macOS) - registered automatically by --daemons (see step 6). Verify:

curl http://localhost:9101/metrics | grep smoltorrent_boot_time_ms
tail -f /tmp/smoltorrent-boot-exporter.log

Pi workers (Linux) - deployed automatically by bootstrap.sh via SSH. To install manually on a Pi:

# SSH into the Pi first, then run - adjust REPO_DIR to where you cloned smoltorrent
REPO_DIR="$HOME/Desktop/smoltorrent"

sudo tee /etc/systemd/system/smoltorrent-boot-exporter.service <<EOF
[Unit]
Description=smoltorrent boot time exporter (port 9101)
After=network.target

[Service]
Type=simple
User=$(whoami)
WorkingDirectory=$REPO_DIR
ExecStart=$HOME/.local/bin/uv run $REPO_DIR/utils/boot_exporter.py
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable --now smoltorrent-boot-exporter

# Verify
systemctl is-active smoltorrent-boot-exporter
curl http://localhost:9101/metrics | grep smoltorrent_boot_time_ms

Disable / uninstall

Server (macOS):

sudo launchctl bootout system/com.smoltorrent.boot-exporter 2>/dev/null || true
sudo rm -f /Library/LaunchDaemons/com.smoltorrent.boot-exporter.plist

Pi workers (Linux):

sudo systemctl disable --now smoltorrent-boot-exporter
sudo rm -f /etc/systemd/system/smoltorrent-boot-exporter.service
sudo systemctl daemon-reload
ℹ️ Prometheus scrapes all 5 nodes on port 9101. The Grafana panel queries smoltorrent_boot_time_ms{node="master"} - no * 1000 conversion needed since the exporter already emits milliseconds.
OPT

Monitoring (Prometheus + Grafana + Loki)

No SSH required. Prometheus scrapes worker metrics directly over TCP on 9200+rank. Everything runs in Docker on the server only.

1. Install Docker (via colima on macOS)

brew install colima docker docker-compose
colima start   # must be running before any docker commands
ℹ️ After a Mac reboot colima does not auto-start. Run colima start again, or run bash scripts/launch_monitoring.sh --daemons once to register a LaunchDaemon that handles it automatically on every boot.

2. Configure Telegram alerts (one time)

Create a Telegram bot with @BotFather (/newbot), get your chat ID from @userinfobot, then fill in monitoring/.env:

# Copy the example and fill in your Telegram bot token + chat ID
cp monitoring/.env.example monitoring/.env

3. Start the stack

bash scripts/launch_monitoring.sh
The script runs preflight checks, starts colima if needed, brings up all containers, and waits until Prometheus, Loki, and Grafana are healthy. Grafana opens at http://localhost:3000 - login admin / smoltorrent.

4. Ship Pi worker logs to Loki (one time)

Once the cluster is up (grove start / grove join or launch.sh), run this once from the master. It reads configs/config.yaml, generates a per-Pi Promtail config with the correct IPs and ranks, and installs it as a systemd service on each Pi over SSH:

# All workers
bash scripts/launch_monitoring.sh --install-pi-promtail

# Specific ranks only
bash scripts/launch_monitoring.sh --install-pi-promtail --workers 1,3
ℹ️ Promtail is installed as a systemd service on each Pi - it auto-restarts on Pi reboot. After this step, Pi logs appear live in Grafana → Explore → Loki → {job="smoltorrent"}.

Other commands

# Stop (volumes and history preserved)
bash scripts/launch_monitoring.sh --down

# Auto-start stack on every Mac reboot (run once)
bash scripts/launch_monitoring.sh --daemons

Metrics endpoints

SourceURLWhat
Master API<master-ip>:8000/metricsFastAPI + transfer metrics
Pi worker N<pi-ip>:920N/metricsPer-worker shard/transfer metrics
All nodes<node-ip>:9100/metricsSystem stats (node_exporter)
ℹ️ See monitoring/README.md for the full metrics reference, dashboard panel guide, and alert rules.
-

Requirements summary

DependencyWhereHow
Python ≥ 3.13All nodesManual on Pis; already on macOS
uvAll nodesAuto by launch.sh
tmux ≥ 3.0All nodesAuto by launch.sh
yqServer onlybrew install yq
node_exporterAll nodesAuto by launch.sh
zeroconfAll nodesAuto by launch.sh (mDNS discovery)
Network (LAN/VPN)All nodesNodes must reach each other over TCP
SSH key authServer → Pisssh-copy-id (step 2)
Docker + colimaServer onlyMonitoring only - no SSH needed