Server prerequisites (macOS)
Install Homebrew tools on your Mac mini (or any Apple Silicon machine):
# Homebrew (if not installed) /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)" # Required tools brew install yq # YAML parser used by bootstrap.sh and launch.sh brew install uv # Python package manager (needed before bootstrap)
uv, tmux, node_exporter, and the Python venv are installed on every Pi automatically by bootstrap.sh in step 4 - no need to install them manually on the workers.
Pi prerequisites
Follow the Raspberry Pi cluster setup guide
to get your Pis networked. The only thing needed on each Pi before bootstrap is git and curl (usually pre-installed on Raspberry Pi OS):
sudo apt update && sudo apt install -y git curl
Set up SSH key auth (server to each Pi)
Generate a key on the server (skip if you already have one) and copy it to each Pi:
# On the server - generate key if needed ssh-keygen -t ed25519 -C "smoltorrent" # Copy to each Pi (repeat for each) ssh-copy-id <pi-user>@<pi-ip>
Add an alias for each Pi in ~/.ssh/config on the server. The alias must exactly match the host field you'll set in configs/config.yaml:
# Append to ~/.ssh/config on the server Host pi4-1 HostName <pi-ip> User <pi-user> IdentityFile ~/.ssh/id_ed25519 IdentitiesOnly yes Host pi4-2 HostName <pi-ip> User <pi-user> IdentityFile ~/.ssh/id_ed25519 IdentitiesOnly yes # ... one block per Pi
launch.sh uses the host value from config.yaml as a literal SSH target - it must exactly match the Host alias above. If they don't match, rsync and remote starts will fail.
Verify SSH access from the server (should connect without a password prompt):
ssh pi4-1 # or whatever alias you chose
Clone & configure
Clone the repo on the server and install Python dependencies:
git clone https://github.com/YuvrajSingh-mist/smoltorrent cd smoltorrent uv sync
Add grove to your PATH so you can call it from anywhere:
echo 'export PATH="$HOME/smoltorrent/.venv/bin:$PATH"' >> ~/.zshrc
source ~/.zshrc
grove --help # should print grove CLI usage
Create your checkpoint root directory (the watcher monitors this folder):
~/Desktop, ~/Documents, or ~/Downloads — macOS silently blocks them. Keep ckpt_root under ~/smolcluster/ (or any subfolder of your home directory that is NOT TCC-protected). Same for the repo itself — clone to ~/smoltorrent/, not the Desktop.
mkdir -p ~/smolcluster/checkpoints
Edit configs/config.yaml
ipconfig getifaddr en0 (Ethernet) or en1 (Wi-Fi). Find each Pi's IP from your router's DHCP table, or run arp -a on the server.
Below is the complete working config for a 4-Pi setup. Replace IPs, paths, and SSH aliases with yours:
n_chunks must equal num_workers - each chunk is assigned to exactly one worker. If they differ, sharding silently under/over-assigns and gather will fail.
Port map - what uses what
| Port | Where | What |
|---|---|---|
8000 | Server | FastAPI - /store-shard, /gather-shards, /discover |
8001 | Server | Watcher Prometheus metrics |
5001–5004 | Each Pi | Worker TCP listener - raw shard transfer (not HTTP) |
9201–9204 | Each Pi | Worker Prometheus metrics (rank+9200) |
9100 | All nodes | node_exporter - system stats for Grafana |
9101 | All nodes | boot_exporter - last boot timestamp |
3000 | Server | Grafana UI |
Bootstrap all nodes
One command rsyncs the codebase to every Pi and installs all dependencies - uv, tmux, node_exporter, the Python venv, zeroconf, and the boot_exporter systemd service. Run this once before your first launch, or again after adding a new worker.
bash scripts/bootstrap.sh
grove start / grove join (no-SSH path below), or proceed to step 5 to launch via SSH.
Useful flags
| Flag | What it does |
|---|---|
--workers 1,3 | Bootstrap only the specified worker ranks |
--dry-run | Print what would run without executing anything |
Launch the cluster
Rsyncs the latest code to every Pi (no dep install - bootstrap already handled that), kills stale sessions, and starts everything in tmux:
bash scripts/launch.sh
launch.sh and grove start forcibly free ports before starting. On the coordinator: 8000 (API) and 8001 (watcher metrics). On each worker Pi: 9200+rank (Prometheus metrics, e.g. 9201–9204). Any process already holding those ports will be killed.
This starts:
syncps_api- FastAPI server on the master (port 8000)syncps_watcher- Watcher daemon on the mastersyncps_worker_N- TCP worker on each Pi (port 5001+)
Verify the cluster is up
# Should return JSON with all 4 workers listed curl http://localhost:8000/discover # Should print "API Process Uptime" metric (confirms API is scraping) curl -s http://localhost:8000/metrics/ | grep process_start_time # Check tmux sessions exist tmux ls # expect: syncps_api, syncps_watcher, syncps_worker_1..4 on Pis
/discover returns "workers": [...] with all 4 Pis listed. If a worker is missing, SSH into that Pi and check tmux attach -t syncps_worker_N for the error.
Useful launch flags
| Flag | What it does |
|---|---|
--dry-run | Print what would happen, no SSH or launches |
--api-only | Heartbeat-check workers, start API only |
--workers 1,3 | Launch only specific worker ranks |
--ext .safetensors,.pth | Override file extensions the watcher monitors |
Watch logs
# API / watcher (on server) tmux attach -t syncps_api tmux attach -t syncps_watcher # Worker logs (SSH into Pi, then attach) ssh -t pi4-1 tmux attach -t syncps_worker_1 # All logs tailed locally tail -f logging/cluster-logs/*.log
No-SSH alternative
grove start / join
Once bootstrap has run on every node (step 4), you can skip launch.sh entirely and use grove instead. No SSH config, no manual config.yaml editing - workers discover the master over mDNS and self-register, and the master writes config.yaml automatically. Good for testing and same-network setups.
Master Install once
brew install uv # if not already installed git clone https://github.com/YuvrajSingh-mist/smoltorrent cd smoltorrent && uv sync # Add grove to your PATH (one time) echo 'export PATH="$HOME/smoltorrent/.venv/bin:$PATH"' >> ~/.zshrc source ~/.zshrc
Each worker Install once (Pi or any Linux/macOS)
# Install uv first if not present curl -LsSf https://astral.sh/uv/install.sh | sh source $HOME/.local/bin/env # or open a new shell git clone https://github.com/YuvrajSingh-mist/smoltorrent cd smoltorrent && uv pip install -e .
Master Advertise and wait for workers
# Run from inside the smoltorrent directory cd smoltorrent grove start -n 4 # replace 4 with your worker count
The master advertises itself over mDNS and opens a registration server on port 5999. It will wait until all N workers have joined before launching.
Each worker Find the master and join
# Run from inside the smoltorrent directory on the worker
cd smoltorrent
grove join
A TUI lists all smoltorrent masters found on the network. Select yours and press Enter. The worker registers, receives its rank, and starts the TCP listener immediately - no manual config needed.
- Master writes
configs/config.yamlwith each worker's IP, rank, and port. - Master starts the FastAPI server (
syncps_api) and watcher daemon (syncps_watcher) in tmux. - Cluster is ready - drop a
.safetensorsfile intockpt_rootand the watcher shards it automatically.
ckpt_root in configs/config.yaml after the first join to point at your checkpoint directory. Everything else is managed automatically.
Pi worker auto-start
Install a systemd service on each Pi so workers restart automatically after a Pi reboot - without waiting for the server:
# All 4 workers bash scripts/install_worker_service.sh # Specific ranks only bash scripts/install_worker_service.sh --workers 1,3 # Remove from all bash scripts/install_worker_service.sh --uninstall
launch.sh also kills and re-launches workers via tmux on every run - systemd and tmux are independent. If both are running, systemd's process will fail to bind the port and retry after 5 s (harmless).
Server auto-start at boot
Register macOS LaunchDaemons so the entire cluster comes up after a server reboot - no manual intervention. Run from the smoltorrent directory:
sudo bash scripts/launch.sh --daemons
This registers three system daemons:
com.smoltorrent.startup- waits for network (pings first worker every 5 s), then runslaunch.shcom.node-exporter- keepsnode_exporterrunning for Grafana system statscom.smoltorrent.boot-exporter- exposes boot time metric on port 9101 for Grafana
Verify & check logs
# Check all three are registered sudo launchctl print system/com.smoltorrent.startup sudo launchctl print system/com.node-exporter sudo launchctl print system/com.smoltorrent.boot-exporter # Startup log (after reboot) cat /tmp/smoltorrent-startup.log cat /tmp/smoltorrent-boot-exporter.log
launchctl print may show last exit code = 1 and state = not running - this is expected. The daemon runs once at boot, launches everything into tmux, then exits. The cluster keeps running in tmux independently.
Remove all daemons
# Cluster startup daemon sudo launchctl bootout system/com.smoltorrent.startup 2>/dev/null || true sudo rm -f /Library/LaunchDaemons/com.smoltorrent.startup.plist sudo rm -f /usr/local/bin/smoltorrent_startup.sh # node_exporter sudo launchctl bootout system/com.node-exporter 2>/dev/null || true sudo rm -f /Library/LaunchDaemons/com.node-exporter.plist # Boot time exporter sudo launchctl bootout system/com.smoltorrent.boot-exporter 2>/dev/null || true sudo rm -f /Library/LaunchDaemons/com.smoltorrent.boot-exporter.plist
Boot time exporter
utils/boot_exporter.py exposes smoltorrent_boot_time_ms on port 9101 - the metric that powers the Server Last Boot panel in Grafana. It reads the OS boot timestamp directly (sysctl kern.boottime on macOS, /proc/stat on Linux) so the value is always accurate.
Server (macOS) - registered automatically by --daemons (see step 6). Verify:
curl http://localhost:9101/metrics | grep smoltorrent_boot_time_ms tail -f /tmp/smoltorrent-boot-exporter.log
Pi workers (Linux) - deployed automatically by bootstrap.sh via SSH. To install manually on a Pi:
# SSH into the Pi first, then run - adjust REPO_DIR to where you cloned smoltorrent REPO_DIR="$HOME/Desktop/smoltorrent" sudo tee /etc/systemd/system/smoltorrent-boot-exporter.service <<EOF [Unit] Description=smoltorrent boot time exporter (port 9101) After=network.target [Service] Type=simple User=$(whoami) WorkingDirectory=$REPO_DIR ExecStart=$HOME/.local/bin/uv run $REPO_DIR/utils/boot_exporter.py Restart=always RestartSec=5 [Install] WantedBy=multi-user.target EOF sudo systemctl daemon-reload sudo systemctl enable --now smoltorrent-boot-exporter # Verify systemctl is-active smoltorrent-boot-exporter curl http://localhost:9101/metrics | grep smoltorrent_boot_time_ms
Disable / uninstall
Server (macOS):
sudo launchctl bootout system/com.smoltorrent.boot-exporter 2>/dev/null || true sudo rm -f /Library/LaunchDaemons/com.smoltorrent.boot-exporter.plist
Pi workers (Linux):
sudo systemctl disable --now smoltorrent-boot-exporter sudo rm -f /etc/systemd/system/smoltorrent-boot-exporter.service sudo systemctl daemon-reload
smoltorrent_boot_time_ms{node="master"} - no * 1000 conversion needed since the exporter already emits milliseconds.
Monitoring (Prometheus + Grafana + Loki)
9200+rank. Everything runs in Docker on the server only.
1. Install Docker (via colima on macOS)
brew install colima docker docker-compose
colima start # must be running before any docker commands
colima start again, or run bash scripts/launch_monitoring.sh --daemons once to register a LaunchDaemon that handles it automatically on every boot.
2. Configure Telegram alerts (one time)
Create a Telegram bot with @BotFather (/newbot), get your chat ID from @userinfobot, then fill in monitoring/.env:
# Copy the example and fill in your Telegram bot token + chat ID
cp monitoring/.env.example monitoring/.env
3. Start the stack
bash scripts/launch_monitoring.sh
4. Ship Pi worker logs to Loki (one time)
Once the cluster is up (grove start / grove join or launch.sh), run this once from the master. It reads configs/config.yaml, generates a per-Pi Promtail config with the correct IPs and ranks, and installs it as a systemd service on each Pi over SSH:
# All workers bash scripts/launch_monitoring.sh --install-pi-promtail # Specific ranks only bash scripts/launch_monitoring.sh --install-pi-promtail --workers 1,3
{job="smoltorrent"}.
Other commands
# Stop (volumes and history preserved) bash scripts/launch_monitoring.sh --down # Auto-start stack on every Mac reboot (run once) bash scripts/launch_monitoring.sh --daemons
Metrics endpoints
| Source | URL | What |
|---|---|---|
| Master API | <master-ip>:8000/metrics | FastAPI + transfer metrics |
| Pi worker N | <pi-ip>:920N/metrics | Per-worker shard/transfer metrics |
| All nodes | <node-ip>:9100/metrics | System stats (node_exporter) |
monitoring/README.md for the full metrics reference, dashboard panel guide, and alert rules.
Requirements summary
| Dependency | Where | How |
|---|---|---|
| Python ≥ 3.13 | All nodes | Manual on Pis; already on macOS |
| uv | All nodes | Auto by launch.sh |
| tmux ≥ 3.0 | All nodes | Auto by launch.sh |
| yq | Server only | brew install yq |
| node_exporter | All nodes | Auto by launch.sh |
| zeroconf | All nodes | Auto by launch.sh (mDNS discovery) |
| Network (LAN/VPN) | All nodes | Nodes must reach each other over TCP |
| SSH key auth | Server → Pis | ssh-copy-id (step 2) |
| Docker + colima | Server only | Monitoring only - no SSH needed |