Disaster recovery

Restore, clone, and stay online

A complete Pi — OS, containers, data, configuration, tunnel — streamed from S3 to new hardware in one command. This section covers the full flow: preparing a rescue SD card, running the restore, verifying the result, and optionally pairing a hot standby that takes over automatically when prod goes down.


Disaster recovery

Restore to a new Pi

Flash an SD card with a Cloudflare tunnel baked in, boot the Pi, run one command. Full runbook in DR-quickstart.md.

1

Prepare the SD card (WiFi + Cloudflare tunnel)

Run on your Mac. Downloads Pi OS Lite, flashes it, installs cloudflared, and pre-configures your WiFi and Cloudflare tunnel — so the Pi comes up fully SSH-accessible on first boot with no cable and no monitor.

bash extras/firstboot/prepare-sd.sh --flash --cf-api-token <CF_API_TOKEN>

The --cf-api-token flag (Tunnel:Edit permission) syncs the remote ingress config so any hostname routes correctly through the tunnel — without it, unknown hostnames return 404. You will be prompted for your SD disk, WiFi SSID/password, Pi password, tunnel UUID, and CF hostname.

2

Run the restore one-liner

Installs dependencies and AWS CLI, prompts for your AWS credentials, then streams the backup directly from S3 to the target device. Have your AWS access key and secret ready — retrieve them from your password manager first.

curl -sL pi2s3.com/restore | bash

Interactive: picks your latest backup and target device. Non-interactive: bash ~/pi2s3/pi-image-restore.sh --date 2026-04-16 --device /dev/nvme0n1 --yes

3

Validate (optional, before touching hardware)

Confirm the S3 image exists and is intact from any machine that has AWS access.

bash ~/pi2s3/test-recovery.sh --pre-flash

✔ AWS access OK
✔ Image exists: 2026-04-16/  (3.4 GB)
✔ Manifest: raspberrypi · Pi 5 · Bookworm · nvme0n1
  Estimated restore time: ~12 min
  Run: pi-image-restore.sh --date 2026-04-16 --device /dev/nvme0n1
or

Partial restore: recover a single file or directory

No target device needed. Streams the partition from S3, mounts it via a loop device, and copies the requested path to ./pi2s3-extract-<date>/. Linux only.

# Recover /home/pi from the latest backup
bash ~/pi2s3/pi-image-restore.sh --extract /home/pi

# Recover /etc from a specific date
bash ~/pi2s3/pi-image-restore.sh --extract /etc --date 2026-04-16

# Specify partition (default: largest non-boot = root fs)
bash ~/pi2s3/pi-image-restore.sh --extract /var/lib/docker --partition nvme0n1p2
or

Restore to a larger device: --resize

Restoring a 128 GB backup to a 256 GB NVMe? Pass --resize and the last partition automatically expands to fill the available space after flash. Uses growpart and resize2fs.

bash ~/pi2s3/pi-image-restore.sh --date 2026-04-16 --device /dev/nvme0n1 --resize
or

Pi 5 NVMe: throttle writes to prevent PCIe crashes

Some Pi 5 + NVMe combinations trigger a PCIe watchdog reset under sustained writes. --rate-limit 10m caps the uncompressed byte rate into partclone at 10 MB/s — applied after gunzip for direct write-rate control. Increase if stable. Requires pv.

sudo apt install pv
bash ~/pi2s3/pi-image-restore.sh --device /dev/nvme0n1 --resize --yes \
  --rate-limit 10m
or

Wire NVMe boot on a new Pi (--post-restore)

When restoring to a bare NVMe on a different Pi, the SD card PARTUUID in /etc/fstab and the root= in /boot/firmware/cmdline.txt need updating for the new hardware. post-restore-nvme-boot.sh handles both automatically — no manual editing.

bash ~/pi2s3/pi-image-restore.sh --device /dev/nvme0n1 --resize --yes \
  --rate-limit 10m \
  --post-restore extras/post-restore-nvme-boot.sh

Set NEW_HOSTNAME=my-pi-qa to rename the clone in the same step.

3

Boot & clear stale SSH key

Raspberry Pi OS automatically expands the root filesystem. Clear the old host key on your Mac before connecting.

ssh-keygen -R raspberrypi.local
ssh pi@raspberrypi.local
4

Post boot verification

Checks filesystem expansion, NVMe mount, all Docker containers, Cloudflare tunnel, cron jobs, MariaDB, HTTP, memory, and load.

bash ~/pi2s3/test-recovery.sh --post-boot

✔ OS: Debian GNU/Linux 12 (bookworm) aarch64
✔ Filesystem expanded (954 GB)
✔ NVMe mounted at /mnt/nvme
✔ Docker: 6/6 containers running
✔ Cloudflare tunnel: active (2 ha_connections)
✔ Cron: pi2s3 backup + app-layer backup present
✔ MariaDB: responding, 42 tables
✔ HTTP: 200 OK on localhost
or

Clone to a second Pi (--post-restore)

Restore to a second Pi and customise it — different hostname, Cloudflare tunnel, or .env variables — before the first boot. The restored root is mounted read-write and your script runs inside it.

bash pi-image-restore.sh --date latest --device /dev/nvme0n1 \
  --post-restore ~/post-restore-office.sh

Template: extras/post-restore-example.sh

If a restore is interrupted mid-way (network cut, power loss, wrong device) it is safe to rerun. partclone.restore writes blocks independently; rerunning overwrites the partial work cleanly. Re-apply the partition table with sfdisk before restarting the restore.


High availability

Hot standby: zero-downtime backup + daily sync

Restore a second Pi from your S3 backup, wire up a failover command, and get both zero-downtime nightly backups and a daily-refreshed standby that can take over in minutes.

Nightly backup flow
02:00  Backup starts       → STANDBY_FAILOVER_CMD (traffic → standby)
        Polls verify URL until standby confirmed serving
        Docker stops, partclone runs, upload to S3
02:30  Backup verified    → STANDBY_FAILBACK_CMD (traffic → primary)
        Polls verify URL until primary confirmed serving
        S3 marker written: standby-sync-ready/latest.json
03:00  Standby cron fires → reads S3 marker, newer backup found
        Reboots to SD, restores NVMe from S3 (~30 min)
        Runs post-restore script (tunnel swap, hostname, .env)
03:30  Standby reboots to NVMe — ready for failover
⚠ Standby data is overwritten on every sync.
Each nightly sync restores a full partition image from production onto the standby NVMe — every file, database row, and config change on the standby is permanently replaced. The standby is always a replica of production, never an independent server. Do not store data on the standby that you expect to survive a sync cycle.
1

Set up the standby Pi (Steps 1–9 first)

Follow DR-quickstart.md Steps 1–9 to restore a second Pi from your S3 backup and give it its own SSH hostname. Then run the sync installer on the standby:

# On the STANDBY Pi, after Steps 1–9
bash ~/pi2s3/extras/install-standby-sync.sh

This copies pi2s3 tools onto the SD card, installs a restore agent that runs automatically on SD boot, and adds a 30-min cron job that watches for the S3 sync marker.

2

Wire failover commands (provider-agnostic)

In the primary Pi's config.env, set the two failover hooks and the verify URL. pi2s3 ships provider scripts for Cloudflare and Route 53 — or supply any shell command.

# Primary Pi config.env
HOT_STANDBY_ENABLED=true

# Cloudflare DNS example
STANDBY_FAILOVER_CMD="bash ~/pi2s3/extras/failover/cf-dns-swap.sh --to-standby"
STANDBY_FAILBACK_CMD="bash ~/pi2s3/extras/failover/cf-dns-swap.sh --to-primary"

# Route 53 example
# STANDBY_FAILOVER_CMD="bash ~/pi2s3/extras/failover/route53-swap.sh --to-standby"
# STANDBY_FAILBACK_CMD="bash ~/pi2s3/extras/failover/route53-swap.sh --to-primary"

# Backup waits until standby is confirmed serving before stopping Docker
STANDBY_VERIFY_URL="https://yourdomain.com"
STANDBY_PRIMARY_VERIFY_URL="https://yourdomain.com"
STANDBY_FAILOVER_TIMEOUT_SECS=300

# Optional: read current DNS TTL to calibrate propagation wait
STANDBY_VERIFY_DOMAIN="yourdomain.com"

After calling the failover command, pi2s3 reads the DNS TTL via dig and polls STANDBY_VERIFY_URL until HTTP 2xx — so backup only proceeds once standby is actually serving. Write your own provider script using extras/failover/generic-template.sh.

3

Cloudflare provider: required vars

If using cf-dns-swap.sh, add these to primary's config.env. The API token needs Zone:DNS:Edit permission only.

# In primary config.env
CF_ZONE_ID="<your-zone-id>"         # zone dashboard → Overview → Zone ID
CF_API_TOKEN="<dns-edit-token>"      # Cloudflare API token, Zone:DNS:Edit
CF_FAILOVER_DOMAINS="yourdomain.com,www.yourdomain.com"
PROD_TUNNEL_UUID="<production-tunnel-uuid>"
STANDBY_TUNNEL_UUID="<standby-tunnel-uuid>"

# Standby SSH hostname stays permanently on standby UUID — never swaps
# ssh-standby.yourdomain.com  CNAME  <STANDBY_TUNNEL_UUID>.cfargotunnel.com
4

Configure the standby Pi's sync settings

In the standby Pi's config.env (same S3 bucket and AWS creds as primary), enable sync and point to a post-restore script that re-applies standby-specific config after each restore.

# Standby Pi config.env
HOT_STANDBY_SYNC_ENABLED=true
STANDBY_SYNC_DEVICE="/dev/nvme0n1"

# Confirm primary is up before syncing (skips sync if primary is down
# and this standby is serving traffic — prevents unnecessary downtime)
STANDBY_SYNC_PRIMARY_URL="https://yourdomain.com"

# Runs after each restore: swaps CF tunnel UUID, updates hostname, etc.
# Copy extras/post-restore-standby-example.sh and fill in your values.
STANDBY_POST_RESTORE_SCRIPT="~/pi2s3/extras/post-restore-standby.sh"

Template: extras/post-restore-standby-example.sh. The script receives the restored NVMe mount point as $RESTORE_ROOT and runs before the first boot — swap the CF tunnel UUID here so the standby doesn't come up using production credentials.

5

SSH access to the standby (no browser auth)

Create a Cloudflare Access service token for the standby SSH hostname and add it to your local SSH config.

# ~/.ssh/config
Host ssh-standby.yourdomain.com
    ProxyCommand cloudflared access ssh --hostname ssh-standby.yourdomain.com \
                 --id <CF-Access-Client-Id> \
                 --secret <CF-Access-Client-Secret>
    User <pi-username>
    IdentityFile ~/.ssh/your_pi_key
    StrictHostKeyChecking no

Make sure your SSH public key is in ~/.ssh/authorized_keys on the standby: ssh-copy-id -i ~/.ssh/your_pi_key.pub <user>@ssh-standby.yourdomain.com

6

Test the full cycle

# 1. Trigger a manual failover (primary → standby)
bash ~/pi2s3/extras/failover/cf-dns-swap.sh --to-standby
curl -I https://yourdomain.com   # confirm 200 from standby

# 2. Fail back to primary
bash ~/pi2s3/extras/failover/cf-dns-swap.sh --to-primary
curl -I https://yourdomain.com   # confirm 200 from primary

# 3. Trigger a manual sync cycle on standby
ssh ssh-standby.yourdomain.com \
  "bash ~/pi2s3/extras/hot-standby-sync.sh"
# Standby will reboot, restore (~30 min), come back up

# 4. Confirm standby is back and shows today's date
ssh ssh-standby.yourdomain.com "cat ~/pi2s3/.standby-last-synced"

Beyond backup & restore

Clone, recover anywhere, deploy fleets

pi2s3 ships four tools for scenarios beyond a single-Pi disaster recovery — staging environments, zero-media recovery, and mass deployment.

Clone / staging environments

Restore a backup to a second Pi and have it come up as a different site — different Cloudflare tunnel, hostname, and .env values — without manual editing after reboot. post-restore-nvme-boot.sh handles the SD PARTUUID and cmdline.txt differences between Pi hardware automatically.

bash pi-image-restore.sh \
  --date latest --device /dev/nvme0n1 \
  --resize --yes --rate-limit 10m \
  --post-restore extras/post-restore-nvme-boot.sh

For app customisation (hostname, CF tunnel, .env, SSH keys): extras/post-restore-example.sh.

Recovery USB image

A pre-built bootable image with pi2s3 and all tools installed. Plug into any Pi 5, power on — it auto-logs in and launches the restore wizard. No laptop, no internet, no SD card setup required at restore time.

# Build your own (Linux, ~15 min)
bash extras/build-recovery-usb.sh

Or download a pre-built image from GitHub Releases. Flash with Raspberry Pi Imager. Default SSH password: recovery.

HTTP netboot (Pi 5)

Configure the Pi 5 EEPROM to fall back to the pi2s3 restore environment over HTTP when no NVMe is present. Power + ethernet is all you need — no physical media at all.

# One-time setup per Pi
bash extras/setup-netboot.sh
sudo reboot

# Force recovery boot immediately
bash extras/setup-netboot.sh --force

Boot files served from boot.pi2s3.com (CloudFront → S3). Terraform in extras/terraform/.

Fleet deployment

Deploy the same backup to 10–100 Pis from a CSV manifest. Each Pi gets the base image plus a per-Pi post-restore script for hostname, tunnel credentials, and SSH keys.

# fleet.csv
pi-01,192.168.1.101,latest,/dev/nvme0n1,./classroom.sh
pi-02,192.168.1.102,latest,/dev/nvme0n1,./classroom.sh

# Deploy all at once
bash extras/fleet-deploy.sh fleet.csv --parallel

Per-Pi logs saved to fleet-deploy-logs-*/. Example manifest + scripts in extras/fleet-example/.