Reference

Configuration, monitoring & troubleshooting

Watchdog scripts, security options, PHP-FPM monitoring, and troubleshooting guides. These aren't needed for a basic install — but they're here when something goes wrong or you want to harden and automate further.


Self-healing

Cloudflare tunnel watchdog

An optional monitor that runs every 5 minutes. If your site or tunnel goes down, it automatically recovers through three escalating phases before rebooting the Pi as a last resort.

root cron · every 5 min
Check 1: Any Docker containers stopped?
Check 2: HTTP probe on localhost (5xx or connection failure?)
Check 3: cloudflared ha_connections > 0?
 
All OK → log and exit
 
Phase 1 (0–20 min) → start containers + restart cloudflared
Phase 2 (20–40 min) → docker compose down/up (full stack restart)
Phase 3 (40+ min) → dump diagnostics + reboot Pi (max once/6 h)

Enable

Set CF_WATCHDOG_ENABLED=true in config.env and run bash install.sh --watchdog. That's it.

Push notifications

ntfy.sh alerts at every stage: first failure, each phase escalation, recovery, and "manual needed" if rate limited from rebooting.

Rate limited reboots

Phase 3 reboots are capped to once every 6 hours. If the rate limit is hit, an alert fires instead and the watchdog exits cleanly.


FPM Monitor

PHP-FPM saturation monitor

When all PHP-FPM workers are exhausted, WordPress serves 504 errors — and WP-Cron goes silent at the worst moment. This monitor runs as a host cron (not inside Docker), so it fires regardless of PHP-FPM state.

host cron · every minute
Check 1: HTTP probe — curl to FPM_PROBE_URL (5xx or timeout?)
Check 2: DB queries — >15 s queries from wordpress user?
Check 3: Orphaned backup lock — pi2s3-lock SLEEP in PROCESSLIST?
 
Saturated for N checks → ntfy alert fires
FPM_AUTO_RESTART=true → docker restart pi_wordpress automatically
Orphaned lock → killed immediately + ntfy (backup must not be running)

Auto-restart

Set FPM_AUTO_RESTART=true to automatically restart the WordPress container when saturation is detected. An ntfy notification confirms the restart. A cooldown (FPM_RESTART_COOLDOWN, default 20 min) prevents restart loops. Configure both in the Devtools plugin (Debug AI → PHP-FPM Monitor) or directly in config.env.

Orphaned lock detection

After a backup, the DB lock process is killed from inside the container so no orphaned SLEEP(86400) connections survive. The monitor also detects and kills any that slip through, with a 30-min alert cooldown.

Plugin integration

The CloudScale Cyber & Devtools plugin (Debug AI tab → PHP-FPM Monitor) shows last saturation event, reason, and a pre-filled config.env snippet. Set FPM_CALLBACK_URL and FPM_CALLBACK_TOKEN to enable.

# config.env
FPM_SATURATION_THRESHOLD=3       # checks before alerting
FPM_WP_CONTAINER=pi_wordpress
FPM_DB_CONTAINER=pi_mariadb
FPM_AUTO_RESTART=true            # restart container automatically
FPM_RESTART_COOLDOWN=1200        # seconds between restarts (20 min)
FPM_ALERT_COOLDOWN=1800          # seconds between repeat alerts

Install: crontab -e → add: * * * * * /home/pi/pi2s3/fpm-saturation-monitor.sh 2>/dev/null


Security & Reliability

Backups you can trust

Five layers that catch failures before they become disasters.

Client-side encryption

Each partition image is encrypted with GPG AES-256 before it leaves your Pi. Even full S3 bucket access is useless without the passphrase, which lives only in config.env, never in S3.

# config.env
BACKUP_ENCRYPTION_PASSPHRASE="my-strong-passphrase"

Requires: sudo apt install gpg. Restore script detects encryption automatically.

Bandwidth throttle

Limit upload speed so nightly backups don't saturate your home or office connection. Uses pv in the upload pipeline. No AWS config changes needed.

# config.env
AWS_TRANSFER_RATE_LIMIT="2m"   # 2 MB/s

Requires: sudo apt install pv. Use 500k, 2m, or 1g format.

Auto-verify after backup

After every upload, the script re-lists S3 to confirm every uploaded file is non-zero bytes. Silent upload failures are caught immediately and reported via push notification.

# config.env (default: true)
BACKUP_AUTO_VERIFY=true

Verification result is included in the ntfy success notification.

Preflight health checks

Before stopping Docker, the script checks: all containers are healthy, free disk space exceeds the threshold, and no recent I/O errors in dmesg. Warns (or aborts) before touching your running stack.

# config.env
PREFLIGHT_MIN_FREE_MB=500
PREFLIGHT_ABORT_ON_WARN=false

Stale backup alert

A separate daily cron job checks S3 for a recent backup and sends a push alert if none is found within STALE_BACKUP_HOURS. Catches silent cron failures, including the Pi being offline.

# config.env (default: true)
STALE_CHECK_ENABLED=true
STALE_BACKUP_HOURS=25

Crash-safe Docker restart

An on_exit trap guarantees containers are restarted even if the backup script crashes mid-imaging. A separate post-backup cron job (30 min after backup) runs a second check to confirm containers came back up.

# config.env (default: true)
POST_BACKUP_CHECK_ENABLED=true

Pre/post backup hooks

Not running Docker? Stop any service (MariaDB, nginx, php-fpm) before imaging and restart it after. Works for native WordPress, Samba, or any systemd service. The crash trap also runs POST_BACKUP_CMD on failure, so services always come back up.

# native WordPress example
PRE_BACKUP_CMD="systemctl stop nginx php8.2-fpm mariadb"
POST_BACKUP_CMD="systemctl start mariadb php8.2-fpm nginx"

Zero-downtime DB lock

For MariaDB/MySQL setups, pi2s3 issues FLUSH TABLES WITH READ LOCK instead of stopping Docker. The lock is released after sync + drop_caches flush InnoDB dirty pages — typically under 10 seconds. Writes resume immediately; partclone images in the background. A background probe pings your site every 60 seconds to confirm it stays up.

# config.env
DB_CONTAINER="auto"        # auto-detect
PROBE_LATEST_POST=true   # probe latest WP post

Falls back to STOP_DOCKER automatically if no DB container is found or the lock fails.


Watchdog

Self-healing tunnel monitor

Optional. Runs every 5 minutes as root. If your site or Cloudflare tunnel goes down, it recovers automatically through three escalating phases. No SSH needed.

PhaseTriggerAction
Phase 1 (0–20 min)First failure detectedRestart cloudflared + start stopped containers
Phase 2 (20–40 min)Still down after 4 attemptsFull docker compose down/up + cloudflared restart
Phase 3 (40+ min)Still down after 8 attemptsReboot Pi (rate-limited: max once per 6 hours)

Three checks run every cycle: Docker containers stopped? Local HTTP probe returning 5xx or connection failure? cloudflared ha_connections > 0? Any failure starts the escalation ladder. Recovery is confirmed by re-running all checks after each action.

Push notifications via ntfy at every stage: first failure, each phase escalation, recovery, and stuck-down alert. Pre-reboot diagnostics dumped to /var/log/pi2s3-watchdog-prediag.log.

# config.env
CF_WATCHDOG_ENABLED=true
CF_SITE_HOSTNAME="yoursite.com"
CF_HTTP_PORT=80

# Then install:
bash ~/pi2s3/install.sh --watchdog

Troubleshooting

Common errors

The three errors that account for most first run failures.

pigz not found

The installer tries to install pigz automatically. If it fails, install manually:

sudo apt install pigz

The backup falls back to single-threaded gzip if pigz is absent. It will still work, just slower.

AWS credential failure

The preflight check will print: Cannot reach s3://…. Verify credentials on the Pi:

aws s3 ls s3://your-bucket --region af-south-1

Confirm ~/.aws/credentials exists, or that AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY are set in the environment. Check the IAM policy includes s3:ListBucket on the bucket ARN (not just the /* path).

partclone device mismatch

Error: partclone: device size mismatch or filesystem not found. Check the detected device:

lsblk   # confirm nvme0n1 or mmcblk0

Set BOOT_DEV=/dev/mmcblk0 explicitly in config.env if auto detection picks the wrong device.

Undervoltage / restore crash

Pi 5 requires a 27W (5.1V / 5A) USB-C power supply. A marginal PSU or long USB-C cable causes the restore to crash 30–90 seconds in. The restore script warns at startup if voltage is low:

vcgencmd get_throttled   # 0x0 = clean

Use the official Pi 5 PSU (SC1159) or any USB-C PD charger that negotiates 5V/5A. A shorter, thicker USB-C cable helps if the PSU is marginal.

Restore diagnostic script

Something went wrong? One command tells you exactly what broke and exactly how to fix it. Every failure prints a colour-coded verdict and a copy-paste fix command. A summary at the end lists every issue — no scrolling required.

sudo bash extras/diagnose-restore.sh

Saves full output to /var/log/pi2s3-diagnose-TIMESTAMP.log. Share that file when reporting an issue on GitHub.

What it checks
  • Voltage & CPU throttle (live + historical)
  • CPU temperature & rail voltages
  • NVMe SMART health & EEPROM boot order
  • WiFi SSID — connected vs config.env
  • Saved password special-char encoding
  • Corporate proxy & outbound firewall
 
  • Packet loss to gateway, 8.8.8.8, 1.1.1.1
  • AWS S3 + STS reachability & credentials
  • cmdline.txt PARTUUID cross-check + rootdelay
  • S3 manifest JSON (catches dd-vs-partclone bug)
  • cloud-init status, sentinel file, runcmd output
  • Cloudflare Tunnel — service, config, journal
[FAIL] blocks restore — fix immediately [WARN] may cause issues [OK] healthy
Pi won't boot? Solid red LED?

If the Pi shows a solid red LED (no green ACT blink) after a restore, the SD card's cmdline.txt likely has a missing or wrong PARTUUID. Run this on your Mac with the SD card inserted:

bash extras/recover-sd-boot.sh

Auto-detects the SD card at /Volumes/bootfs, shows the broken cmdline.txt, restores from the automatic backup if present, and prints step-by-step recovery instructions.

Example output — healthy Pi (all OK)
pi2s3 restore diagnostic
Generated: Mon Apr 28 08:00:00 SAST 2026
Host:      andrewninja-pi-5
Uptime:    up 2 hours, 14 minutes

────────────────────────────────────────────────────────────────
  1. POWER & VOLTAGE
────────────────────────────────────────────────────────────────
  get_throttled = 0x0
  [OK]    No under-voltage
  [OK]    CPU not throttled
  [OK]    Temperature OK
  [OK]    No historical under-voltage
    core         volt=0.8688V
    temp=48.3'C

────────────────────────────────────────────────────────────────
  3. WIFI & NETWORK
────────────────────────────────────────────────────────────────
  [OK]    Connected to 'Baker' — matches config.env
  [OK]    'Baker' (SSID=Baker) — password OK (len=12)

────────────────────────────────────────────────────────────────
  4. INTERNET & AWS REACHABILITY
────────────────────────────────────────────────────────────────
  [OK]    Gateway (192.168.0.1) — 0% packet loss
  [OK]    Internet (8.8.8.8) — 0% packet loss
  [OK]    AWS S3 (af-south-1) (s3.af-south-1.amazonaws.com:443)
  [OK]    s3://mybucket/ accessible (profile=personal)

────────────────────────────────────────────────────────────────
  8. BOOT CONFIG
────────────────────────────────────────────────────────────────
  [OK]    root=PARTUUID=21b5924d-0d80-4332-a32c-3b24a6bde370 → /dev/nvme0n1p2
  [OK]    rootdelay present in cmdline.txt

────────────────────────────────────────────────────────────────
  12. CLOUDFLARE TUNNEL
────────────────────────────────────────────────────────────────
  [OK]    cloudflared 2026.3.0
  [OK]    cloudflared.service  active=active  enabled=enabled
  [OK]    Credentials file present: /etc/cloudflared/84c2c36c-....json

────────────────────────────────────────────────────────────────
  SUMMARY
────────────────────────────────────────────────────────────────
  Ran in 38s.  Report: /var/log/pi2s3-diagnose-20260428_080000.log

  All checks passed — no issues found.
Example output — under-voltage + wrong SSID (2 issues)
────────────────────────────────────────────────────────────────
  1. POWER & VOLTAGE
────────────────────────────────────────────────────────────────
  get_throttled = 0x50005
  [FAIL]  UNDER-VOLTAGE RIGHT NOW — restore WILL crash
         fix: Use official Pi 5 PSU (5.1V / 5A / 27W, SC1159). Short/thin cable also drops voltage.
  [WARN]  Under-voltage event(s) since boot — PSU is marginal
         fix: Upgrade to official Pi 5 PSU or 27W+ USB-C PD charger

────────────────────────────────────────────────────────────────
  3. WIFI & NETWORK
────────────────────────────────────────────────────────────────
  [FAIL]  Connected to 'C_WIFI' but config.env WIFI_SSID='Baker'
    Visible networks:
      Baker                            signal=72
      C_WIFI                           signal=68
         fix: Update WIFI_SSID in config.env to 'C_WIFI', or connect to the right network:
         fix: sudo nmcli dev wifi connect "Baker" password "<password>"

────────────────────────────────────────────────────────────────
  SUMMARY
────────────────────────────────────────────────────────────────
  Ran in 41s.  Report: /var/log/pi2s3-diagnose-20260428_081200.log

  2 issue(s) found:

  ✗ UNDER-VOLTAGE RIGHT NOW — restore WILL crash
  △ Under-voltage event(s) since boot — PSU is marginal
  ✗ Connected to 'C_WIFI' but config.env WIFI_SSID='Baker'