12
4 min read
|

ZFS e-mail

I do not have a proper monitoring system like Prometheus & Grafana or Nagios. I use a bash script on a cron job that gets the zpool status of each pool and sends me a daily status email. I opted for a daily email over on-fail because I like the daily assurance that the script ran. If I don’t get the email, something is wrong. I check it each morning for pool state, pool status, and any errors.

#!/bin/bash

hal_status=$(ssh [email protected] "zpool status")
gamehenge_status=$(zpool status)
debianbackup_status=$(ssh [email protected] "sudo zpool status")
m4mini_status=$(ssh [email protected] "sudo zpool status")
LOG=zfs_log.txt

cat > $LOG << DMZ

=== daily zpool status ===

=== $(date '+%A, %B %d, %Y %H:%M:%S') ===

=== Hal (samizdat) ===

$hal_status

=== gaia (gamehenge) ===

$gamehenge_status

=== debian-backup (catalyst & magellan) ===

$debianbackup_status

=== mac mini (abraxas) ===

$m4mini_status

DMZ

mail -s "Daily ZFS Status - $(date +%Y-%m-%d)" [email protected] < $LOG

Quite simple, does the job.

I’m going to admit I’m not much of a programmer. The precise syntax needed and digital abstraction prevent me from fully grasping writing code fluidly, instead it’s an arduous one line at a time process for me. But I wanted to add more features to this simple script. So I did what any amateur modern programmer would do, and asked Claude. This is what he came up with.

#!/bin/bash

# ── connection targets ──────────────────────────────────────────────────────
HAL="[email protected]"
DEBIAN="[email protected]"
M4MINI="[email protected]"
LOG="/tmp/zfs_daily_$(date +%Y%m%d).log"
SNAP_COUNT=5  # how many recent snapshots to show per dataset

# ── helper: fetch recent snapshots ─────────────────────────────────────────
# Takes a command prefix (e.g. "ssh user@host sudo") and formats the last N
# snapshots per dataset as a compact, readable table sorted by creation time.
get_snapshots() {
    local cmd_prefix="$1"
    $cmd_prefix zfs list -t snapshot -o name,creation,used -s creation 2>/dev/null \
        | awk -v n="$SNAP_COUNT" '
            NR==1 { next }  # skip header
            {
                # extract dataset name (everything before the @)
                split($1, parts, "@")
                ds = parts[1]
                counts[ds]++
                if (counts[ds] <= n) {
                    print
                }
            }
        ' \
        | column -t
}

# ── helper: pool health summary ────────────────────────────────────────────
# Pulls just the state line so you get a quick ONLINE/DEGRADED/FAULTED
# flag at the top before the verbose status block.
get_pool_health() {
    local cmd_prefix="$1"
    $cmd_prefix zpool list -o name,health,size,alloc,free,cap,frag 2>/dev/null \
        | column -t
}

# ── gather data ─────────────────────────────────────────────────────────────
hal_status=$(ssh $HAL "zpool status")
hal_health=$(get_pool_health "ssh $HAL")
hal_snaps=$(get_snapshots "ssh $HAL")

gamehenge_status=$(sudo zpool status)
gamehenge_health=$(get_pool_health "sudo")
gamehenge_snaps=$(get_snapshots "sudo")

debian_status=$(ssh $DEBIAN "sudo zpool status")
debian_health=$(get_pool_health "ssh $DEBIAN sudo")
debian_snaps=$(get_snapshots "ssh $DEBIAN sudo")

m4mini_status=$(ssh $M4MINI "sudo zpool status")
m4mini_health=$(get_pool_health "ssh $M4MINI sudo")
m4mini_snaps=$(get_snapshots "ssh $M4MINI sudo")

# ── detect any non-ONLINE pools across all hosts ────────────────────────────
# Useful for putting a warning banner at the top of the email subject/body.
ALL_HEALTH="$hal_health
$gamehenge_health
$debian_health
$m4mini_health"

ALERT=""
if echo "$ALL_HEALTH" | grep -qvE "(ONLINE|NAME|health)"; then
    ALERT="⚠ WARNING: DEGRADED OR FAULTED POOL DETECTED — "
fi

# ── build the log ───────────────────────────────────────────────────────────
cat > "$LOG" << DMZ

=== daily zfs status ===
=== ${ALERT}$(date '+%A, %B %d, %Y %H:%M:%S') ===

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  HAL (samizdat)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

--- pool health ---
$hal_health

--- pool status ---
$hal_status

--- last $SNAP_COUNT snapshots per dataset ---
$hal_snaps

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  GAIA (gamehenge)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

--- pool health ---
$gamehenge_health

--- pool status ---
$gamehenge_status

--- last $SNAP_COUNT snapshots per dataset ---
$gamehenge_snaps

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  DEBIAN-BACKUP (catalyst & magellan)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

--- pool health ---
$debian_health

--- pool status ---
$debian_status

--- last $SNAP_COUNT snapshots per dataset ---
$debian_snaps

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  MAC MINI (abraxas)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

--- pool health ---
$m4mini_health

--- pool status ---
$m4mini_status

--- last $SNAP_COUNT snapshots per dataset ---
$m4mini_snaps

DMZ

# ── ship it ─────────────────────────────────────────────────────────────────
mail -s "${ALERT}Daily ZFS Status - $(date +%Y-%m-%d)" [email protected] < "$LOG"

There was some fine-tuning to be done of course. As it was showing full zpool disk capacity rather than total usable for each zpool, but it was an easy fix. I also ended up toggling the variable for snapshot listing to 1 for each dataset, easy to tell they’re all up-to-date and replicating. It feels bad missing an opportunity to learn. It still uses my base script, it just adds features I would have scratched my head over implementing and likely left half finished.