ZFS Housekeeping
Hello! It’s been a little while since I did anything ZFS related.
This past holiday weekend I decided to do some office reorganizing, a daunting task I’ve been putting off because I needed to shutdown the whole homelab. This includes my desktop workstation, NAS, and backup servers. Not something I take lightly. But, I’m happy to announce that I finally did it! And it (mostly) went without a hitch!
One of the two biggest issues I encountered was when I went to replace my UPS battery. As the source of power for all of the above machines, I had the foresight to handle this when I rearranged everything. Except much to my dismay, after unplugging everything and opening up the UPS and the package for the replacement battery, it was obvious I bought the wrong one. It was listed as a replacement for my model of UPS, though that’s what I get for not doing my due diligence, and for buying one off eBay. So that will have to wait for another day.
The other thing that didn’t go as planned was my intention to move one of my backup zpools to another machine. After building my NAS earlier this year to upgrade from the Intel nuc I was using, I left the main zpool on its old proxmox host and made it a replication target for the NAS. I figured I’d keep the zpool running until the day (yesterday) came to move things around. I tried before moving it to its intended new destination, a 2018 Mac Mini with the T2 Linux kernel & Debian as a backup server. With two pools attached via Thunderbolt, it was already handling backups for the NAS and LXC container backups from proxmox. I figured I could simply export and import the zpool, plug in, and keep things rolling. In practice, I didn’t get very far. With data actively moving around, the server didn’t like it when I plugged in a new pool, and ZFS threw checksum errors into my pools, which resulted in scrubs that fixed the issue and resulted in no data loss. Having learned my lesson, this time I took the care to shut down the server first. Unfortunately, issues persisted. My main guess after looking for any thing resembling a clue in dmesg or journalctl was the bus for Thunderbolt management simply couldn’t seem to handle 3 zpools, and it kept falling over. I visualize a sort of digital traffic jam. So I moved it back to where it started, my Arch Linux desktop. All things considered, this seems to be an acceptable option. I’ll keep the NAS as the source of truth since containers running services like Immich save images from my wife and I’s phones into datasets, but I can still access things locally. I still had an issue with the backup server falling over with only two pools connected, I ended up disabling ASPM, a power management feature in grub and I haven’t had an issue since.
Performed the following changes:
sudo vim /etc/default/grub
# then edit
GRUB_CMDLINE_LINUX_DEFAULT="quiet pcie_aspm=off"
sudo update-grub
sudo reboot
Once I brought everything back online, I needed to reassess my backup strategy and update my syncoid script. Sanoid & Syncoid are tools made by Jim Salter (2.5 Admins podcast host) to manage ZFS snapshots and replications. I have only a little clue what I’m doing here, I’m surprised at how I’ve managed to get as far as I have. The update was a simple change to the host variable in my script. But when testing, I noticed something peculiar and potentially catastrophic in the potential future event of needing to restore from backup. My script kept causing my backup pools datasets to roll back in time to a certain snapshot, and then would replicate everything past that snapshot. It seems the issue was an incorrect setting in sanoid.conf, and that I was using the flag —delete-target-snapshots in my syncoid command. Switching to —no-sync-snap and reviewing and fixing my sanoid.conf snapshot policy seems to have solved the problem.
I had Sanoid taking and managing its own snapshots on each machine, when I should have had set ‘autosnap=no’. Duh. Pools on the receiving end shouldn’t be making snapshots, it should only be pruning what syncoid delivers. Thus, I think syncoid trying to delete target snapshots was throwing things off, I guess? Will definitely be keeping a closer eye on my zpools, which may be difficult because I thought I already was! The deeper I get into ZFS the more I realize how many moving parts there are to get and keep everything working together seamlessly. It makes me appreciate how expeditiously content is delivered all across the web.
Now that things are tightened up a bit more, I have that much more confidence in my data and backup strategy. I wonder what new surprise lurks around the next corner? Perhaps I’ll make my next post about getting my Canon Pixma Pro-10 photo printer working on Linux (if I ever get there, I’m this close!)