Posted / February 17, 2025 Last Processed / November 22, 2025 3 min read

I am my own chaos monkey.

What’s the point of using and learning Linux if you’re not gonna break things due to your own carelessness? Even in the midst of writing this, I watched a video with network chuck that had him almost frustrated because he just “wanted to fool around and break things” when running into issues trying to set up an environment. It’s in our nature as tinkerer’s and hackers.

Now one might think breaking things is just dumb, especially when learning to configure and run something as important as your digital infrastructure. Even a homelab can be mission critical, I have local entertainment in place of streaming and critical storage with irreplaceable family photos and documents. Why risk carelessness with something so important? Why would a business ever want to hire someone prone to breaking things?

Isn’t that exactly why Netflix programmed their own chaos monkey? It’s a program literally engineered to break production systems, leaving engineers scrambling to fix the destruction left behind while millions of people are sitting at home streaming content clueless to the mayhem occurring on the servers. The only way this is functional at the level of Netflix is that this creates a sort of catch-22 that ensures they have enough redundancy in place to still function as they fix the problem. It says so right on the GitHub for the tool and Wikipedia account for the concept of chaos engineering. I’m sure it’s more complicated, it’s an insane idea but in practicality it makes so much sense. anything can break at any time, and you best be prepared by the fear of the chaos monkey. In my case, myself.

I don’t need to program a chaos monkey, I’m my own. I can’t tell you how many times during a late night Linux session after a drink or two I’ve overwritten a directory, resulting in me needing to restore data from backup. Or like my most recent chaos engineering success—I was fiddling with replacing failed disks and a proper ‘zfs offline disk’ command kept failing. Against all good sense, this led to me simply ripping it out after repeated unmount/offline failures, rationalizing that what I did was similar to an immediate catastrophic disk failure and cautiously watching the pool successfully resilver. knowing even if it failed and I lost the pool, I could rebuild with the several layers of backups I had on and offsite.

The lesson to being able to have fun and tinker like this where you can fuck up regularly is to have several layers of backups and redundancy, and containerizing services and separating storage when possible and backing those up as well. While we strive to configure our computers and servers to be ephemeral, storage and data simply cannot be. Backups are a simple and practical solution in concept, but hard in practice, and often the most expensive, it’s a notoriously hard sell. But it’s the only option to prevent catastrophe.