Welcome to another VSHN.timer! Every Monday, 5 links related to Kubernetes, OpenShift, CI / CD, and DevOps; all stuff coming out of our own chat system, making us think, laugh, or simply work better.
This week we’re going to talk about how teams deal with failures, sometimes of seemingly catastrophic proportions.
1. We will start with this gem by Lisa Seelye from Red Hat. It is a mind bending article, providing an answer to the basic question: „Why do we externalize our successes but internalize our failures?“ Failures in systems is a matter of „when“ rather than „if“, and healthy teams, including their management, need to embrace them. And to stop calling those situations „mistakes“ but rather „learning opportunities“.
2. Failure can happen to anyone, even to the biggest IaaS provider on the planet. AWS suffered a major outage of its Kinesis service in the US-EAST-1 region, on November 25th, 2020. And since Kinesis is used by many other major pieces of AWS infrastructure, this failure rippled on other parts of the infrastructure like domino pieces. First Cognito, dealing with user authentication; then CloudWatch, dealing with systems monitoring; and finally Lambda and EventBridge, both of which depend on CloudWatch. The post mortem of this outage reads like a detective novel, the fascinating story of a hard day at the core of the cloud.
3. How to deal with failures in OpenShift clusters? The Performance and Scalability team at Red Hat has published a short summary of the three biggest outages they faced in production environments: a rogue DaemonSet taking down a 2000-node cluster, an etcd database that refused to write things down, and the sad results of running etcd on slow storage. Extreme examples for sure, but interesting lessons nonetheless, even though one would rather read about than experience them first hand.
4. DevOps engineering brings its own load of issues to consider. Take for example the issues related to DNS records, their propagation and validity, and the availability of the systems referenced by them. Blake Stoddard from HEY tells the story of a whole day spent at work „banging his head against the desk“ because of failing to RTFM. In this case the manual was RFC 1034 so please go and re-read it now before you hit that „deploy“ button once again.
5. Raphael Michel from Pretix explains how they solved a data loss failure caused by a video file overwritten by mistake… by grepping the contents of a disk, looking for the header of the FLV video format. Spoiler alert: it took 7 hours, and it’s absolutely epic.
How does your team manage failure? Do you keep a log or do you write post mortems after major outages? Do you have any failure handling tips you would like to share with the community? Get in touch with us through the form at the bottom of this page, and see you next week for another edition of VSHN.timer.
PS: would you like to receive VSHN.timer every Monday in your inbox? Sign up for our weekly VSHN.timer newsletter.
PS2: would you like to watch VSHN.timer on YouTube? Subscribe to our channel vshn.tv and give a „thumbs up“ to our videos.
PS3: check out our previous VSHN.timer editions about incidents and failures: #32, #41, #49, and #66.