Welcome to another VSHN.timer! Every Monday, 5 links related to Kubernetes, OpenShift, CI / CD, and DevOps; all stuff coming out of our own chat system, making us think, laugh, or simply work better.
This week we’re going to talk about how to handle operations and post-mortems in a humane fashion.
1. We all know that whatever can go wrong, inevitably will go wrong. Mature IT teams get into the habit of documenting system failures in “post-mortems”… even though they are seldom read afterwards. Why is that so? Matías E. Fernández provides some insight about creating post-mortems and why they require a certain culture to survive and thrive.
2. In this Brave New World of Kubernetes, teams need to keep an eye on their clusters at all times. But, of all the metrics available in Prometheus, which ones to choose? Charlie Fiskeaux II enumerates the 12 most important metrics to monitor, and why.
3. Speaking about Kubernetes and Prometheus, and particularly useful for all of you microservice fans out there, here’s a guide to setting Service Level Objectives with Prometheus and Linkerd in the Cloud Native Computing Foundation blog.
5. The tool of the week is driftctl, a new open-source project to help you monitor Infrastructure as Code changes without having to execute
terraform plan in crontabs and expecting a zero return code at every run.
Which Prometheus metrics do you pay attention the most in your monitoring? How do you manage and control drift in your infrastructure? Would you like to share some post-mortem tips and tricks with the community? Get in touch with us, and see you next week for another edition of VSHN.timer.
PS: would you like to receive VSHN.timer every Monday in your inbox? Sign up for our weekly VSHN.timer newsletter.
PS2: check out our previous VSHN.timer editions about incidents and operations: #32, #41, #49, #66, and #75.