VSHN.timer #238: CrowdStrike!
Welcome to another VSHN.timer! Every Monday, 5 links related to Kubernetes, OpenShift, CI / CD, and DevOps; all stuff coming out of our own chat system, making us think, laugh, or simply work better.
What a crazy weekend it was! For many employees, the end of work on Friday came unexpectedly early due to the faulty CrowdStrike update, while countless IT administrators have been putting in extra shifts over the last few days to get Windows systems up and running again. Thousands of travellers were stranded at airports, banks were down and even surgeries had to be postponed. The overall economic damage caused by these outages cannot yet be quantified, but will probably run into the billions.
What has this incident taught us? That extensive testing of updates before rollout is essential!
This may be one of the most serious IT failures of all time, but CrowdStrike and Microsoft are by no means the only companies to have suffered such mishaps. Here are five incidents that illustrate this.
- British Airways
A sudden surge of power after an outage wreaked havoc on British Airways‘ data center servers, triggering a colossal IT meltdown that stranded 75,000 passengers on May 27, 2017:
https://www.theguardian.com/business/2017/may/31/ba-it-shutdown-caused-by-uncontrolled-return-of-power-after-outage - Google
In 2020, Google experienced a significant service outage that, although only lasting around forty-five minutes, had a global impact. Gmail, YouTube, Google Calendar, and Google Home apps all crashed, along with third-party applications relying on Google for authentication. The root cause of the issue was insufficient storage capacity for the company’s authentication services:
https://www.theguardian.com/technology/2020/dec/14/google-suffers-worldwide-outage-with-gmail-youtube-and-other-services-down - Dyn
One of the biggest distributed denial of service (DDoS) attacks in history hit Dyn, a key backbone provider, in 2016. The assault came in three relentless waves, overpowering the company’s servers and leaving many internet users unable to access popular platforms like Twitter, Spotify, and Netflix:
https://mse238blog.stanford.edu/2018/07/clairemw/the-2016-dyn-attack-and-its-lessons-for-iot-security - AWS
In December 2021, Amazon Web Services (AWS) suffered a major outage that lasted several hours, disrupting operations for prominent companies like Netflix, Disney, Spotify, DoorDash, and Venmo. Amazon attributed the issue to an automation error that caused multiple systems to malfunction. The outage also blocked access to some cloud services, highlighting that even the largest and most secure cloud providers are not immune to downtime:
https://www.pcmag.com/news/heres-why-a-vital-aws-region-went-down-on-dec-7 - Meta
On October 4, 2021, Meta’s hugely popular social media platforms – Instagram, Facebook, and WhatsApp – went dark for an astonishing six hours. A routine maintenance task gone wrong disconnected all Facebook data centers worldwide, impacting over 10 million users globally:
https://engineering.fb.com/2021/10/05/networking-traffic/outage-details
Were you affected by the CrowdStrike incident? Or did you once contribute to a failure yourself? How do you ensure that updates are rolled out safely? Get in touch with us, and see you next week for another edition of VSHN.timer.
PS: check out our previous VSHN.timer editions about security: #8, #17, #22, #27, #32, #44, #54, #62, #76, #84, #93, #106, #117, #128, #142, #145, #164, #169, #182, #203, #223, #227, #228, #231
PS2: do you prefer reading VSHN.timer in your favorite RSS reader? Subscribe to this feed.
PS3: would you like to receive VSHN.timer every Monday in your inbox? Sign up for our weekly VSHN.timer newsletter.