22 June 2023 —
Dragoș Abagiu [Site Reliability Engineer — Booking Holdings]
disasters can be due to equipment failures, user errors, natural disasters,
malware and other unexpected events.
At Booking.com, part of the Booking Holdings family of brands, we have established a program to test the impact of these disasters and the recovery mechanism that we have in place. Those tests started years ago with simple region evacuation in normal conditions, and later expanded to injection of latency at the network level, packet dropping, cut of inter-datacenter connection, cut of power feed or even region-wide shutdown.In this talk, Dragoș Abagiu will share information based on 5 years of knowledge acquired during this program: the improvement of the reliability of our platform, the organizational impact, the automation created for or after the tests, and how the company managed to mitigate real incidents that have happened since the start of the project.