Coping with Reality: Chaos Engineering in Action
Tai Huynh
14:45 - 15:30
Pavillon
What if the best way to build resilient systems is to break them, intentionally? This 45-minute talk will challenge our conventional thinking about resilience by diving into the hands-on execution of Chaos Engineering through Gamedays, which are structured, high-impact events where teams deliberately inject failure to uncover weaknesses before they manifest in production and impact customers. Despite its playful name, a Gameday is anything but a game; it’s a methodical, collaborative, and sometimes nerve-wrecking exercise designed to stretch the limits of our systems and validate our assumptions. We’ll explore what it takes to run an effective Gameday: from selecting the right applications and environments to defining steady states and executing controlled, disruptive chaos experiments. Attendees will gain insight into how Datadog orchestrates Gamedays, curating participants based on system architecture, aligning on steady-state definitions, and incrementally scaling failure scenarios from isolated latency injections to full-scale zonal disruptions. Gamedays are a shift in mindset, from preventing failure to preparing for it. We’ll also discuss a crucial evolution: transitioning from manual, ad-hoc failure injection to a scalable, automated platform that enables safe, rapid, and transparent execution. The ultimate goal? To shift our mindset from fearing failure to leveraging it as a catalyst for resilience. By embracing Chaos Engineering through Gamedays, teams don’t just prevent outages - they gain deep, actionable insights, foster cross-team collaboration, and build a culture where failure isn’t a setback, but a stepping stone to resilience.