Chaos testing is the highly disciplined approach to test a system’s integrity by proactively simulating and identifying failures in a given environment before they lead to unplanned downtime or a negative user experience.
It can also be defined as a method of testing distributed software that purposely introduces failures and faulty scenarios to verify how the system behaves in the face of random disruptions. These disruptions can cause applications to respond unpredictably and break under pressure.
Principles of chaos testing
Chaos engineering is made up of five main principles:- Identify a steady state We should define a “steady state” or control as a measurable system output that indicates the normal working behaviour (in most cases it is well below a one percent error rate).
- Hypothesize that the system will hold its steady state Once a steady state has been determined, it must be hypothesized that it will continue in both control and experimental conditions.
- Ensure minimal impact to your users During chaos testing, the goal is to actively try to break or disrupt the system, but it’s important to do so in a way that minimises any negative impact to your users. Your team will be responsible for ensuring all tests are focused on specific areas and should be ready for incident response as needed.
- Introduce chaos Once you are confident that your system is working, your team is prepared, and the impact areas are contained, you can start running your chaos testing applications. Try to introduce different variables to simulate real world scenarios, including everything from a server crash to malfunctioning hardware and severed network connections. It’s best to test in a non-production environment so you can monitor how your service or application would react to these events without directly affecting the live version and active users.
- Monitor and repeat With chaos engineering, the key is to test consistently, introducing chaos to pinpoint any weaknesses within your system. The goal of chaos engineering is to disprove the established hypothesis from number two and build a more reliable system in the process.