What is chaos engineering and how can it benefit businesses?
Chaos engineering can be especially useful for finding single points of weakness, but implementing it requires careful collaboration
Chaos engineering is the discipline of purposefully injecting faults into a system to build confidence in its resiliency. In the hands of IT teams, it can be a powerful tool for probing an organization’s security – but it has to be implemented in the right way to reduce and not add to staff burden.
Chaos engineering isn’t all that different from the stress testing we see in other industries, says Kelly Shortridge, author of Security Chaos Engineering: Sustaining Resilience in Software and Systems. Here, organizations run complex simulations across systems at water treatment or nuclear reactor facilities to see how they behave, end-to-end, during adverse conditions.
It differs from regular error and vulnerability testing by focusing on unknown scenarios and proactively introducing controlled disruptions to test how a system responds. This enables engineers to validate their assumptions about system resilience or, when they fail, to identify hidden weaknesses.
The rise of chaos engineering
Chaos engineering has risen in response to the growing complexity of software – and hardware – systems, and organizations understand the need to look at the bigger picture when it comes to testing resilience.
According to data from Cognitive Market Research, the chaos engineering market is set for 9% compound annual growth between 2024 and 2031.
“They realize that looking at components, like an individual library or service’s behavior, won’t give you a real sense of how the system will respond to something bad happening overall,” says Shortridge, also senior director of Portfolio Product Management at Fastly. “That’s why you need to simulate adverse conditions as an input to see how that influences the system as a whole.”
One of the first companies to embrace chaos engineering was Netflix, as a way to ensure that a failure in one of its servers or workloads wouldn’t cause all users to have their streaming interrupted. The company developed Chaos Monkey, which would purposefully terminate one of its servers as it was running in production to ensure its whole fleet stayed healthy.
Get the ITPro. daily newsletter
Receive our latest news, industry updates, featured resources and more. Sign up today to receive our FREE report on AI cyber crime & security - newly updated for 2024.
Finding single points of failure
Chaos engineering is particularly useful for finding single points of failure, as Joachim Hershmann, VP analyst at Gartner, explains.
“It’s particularly interesting, in that a lot of organizations have a second server somewhere as their backup, but if you look at the whole chain of applications, it’s surprising how often there’s a point where if that fails, then everything begins to break,” Hershmann tells ITPro.
“It could be there’s no load balancing, or a particular server doesn’t respond how it should. It could even be a person – if there’s only one person who knows what to do in a specific scenario, then that’s a failure point. Identifying these single points of failure is important because if one thing goes wrong, that will impact something else in the system and so on and so on until, well, chaos.”
“Every organization wants to reduce risk, and going forward this is one of the things that will drive the growth in chaos engineering,” says Hershmann. “It may also impact insurance and be used as a way to credibly prove a system is secure,” he notes.
Who are the chaos engineers?
As it currently stands, chaos engineering doesn’t have a distinct career path and few people exist solely in the role of chaos engineer. Instead, chaos engineering falls largely under platform engineering and is viewed as one of the many tools these engineers have at their disposal to verify that systems work as anticipated.
Initially, this team will be responsible for designing and executing chaos experiments but should ideally adopt a federated model down the line. This is where teams are responsible for the chaos engineering infrastructure and assisting teams that need some expertise, says David Mooter, principal analyst at Forrester.
“You want creation and execution of chaos tests to be pushed into the application developers,” he explains. “If they remain centralized then one, it won’t scale due to overwhelming that central team, two, it creates an ‘us versus them’ mentality and a finger-pointing culture between the groups when things go wrong, and three, the application developers won’t learn and build intuition for resilience if they’re not directly involved in the experiments,” Mooter notes.
Chaos engineering is about learning – from both a technical perspective and an organizational/people perspective – and therefore you should start by bringing together the right stakeholders.
Teams involved could include operations, developers, security, and architects, who based on their knowledge should hypothesize what they think will happen. Once the experiment is run, comparisons can be made and lessons learned to inform future decision-making.
“You’re doing the experiment to not only learn what actually happens to the system, but also if you have a contingency plan in place for that,” Hershmann explains, “because maybe your expectations were correct, but maybe not.
“The second learning is if you have a contingency plan in place, does it work? You take this, make changes as necessary then re-run the experiment. This time hopefully everything will kick in and serve its purpose as expected.”
Starting small and where to focus
When looking to dip your toes into chaos engineering for the first time, experts recommend starting with something small, like verifying your cookies are working as anticipated on your login page.
“If you start with a much grander experiment, which I can’t recommend against enough when you’re starting out, that’s going to be expensive, because you’re going to have to loop in several teams, and level up knowledge,” Shortridge says.
She also recommends starting with a staging or test environment, because “if you do things Netflix-style in a production environment and something does go wrong, it’s a lot more urgent to clean things up.”
When it comes down to what to test, Shortridge goes on to highlight that you should focus on areas that would have the biggest business impact.
“It should be the business-critical things that will cost your business money, time, or reputation if something goes wrong,” she explains. “You want to be confident that they’re working as anticipated and you have everything in place to minimize the impact of potential failures.”
Shortridge gives the example of organizations maintaining their compliance with regulations like GDPR by testing their assumptions around a system's ability to anonymize tokens.
“You could inject an unencrypted token – ideally in a test rather than a production environment – to see how the system handles it. The key thing is seeing it end-to-end. You’re not just looking at whether there was an initial rejection, but also downstream.
“Was there an alert? Did your humans notice the alert? What could they do about it – was there enough context for them to intervene and take action? Was the compliance team notified? It’s that end-to-end picture of the system that’s really important,” she concludes.
Keri Allan is a freelancer with 20 years of experience writing about technology and has written for publications including the Guardian, the Sunday Times, CIO, E&T and Arabian Computer News. She specialises in areas including the cloud, IoT, AI, machine learning and digital transformation.