In the featured video, Seth Eliot, the Principal Reliability Solutions Architect at the AWS Well-Architected team, explains Testing Resiliency using Chaos Engineering and how to use it to set up failure injection testing to validate the resiliency of your service.
At 00:16 he formally starts the session by quoting Werner Vogels (CTO at Amazon):
“We needed to build systems that embrace failure as a natural occurrence”
He further explains the quote by saying that failure is part of any sufficiently large and complex system such as a data center. Failures are occurring in the data center all the time in the form of hard drives failure or power supplies failure.
Here in the quote the ‘embrace failure’ means that we do not let failure affect our services but instead, we find ways to mitigate it. Where ‘we’ in this sentence means AWS. AWS is responsible for abstracting away those hard drives and power supplies from the customers.
This ‘we’ in the AWS is the resiliency of cloud. When you build your workload and systems, you must build them to be resilient in the cloud using AWS services and resources that we supply for you. Further, you must test them to be resilient and that’s what the session is all about.
AWS Well-Architected Framework
From 1:20 Seth begins to explain about the Well-Architected Framework and says if you want to build resilient services, then this framework is a great place to start.
There are five pillars of the Framework but we will be talking about ‘Reliability’ pillar. This Reliability pillar of Framework defines the testing resiliency as “The ability of the system to recover from infrastructure of service disruptions”.
At 2:31 Seth gives two of the five design principles of Reliability pillar which are used in WA framework. These are:
- Automatically recover from failure
- Test recovery procedures
To tell us how to use AWS to automatically recover form failure, from 3:07 Seth introduces us to the Well-Architected tool. The first basis of well-Architected tool is how you use fault isolation to protect your workload. And the best practice for this is to deploy your workload to multiple locations. These locations can be regarded as the AWS Availability Zones.
These locations are in 24 places around the world where AWS clusters data centers. These are completely separated in terms of connectivity, physical locations and power supplies. Any failure in one cannot affect the other in any sense. These zones are designed using a Three-tiers architecture.
So, when there’s a failure, e.g., if any EC2 instance fails. The Elastic load balancing is continuously checking the health of these instances as well as your application running on them. When it fails, the Elastic load balancing which is constantly checking the health of these instances, no longer routes traffic to that one.
From 7:51 Seth points how auto-recovery works in AWS by giving the examples of three parallel instances and RDS (Relational Database Service). Where 3 sets of RDS are available as Primary, Standby and Read replica. Incase primary gest unavailable or unhealthy, the standby is promoted to primary and the clients continue making requests seamlessly. That was a little bit of brief overview of how to design systems or workloads to automatically recover from failure.
Testing Resiliency with Chaos Engineering
At 9:45 Seth gives the definition of Chaos Engineering which goes as, “The discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production”. He further states chaos engineering as a scientific method by presenting a loop diagram which highlights steps of a scientific method as steady state, hypothesis, run experiment, verify and improve your work.
From 12:23 the presenter gives some history about chaos engineering which dates back to 2005 where they used Game Days to train how to respond to a disaster and then again in 2010 in the form of Netflix Chaos Monkey which randomly used to destroy servers in different places. But 2016 was the first time the Principles of Chaos were published on GitHub. And now a days we can write our own availability zone failure simulations.
As Chaos Engineering is about scientific method, at 15:20 Seth starts to give demo on two experiments that we can do in Chaos Engineering. The first one Component Failure and the second one being the whole Availability Zone failure.
Experiment: Component Failure and Availability Zone Failure
The architecture for the demo looks exactly like outlined in the above topic. There are three availability zones each containing an EC2 instance and RDS which are running a simple web server and this web server is getting and image from S3 bucket, serving it up to clients. There are two hypotheses for the experiment:
- If one EC2 instance dies, then availability will not be impacted
- If an entire availability zone dies, then availability will not be impacted
From 16:52 Seth begins to give demo about the component and availability zone failure. There are three EC2 instances with IDs in three different availability zones (A, B and C) as said earlier. A simple website is also provided which is serving an image. There is an availability zone metadata which on refreshing the page, shows the three instances in three availability zones serving the webserver.
Seth now runs a script to fail an instance, which immediately starts to terminate that instance provided its ID. This instance was in zone A. On checking and refreshing the website, we come to know that there are only two zones showing in the metadata which are B and C. It is no longer serving instances from zone A because we can say it’s unhealthy.
So overall, there was no availability impact on the server. Which confirms our first hypothesis. This phenomenon was Elastic Load Balancer which balanced the load between the other two instances with no availability impact. Now, if we see the auto scaling group, we’ll notice there are still three instances running in here. That is what its job is. If we wait for a moment and go back to instances group, a new instance in zone A has replaced the dead instance there and it is running and taking the traffic (can also be confirmed from provided website).
At 21:02 Seth starts the new experiment based on the second hypothesis of dying an entire availability zone. This time, zone C is the targeted one and exactly as above, we write the script to kill zone C. Now if we go back to our EC2 instances group and refresh them, we’ll see as the Zone C is out of commission, the instance is Zone C is also killed. As a result, there is another instance beginning to form in Zone A to replace the one in Zone C. So, overall, there’s no impact on availability as we check on the website. This is the resilience we were talking about in the start using Chaos.
How the Lab Simulates Failure Events
From 24:55 Seth introduces us to yet other methods to write failure events. In ongoing session, we used the AWS command line interface but we can also do it in AWS SDK which uses Python and Java as its languages. We can also have PowerShell version which can use AWS PowerShell tools.
<But the fourth way we can simulate the events is not part of the lab environment. This one is using the AWS Systems Manager. Its main job is to run automations for us on AWS resources.
Making all of it Real
From 26:15 Seth gives all of it a reality touch and explains how reality might be different from what we’ve designed so far. We want to get close to production as possible. What will it look like? First of all, if we look at traffic patterns, which traffic I want to use to test Chaos Engineering on my workload? We could use actual user traffic and replay it. Another thing we can do is shadow testing. And also, we can test our production.
If we are looking at which type of environment, I am going to setup for Chaos Testing, AWS and AWS-Cloud is the best thing we can be provided with. Finally, we can talk about simulations. By definition they may not look like real thing so we need to be constantly looking at production. Looking how our system operates in production and when real time failures occur how did our system react to production. So always iterating and improving our events, so that our simulations are as close to production as possible.
What Should We do to Make Tests More Resilient Using Chaos Engineering?
Wrapping up the Chaos Engineering, from 29:55 Seth wants us to run these types of tests on regular basis and automatically. Just like automatic recovery, the automatic testing should be automatic. The tests are:
- As part of CI/CD Pipeline
- Start with staging (test, pro-production) environments
- Consider which tests to run in production phase
- Test resiliency using Chaos Engineering
- Conduct Game Days regularly
In the closing, Seth clears doubt the word ‘Chaos’ in resiliency by quoting the authors of the book on Chaos Engineering. They say:
“This is not about making chaos. It is about making the chaos inherent in the system visible.”