Problem solve Get help with specific problems with your technologies, process and projects.

QCon New York Sessions – Fault injection with Microsoft

When it comes to testing software, many of today’s organizations rely heavily on comprehensive testing, especially unit testing, to minimize the risk of outages. But in this session, Michalis Zervos of Microsoft talked to audience members about what some consider the “next generation” of creating software resiliency: actually taking those anticipated faults and forcing them to occur to your software.

Fault injection,” as Zervos refers to it, can be performed on everything from virtual machines, to custom applications to hardware. And this is a practice Zervos’ team at Microsoft actively uses and promotes in order to see not just how particular services and such are affected by certain unwanted events, but also how the dependent services and software are affected as well.

Fault injection benefits

Zervos explains some of the reasons to adopt fault injection alongside testing.

“We create ‘storms in the cloud’ to see how it performs under pressure and failure and use that to create resiliency,” he said. And according to Zervos, fault injection can be used for more than just testing resiliency. It can also be used for things like testing new features, training and verifying staged deployments.

Zervos covered the numerous faults that teams could consider injecting, including creating a kernal panic, “hooking” and disrupting critical service code, crashing critical processes and even pulling the power plug on your data center. He also suggested a few publically available tools that development teams can use to make the process easier, such as Consume.exe, Sysinternals tools and “managed code fault injection” through TestApi, a library of test and utility APIs.

Zervos did warn audience members that fault injection cannot be performed without certain precautions and considerations in order to achieve accurate results and avoid creating more problems. He cautioned that teams need to still follow fundamental security principles such as the least-privilege principle, make extensive use of code signing, create a “safety net” for the automatic removal of faults should they get out of a tester’s control and have a “kill switch” available, which he said can save developers and testers “a lot of grief.”

Zervos also stressed this importance of extensive verification and reporting when it comes to fault injection. He also instructed audience members that it is useful to manage fault injection from a centralized location.

“If you are not able to verify what happened, you don’t get the most out of your system,” he said.

System architecture for fault injection

Zervos presents his own system architecture in relation to a centralized fault management service.

One of Zervos’ final points was that it is not enough to simply perform fault injection every now and again. He stressed that teams need to integrate fault injection as a continuous part of the production cycle and find creative ways to encourage teams to adopt its practice. One suggestion he made was the idea of “recovery games,” in which one team member simulates an attack on a particular system and another team member, often a trainee, must record what occurs and take the proper steps to mitigate the risks of an outage. By implementing these types of programs, Zervos said his organization was able to increase adoption of fault-injection and also garner helpful insights about the behaviors of team members, such as that some spent too much time debugging and not enough time actually mitigating the problem.

“It needs to be part of the engineering process and part of the culture of the company,” Zervos said.

Fault injection recovery game goals

Zervos provides examples of the goals that can be achieved through adoption and training programs such as “recovery games.”

John Billings, technical lead on one of the infrastructure teams at Yelp and attendee of Zervos’ talk, said he thoroughly enjoyed the session and believes that fault injection is “the next step in actually testing resiliency of production systems,” he said.

Billings, who also held a talk at QCon on the “human side of microservices,” said he particularly liked the fact that Zervos spent his time discussing the general principles of fault injection rather than talking about specific technologies. And while his company does already make use of fault injection techniques, he is hoping to push the adoption of this strategy even further within his company and hopes that others will as well.

“Tests can only cover so much that you’ve thought about beforehand,” he said. “If you actually have fault injection happening all the time in production, you get that additional level of reliability that otherwise would be very difficult to achieve.”

Billings also said he liked the idea of introducing “fault injection games” as an approach to encouraging the adoption of this strategy, but believes that these adoption strategies must be align with a company’s individual culture. For instance, he noted hearing about the idea of a “badge-based system” that awards teams particular badges for completing and adopting certain testing and production techniques.

“You have to experiment and just see what works for your particular culture and your company,” he said.

Join the conversation

4 comments

Send me notifications when other members comment.

By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy

Please create a username to comment.

This author is scaring enterprises instead of helping them to take thought-out, bold action. The flexibility, agility and cost-savings that come with the cloud are things that senior management cannot afford to ignore. It is like thinking as follows - "You got a great job in another city. You have thought about it and it seems like a very good career move. Your family is ok with it. But then you are punching in numbers into the Excel spreadsheet on how much it will take to come back to your city and buy the same house, in case it does not work out." One would be stuck in analysis-paralysis.
Cancel
Why would they take steps to enhance portability? That would provide their client with choices. You can't scrimp on support... I mean 'maximize profits'- Or siphon your client's data... I mean 'cultivate innovative new value streams' when your client has choices. That is crazy talk.
Cancel
Well, I don't think you can overlook the existence of de facto standards like AWS EC2 and S3 and their compatible APIs. Like in the Microsoft era, he who controls the APIs rules. Right now that happens to be AWS. OpenStack may have good chance to bust a move with their own APIs given the large number of vendors who have signed on to support OpenStack. Mr. Linthincum is right that standards bodies like the IEEE do take a long time to get their work done but in the meantime, there are choices. The AWS APIs and OpenStack APIs look like good bets for data and workload interoperatbility.
Cancel
conscious approach to cloud movement has been well recommended. Would be interested to know the list of risks that may be faced in availing cloud based services from a provider and endorsing the entire business privacy through such providers
Cancel

-ADS BY GOOGLE

SearchSoftwareQuality

SearchCloudApplications

SearchAWS

TheServerSide

SearchWinDevelopment

Close