Andrea Danti - Fotolia
When listing skills on their resume, software testers and engineers may want to consider that an increasing number of companies are looking for a new kind of software expert. They aren't just looking for someone who can fix buggy systems and crashed services as problems arise, they are also looking someone who will spend time breaking those services and systems on purpose.
Purposefully creating situations that can cause services and software to crash or malfunction is called fault injection. This is a QA paradigm that two software engineers from Microsoft believe can mitigate the risks associated with modern software deployment and management, especially in relation to applications and services in the cloud, by helping engineers observe and find fixes for these failures in a controlled manner rather than dealing with them for the first time at an unexpected moment.
How the cloud complicates things
While the cloud has improved business' abilities to quickly release high-quality, highly capable apps, experts warn that there is a caveat -- those apps are becoming too complex to test manually.
Software is getting more complicated, to the point where the processes cannot be mentally modeled, placing an increased burden on QA, according to Casey Rosenthal, an engineering manager at Netflix. In fact, the company software systems may be "black boxes" even to the software engineers in charge of building them.
"I think the industry as a whole is moving to a space where systems are getting more and more complex," Rosenthal said. "To the point where even to the software engineers building some systems, those systems themselves are going to be black boxes. You can't introspect a deep learning algorithm, for example; not in any meaningful way."
Added to that complexity is the rapid rate of software deployment. The increased pace of deployment means that organizations need to go beyond testing and find a way to verify that those services are behaving as expected in production, according to Microsoft software engineer Michalis Zervos.
Dmitri Klementiev, principal software engineering manager at Microsoft and leader of the service resilience team, agreed, and said that the rate at which software is being developed and shipped to customers makes it too difficult to do a complete test of the software in a timely fashion. In fact, he doesn't believe that any service developers have the time to test all of their software thoroughly before shipping.
A big part of the issue is the fact that when services go down, they often go down at the most inconvenient times -- namely, in the middle of the night, Rosenthal pointed out.
"Nobody wants to get a call at 3:00 in the morning when their system stops working because one of their servers spontaneously disappeared or stopped responding," he said.
This is not just an inconvenience for the developer, Zervos noted, but can lead to bad decisions being made by a tired mind -- making the consequences even worse.
A new approach is needed across the industry in order to address increasing complexity, Klementiev said. One way to deal with this is to adopt what Zervos and Klementiev call "fault injection."
Fault injection, according to Zervos, is a method by which software engineers, developers or testers can simulate the effect of errors that commonly occur when working with things like cloud services, such as forced increases in resource pressure or a network interruption. There is no prescribed method for fault injection, necessarily -- it is simply a matter of pushing systems to the limit, sometimes to the point of breakage, just to see what will happen and to determine how you should respond as a team.
A novel idea or a recycled concept?
The idea behind fault injection is not new. People have been performing what could be considered variations of fault injection at Microsoft alone for close to 10 years, according to both Michalis Zervos and Dmitri Klementiev, both Microsoft software engineers.
"When we had traditional boxed software here at Microsoft and [at] other companies, they were, for example, doing fault injection in the code ... forcing the code to throw an error, forcing the code to return a wrong value to see how the rest of the application would behave," Zervos said.
The term fault injection has existed for a long time, Klementiev agreed, but the nature of software delivery in the cloud has changed the ideology. In the past, when running on one- to two-year software delivery cycles, time was on their side when it came to testing products and dealing with bugs.
"We [had] incomparably much more time to test ... we could delay our shipping until we fixed all the found and approved bugs," Klementiev said. "These days, when we ship cloud software, we may ship every day."
In light of these accelerated delivery cycles, fault injection can help fill in the gaps where simply performing traditional techniques, like unit tests, may fall short, Zervos and Klementiev agreed. However, these tests don't always provide insight on how dependencies of services, rather than just the code, impact overall systems, Zervos noted.
"If we want to test, for example, what would happen on this login service that I have -- an authentication service -- if my CPU pressure went up while people were actually hitting my service and increased [it to] 90% CPU usage ... it is almost impossible to try with any of these [traditional] methods," Zervos said.
Fault injection could be compared to the testing method known as "stress testing," Zervos added -- creating more traffic or putting more stress on a service externally. But even this type of test will not provide the kind of information or insight fault injection can provide, including a look at how dependencies will behave in a given situation.
One of the most beneficial aspects of fault injection is the fact that you are able to see the result of a bug or crash and you can determine a good course of action well ahead of when they would occur in real life, Klementiev said. These crashes are bound to happen, and this simply allows engineers to find problems faster and to deal with them in a controlled manner.
This can involve faults that are specific to software, as well, Zervos pointed out, such as the appearance of unexpected exceptions or error codes like an OutOfMemoryException. Fault injection will allow developers and engineers to see how the rest of the code, as well as any dependent services, will behave due to that event.
"For example ... you're trying to talk to your SQL server and some packets are lost or the connection is very slow," Zervos said. "This is very easy to test with fault injection."
The arguments against fault injection
However, adopting this discipline is not a free ride, Zervos and Klementiev said, particularly when it comes to performing fault injection. Many developers simply can't or don't want to take the time to learn a new way of testing. This is a problem that makes it critical for you to understand how to automate fault injection and how to make the process as easy as possible for anyone in your organization who needs to perform it.
Another problem, Zervos said, is that he finds that many engineers are understandably skeptical when encouraged to inject faults into production environments with real customer traffic. To deal with this, it is best to slowly build confidence in your ability to perform fault injections in a test environment. Then, when confidence is high enough, start performing fault injections in production environments.
However, even if the tester, developer or engineer is confident enough in their own ability to do a fault injection in production, they will still need to have total support from the leadership in their company since they are, in essence, breaking things, Klementiev said. This can be a big cultural change, but it is still critical that the leaders buy in.
"If they don't pay [for] it now with a little risk, they can pay much more as a company later," Klementiev said.
This may not be such a simple thing for small companies with limited budgets, customer bases and development or test teams, Zervos and Klementiev noted. However, even if they can't formally perform fault injection, there are ad hoc ways to go about it, Klementiev suggested.
"I think that the simplest fault injection that anyone can do is to just go ahead and pull the plug," Klementiev said. "Just go to [the] data center and turn it off. Or turn off a rack of machines. This is something that I know many companies are doing."
Are developers and testers on board?
This type of discipline has still yet to catch on across the industry, Klementiev and Zervos agreed. However, the direction this discipline is going in is positive, according to Zervos, and he believes that the nature of software engineering will make it absolutely essential to jump on board.
"I think that many companies will want to have them [fault injections tests] in the near future as they realize the benefits of investing in resilience and fault injection," he said. "You see an increasing number of enterprises, [over] the last two years, putting [in the] resources and working in this area. This is a sign to me that fault injection testing is going to get bigger and bigger in the cloud service world."
And this is not just a matter of learning the techniques, Klementiev pointed out. Rather, it's about adopting a new philosophy when it comes to software development in the cloud. As he put it, there needs to be a new discipline in the industry, one that becomes part of the culture of cloud services.
Why big changes are coming to the software testing world
Will testers get automated out of their jobs?
Why data analytics metrics are the key to better app performance