“Incident response – what makes it so terribly difficult?” – John Allspaw at QCon New York
“Anomaly response does not happen the way we might imagine it does,” John Allspaw, CTO at Etsy, said in his opening keynote presentation at QCon New York, “Incident Response: Trade-offs Under Pressure.”
Can we trust tools?
One of the first notes that Allspaw made is that organizations cannot simply rely on tools to make it easier to understand how and why incidents are occurring. Instead, teams need to rely on processes and reasoning in order to truly respond to anomalies. And they cannot, he said, treat these outages as a mystery that is constantly developing over time.
“An outage is not a detective story,” Allspaw said. “It’s static, and it’s there.”
A model of reasoning
In order to properly deal with outage-causing anomalies, Allspaw recommended that organizations implement a “model of reasoning” that does not “distinguish between diagnosis and therapy.”
Avoiding “cognitive fixation”
Listeners were also warned not to fall into the traps of “thematic vagabonding” and “cognitive fixation” – meaning that those debugging the code can become so wrapped up in simply fixing bugs and symptoms that they fail to delve further into discover the actually root cause of the issue.
“As one thread of diagnosis comes in, you start running to more,” Allspaw said. He said that avoiding this requires developers and testers to communicate about what they are seeing and not get stuck alone on a path of just fixing bug after bug.
In fact, he provided a list of “prompts” that teams can use to frame particular question, dividing the questions into four “stages” of incident response: observations, hypotheses, coordination and suggesting actions. By asking these questions, team members may be able to avoid “cognitive fixation” and get to the root of the problem.
Allspaw also talked about the importance of linking anomalies to any known, recent changes in the code or application and, more so, of having peers review your hypotheses.
“Validate the hypothesis that most easily comes to mind,” he said, while also adding that anyone who begins to build confidence about discovering a certain cause of an outage should always check that confidence with a peer review.