Problem solve Get help with specific problems with your technologies, process and projects.

QCon New York Sessions - Incident Response with Etsy

“Incident response – what makes it so terribly difficult?” – John Allspaw at QCon New York

“Anomaly response does not happen the way we might imagine it does,” John Allspaw, CTO at Etsy, said in his opening keynote presentation at QCon New York, “Incident Response: Trade-offs Under Pressure.”

Can we trust tools?

One of the first notes that Allspaw made is that organizations cannot simply rely on tools to make it easier to understand how and why incidents are occurring. Instead, teams need to rely on processes and reasoning in order to truly respond to anomalies. And they cannot, he said, treat these outages as a mystery that is constantly developing over time.

John Allspaw on tools at QCon New York

Allspaw believes that tools designed for incident response may never actually simplify the process.

“An outage is not a detective story,” Allspaw said. “It’s static, and it’s there.”

A model of reasoning

In order to properly deal with outage-causing anomalies, Allspaw recommended that organizations implement a “model of reasoning” that does not “distinguish between diagnosis and therapy.”

John Allspaw presents "model of reasoning" at QCon New York

Allspaw presents this model as an ideal strategy for anomaly response.

Avoiding “cognitive fixation”

Listeners were also warned not to fall into the traps of “thematic vagabonding” and “cognitive fixation” – meaning that those debugging the code can become so wrapped up in simply fixing bugs and symptoms that they fail to delve further into discover the actually root cause of the issue.

“As one thread of diagnosis comes in, you start running to more,” Allspaw said. He said that avoiding this requires developers and testers to communicate about what they are seeing and not get stuck alone on a path of just fixing bug after bug.

In fact, he provided a list of “prompts” that teams can use to frame particular question, dividing the questions into four “stages” of incident response: observations, hypotheses, coordination and suggesting actions. By asking these questions, team members may be able to avoid “cognitive fixation” and get to the root of the problem.

Allspaw's "questions to ask" at QCon New York

Allspaw provided a list of question ideas and prompts that can help move anomaly response forward.

Final notes

Allspaw also talked about the importance of linking anomalies to any known, recent changes in the code or application and, more so, of having peers review your hypotheses.

“Validate the hypothesis that most easily comes to mind,” he said, while also adding that anyone who begins to build confidence about discovering a certain cause of an outage should always check that confidence with a peer review.

The "punch line" of John Allspaw's talk at QCon New York

Allspaw sums up his presentation at QCon New York by saying teams need to rethink how they approach incident and anomaly response.

Join the conversation


Send me notifications when other members comment.

Please create a username to comment.

Good overview describing seven criterion. A more expansive evaluation framework can be found at

The white paper describes seven distinct criteria categories. PaaS offerings can be evaluated and compared across the following criteria categories:
● Cloud Characteristics: Measures characteristics (i.e. on-demand self-service, resource pooling, rapid elasticity, and measured service) used to distinguish Cloud solutions from traditional application solutions.

● Cloud Dimensions: Measures how widely the solution can be shared (i.e. private, public, community), who is responsible for PaaS environment management (i.e. internal, external), and where the PaaS is located (i.e. on-premise, outsourced) options

● Production Ready: Measures PaaS maturity and suitability for enterprise, mission critical level use

● DevOps Activities and Lifecycle Phases: Measures how to design, construct, deploy, and manage applications and services using DevOps practices (i.e. continuous integration, continuous delivery, automated release management, and incremental testing)

● Cloud Architecture: Measures architecture principles, concepts, and patterns enabling applications to dynamically execute parallel workloads across a highly distributed environment

● Platform Services: Measures how completely the PaaS satisfies development of complex applications by providing comprehensive application middleware components and services

● Programming Model: Measures programming languages and frameworks, which facilitates building applications and services exhibiting Cloud characteristics

Hi, Very nice and interesting article! but I see there are some misleading information over Microsoft Windows Azure Platform. First, I see Windows Azure as an open public PaaS provider as well as IBM. Then, Azure is since June 7th, 2012 officially supporing not only .NET and PHP but Java and other languages.
Hope my comments help