UML co-founder Grady Booch sometimes describes himself as a ‘software archeologist.’ As an IBM fellow and IEEE Software columnist, he gets to interview teams to find out how important systems were built.
The noted software methodologist recently turned his attention to IBM's Watson, the question answering natural language processing system that beat Jeopardy champions in a televised ''head-to-head'' competition earlier this year.
Last month at Innovate 2011, Booch went ''under the hood'' to uncover the system architecture of Watson. Booch described key project stratagems and artifacts. These include important design decisions such as Watson's use of pipe and filter architecture patterns for parallelization, and development of an XML-based UIMA-AS (Unstructured Information Management Architecture with Asynchronous Scale-out) scheme.
As well, Booch briefly discussed next steps for Watson. Now that it has garnered the Jeopardy crown its underlying architecture and code is expected to be re-factored for use in tackling commercial systems for unstructured data handling. Medical diagnosis is a likely first target, with others under consideration.
''I am just a story teller,'' Booch told the Innovate crowd , noting he was not actually involved in the Watson design. He said his job is to ''come in from the outside to uncover the design decisions'' made by the Watson system architects.
What is the essence of architecture? Decisions!
From a 5,000-foot view, IBM's Watson begins by analyzing a question, generating a hypothesis, narrowing down possible answers and then presenting best possible answers according to a scored level of confidence. (This being Jeopardy, of course, the machine takes an answer and tries to come up with a correct question.)
IBM's Watson comes up with answers – or questions - very, very quickly – in something like 2.7 sec.
Watson has a lot of parts – moving and otherwise. There are about 1-million lines of code in Watson, said Booch. These are written mostly in Java and C++, but other languages are involved. The code translates into about 130 software components. From a hardware point of view, there are 90 IBM Power 750 servers carrying the processing load.
''Architecture is the essence,'' said Booch. ''It's where the 'load-bearing walls' are.''
''Ultimately, architecture is about decisions,'' he said
And, among the key design decisions the Watson architects made, was the decision to employ a pipe and filter architecture pattern along the lines of that described by software luminaries such as Mary Shaw and David Garlan and Gregor Hoppe. This pattern can be readily parallelized, to handle the great mass of data that must be processed. Another important design solution was the selection of UIMA.
UIMA-AS builds upon the relatively little known UIMA standard that started life in IBM Research, set the stage for the DeepQA (for Deep Question Answering) project - and ultimately the Watson project - becoming an OASIS standard along the way.
UIMA-AS includes frameworks, infrastructure and components for the analysis and annotation of unstructured data of the kind a Jeopardy contestant - or a medical doctor - might have to grapple with. UIMA-AS now comes under the aegis of the Apache Software Foundation and, as its name implies, it has improved on the scalability available in its predecessor version.
This open source software is very crucial in Watson's operation. For communications, the Watson designers used UIMA-AS across JMS, a DeepQA protocol for accessing large in-memory data sets, and an Indri distributed search protocol.
Future refactoring for IBM Watson code
From a use case point of view, Watson has three primary sets, according to UML co-founder Booch. He describes these as source ingestion (or information gathering), training (or machine learning) and game playing (or the answering itself).
For now, Watson is an IBM Research effort, but systems based on similar software architecture underpinnings could well enter the enterprise. Besides medical diagnosis, there are finance and telecomm seen as Watson targets.
A lot of work has to be done to get there. New development must occur. Refactoring of code must take place in order to elevate certain features to the status of ''first-class architecture elements.'' This is not uncommon when research projects move to commercial production. Code will be redone so that it is more readable and maintainable. Common methods of configuration management will be applied. All of this, in Booch's words, is part of turning a research project into a product line.
[Ed. Note: This week IBM posted additional background on Watson architecture (''How Watson answers a question'') on YouTube.]