Tough Analytics? Watson to the Rescue

“Quickly Watson, get your service revolver!” Is Watson about to put business intelligence out of its misery? Is the good doctor about to surpass Sherlock Holmes in his ability to solve life’s enduring mysteries? Or are we in jeopardy of falling into another artificial intelligence rabbit hole?

Yes, I know. Although I haven’t found a reference to prove it, I’m pretty sure that IBM Watson, the computer that recently won “Jeopardy!” is named after one of the founding fathers of IBM–Thomas J. Watson Sr. or Jr.–rather than Sherlock Holmes’ sidekick. But, the questions above remain highly relevant.

IBM Watson is, of course, an interesting beast. The emphasis in the popular press has been on the physical technology specs–10 refrigerator-sized cabinets containing approximately 3,000 CPU cores, 15 TB of RAM and 500 GB of disk running at about 80 teraflops, and cooled by two industrial air-conditioning units. But, in comparison to some of today’s “big data” implementations, IBM Watson is pretty insignificant. eBay, for example, is running up to 20 petabytes of storage. As of 2010, Facebook’s Hadoop cluster was running on 2300 servers with over 150,000 cores and 64 TB of memory between them. The world’s current (Chinese) supercomputer champion is running at 2.5 petaflops.

On the other hand, a perhaps more telling comparison is to the size and energy consumption of the human brain that Watson beat, but certainly did not outclass, in the quiz show!

However, what’s really more interesting from a business intelligence viewpoint is the information stored, the architecture employed and the effort expended in optimizing the processing and population of the information store.

We know from the type of knowledge needed in Jeopardy! and, indeed, from the possible future applications of the technology discussed by IBM that the raw information input to the system was largely unstructured, or soft information, as I prefer to call it. During the game, Watson was disconnected from the Internet, so its entire knowledge base was only 500 GB in size. This suggests the use of some very effective artificial intelligence and learning techniques to condense a much larger natural language information base to a much more compact and usable structure prior to the game. Over a period of more than four years, IBM researchers developed DeepQA, a massively parallel, probabilistic, evidence-based architecture that enables Watson to extract and structure meaning from standard textbooks, encyclopedias and other documents. When we recall that the natural language used in such documents contains implicit meaning, is highly contextual, and often ambiguous or imprecise, we can begin to appreciate the scale of the achievement. A wide variety of AI techniques, such as temporal reasoning, statistical paraphrasing, and geospatial reasoning, were used extensively in this process.

Dr. David Ferrucci, leader of the research project, states that no database of questions and answers was used nor was a formal model of the world created in the project. However, he does say that structured data and knowledge bases were used as background knowledge for the required natural language processing. It makes sense to me that such knowledge, previously gathered from human experts, would be needed to contextualize and disambiguate the much larger natural language sources as Watson pre-processed them. Watson’s success in the game suggests to me that IBM have succeeded in using existing human expertise, probably gathered in previous AI tools, to seed a much larger automated knowledge mining process. If so, we are on the cusp of an enormous leap in our ability to reliably extract meaning and context from soft information and to use it in ways long envisaged by proponents of artificial intelligence.

What this means for traditional business intelligence is a moot point. Our focus and experience is directed mainly towards structured, or hard, data. By definition, such data has already been processed to remove or minimize ambiguity in context or content by creating and maintaining a separate metadata store, as I’ve described elsewhere.

However, there is no doubt that the major growth area for business intelligence over the coming years is soft information, which, according to IDC is growing at over 60% compound annual growth rate, about three times as fast as hard information, and which already accounts for over 95% of the information stored in enterprises. It is in this area, I believe, that Watson will make an enormous impact as the technology, already based on the open-source Apache UIMA (Unstructured Information Management Architecture), moves from research to full-fledged production. There already exists a significant pent-up demand to gain business advantage by mining and analyzing such information. Progress in releasing the value tied up in soft information has been slowed by a lack of appropriate technology. That is something that Watson and its successors will certainly change.

While I have focused so far on the knowledge/information aspects of Watson–that being probably the most relevant aspect for BI experts, there is one other key feature of the technology that should be emphasized. That is Watson’s ability to parse and understand the sort of questions posed in everyday English with all their implicit assumptions and inherent context. Despite appearances to the contrary in the game show, Watson was not responding to the spoken questions from the quiz master; the computer had no audio input, so the exact same questions were passed to it as text as were heard by the human contestants. In fact, speech recognition technology has also advanced significantly to the stage where very high levels of accuracy can be achieved. (As an aside, I use this technology myself extensively and successfully for all my writing…) The opportunities that this affords in simplifying business users’ communication with computers are immense.

It seems likely that over the next few years this combination of technologies will empower business users to ask the sort of questions that they’ve always dreamed of, and perhaps haven’t even dreamed of yet. They will gain access, albeit indirectly, to a store of information far in excess of what any human mind can hope to amass a lifetime. And they will receive answers based directly on the sum total of all that information, seeded by the expertise of renowned authorities in their respective fields and analyzed by highly structured and logic-based methods.

Of course, there is the danger that if a given answer happens to be incorrect, it is difficult to see how the business user would discover that error or be able to figure out why it had been generated.

And that, as Sherlock Holmes never said is far from “Elementary, my dear Watson!”