Ted Talks on IBM Watson & Bayes’ rule in evolution

Acculation

9 years ago

The secret to IBM Watson is the same one discovered a decade ago in statistical inference research: distance metrics & Bayes' rule. Photo: Wikimedia/mattbuck/cc-by-SA-3 #black #office #art #artwork #data #science This photo originally appeared in our Instagram on January 8, 2015, as the final clue in our reader's puzzle on what ostriches had to do with data science. (The clue was Bayes' rule

Some of our most read articles have been on IBM Watson, including suggestions & possible alternatives. We’ve pushed IBM several times to come up with better demos for Watson in a business context.

This author went to a demonstration by IBM of Watson in August 2014, an witnessed an overglorified AltaVista demonstrated by an engineer. (AltaVista was the dominant Internet search engine prior to Google. Both Google and AltaVista can handle natural language question-like syntax in search queries, although users tend not to use since it just adds boilerplate text. Similar to AltaVista two decades ago, this Watson demonstration would some respond to questions by coming up with partially-related excerpts from various web pages from a small medical database on the web. It had difficultly understanding many simple questions, and the clips selected weren’t always the most appropriate responses. It didn’t look like this thing was a Jeopardy! champion. As the engineer was told, this is bleeding-edge technology; if you want to sell it to businesses, you need to make the case. Businesses don’t want to invest in an AltaVista rehash from twenty years ago. The Watson engineer explained he had been in a hurry and had just loaded a small Watson demo that would fit onto his laptop.

In practice, the secret to IBM Watson is the same one this author discovered a decade ago while doing statistical inference research in academia: distance metrics. At the time, many artificial intelligence researchers believed symbolic reasoning was adequate for artificial intelligence. (From our results, we knew that statistical inference would quickly blow symbolic reasoning out of the water. It’s essentially a consequence of Moore’s law: statistical inference requires a lot more computing power than symbolic reasoning. Symbolic reasoning was an appropriate choice in the days when computers were much slower.)

So it’s nice to see that IBM has put together some Ted Talks about Watson, as well as used Watson to build a system capable of searching Ted Talks. (The example they show brings up the Geoffrey West Ted Talk that we also used in one of our first articles on the Singularity.)

IBM’s published papers on IBM Watson talk about how it uses a large number of different distance metrics. It uses the same one we used in our paper a decade ago: Smith-Waterman, to compare questions to answers in Wikipedia (or other data sources). A closer match between question text and a paragraph in Wikipeida implies greater statistical likelihood you’ve found Wikipedia text with the right answer. (You can then dynamically adjust the size of the text so that optimally balances conciseness with statistical probability of correctness. This author’s paper on statistical inference a decade did precisely such a dynamic clustering, in a different context, to optimize the informational richness of the answer against it’s likelihood of being correct.)

Continue reading on the next page for more videos….

However, IBM Watson must do a great deal more to win Jeopardy! Finding a close match in Wikipedia between its text and a question is far from being the correct answer in many cases. IBM Watson looks at many other factors, such as the implied historical period of the question (modern medicine in Wikipedia would give the wrong answer to a question about medieval medicine). It has distance metrics for the popularity and authenticity of the data source. (Popular data sources aren’t always right. One example given are lengths of borders of South American countries, were a common fact frequently quoted by newspapers is, in fact, wrong.) When these different distance metrices are in conflict, it then applies machine learning to learn which metrics should dominate in any given answer.

The corollary, of course, is that much of the development time for a novel solution for Watson will be in writing code to compute distance metrics. For example, in our hypothetical FBI database Watson implementation, suppose a frequent use case was in matching partial license plates using natural language. Let’s say witnesses frequently said things like “license plate started with NQZ, the suspect had blue eyes, and last name sounded like Mike.” You could pay an in-demand SQL programmer to write complex SOUNDEX and Regex queries to access some database, and maybe come back with an answer several hours and several hundred dollars later. Or you can have IBM Watson or another natural language processing system (hopefully) figure out how to retrieve this information from your natural language query using much more computing power but presumably much faster and at lower total cost than the dedicated SQL programmer. In order to do that, however, Watson would probably need new distance metrics written for things the FBI (or witnesses) would commonly search for, such as (in our example) license plates or similar-sounded names. Basically, new distance metrics probably have to written for anything that doesn’t frequently come up in Jeopardy!

Not many questions involving similar license plates come up in Jeopardy!, although similar-sounding names might, so perhaps only one new metric has to be written in this example. This would create a new quantitate score that would compare two cases in the database exclusively on how similar their license plates are. A separate metric (or at least separate treatment) is needed, because similar or matching license plates between two cases is a qualitatively very different signal than some random text matching between the cases. The choice of metrics presumably re-imposes some structure on the resulting system and data interpretation. In a more realistic example, an experienced agent would guide the Watson engineers into creating new quantitatively metrics based on how they compare cases or suspects in real-life. They might create a metric that could compare two artists’ sketches, for example, or score how similar a sketch is with a photo. Machine learning would then take over to figure out how to integrate the different metrics in formulating responses to natural language questions. For example: Should similar license plate dominate when appearance are different?)

These considerations then get at the true cost of a Watson deployment. They also answer the question about the infrastructure that should be built out to develop a pre-Watson prototype: if the similarity you’re looking for isn’t asked for in Jeopardy!, write a custom distance metric for it.

Photo: Wikimedia/mattbuck/cc-by-SA-3. Black light office art & artwork: our featured photo is Bayes’ theorem in neon. When this photo was originally published on our Instagram feed, we used it to wrap up our final set of clues in our reader’s puzzle on the relationship between ostriches, Aristotle and data science.

This is Bayes’ theorem from statistics (and data science) spelled out in blue neon at the Cambridge, UK offices of data science firm Autonomy. (Apologies to the frequentists or should we say frequentistas among our readers.. This supposedly rival but in reality complementary branch of statistics is in holy war against Bayesians. We’re being satirical here in a nod to today’s (Jan 8, 2015’s) tragic events. Even bloggers have been targeted by dictators and fanatics, so these things make us all less free, but more on that later.

The clue was Bayes’ rule. As we and others have argued elsewhere, there are only so many ways you can design an intelligent system. Convergent evolution requirements will dictate that such systems use statistical inference, and specifically Bayes’ rule is essentially any such system. There is growing evidence in neuroscience that the human brain does, indeed, use Bayes’ rule, hardcoded by evolution. IBM Watson thus necessarily makes use of Bayes’ rule as one of many parts of a complex chain of statistical reasoning and machine learning heuristics.

Next steps: Check out our YouTube channel for more great info, including our popular "Data Science Careers, or how to make 6-figures on Wall Street" video (click here)!

Related posts:

Next steps: Check out our YouTube channel for more great info, including our popular "Data Science Careers, or how to make 6-figures on Wall Street" video (click here)!