Acculation
Talk with Ivy League PhD data scientists...
Internet map: network data visualization...
A Visualization of Wikipedia Data...
Streamgraph: multidimensional data visua...
Data viz: scoreboards as the original an...
Evolution of historical notions of Earth...

Watson from IBM: Why semantic text tech helps analytics

IBM's Watson natural language processing software is named after IBM founder, Thomas J. Watson, Sr., pictured here in this 1920s photo from IBM's corporate archives. Photo: Wikimedia/IBM/CC-BY-SA-3.0

IBM Watson: dramatic potential or nothing new?

There is a new type of software technology from IBM and others that has the potential to dramatically change how we work. In particular, certain types of workplace drudgery may be eliminated. The key is intelligent processing of unstructured text.

You might remember IBM’s Watson technology from a few year’s back. It’s a natural language Q&A system named after IBM’s founder (pictured). (We’re guessing it’s also an allusion to the Sherlock Holmes character.) As a publicity stunt, IBM had Watson beat the reigning Jeopardy! champion. At least based on the material on IBM’s website it still isn’t too clearly technically what IBM is. There are other impressive systems for searching and processing unstructured text and answering human-language questions. (We’ll talk about them in a bit.) Aside from some glossy marketing materials, IBM seems coy about why Watson is better or cheaper than the competition. However, few businesses have a need to keep a Jeopardy! or chess-playing computer on staff. (For those that do, there are competing platforms even in these areas.) So, making the business case is important.

Fortunately, IBM is promising to make more technical details available soon. So let’s turn to why processing unstructured text is extremely useful.

Searches powered by semantic meaning rather than keywords

One of the most basic business uses for semantic text is in the area of job searches. When an applicant writes a resume, they describe their job skills using keywords. The employer has the reverse problem. Employers try to imagine the ideal candidate and lace the job description with relevant keywords that recruiters or candidates will most likely be searching for. It often doesn’t work. There is so much technical and business jargon out there. Closely related specialties will use completely different terminology to describe identical processes. (We can give many examples but this is somewhat beyond the scope of this article.) Recruiters use their knowledge of different but closely related job functions to expand the set of search keywords. If software understood the underlying semantic meaning of a job description or resume, it would be able to take some of that burden off recruiters. Computers would be able to come up with much better matches.

Recruiters are highly compensated. (20% of the candidate’s first year’s pay is typical.) Most of this is likely salesmanship. (99% of potential matches are likely rejected by either the candidate or the employer. For each successful sourced candidate, there may be enormous efforts exerted behind the scenes on many unsuccessful prospects. Recruiter compensation reflects this.) Nevertheless, being able to come up with better job search keywords is apparently a valuable skill. Some fraction of recruiters’ compensation is due to that skill set. Semantic search technologies like IBM’s Watson stand to make significant revenue by streamlining recruiting. (Elance shows up in IBM’s marketing video for Watson. They don’t say what’re working on, but we’re guessing it’s something like we just described.)

Let’s give another example. Let’s say you’re searching for restaurants. You’ve got exotic tastes, or maybe dietary restrictions that sites like Zagat don’t cater to. You’ve heard that there are a lot of fake reviews written on Yelp! These are both good and bad, written by competitors and business owners. So, you don’t fully trust the ratings on Yelp! and similar sites. Instead, you’re a member of a social network food group. The people in the group have been closely vetted by the group’s moderator. They’re known to share your tastes or restrictions. So, you trust them more than Yelp. The problem is that there are far fewer reviews in your social network’s group. So, although you trust these individual recommendations more, there is less data to work with. How might you proceed? Well, humans are very good at these kinds of problems, which they encounter constantly. You’ll integrate the reviews from your social network with the same restaurants on Yelp!, filtering out suspicious or dubious reviews. You’ll quickly come up with often aggregate excellent review that combines data from all of these web sites. You’re able to do this because you (unlike the computer or web site) understand the semantic meaning that underlies the website reviews as well as the context surrounding your specific needs or tastes.

Unstructured text in law, law enforcement, and other government databases

The stereotypical lawyer in a TV melodrama deals with boxes and boxes of paper documents. Indeed, some economists have noted that the basic work of a lawyer primarily involves reading documents (aka unstructured text). In many legal issues, data and documents from a variety of disciplines and organizations must be combined to file an application, justify an opinion, or build a legal case. Since each organization or discipline has its own conventions, the resulting legal “meta-document” if you will (the docket or dossier) is necessarily unstructured. This fact hasn’t been lost on the Watson team. IBM’s marketing video expressly mentions legal applications as one of the next implementations for Watson.

Crowd-sourced law enforcement?

A related application might be described as “crowdsourced” law enforcement. (We admit this term sounds a bit scary). This sort-of already works in traffic applications. You can report a road hazard or congested conditions using the Waze app. This will alert other, nearby drivers. Hopefully it alerts the authorities as well, who can remove the hazard or mitigate the congestion. This works because Waze is dealing with structured data. The main data point is the GPS location and a condition code. It’s straightforward to do this with a pre-determined data format. Any review can be almost entirely automated.

However, a great deal of unstructured text is needed to process most citizen complaints. We’ll see why this likely never change in a bit: it’s the inherent nature of dealing with “bad guys.” But’s ignore that for a moment, and just look at the data itself.

Take the lowly databases collecting information on U.S. telephone scam calls. This would seem like a simple problem. The main data point is the phone number. Yet these databases contain huge amounts of essentially unstructured data. Why is this? First, there are many such databases. Depending on the exact type of complaint there may be multiple federal, state, local, and private organizations with overlapping jurisdiction or interest in the complaint. (It could be a commercial call, fake non-profit survey call, senior citizens scam, call to Do Not Call List, call to mobile phone, text message, etc. Both the FCC and FTC have separate databases for Do Not Call complaints. There are multiple public databases run by private organizations for reporting of unwanted or illegal phone calls. And so on. There are even a few completely structured databases, like the blacklist used by Google Voice’s spam filter.)

If you’re the FTC or the FCC, you’re dealing with precious resources. Today, manual review of unstructured data is still needed to allocate those resources. (If you’re Google Voice trying to put together a database of spam callers, you might have to undertake a similar process. You might need to read through of some of these complaints to try to help assess their credibility.)

These phone complaint databases are actually one of the easiest problems out there! They key data points are phone number of caller, time or frequency of calls, do-not-call list status and a few other data items that are already structured on the complaint form. Yet a human is still required to read these complaints, decide which of a plethora of regulations may have been violated (if any).

Internet threats

(Closely related to phone spam databases, are Internet threats. You’d think this would be easy, since the main data points are things like IP address and timestamp. However, again, each country and organization have their own, largely unstructured, database. Many countries have multiple organizations, public and private, with overlapping jurisdiction. Then there are companies that may own the affected machines. These each use their own, completely different system. It’s highly desirable to aggregate this information to respond to threats in real time. However, it’s not currently possible to do this effectively unless you can process unstructured texts. We might have a future blog post on this specific topic later.)

Unstructured data can’t be “gamed” by the “bad guys”



More significantly, the “bad guys” (if you will) are aware of these databases, and may take steps to evade automated surveillance. They may constantly change their calling phone number to avoid being blacklisted. Thus, real humans are still needed to go into these databases and study the unstructured text. If you see people writing descriptions of the same kind of illegal phone call coming from constantly changing numbers, you want to assign someone to investigate further. This is why unstructured text is required: there is no way to know in advance how scammers or criminals will behave. If you tried to anticipate every possible situation with a pre-designed form, the “bad guys” would modify their behavior to “game” the predesigned form an avoid surveillance.

FBI and Crimestoppers databases and Dr. Watson

The more important examples are things like the FBI IC3 database and the Crimestoppers database. In part for the reasons we just mentioned, unstructured text plays an essential role in the complaint used by citizens. So, at present, a human is still required to read each complaints. Since any one human can only read a small part of these databases, they might miss complex crime trends. This is where a technology like Watson might come in. Watson could read every complaint in FBI IC3 or Crimestoppers, and potentially catch significant details in the aggregate unstructured text that would be overlooked by necessarily more myopic humans. (Note that since every scam involving the Internet is potentially an FBI IC3 complaint, this database of citizen complaints is huge. Most complaints are ultimately ignored due to lack of resources. Watson could help here by connecting dots that otherwise are overlooked.)

We’ve mentioned Yelp for finding restaurants. Although it’s controversial (and there may be legal hazards to those making complaints) it also serves as a repository for citizen complaints against businesses. (There are many other companies recording these kinds of complaints, with various levels of complaint vetting, different business models, and different amounts of controversy.) Yelp and similar databases are often largely public, so the government (or Yelp’s owners) could have Watson analyze this data to find previously undiscovered trends as well.

The IBM Watson team is asking people with suggestions for how Watson’s technology might be used to make submissions to them. Since this application occurred to us, we thought we’d just throw it out here. (IBM has specific guidelines. For example, the application must involve unstructured text, as well as involving human language questions for Watson to answer. In this case, the questions to Watson would be about ideal allocation of resources, or whether there were unusual new patterns appearing in the aggregate submissions.)

Dr. Watson the spook

We’re limiting our discussion here to unclassified government databases, which must be huge and full of unstructured text. Of course, the government also has those “other” databases we’ve heard so much about recently. We’re guessing Watson already passed the background checks for a security clearance, so that’s one’s probably covered. It’s pretty obvious the technology hasn’t made it around to “more civilian” government applications yet, however.

Forming a posse in the digital age

Maybe semantic processing of text and crowd-sourced law enforcement is how you form a posse in the digital age. (Or maybe we just think that because Americans located out West with an engineering bent. Posses have obvious legal and ethical hazards — we not being entirely serious in our enthusiasm for the crowd-sourced posse thing here. Perhaps on the emerging risks side, these new technologies might facilitate a dangerous new vigilantism.)

The problem with unstructured text

To date, computers have been mainly good at processing structured text. This means a designer or software developed needs to anticipate in advance how data will be used. They need to carefully design a database schema or web form that captures these use cases in advance.

Humans, on the other hand, are especially good at processing unstructured data. (We may create a structure around that data after the fact.) From an evolutionary standpoint, this makes sense. The importance of things we encounter may not become clear until after the fact. Therefore, it’s usually not possible to design a structure around this data in advance.

There’s a bear in the woods, or, Watson come quickly.

Usually, in these kinds of evolutionary examples a lion is used. We’ll use a bear instead. Let’s say there’s a bear in the woods. You thought the bear was tame, but in hindsight it turned out very dangerous. You previously thought the information you had about bears was not valuable, so you didn’t bother to organize it since that would involve unnecessary effort. If you’re able to go back and extract crucial information about bears from this previous unstructured data, your genes are much more likely to be passed on. Similarly, the vast majority of information streaming in to your senses will later to out to be extraneous. By being able to leave this data unstructured until its importance becomes clearer, you’re able to save energy (computation effort) and will have an evolutionary advantage.


This example also illustrates nicely why this is a valuable computation technique. If you’re able to “lazily” (a technical term) leave data unstructured until it’s value is certain, you can eliminate the significant design, storage, and data entry associated with database schemas.

We briefly touched on IBM’s Watson previously in our article on Cisco’s CES talk on Internet of Things. The Internet of Things will create a lot of data. Some of it, especially in hindsight, will not be optimally structured. And, as our example above illustrated, it is actually more efficient to leave data unstructured when there is a great deal of it and the relative future importance of various features is uncertain. That’s where Watson will come to play.

Apparently, IBM Watson could mine text very well for Jeopardy answers, somehow. But what is this technology, exactly? Can it do something besides play a mean game of Jeopardy?

Medicine like playing Jeopardy!?

IBM’s first killer application for Watson is assisting medical doctors in keeping up with the latest research. A huge amount of medical research is published each year. Articles sometimes provide new insights into diagnosis and treatments of disease, especially the more exotic and interesting cases. But the amount of new literature is vast and nearly impossible for specialists to keep up with, let alone your average general practitioner.

Enter Watson, which can understand large quantities text almost like a human. It can also answer “natural language” questions from humans (medical doctors) and respond to those questions in a natural way, as proved on Jeopardy. (It doesn’t really yet “understand” the way a human does. But it is able to create statistical models of the meanings of questions and text. So, when it is asked a question by a human, it is able to find the medical articles that are statistically most likely to be relevant in answering that question, and present that result back to the doctor.

Now, to be perfectly fair, this may not be an entirely new concept. (More on why tech to search scientific articles isn’t new in a bit.) Moreover, medical articles play to the computer’s strengths (much like the game of chess in that other famous IBM exhibition). While they are pretty fair along in the continuum of structured versus unstructured text, medical articles still have more structure than an average newspaper article or short story. Researchers go through a precise ritual when writing a medical article. There’s an abstract, introduction, conclusion, and so on. Space is very limited, so research describe concisely what they are doing in a set number of words in each section following a pre-set scientific style. (This is unlike a classic novel by, say, Agatha Christie or Lewis Carrol, which might jump from first-person to third-person narration or switch to prose mid-story.) Scientific articles use a lot of jargon. This also plays to the strength of computers, which can have a potentially unlimited vocabulary.

Searching scientific articles isn’t new tech

Scientific articles use citations to link paragraphs and sentences to other scientific articles. These citations also follow one of a small number of allowed formats, which provide a standardized reference intended to retrieve the cited article. The vast majority of these other articles, going back several decades, will already be on-line. Again, advantage computer, since it will be able to instantly retrieve and scan each citation to learn more about the meaning of the article. The poor human specialize must either already be familiar with the article (as is sometimes the case with highly cited articles in specialized fields) or spend time reading and retrieving it.

Moreover, in many cases the National Library of Medicine (NLM) and similar groups have electronically annotated cited articles in a machine-readable way. (Each discipline has their own system, but medicine and biology often use the MeSH ontology.) This was originally intended to speed researcher’s searching for related articles in PubMed/Medline (the online electronic article abstract searching system set up by the NLM, which took the place of multiple similar commercial services in the 1990s). If you knew (or know) the MeSH terms for the subjects you are interested in, you can pull their abstracts over the Internet via Medline. This, in turn, sometimes allows access to the full-text articles on publisher sites.


Thus, a computer system analyzing medical articles like Watson need not strictly limit itself to the human-readable text. It could parse the semi-structured citations data for each paragraph and sentence, and jump from that to previously human-annotated machine-readable MeSH terms for that cited article. From these MeSH terms, it could gain further insights into the meaning of the paragraph or sentence referencing the cited article. (Medical articles typically have a great many citations.) In addition to MeSH terms, there is the machine-readable Scientific Citation Index (and competitors) which seek to quantify the quality, importance, or influence of scientific articles by counting the numbers of times each has been cited in other scientific articles.

However outstanding Watson’s ability to process human-readable text is, it would be foolish for Watson not to use this already machine-readable data to gain additional inssights.  Many other fields of scientific endeavor don’t yet have the extensive machine-readable subject annotations such as the MeSH terms that the NLM has painstaking assigned to each article. And other areas of scientific endeavor may rely far more on non-text to convey meaning, such as mathematical equations, theorems, computer code, tables or graphs. Many fields use far fewer citations owing to a much smaller body of relevant literature. Although Watson’s Jeopardy championship gives a clear hint, we won’t know how significant these differences are until there’s a Watson for pure mathematics or geophysical chemistry.

Of course,, these other fields lack the business model of physicians with the financial resources and a real need for a computer assistant. This makes medical articles a low-hanging fruit for both business and technical reasons.

We briefly suggested it perhaps wasn’t all that new of a technology. Recall we mentioned one of the machine-readable datums associated with a scientific article was a count of how often it had been cited in other articles. This has been used for many decades to attempt to quantify (in a crude way) the quality, popularity, or importance of scientific articles. It was sort of the PageRank of its day. (PageRank, named after Google founder Larry Page, was the original Google search engine algorithm.)

Origins of Google

In fact, this is where Google got it’s start. Recall that Google was originally a PhD project at Stanford to help libraries keep track of scientific articles. Google’s founders, then Stanford PhD students, realized the number of times a web article was linked could be similar to the Scientific Citation Index. Thus, counting links provided a way to numerically score articles. (Prior to that, search engines mainly just looked at the keywords in each article.)

From the beginning, every Google search has been implicitly a question. They are a request for the most relevant information on a specific topic. You can even ask Google Jeopardy-like questions. (Well, you know we mean Jeopardy! answers, since questions are answers on Jeopardy!). Originally, those the topics of those questions were intended to be scientific, and the pages returned links to scientific articles.

Of course, Watson is obviously much better at solving Jeopardy! trivia than Google or Google Now. Watson wasn’t connected to the Internet during the Jeopardy challenge. (It had to rely on information it had already downloaded.) It didn’t return a web page, but rather the best sentence giving a concise, human-like solution to the trivia problem. It formulated these sentences on it’s on, by parsing the information in its databanks. (The game requires each solution to be in the grammatical form of a question. This additional twist proves Watson was generating its own, grammatically correct sentences, rather than merely mining existing sentences on web pages.)

Google can do something similar for some simple, common questions that appear to be preprogrammed (“What is the current time in Madrid?”). The search engine Wolfram Alpha that Apple’s Siri sometimes uses can answer an additional universe of questions that appear to be preprogrammed, together with solving some math problems by harnessing a Wolfram-Mathematica engine.

So Watson-like technology is already in products like Siri and Google Now. Of course, these didn’t exist (in their present sophistication) at the time IBM did its Jeopardy! Challenge. Both require access to the Internet to answer questions, and often take a much longer time to respond than Watson was permitted in competition. Often the responses are still lengthy web pages rather than the concise, accurate answers Watson was capable of generating.

So what is Watson, exactly?

Which brings us back to our original problem. What exactly is Watson?

If you visit the Watson developer ecosystem website, the general public only has access to some glossy marketing brochures and videos. These are high-concept and mostly say little about the underlying technology.

(The Wikipedia article on Watson isn’t that much more helpful. It lengthy mostly discusses the Jeopardy! stunt. We assume this is because so little else is known about the system. It does note a medical application in field trial for lung cancer diagnosis. 90% of the nurses in that field trial now rely on Watson’s judgement. The article cites Wolfram’s Alpha, which we mention above in connection with Apple’s Siri, as the main competitor.)

In recent months, IBM has started to explain a bit more, sort-of. They’ve announced  and Watson app cloud. They’re still only letting a small number of companies in at the moment. (Apparently Elance is one of them.) Everyone else gets glossy marketing brochures and videos.

Eventually, it seems IBM will publish a public API to Watson, as well as provide a hosted cloud service for Watson-enabled apps, a la a the Google App Engine. (They’re taking sign-ups for people interested in the public announcement.)

One of the initial  problems with Watson is that it apparently required a substantial up-front investment. This was in form of a state-of-the-art data warehouse and the staff to run that.

(Our initial suspicions that Watson was built on top of IBM DB/2 were quickly confirmed. In addition to significant software licensing costs, the last time we checked installing and maintaining a DB/2 installation was non-trivial. Developers used to free systems like MySQL may not realize it, but there are multi-million dollar R&D investments in proprietary fast join optimization and parallelization technologies that go into systems like Oracle, SQL Server, or DB/2. This, and the existence of legacy software, is why people put up with the maintenance expense of these systems.)

The Watson cloud

So setting a cloud ecosystem makes perfect sense. IBM will maintain the technology’s complex software and hardware stack. Developers can rent Watson instances a la an Amazon EC2 model. Companies can focus on writing innovative apps, not maintaining complex data warehouse hardware, software, and support staff. The barrier to entry for innovation drops from very substantial to near zero with the cloud-based model.

Elance is mentioned in IBM’s glossy video. We’ve already discussed why searching job descriptions and resumes in a “natural” way could be huge.

The missing technology for analytics and Internet of Things

Semantic text searching and improved natural language and unstructured data processing are the key missing ingredients for analytics and the Internet of Things. We’re sure to have more on to say on this key emerging technology in the future.

Next steps: Check out our YouTube channel for more great info, including our popular "Data Science Careers, or how to make 6-figures on Wall Street" video (click here)!