Home » Blog » analytics » Watson from IBM: Why semantic text tech helps analytics
If you're new here, you may want to first register and subscribe to the RSS feed. Thanks for visiting!
IBM Watson: dramatic potential or nothing new?
There is a new type of software technology from IBM and others that has the potential to dramatically change how we work. In particular, certain types of workplace drudgery may be eliminated. The key is intelligent processing of unstructured text.
You might remember IBM’s Watson technology from a few year’s back. It’s a natural language Q&A system named after IBM’s founder (pictured). (We’re guessing it’s also an allusion to the Sherlock Holmes character.) As a publicity stunt, IBM had Watson beat the reigning Jeopardy! champion. At least based on the material on IBM’s website it still isn’t too clearly technically what IBM is. There are other impressive systems for searching and processing unstructured text and answering human-language questions. (We’ll talk about them in a bit.) Aside from some glossy marketing materials, IBM seems coy about why Watson is better or cheaper than the competition. However, few businesses have a need to keep a Jeopardy! or chess-playing computer on staff. (For those that do, there are competing platforms even in these areas.) So, making the business case is important.
Fortunately, IBM is promising to make more technical details available soon. So let’s turn to why processing unstructured text is extremely useful.
Searches powered by semantic meaning rather than keywords
One of the most basic business uses for semantic text is in the area of job searches. When an applicant writes a resume, they describe their job skills using keywords. The employer has the reverse problem. Employers try to imagine the ideal candidate and lace the job description with relevant keywords that recruiters or candidates will most likely be searching for. It often doesn’t work. There is so much technical and business jargon out there. Closely related specialties will use completely different terminology to describe identical processes. (We can give many examples but this is somewhat beyond the scope of this article.) Recruiters use their knowledge of different but closely related job functions to expand the set of search keywords. If software understood the underlying semantic meaning of a job description or resume, it would be able to take some of that burden off recruiters. Computers would be able to come up with much better matches.
Recruiters are highly compensated. (20% of the candidate’s first year’s pay is typical.) Most of this is likely salesmanship. (99% of potential matches are likely rejected by either the candidate or the employer. For each successful sourced candidate, there may be enormous efforts exerted behind the scenes on many unsuccessful prospects. Recruiter compensation reflects this.) Nevertheless, being able to come up with better job search keywords is apparently a valuable skill. Some fraction of recruiters’ compensation is due to that skill set. Semantic search technologies like IBM’s Watson stand to make significant revenue by streamlining recruiting. (Elance shows up in IBM’s marketing video for Watson. They don’t say what’re working on, but we’re guessing it’s something like we just described.)
Let’s give another example. Let’s say you’re searching for restaurants. You’ve got exotic tastes, or maybe dietary restrictions that sites like Zagat don’t cater to. You’ve heard that there are a lot of fake reviews written on Yelp! These are both good and bad, written by competitors and business owners. So, you don’t fully trust the ratings on Yelp! and similar sites. Instead, you’re a member of a social network food group. The people in the group have been closely vetted by the group’s moderator. They’re known to share your tastes or restrictions. So, you trust them more than Yelp. The problem is that there are far fewer reviews in your social network’s group. So, although you trust these individual recommendations more, there is less data to work with. How might you proceed? Well, humans are very good at these kinds of problems, which they encounter constantly. You’ll integrate the reviews from your social network with the same restaurants on Yelp!, filtering out suspicious or dubious reviews. You’ll quickly come up with often aggregate excellent review that combines data from all of these web sites. You’re able to do this because you (unlike the computer or web site) understand the semantic meaning that underlies the website reviews as well as the context surrounding your specific needs or tastes.
Unstructured text in law, law enforcement, and other government databases
The stereotypical lawyer in a TV melodrama deals with boxes and boxes of paper documents. Indeed, some economists have noted that the basic work of a lawyer primarily involves reading documents (aka unstructured text). In many legal issues, data and documents from a variety of disciplines and organizations must be combined to file an application, justify an opinion, or build a legal case. Since each organization or discipline has its own conventions, the resulting legal “meta-document” if you will (the docket or dossier) is necessarily unstructured. This fact hasn’t been lost on the Watson team. IBM’s marketing video expressly mentions legal applications as one of the next implementations for Watson.
Crowd-sourced law enforcement?
A related application might be described as “crowdsourced” law enforcement. (We admit this term sounds a bit scary). This sort-of already works in traffic applications. You can report a road hazard or congested conditions using the Waze app. This will alert other, nearby drivers. Hopefully it alerts the authorities as well, who can remove the hazard or mitigate the congestion. This works because Waze is dealing with structured data. The main data point is the GPS location and a condition code. It’s straightforward to do this with a pre-determined data format. Any review can be almost entirely automated.
However, a great deal of unstructured text is needed to process most citizen complaints. We’ll see why this likely never change in a bit: it’s the inherent nature of dealing with “bad guys.” But’s ignore that for a moment, and just look at the data itself.
Take the lowly databases collecting information on U.S. telephone scam calls. This would seem like a simple problem. The main data point is the phone number. Yet these databases contain huge amounts of essentially unstructured data. Why is this? First, there are many such databases. Depending on the exact type of complaint there may be multiple federal, state, local, and private organizations with overlapping jurisdiction or interest in the complaint. (It could be a commercial call, fake non-profit survey call, senior citizens scam, call to Do Not Call List, call to mobile phone, text message, etc. Both the FCC and FTC have separate databases for Do Not Call complaints. There are multiple public databases run by private organizations for reporting of unwanted or illegal phone calls. And so on. There are even a few completely structured databases, like the blacklist used by Google Voice’s spam filter.)
If you’re the FTC or the FCC, you’re dealing with precious resources. Today, manual review of unstructured data is still needed to allocate those resources. (If you’re Google Voice trying to put together a database of spam callers, you might have to undertake a similar process. You might need to read through of some of these complaints to try to help assess their credibility.)
These phone complaint databases are actually one of the easiest problems out there! They key data points are phone number of caller, time or frequency of calls, do-not-call list status and a few other data items that are already structured on the complaint form. Yet a human is still required to read these complaints, decide which of a plethora of regulations may have been violated (if any).
Internet threats
(Closely related to phone spam databases, are Internet threats. You’d think this would be easy, since the main data points are things like IP address and timestamp. However, again, each country and organization have their own, largely unstructured, database. Many countries have multiple organizations, public and private, with overlapping jurisdiction. Then there are companies that may own the affected machines. These each use their own, completely different system. It’s highly desirable to aggregate this information to respond to threats in real time. However, it’s not currently possible to do this effectively unless you can process unstructured texts. We might have a future blog post on this specific topic later.)
Unstructured data can’t be “gamed” by the “bad guys”
Search API will now always return "real" Twitter user IDs. The with_twitter_user_id parameter is no longer necessary. An era has ended. ^TS
— Twitter API (@twitterapi)November7, 2011
Search API will now always return "real" Twitter user IDs. The with_twitter_user_id parameter is no longer necessary. An era has ended. ^TS
— Twitter API (@twitterapi)November7, 2011
Search API will now always return "real" Twitter user IDs. The with_twitter_user_id parameter is no longer necessary. An era has ended. ^TS
— Twitter API (@twitterapi)November7, 2011
There are 7 comments so far
Leave a Comment
Don't worry. We never use your email for spam.Recent Comments
- florimee on genetic disease turns you into a real-life vampire
- Acculation on Alien Pioneer plaque starmap to 3D printed jewelry transmedia: maker movement data-driven multiplatform media
- Acculation on Free Video Data Science Assessment Tool
- Acculation on Free Business Advice Chatbot Product
- Acculation on Online Consultation with Dr. Krebs (Big Data and Management Consulting)
We were curious to know what the folks at IBM thought about some of our proposed uses for Watson, so we posted to the IBM developer forum. Will Sennett of IBM was kind of enough to write a detail response on the IBM site.
Here’s an excerpt: “I’d have to dig a bit more at the FBI assistant example … certainly solutions in the big data and analytics realm that are a great fit for government…. On the HR side, I think you’re spot on. In fact, one of our Watson Mobile Developer
[Waston] application difficulty and complexity is probably dependent on the data ….”
Read his full response on the IBM forum.
[…] recent articles on IBM Watson analytics and Google Glass generated a lot of interest with people contacting us privately to ask for advice […]
[…] are all the trolls on the Internet? We have done our best to tick people off in this blog. We skewered Google Glass. We did not have kind words for IBM Watson’s marketing department. We’ve even poked fun […]
[…] been a fan of IBM’s Watson semantic meaning analytics system since IBM first announced they were opening up their ecosystem. Around the time of CES we pointed […]
[…] up the topic of semantic text systems. In our earlier article from April, we mentioned a “bear in the woods” scenario. The idea there is that structured data, such as the forms used in hospital […]
[…] of our most read articles have been on IBM Watson, including suggestions & possible alternatives. We’ve pushed IBM several times to come up with better demos for […]
Twitter comments updated.