In 2011, IBM got a lot of publicity when it demonstrated a computer named Watson, which was designed to answer questions on the game show Jeopardy. Watson was good enough to be competitive with (and ultimately better than) some of the best ever Jeopardy players.

Playing Jeopardy perhaps seems like a frivolous task, and it’s tempting to think that Watson was primarily a public relations coup for IBM. In fact, Watson is a remarkable technological achievement. Jeopardy questions use all sorts of subtle hints, humour and obscure references. For Watson to compete with the best humans it needed to integrate an enormous range of knowledge, as well as many of the best existing ideas in natural language processing and artificial intelligence.

In this post I describe one of Watson’s ancestors, a question-answering system called AskMSR built in 2001 by researchers at Microsoft Research. AskMSR is much simpler than Watson, and doesn’t work nearly as well, but does teach some striking lessons about question-answering systems. My account of AskMSR is based on a paper by Brill, Lin, Banko, Dumais and Ng, and a followup paper by Brill, Dumais and Banko.

The AskMSR system was developed for a US-Government-run workshop called TREC. TREC is an annual multi-track workshop, with each track concentrating on a different information retrieval task. For each TREC track the organizers pose a challenge to participants in the run-up to the workshop. Participants build systems to solve those challenges, the systems are evaluated by the workshop organizers, and the results of the evaluation are discussed at the workshop.

For many years one of the TREC tracks was question answering (QA), and the AskMSR system was developed for the QA track at the 2002 TREC workshop. At the time, many of the systems submitted to TREC’s QA track relied on complex linguistic analysis to understand as much as possible about the meaning of the questions being asked. The researchers behind AskMSR had a different idea. Instead of doing sophisticated linguistic analysis of questions they decided to do a much simpler analysis, but to draw on a rich database of knowledge to find answers – the web. As they put it:

In contrast to many question answering systems that begin with rich linguistic resources (e.g., parsers, dictionaries, WordNet), we begin with data and use that to drive the design of our system. To do this, we first use simple techniques to look for answers to questions on the Web.

As I describe in detail below, their approach was to take the question asked, to rewrite it in the form of a search engine query, or perhaps several queries, and then extract the answer by analysing the Google results for those queries.

The insight they had was that a large and diverse document collection (such as the web) would contain clear, easily extractable answers to a very large number of questions. Suppose, for example, that your QA system is trying to answer the question “Who killed Abraham Lincoln?” Suppose also that the QA system only received limited training data, including a text which read “Abraham Lincoln’s life was ended by John Wilkes Booth”. It requires a sophisticated analysis to understand that this means Booth killed Lincoln. If that were the only training text related to Lincoln and Booth, the system might not make the correct inference. On the other hand, with a much large document collection it would be likely that somewhere in the documents it would plainly say “John Wilkes Booth killed Abraham Lincoln”. Indeed, if the document collection were large enough this phrase (and close variants) might well be repeated many times. At that point it takes much less sophisticated analysis to figure out that “John Wilkes Booth” is a good candidate answer. As Brill et al put it:

[T]he greater the answer redundancy in the source, the more likely it is that we can find an answer that occurs in a simple relation to the question, and therefore, the less likely it is that we will need to resort to solving the aforementioned difficulties facing natural language processing systems.

To sum it up even more succinctly, the idea behind AskMSR was that: unsophisticated linguistic algorithms + large amounts of data sophisticated linguistic algorithms + only a small amount of data.

I’ll now describe how AskMSR worked. I limit my discussion to just a single type of question, questions of the form “Who…?” The AskMSR system could deal with several different question types (“When…?”, etc), but the ideas used for the other question types were similar. More generally, my description is an abridgement and simplification of the original AskMSR system, and you should refer to the original paper for more details. The benefit of my simplified approach is that we’ll be able to create a short Python implementation of the ideas.

How AskMSR worked

Rewriting: Suppose we’re asked a question like “Who is the richest person in the world?” If we’re trying to use Google ourselves to answer the question, then we might simply type the question straight in. But Google isn’t (so far as I know) designed with question-answering in mind. So if we’re a bit more sophisticated we’ll rewrite the text of the question to turn it into a query more likely to turn up answers, queries like richest person world or world's richest person . What makes these queries better is that they omit terms like “Who” which are unlikely to appear in answers to the question we’re interested in.

AskMSR used a similar idea of rewriting, based on very simple rules. Consider “Who is the richest person in the world?” Chances are pretty good that we’re looking for a web page with text of the form “the richest person in the world is *”, where * denotes the (unknown) name of that person. So a good strategy is to search for pages matching the text “the richest person in the world is” exactly, and then to extract whatever name comes to the right.

There are a few things to note about how we rewrote the question to get the search query.

Most obviously, we eliminated “Who” and the question mark.

A little less obviously, we moved the verb “is” to the end of the phrase. Moving the verb in this way is common when answering questions. Unfortunately, we don’t always move the verb in the same way. Consider a question such as “Who is the world’s richest person married to?” The best way to move the verb is not to the end of the sentence, but rather to move it between “the world’s richest person” and “married to”, so we are searching for pages matching the text “the world’s richest person is married to *”.

More generally, we’ll rewrite questions of the form Who w0 w1 w2 w3 ... (where w0 w1 w2 ... are words) as the following search queries (note that the quotes matter, and are part of the query):

"w0 w1 w2 w3 ... " "w1 w0 w2 w3 ... " "w1 w2 w0 w3 ... " ...

It’s a carpet-bombing strategy, trying all possible options for the correct place for the verb in the text. Of course, we end up with many ridiculous options, like "the world's is richest person married to" . However, according to Brill et al, “While such an approach results in many nonsensical rewrites… these very rarely result in the retrieval of bad pages, and the proper movement position is guaranteed to be found via exhaustive search.”

Why does this very rarely result in the retrieval of bad pages? Ultimately, the justification must be empirical. However, a plausible story is that rewrites such as "the world's is richest person married to" don’t make much sense, and so are likely to match only a very small collection of webpages compared to (say) "world's richest person is married to" . Because of this, we will see few repetitions in any erroneous answers extracted from the search results for the nonsensical query, and so the nonsensical query is unlikely to result in incorrect answers.

Problems for the author

Contrary to the last paragraph, I think it’s likely that sometimes the rewritten questions do make sense as search queries, but not as ways of searching for answers to the original question. What are some examples of such questions?

Another feature of the queries above is that they are all quoted. Google uses this syntax to specify that an exact phrase match is required. Below we’ll introduce an unquoted rewrite rule, meaning that an exact phrase match isn’t required.

With the rules I’ve described above we rewrite “Who is the world’s richest person?” as both the query “is the world’s richest person” and the query “the world’s richest person is”. There are several other variations, but for now we’ll focus on just these two. For the first query, “is the world’s richest person”, notice that if we find this text in a page then it’s likely to read something like “Bill Gates is the world’s richest person” (or perhaps “Carlos Slim is…”, since Slim has recently overtaken Gates in wealth). If we search on the second query, then returned pages are more likely to read “the world’s richest person is Bill Gates”.

What this suggests is that accompanying the rewritten query, we should also specify whether we expect the answer to appear on the left (L) or the right (R). Brill et al adopted the rule to expect the answer on the left when the verb starts the query, and on the right when the verb appears anywhere else. With this change the specification of the rewrite rules becomes:

["w0 w1 w2 w3 ... ", L] ["w1 w0 w2 w3 ... ", R] ["w1 w2 w0 w3 ... ", R] ...

Brill et al don’t justify this choice of locations. I haven’t thought hard about it, but the few examples I’ve tried suggest that it’s not a bad convention, although I’ll bet you could come up with counterexamples.

For some rewritten queries Brill allow the answer to appear on either (E) side. In particular, they append an extra rewrite rule to the above list which is:

[w1 w2 w3..., E]

This differs in two ways to the earlier rewrite rules. First, the search query is unquoted, so it doesn’t require an exact match, or for the order of the words to be exactly as specified. Second, the verb is entirely omitted. For these reasons it makes sense to look for answers on either side of the search text.

Of course, if we extract a candidate answer from this less precise search then we won’t be as confident in the answer. For that reason we also assign a score to each rewrite rule. Here’s the complete set of rewrite rules we use, including scores. (I’ll explain how we use the scores a little later: for now all that matters is that higher scores are better.)

["w0 w1 w2 w3 ... ", L, 5] ["w1 w0 w2 w3 ... ", R, 5] ["w1 w2 w0 w3 ... ", R, 5] ... [w1 w2 w3..., E, 2]

Problems for the author

At the moment we’ve implicitly assumed that the verb is a single word after “Who”. However, sometimes the verb will be more complex. For example, in the sentence “Who shot and killed President Kennedy?” the verb is “shot and killed”. How can we identify such complex compound verbs?

Extracting candidate answers: We submit each rewritten query to Google, and extract Google’s document summaries for the top 10 search results. We then split the summaries into sentences, and each sentence is assigned a score, which is just the score of the rewrite rule the sentence originated from.

How should we extract candidate answers from these sentences? First, we remove text which overlaps the query itself – we’re looking for terms near the query, not the same as the query! This ensures that we don’t answer questions such as “Who killed John Fitzgerald Kennedy?” with “John Fitzgerald Kennedy”. We’ll call the text that remains after deletion a truncated sentence.

With the sentence truncated, the idea then is to look for 1-, 2-, and 3-word strings (n-grams) which recur often in the truncated sentences. To do this we start by listing every n-gram that appears in at least one truncated sentence. This is our list of candidate answers. For each such n-gram we’ll assign a total score. Roughly speaking, the total score is the sum of the scores of all the truncated sentences that the n-gram appears in.

In fact, we can improve the performance of the system by modifying this scoring procedure. We do this by boosting the score of a sentence if one or more words in the n-gram are capitalized. We do this because capitalization likely indicates a proper noun, which means the word is an especially good candidate to be part of the correct answer. The system outputs the n-gram which has the highest total score as its preferred answer.

Toy program: Below is a short Python 2.7 program which implements the algorithm described above. (It omits the left-versus-right-versus-either distinction, though.) The code is also available on GitHub, together with a small third-party library (via Mario Vilas) that’s used to access Google search results. Here’s the code:

#### mini_qa.py # # A toy question-answering system, which uses Google to attempt to # answer questions of the form "Who... ?" An example is: "Who # discovered relativity?" # # The design is a simplified version of the AskMSR system developed by # researchers at Microsoft Research. The original paper is: # # Brill, Lin, Banko, Dumais and Ng, "Data-Intensive Question # Answering" (2001). # # I've described background to this program here: # # http://michaelnielsen.org/ddi/how-to-answer-a-question-v1/ #### Copyright and licensing # # MIT License - see GitHub repo for details: # # https://github.com/mnielsen/mini_qa/blob/blog-v1/mini_qa.py # # Copyright (c) 2012 Michael Nielsen #### Library imports # standard library from collections import defaultdict import re # third-party libraries from google import search def pretty_qa(question, num=10): """ Wrapper for the `qa` function. `pretty_qa` prints the `num` highest scoring answers to `question`, with the scores in parentheses. """ print "

Q: "+question for (j, (answer, score)) in enumerate(qa(question)[:num]): print " def qa(question): """ Return a list of tuples whose first entry is a candidate answer to `question`, and whose second entry is the score for that answer. The tuples are ordered in decreasing order of score. Note that the answers themselves are tuples, with each entry being a word. """ answer_scores = defaultdict(int) for query in rewritten_queries(question): for summary in get_google_summaries(query.query): for sentence in sentences(summary): for ngram in candidate_answers(sentence, query.query): answer_scores[ngram] += ngram_score(ngram, query.score) return sorted(answer_scores.iteritems(), key=lambda x: x[1], reverse=True) def rewritten_queries(question): """ Return a list of RewrittenQuery objects, containing the search queries (and corresponding weighting score) generated from `question`. """ rewrites = [] tq = tokenize(question) verb = tq[1] # the simplest assumption, something to improve rewrites.append( RewrittenQuery("\" for j in range(2, len(tq)): rewrites.append( RewrittenQuery( "\" " ".join(tq[2:j+1]), verb, " ".join(tq[j+1:])), 5)) rewrites.append(RewrittenQuery(" ".join(tq[2:]), 2)) return rewrites def tokenize(question): """ Return a list containing a tokenized form of `question`. Works by lowercasing, splitting around whitespace, and stripping all non-alphanumeric characters. """ return [re.sub(r"\W", "", x) for x in question.lower().split()] class RewrittenQuery(): """ Given a question we rewrite it as a query to send to Google. Instances of the RewrittenQuery class are used to store these rewritten queries. Instances have two attributes: the text of the rewritten query, which is sent to Google; and a score, indicating how much weight to give to the answers. The score is used because some queries are much more likely to give highly relevant answers than others. """ def __init__(self, query, score): self.query = query self.score = score def get_google_summaries(query): """ Return a list of the top 10 summaries associated to the Google results for `query`. Returns all available summaries if there are fewer than 10 summaries available. Note that these summaries are returned as BeautifulSoup.BeautifulSoup objects, and may need to be manipulated further to extract text, links, etc. """ return search(query) def sentences(summary): """ Return a list whose entries are the sentences in the BeautifulSoup.BeautifulSoup object `summary` returned from Google. Note that the sentences contain alphabetical and space characters only, and all punctuation, numbers and other special characters have been removed. """ text = remove_spurious_words(text_of(summary)) sentences = [sentence for sentence in text.split(".") if sentence] return [re.sub(r"[^a-zA-Z ]", "", sentence) for sentence in sentences] def text_of(soup): """ Return the text associated to the BeautifulSoup.BeautifulSoup object `soup`. """ return ".join(str(soup.findAll(text=True))) def remove_spurious_words(text): """ Return `text` with spurious words stripped. For example, Google includes the word "Cached" in many search summaries, and this word should therefore mostly be ignored. """ spurious_words = ["Cached", "Similar"] for word in spurious_words: text = text.replace(word, "") return text def candidate_answers(sentence, query): """ Return all the 1-, 2-, and 3-grams in `sentence`. Terms appearing in `query` are filtered out. Note that the n-grams are returned as a list of tuples. So a 1-gram is a tuple with 1 element, a 2-gram is a tuple with 2 elements, and so on. """ filtered_sentence = [word for word in sentence.split() if word.lower() not in query] return sum([ngrams(filtered_sentence, j) for j in range(1,4)], []) def ngrams(words, n=1): """ Return all the `n`-grams in the list `words`. The n-grams are returned as a list of tuples, each tuple containing an n-gram, as per the description in `candidate_answers`. """ return [tuple(words[j:j+n]) for j in xrange(len(words)-n+1)] def ngram_score(ngram, score): """ Return the score associated to `ngram`. The base score is `score`, but it's modified by a factor which is 3 to the power of the number of capitalized words. This biases answers toward proper nouns. """ num_capitalized_words = sum( 1 for word in ngram if is_capitalized(word)) return score * (3**num_capitalized_words) def is_capitalized(word): """ Return True or False according to whether `word` is capitalized. """ return word == word.capitalize() if __name__ == "__main__": pretty_qa("Who ran the first four-minute mile?") pretty_qa("Who makes the best pizza in New York?") pretty_qa("Who invented the C programming language?") pretty_qa("Who wrote the Iliad?") pretty_qa("Who caused the financial crash of 2008?") pretty_qa("Who caused the Great Depression?") pretty_qa("Who is the most evil person in the world?") pretty_qa("Who wrote the plays of Wiliam Shakespeare?") pretty_qa("Who is the world's best tennis player?") pretty_qa("Who is the richest person in the world?")

If you run the program you’ll see that the results are a mixed bag. When I tested it, it knew that Roger Bannister ran the first four-minute mile, that Dennis Ritchie invented the C programming language, and that Homer wrote the Iliad. On the other hand, the program sometimes gives answers that are either wrong or downright nonsensical. For example, it thinks that Occupy Wall Street caused the financial crash of 2008 (Ben Bernanke also scores highly). And it replies “Laureus Sports Awards” when asked who is the world’s best tennis player. So it’s quite a mix of good and bad results.

While developing the program I used some questions repeatedly to figure out how to improve perfomance. For example, I often asked it the questions “Who invented relativity?” and “Who killed Abraham Lincoln?” Unsuprisingly, the program now answers both questions correctly! So to make things fairer the questions used as examples in the code aren’t ones I tested while developing the program. They’re still far from a random sample, but at least the most obvious form of bias has been removed.

Much can be done to improve this program. Here are a few ideas:

Problems

To solve many of the problems I describe below it would help to have a systematic procedure to evaluate the performance of the system, and, in particular, to compare the performance of different versions of the system. How can we build such a systematic evaluation procedure?

The program does no sanity-checking of questions. For example, it simply drops the first word of the question. As a result, the question “Foo killed Abraham Lincoln?” is treated identically to “Who killed Abraham Lincoln?” Add some basic sanity checks to ensure the question satisfies standard constraints on formatting, etc.

More generally, the program makes no distinction between sensible and nonsensical questions. You can ask it “Who coloured green ideas sleeping furiously?” and it will happily answer. (Perhaps appropriately, “Chomsky” is one of the top answers.) How might the program figure out that the question doesn’t make sense? (This is a problem shared by many other systems. Here is to the question, and also Google’s response.)

The program boosts the score of n-grams containing capitalized words. This makes sense most of the time, since a capitalized word is likely to be a proper noun, and is thus a good candidate to be the answer to the question. But it makes less sense when the word is the first word of the sentence. What should be done then?

In the sentences extracted from Google we removed terms which appeared in the original query. This means that, for example, we won’t answer “Who killed John Fitzgerald Kennedy?” with “John Fitzgerald Kennedy”. It’d be good to take this further by also eliminating synonyms, so that we don’t answer “JFK”. How can this be done?

There are occasional exceptions to the rule that answers don’t repeat terms in the question. For example, the correct answer to the question “Who wrote the plays of William Shakespeare?” is, most likely, “William Shakespeare”. How can we identify questions where this is likely to be the case?

How does it change the results to use more than 10 search summaries? Do the results get better or worse?

An alternative approach to using Google would be to use Bing, Yahoo! BOSS, Wolfram Alpha, or Cyc as sources. Or we could build our own source using tools such as Nutch and Solr. How do the resulting systems compare to one another? Is it possible to combine the sources to do better than any single source alone?

Some n-grams are much more common than others: the phrase “he was going” occurs far more often in English text than the phrase “Lee Harvey Oswald”, for example. Because of this, chances are that repetitions of the phrase “Lee Harvey Oswald” in search results are more meaningful than repetitions of more common phrases, such as “he was going”. It’d be natural to modify the program to give less common phrases a higher score. What’s the right way of doing this? Does using inverse document frequency give better performance, for example?

Question-answering is really three problems: (1) understanding the question; (2) figuring out the answer; and (3) explaining the answer. In this post I’ve concentrated on (2) and (to some extent) (1), but not (3). How might we go about explaining the answers generated by our system?

Discussion

Let’s come back to the idea I described in the opening: unsophisticated algorithms + large amounts of data sophisticated linguistic algorithms + only a small amount of data.

Variations on this idea have become common recently. In an influential 2009 paper, Halevy, Norvig, and Pereirawrote that:

[I]nvariably, simple models and a lot of data trump more elaborate models based on less data.

They were writing in the context of machine translation and speech recognition, but this point of view has become commonplace well beyond machine translation and speech recognition. For example, at the 2012 O’Reilly Strata California conference on big data, there was a debate on the idea that “[i]n data science, domain expertise is more important than machine learning skill.” The people favouring the machine learning side won the debate, at least in the eyes of the audience. Admittedly, a priori you’d expect this audience to strongly favour machine learning. Still, I expect that just a few years earlier even a highly sympathetic audience would have tilted the other way.

An interesting way of thinking about this idea that data trumps the quality of your models and algorithms is in terms of a tradeoff curve:

The curve shown is one of constant performance: as you increase the amount of training data, you decrease the quality of the algorithm required to get the same performance. The tradeoff between the two is determined by the slope of the curve.

A difficulty with the “more data is better” point of view is that it’s not clear how to determine what the tradeoffs are in practice: is the slope of the curve very shallow (more data helps more than better algorithms), or very steep (better algorithms help more than more data). To put it another way, it’s not obvious whether to focus on acquiring more data, or on improving your algorithms. Perhaps the correct moral to draw is that this is a key tradeoff to think about when deciding how to allocate effort. At least in the case of the AskMSR system, taking the more data idea seriously enabled the team to very quickly build a system that was competitive with other systems which had taken much longer to develop.

A second difficulty with the more data is better point of view is, of course, that sometimes more data is exremely expensive to generate. IBM has said that it intends to use Watson to help doctors answer medical questions. Yet for such highly specialized questions it’s not so clear that it will be easy to find data sources which meet the criteria outlined by Brill et al:

[T]he greater the answer redundancy in the source, the more likely it is that we can find an answer that occurs in a simple relation to the question, and therefore, the less likely it is that we will need to resort to solving the aforementioned difficulties facing natural language processing systems.

Data versus understanding: I’ll finish by briefly discussing one possible criticism of the data-driven approach to question-answering. That criticism is that it rejects detailed understanding in favour of a simplistic data-driven associative model. It’s tempting to think that this must therefore be the wrong track, a blind alley that may pay off in the short term, but which will ultimately be unfruitful. I think that point of view is shortsighted. We don’t yet understand how human cognition works, but we do know that we often use after-the-fact rationalization and confabulation, suggesting that the basis for many of our decisions isn’t a detailed understanding, but something more primitive. This isn’t to say that rationalization and confabulation are desirable. But it does mean that it’s worth pushing beyond narrow conceptions of understanding, and trying to determine how much of our intelligence can be mimicked using simple data-driven associative techniques, and by combining such techniques with other ideas.

Acknowledgements: Thanks to Thomas Ballinger, Dominik Dabrowski, Nathan Hoffman, and Allan O’Donnell for comments that improved my code.

Interested in more? Please subscribe to this blog, or follow me on Twitter. You may also enjoy reading my new book about open science, Reinventing Discovery.