The Web as a Knowledge-Base for Answering Complex Questions

Alon Talmor
Jonathan Berant
NAACL
2018
View in Semantic Scholar

Abstract

Answering complex questions is a time-consuming activity for humans that requires reasoning and integration of information. Recent work on reading comprehension made headway in answering simple questions, but tackling complex questions is still an ongoing research challenge. Conversely, semantic parsers have been successful at handling compositionality, but only when the information resides in a target knowledge-base. In this paper, we present a novel framework for answering broad and complex questions, assuming answering simple questions is possible using a search engine and a reading comprehension model. We propose to decompose complex questions into a sequence of simple questions, and compute the final answer from the sequence of answers. To illustrate the viability of our approach, we create a new dataset of complex questions, ComplexWebQuestions, and present a model that decomposes questions and interacts with the web to compute an answer. We empirically demonstrate that question decomposition improves performance from 20.8 precision@1 to 27.5 precision@1 on this new dataset.

1 Introduction

Humans often want to answer complex questions that require reasoning over multiple pieces of evidence, e.g., "From what country is the winner of the Australian Open women's singles 2008?". Answering such questions in broad domains can be quite onerous for humans, because it requires searching and integrating information from multiple sources.

Recently, interest in question answering (QA) has surged in the context of reading comprehension (RC) , where an answer is sought for a question given one or more documents (Hermann et al., 2015; Joshi et al., 2017; Rajpurkar et al., 2016 {Warsaw, Kiev, Lviv, ...} Recompose: a :({Cardiff} ∪ {Lviv}) ∩ {Warsaw, Kiev, Lviv, ...}={Lviv} Figure 1 : Given a complex questions q, we decompose the question to a sequence of simple questions q1, q2, . . . , use a search engine and a QA model to answer the simple questions, from which we compute the final answer a.

Figure 1: Given a complex questions q, we decompose the question to a sequence of simple questions q1, q2, . . . , use a search engine and a QA model to answer the simple questions, from which we compute the final answer a.

Neural models trained over large datasets led to great progress in RC, nearing human-level performance . However, analysis of models revealed (Jia and Liang, 2017; Chen et al., 2016) that they mostly excel at matching questions to local contexts, but struggle with questions that require reasoning. Moreover, RC assumes documents with the information relevant for the answer are available -but when questions are complex, even retrieving the documents can be difficult.

Conversely, work on QA through semantic parsing has focused primarily on compositionality: questions are translated to compositional programs that encode a sequence of actions for finding the answer in a knowledge-base (KB) (Zelle and Mooney, 1996; Zettlemoyer and Collins, 2005; Krishnamurthy and Mitchell, 2012; Kwiatkowski et al., 2013; Liang et al., 2011) . However, this reliance on a manually-curated KB has limited the coverage and applicability of semantic parsers.

In this paper we present a framework for QA that is broad, i.e., it does not assume information is in a KB or in retrieved documents, and compositional, i.e., to compute an answer we must perform some computation or reasoning. Our thesis is that answering simple questions can be achieved by combining a search engine with a RC model. Thus, answering complex questions can be addressed by decomposing the question into a sequence of simple questions, and computing the answer from the corresponding answers. Figure 1 illustrates this idea. Our model decomposes the question in the figure into a sequence of simple questions, each is submitted to a search engine, and then an answer is extracted from the search result. Once all answers are gathered, a final answer can be computed using symbolic operations such as union and intersection.

To evaluate our framework we need a dataset of complex questions that calls for reasoning over multiple pieces of information. Because an adequate dataset is missing, we created COM-PLEXWEBQUESTIONS, a new dataset for complex questions that builds on WEBQUESTION-SSP, a dataset that includes pairs of simple questions and their corresponding SPARQL query. We take SPARQL queries from WEBQUESTIONSSP and automatically create more complex queries that include phenomena such as function composition, conjunctions, superlatives and comparatives. Then, we use Amazon Mechanical Turk (AMT) to generate natural language questions, and obtain a dataset of 34,689 question-answer pairs (and also SPARQL queries that our model ignores). Data analysis shows that examples are diverse and that AMT workers perform substantial paraphrasing of the original machine-generated question.

We propose a model for answering complex questions through question decomposition. Our model uses a sequence-to-sequence architecture (Sutskever et al., 2014) to map utterances to short programs that indicate how to decompose the question and compose the retrieved answers. To obtain supervision for our model, we perform a noisy alignment from machine-generated questions to natural language questions and automatically generate noisy supervision for training. 1 We evaluate our model on COMPLEXWE-BQUESTIONSand find that question decomposition substantially improves precision@1 from 20.8 to 27.5. We find that humans are able to reach 63.0 precision@1 under a limited time budget, leaving ample room for improvement in future work.

To summarize, our main contributions are: 1 We differ training from question-answer pairs for future work. Figure 2: A computation tree for "What city is the birthplace of the author of 'Without end', and hosted Euro 2012?". The leaves are strings, and inner nodes are functions (red) applied to their children to produce answers (blue).

Figure 2: A computation tree for “What city is the birthplace of the author of ‘Without end’, and hosted Euro 2012?”. The leaves are strings, and inner nodes are functions (red) applied to their children to produce answers (blue).

1. A framework for answering complex questions through question decomposition. 2. A sequence-to-sequence model for question decomposition that substantially improves performance. 3. A dataset of 34,689 examples of complex and broad questions, along with answers, web snippets, and SPARQL queries. Our dataset, COMPLEXWEBQUESTIONS, can be downloaded from http://nlp.cs.tau. ac.il/compwebq and our codebase can be downloaded from https://github.com/ alontalmor/WebAsKB.

2 Problem Formulation

Our goal is to learn a model that given a question q and a black box QA model for answering simple questions, SIMPQA(•), produces a computation tree t (defined below) that decomposes the question and computes the answer. The model is trained from a set of N question-computation tree pairs

{q i , t i } N i=1 or question-answer pairs {q i , a i } N i=1

. A computation tree is a tree where leaves are labeled with strings, and inner nodes are labeled with functions. The arguments of a function are its children sub-trees. To compute an answer, or denotation, from a tree, we recursively apply the function at the root to its children. More formally, given a tree rooted at node t, labeled by the function f , that has children c 1 (t), . . . , c k (t), the denotation t = f ( c 1 (t) , . . . , c k (t) ) is an arbitrary function applied to the denotations of the root's children. Denotations are computed recursively and the denotation of a string at the leaf is the string itself, i.e., l = l. This is closely related to "semantic functions" in semantic parsing , except that we do not interact with a KB, but rather compute directly over the breadth of the web through a search engine. Figure 2 provides an example computation tree for our running example. Notice that words at the leaves are not necessarily in the original question, e.g., "city" is paraphrased to "cities". More broadly, our framework allows paraphrasing questions in any way that is helpful for the function SIMPQA(•). Paraphrasing for better interaction with a QA model has been recently suggested by Buck et al. (2017) and Nogueira and Cho (2016) .

We defined the function SIMPQA(•) for answering simple questions, but in fact it comprises two components in this work. First, the question is submitted to a search engine that retrieves a list of web snippets. Next, a RC model extracts the answer from the snippets. While it is possible to train the RC model jointly with question decomposition, in this work we pre-train it separately, and later treat it as a black box.

The expressivity of our QA model is determined by the functions used, which we turn to next.

3 Formal Language

Functions in our formal language take arguments and return values that can be strings (when decomposing or re-phrasing the question), sets of strings, or sets of numbers. Our set of functions includes:

1. SIMPQA(•): Model for answering simple questions, which takes a string argument and returns a set of strings or numbers as answer.

2. Comp(•, •):

This function takes a string containing one unique variable VAR, and a set of answers. E.g., in Figure 2 the first argument is "birthplace of VAR", and the second argument is "{KEN FOLLETT, ADAM ZAGAJEWSKI}". The function replaces the variable with each answer string representation and returns their union. Formally, COMP(q, A) = ∪ a∈A SIMPQA(q/a), where q/a denotes the string produced when replacing VAR in q with a. This is similar to function composition in CCG (Steedman, 2000) , or a join operation in λ-DCS (Liang, 2013) , where the string is a function applied to previously-computed values. 3. CONJ(•, •): takes two sets and returns their intersection. Other set operations can be defined analogously. As syntactic sugar, we allow CONJ(•) to take strings as input, which means that we run SIMPQA(•) to obtain a set and then perform intersection. The root node in Figure 2 illustrates an application of CONJ.

4. ADD(•, •): takes two singleton sets of numbers and returns a set with their addition. Similar functions can be defined analogously. While we support mathematical operations, they were not required in our dataset.

Other logical operations In semantic parsing superlative and comparative questions like "What is the highest European mountain?" or "What European mountains are higher than Mont Blanc?" are answered by joining the set of European mountains with their elevation. While we could add such functions to the formal language, answering such questions from the web is cumbersome: we would have to extract a list of entities and a numerical value for each. Instead, we handle such constructions using SIMPQA directly, assuming they are mentioned verbatim on some web document. Similarly, negation questions ("What countries are not in the OECD?") are difficult to handle when working against a search engine only, as this is an open world setup and we do not hold a closed set of countries over which we can perform set subtraction.

In future work, we plan to interface with tables (Pasupat and Liang, 2015) and KBs (Zhong et al., 2017) . This will allow us to perform set operations over well-defined sets, and handle in a compositional manner superlatives and comparatives.

4 Dataset

Evaluating our framework requires a dataset of broad and complex questions that examine the importance of question decomposition. While many QA datasets have been developed recently (Yang et al., 2015; Rajpurkar et al., 2016; Hewlett et al., 2016; Nguyen et al., 2016; Onishi et al., 2016; Hill et al., 2015; Welbl et al., 2017) , they lack a focus on the importance of question decomposition.

Most RC datasets contain simple questions that can be answered from a short input document. Recently, TRIVIAQA (Joshi et al., 2017) presented a larger portion of complex questions, but still most do not require reasoning. Moreover, the focus of TRIVIAQA is on answer extraction from documents that are given. We, conversely, highlight question decomposition for finding the relevant documents. Put differently, RC is complementary to question decomposition and can be used as part of the implementation of SIMPQA. In Section 6 we demonstrate that question decomposition is useful for two different RC approaches.

Figure 3: Overview of data collection process.

Figure 4: MG and NL questions similarity with normalized edit-distance, and the DICE coefficient (bars are stacked).

Figure 5: Heat map for similarity matrix between a MG and NL question. The red line indicates a known MG split point. The blue line is the approximated NL split point.

4.1 Dataset Collection

To generate complex questions we use the dataset WEBQUESTIONSSP , which contains 4,737 questions paired with SPARQL queries for Freebase (Bollacker et al., 2008) . Questions are broad but simple. Thus, we sample question-query pairs, automatically create more complex SPARQL queries, generate automatically questions that are understandable to AMT workers, and then have them paraphrase those into natural language (similar to Wang et al. (2015) ). We compute answers by executing complex SPARQL queries against Freebase, and obtain broad and complex questions. Figure 6 provides an example for this procedure, and we elaborate next.

Figure 6: Overview of data collection process. Blue text denotes different stages of the term addition, green represents the obj value, and red the intermediate text to connect the new term and seed question

Table 1: Rules for generating a complex query r′ from a query r (’.’ in SPARQL corresponds to logical and). The query r returns the variable ?x, and contains an entity e. We denote by r[e/y] the replacement of the entity e with a variable ?y. pred1 and pred2 are any KB predicates, obj is any KB entity, V is a numerical value, and ?c is a variable of a CVT type in Freebase which refers to events. The last column provides an example for a NL question for each type.

Table 2: Examples and frequency of prevalent phenomena in the NL questions for a manually analyzed subset (see text).

Table 3: Examples for the types of computation trees that can be decoded by our model.

Table 4: precision@1 results on the development set and test set for COMPLEXWEBQUESTIONS.

Table 5: Examples for question decompositions from SPLITQA.

Table 6: F1 results for COMPLEXQUESTIONS.

Generating SPARQL queries Given a SPARQL query r, we create four types of more complex queries: conjunctions, superlatives, comparatives, and compositions. Table 7 gives the exact rules for generation. For conjunctions, superlatives, and comparatives, we identify queries in WEBQUESTIONSSP whose denotation is a set A, |A| ≥ 2, and generate a new query r whose denotation is a strict subset A , A ⊂ A, A = φ. For conjunctions this is done by traversing the KB and looking for SPARQL triplets that can be added and will yield a valid set A . For comparatives and superlatives we find a numerical property common to all a ∈ A, and add a triplet and restrictor to r accordingly. For compositions, we find an entity e in r, and replace e with a variable y and add to r a triplet such that the denotation of that triplet is {e}.

Table 7: Rules for generating a complex query r′ from a query r (’.’ in SPARQL corresponds to logical and). The query r returns the variable ?x, and contains an entity e. We denote by r[e/y] the replacement of the entity e with a variable ?y. pred1 and pred2 are any KB predicates, obj is any KB entity, V is a numerical value, and ?c is a variable of a CVT type in Freebase which refers to events. The last column provides an example for a NL question for each type.

Question Rephrasing We used AMT workers to paraphrase MG questions into natural language (NL). Each question was paraphrased by one AMT worker and validated by 1-2 other workers. To generate diversity, workers got a bonus if the edit distance of a paraphrase was high compared to the MG question. A total of 200 workers were involved, and 34,689 examples were produced with an average cost of 0.11$ per question. Table 7 gives an example for each compositionality type.

A drawback of our method for generating data is that because queries are generated automatically the question distribution is artificial from a semantic perspective. Still, developing models that are capable of reasoning is an important direction for natural language understanding and COM-PLEXWEBQUESTIONS provides an opportunity to develop and evaluate such models.

To summarize, each of our examples contains a question, an answer, a SPARQL query (that our models ignore), and all web snippets harvested by our model when attempting to answer the question. This renders COMPLEXWEBQUESTIONS useful for both the RC and semantic parsing communities.

4.2 Dataset Analysis

COMPLEXWEBQUESTIONS builds on the WE-BQUESTIONS (Berant et al., 2013) . Questions in WEBQUESTIONS are usually about properties of entities ("What is the capital of France?"), often with some filter for the semantic type of the answer ("Which director", "What city"). WE-BQUESTIONS also contains questions that refer to events with multiple entities ("Who did Brad Pitt play in Troy?"). COMPLEXWEBQUESTIONS contains all these semantic phenomena, but we add four compositionality types by generating composition questions (45% of the times), conjunctions (45%), superlatives (5%) and comparatives (5%).

Paraphrasing To generate rich paraphrases, we gave a bonus to workers that substantially modified MG questions. To check whether this worked, we measured surface similarity between MG and NL questions, and examined the similarity. Using normalized edit-distance and the DICE coefficient, we found that NL questions are different from MG questions and that the similarity distribution has wide support ( Figure 4 ). We also found that AMT workers tend to shorten the MG question (MG avg. length: 16, NL avg. length: 13.18), and use a richer vocabulary (MG # unique tokens: 9,489, NL # unique tokens: 14,282). We created a heuristic for approximating the amount of word re-ordering performed by AMT workers. For every question, we constructed a matrix A, where A ij is the similarity between token i in the MG question and token j in the NL ques-tion. Similarity is 1 if lemmas match, or cosine similarity according to GloVe embeddings (Pennington et al., 2014), when above a threshold, and 0 otherwise. The matrix A allows us to estimate whether parts of the MG question were re-ordered when paraphrased to NL (details in supplementary material). We find that in 44.7% of the conjunction questions and 13.2% of the composition questions, word re-ordering happened, illustrating that substantial changes to the MG question have been made. Figure 8 illustrates the matrix A for a pair of questions with re-ordering.

Figure 8: Heat map for similarity matrix between an MG and NL question. The red line indicates a known MG split point. The blue line is the approximated NL split point. Below is a graph of each candidate split point score.

Last, we find that in WEBQUESTIONS almost all questions start with a wh-word, but in COM-PLEXWEBQUESTIONS 22% of the questions start with another word, again showing substantial paraphrasing from the original questions.

Qualitative Analysis

We randomly sampled 100 examples from the development set and manually identified prevalent phenomena in the data. We present these types in Table 2 along with their frequency. In 18% of the examples a conjunct in the MG question becomes a modifier of a wh-word in the NL question (WH-MODIFIER). In 22% substantial word re-ordering of the MG questions occurred, and in 42% a minor word re-ordering occurred ("number of building floors is 50" paraphrased as "has 50 floors"). AMT workers used a synonym in 54% of the examples, they omitted words in 27% of the examples and they added new lexical material in 29%.

To obtain intuition for operations that will be useful in our model, we analyzed the 100 examples for the types of operations that should be applied to the NL question during question decomposition. We found that splitting the NL question is insufficient, and that in 53% of the cases a word in the NL question needs to be copied to multiple questions after decomposition (row 3 in Table 3 ). Moreover, words that did not appear in the MG question need to be added in 39% of the cases, and words need to be deleted in 28% of the examples.

5 Model And Learning

We would like to develop a model that translates questions into arbitrary computation trees with arbitrary text at the tree leaves. However, this requires training from denotations using methods such as maximum marginal likelihood or reinforcement learning (Guu et al., 2017) that are difficult to optimize. Moreover, such approaches involve issuing large amounts of queries to a search engine at training time, incurring high costs and slowing down training. Instead, we develop a simple approach in this paper. We consider a subset of all possible computation trees that allows us to automatically generate noisy full supervision. In what follows, we describe the subset of computation trees considered and their representation, a method for automatically generating noisy supervision, and a pointer network model for decoding.

Representation We represent computation trees as a sequence of tokens, and consider trees with at most one compositional operation. We denote a sequence of question tokens q i:j = (q i , . . . , q j ), and the decoded sequence by z. We consider the following token sequences (see Table 3 ):

1. SimpQA: The function SIMPQA is applied to the question q without paraphrasing. In prefix notation this is the tree SIMPQA(q). 2. Comp i j: This sequence of tokens corresponds to the following computation tree:

COMP(q 1:i−1 •VAR•q j+1:|q| , SIMPQA(q i:j )),

where • is the concatenation operator. This is used for questions where a substring is answered by SIMPQA and the answers replace a variable before computing a final answer. 3. Conj i j: This sequence of tokens corresponds to the computation tree CONJ(SIMPQA(q 0:

i−1 ), SIMPQA(q j • q i:|q| )).

The idea is that conjunction can be answered by splitting the question in a single point, where one token is copied to the second part as well ("film" in Table 3 ). If nothing needs to be copied, then j = −1. This representation supports one compositional operation, and a single copying operation is allowed without any re-phrasing. In future work, we plan to develop a more general representation, which will require training from denotations.

Supervision Training from denotations is difficult as it involves querying a search engine frequently, which is expensive. Therefore, we take advantage of the the original SPARQL queries and MG questions to generate noisy programs for composition and conjunction questions. Note that these noisy programs are only used as supervision to avoid the costly process of manual annotation, but the model itself does not assume SPARQL queries in any way.

Figure 7: First word in question distribution

We generate noisy programs from SPARQL queries in the following manner: First, we automatically identify composition and conjunction questions. Because we generated the MG question, we can exactly identify the split points (i, j in composition questions and i in conjunction questions) in the MG question. Then, we use a rulebased algorithm that takes the alignment matrix A (Section 8), and approximates the split points in the NL question and the index j to copy in conjunction questions. The red line in Figure 8 corresponds to the known split point in the MG question, and the blue one is the approximated split point in the NL question. The details of this rulebased algorithm are in the supplementary material.

Thus, we obtain noisy supervision for all composition and conjunction questions and can train a model that translates questions q to representations z = z 1 z 2 z 3 , where z 1 ∈ {Comp, Conj} and z 2 , z 3 are integer indices.

Question

Split SimpQA "What building in Vienna, Austria has 50 floors" -Comp 5 9

"Where is the birthplace of the writer of Standup Shakespeare" "Where is the birthplace of VAR" "the writer of Standup Shakespeare" Conj 5 1 "What film featured Taylor Swift and "What film featured Taylor Swift" was directed by Deborah Aquila" "film and was directed by Deborah Aquila" Table 3 : Examples for the types of computation trees that can be decoded by our model.

Pointer network The representation z points to indices in the input, and thus pointer networks (Vinyals et al., 2015) are a sensible choice. Because we also need to decode the tokens COMP and CONJ, we use "augmented pointer networks", (Zhong et al., 2017) : For every question q, an augmented questionq is created by appending the tokens "COMP CONJ" to q. This allows us to decode the representation z with one pointer network that at each decoding step points to one token in the augmented question. We encodeq with a onelayer GRU (Cho et al., 2014) , and decode z with a one-layer GRU with attention as in Jia and Liang (2016) . The only difference is that we decode tokens from the augmented questionq rather than from a fixed vocabulary. We train the model with token-level crossentropy loss, minimizing j log p θ (z j |x, z 1:j−1 ). Parameters θ include the GRU encoder and decoder, and embeddings for unknown tokens (that are not in pre-trained GloVe embeddings (Pennington et al., 2014) ).

The trained model decodes COMP and CONJ representations, but sometimes using SIMPQA(q) without decomposition is better. To handle such cases we do the following: We assume that we always have access to a score for every answer, provided by the final invocation of SIMPQA (in CONJ questions this score is the maximum of the scores given by SIMPQA for the two conjuncts), and use the following rule to decide if to use the decoded representation z or SIMPQA(q). Given the scores for answers given by z and the scores given by SIMPQA(q), we return the single answer that has the highest score. The intuition is that the confidence provided by the scores of SIMPQA is correlated with answer correctness. In future work we will train directly from denotations and will handle all logical functions in a uniform manner.

6 Experiments

In this section, we aim to examine whether question decomposition can empirically improve performance of QA models over complex questions.

Experimental setup We used 80% of the examples in COMPLEXWEBQUESTIONS for training, 10% for development, and 10% for test, training the pointer network on 24,708 composition and conjunction examples. The hidden state dimension of the pointer network is 512, and we used Adagrad (Duchi et al., 2010) combined with L 2 regularization and a dropout rate of 0.25. We initialize 50-dimensional word embeddings using GloVe and learn embeddings for missing words.

Simple QA model As our SIMPQA function, we download the web-based QA model of Talmor et al. (2017) . This model sends the question to Google's search engine and extracts a distribution over answers from the top-100 web snippets using manually-engineered features. We re-train the model on our data with one new feature: for every question q and candidate answer mention in a snippet, we run RASOR, a RC model by lee et al. (2016), and add the output logit score as a feature. We found that combining the web-facing model of Talmor et al. (2017) and RASOR, resulted in improved performance.

Evaluation For evaluation, we measure precision@1 (p@1), i.e., whether the highest scoring answer returned string-matches one of the correct answers (while answers are sets, 70% of the questions have a single answer, and the average size of the answer set is 2.3).

We evaluate the following models and oracles: 1. SIMPQA: running SIMPQA on the entire question, i.e., without decomposition. TRIVIAQA. 5. SPLITRCQA: This is identical to SPLITQA, except that we replace the RC model from Talmor et al. (2017) with DOCQA. 6. GOOGLEBOX: We sample 100 random development set questions and check whether Google returns a box that contains one of the correct answers. 7. HUMAN: We sample 100 random development set questions and manually answer the questions with Google's search engine, including all available information. We limit the amount of time allowed for answering to 4 minutes. Table 4 presents the results on the development and test sets. SIMPQA, which does not decompose questions obtained 20.8 p@1, while by performing question decomposition we substantially improve performance to 27.5 p@1. An upper bound with perfect knowledge on when to decompose is given by SPLITQAORACLE at 33.7 p@1.

RCQA obtained lower performance SIMPQA, as it was trained on data from a different distribution. More importantly SPLITRCQA outperforms RCQA by 3.4 points, illustrating that this RC model also benefits from question decomposition, despite the fact that it was not created with question decomposition in mind. This shows the importance of question decomposition for retrieving documents from which an RC model can extract answers. GOOGLEBOX finds a correct answer in 2.5% of the cases, showing that complex questions are challenging for search engines.

To conclude, we demonstrated that question decomposition substantially improves performance on answering complex questions using two independent RC models.

Analysis We estimate human performance (HUMAN) at 63.0 p@1. We find that answering complex questions takes roughly 1.3 minutes on average. For questions we were unable to answer, we found that in 27% the answer was correct but exact string match with the gold answers failed; in 23.1% the time required to compute the answer was beyond our capabilities; for 15.4% we could not find an answer on the web; 11.5% were of ambiguous nature; 11.5% involved paraphrasing errors of AMT workers; and an additional 11.5% did not contain a correct gold answer.

SPLITQA decides if to decompose questions or not based on the confidence of SIMPQA. In 61% of the questions the model chooses to decompose the question, and in the rest it sends the question as-is to the search engine. If one of the strategies (decomposition vs. no decomposition) works, our model chooses that right one in 86% of the cases. Moreover, in 71% of these answerable questions, only one strategy yields a correct answer.

We evaluate the ability of the pointer network to mimic our labeling heuristic on the development set. We find that the model outputs the exact correct output sequence 60.9% of the time, and allowing errors of one word to the left and right (this often does not change the final output) accuracy is at 77.1%. Token-level accuracy is 83.0% and allowing one-word errors 89.7%. This shows that SPLITQA learned to identify decomposition points in the questions. We also observed that often SPLITQA produced decomposition points that are better than the heuristic, e.g., for "What is the place of birth for the lyricist of Roman Holiday", SPLITQA produced "the lyricist of Roman Holiday", but the heuristic produced "the place of birth for the lyricist of Roman Holiday". Additional examples of SPLITQA question decompositions are provided in Table 5 .

ComplexQuestions To further examine the ability of web-based QA models, we run an experiment against COMPLEXQUESTIONS (Bao et al., 2016) , a small dataset of question-answer pairs designed for semantic parsing against Freebase.

We ran SIMPQA on this dataset (Table 6 ) and obtained 38.6 F 1 (the official metric), slightly lower than COMPQ, the best system, which operates directly against Freebase. 2 By analyzing the training data, we found that we can decompose COMP questions with a rule that splits the question when the words "when" or "during" appear, e.g., "Who was vice president when JFK was Question Split-1 Split-2 "Find the actress who played Hailey Rogers, "the actress who played Hailey Rogers" "Find VAR , what label is she signed to" what label is she signed to" "What are the colors of the sports team whose "the sports team whose arena stadium "What are the colors of VAR" arena stadium is the AT&T Stadium" is the AT&T Stadium" "What amusement park is located in Madrid "What amusement park is located in "park includes the stunt fall ride" Spain and includes the stunt fall ride" Madrid Spain and" "Which university whose mascot is "Which university whose mascot is "university Derek Fisher attend" The Trojan did Derek Fisher attend"

The Trojan did" president?". 3 We decomposed questions with this rule and obtained 39.7 F 1 (SPLITQARULE). Analyzing the development set errors, we found that occasionally SPLITQARULE returns a correct answer that fails to string-match with the gold answer. By manually fixing these cases, our development set F 1 reaches 46.9 (SPLITQARULE++). Note that COMPQ does not suffer from any string matching issue, as it operates directly against the Freebase KB and thus is guaranteed to output the answer in the correct form. This short experiment shows that a web-based QA model can rival a semantic parser that works against a KB, and that simple question decomposition is beneficial and leads to results comparable to state-of-the-art.

7 Related Work

This work is related to a body of work in semantic parsing and RC, in particular to datasets that focus on complex questions such as TRIVIAQA (Joshi et al., 2017) , WIKIHOP (Welbl et al., 2017) and RACE (Lai et al., 2017) . Our distinction is in proposing a framework for complex QA that focuses on question decomposition. Our work is related to Chen et al. (2017) and Watanabe et al. (2017) , who combined retrieval and answer extraction on a large set of documents. We work against the entire web, and propose question decomposition for finding information.

This work is also closely related to Dunn et al. (2017) and Buck et al. (2017) : we start with questions directly and do not assume documents are given. Buck et al. (2017) also learn to phrase questions given a black box QA model, but while they focus on paraphrasing, we address decompo- 3 The data is too small to train our decomposition model. sition. Using a black box QA model is challenging because you can not assume differentiability, and reproducibility is difficult as black boxes change over time. Nevertheless, we argue that such QA setups provide a holistic view to the problem of QA and can shed light on important research directions going forward.

Another important related research direction is Iyyer et al. (2016) , who answered complex questions by decomposing them. However, they used crowdsourcing to obtain direct supervision for the gold decomposition, while we do not assume such supervision. Moreover, they work against web tables, while we interact with a search engine against the entire web.

8 Conclusion

In this paper we propose a new framework for answering complex questions that is based on question decomposition and interaction with the web. We develop a model under this framework and demonstrate it improves complex QA performance on two datasets and using two RC models. We also release a new dataset, COMPLEXWE-BQUESTIONS, including questions, SPARQL programs, answers, and web snippets harvested by our model. We believe this dataset will serve the QA and semantic parsing communities, drive research on compositionality, and push the community to work on holistic solutions for QA.

In future work, we plan to train our model directly from weak supervision, i.e., denotations, and to extract information not only from the web, but also from structured information sources such as web tables and KBs.

Dataset

Generating SPARQL queries Given a SPARQL query r, we create four types of more complex queries: conjunctions, superlatives, comparatives, and compositions. For conjunctions, superlatives, and comparatives, we identify SPARQL queries in WEBQUESTIONSSP whose denotation is a set A, |A| ≥ 2, and generate a new query r whose denotation is a strict subset A , A ⊂ A, A = φ. We also discard questions that contain the answer within the new machine-generated questions.

For conjunctions this is done by traversing the KB and looking for SPARQL triplets that can be added and will yield a valid set A .

For comparatives and superlatives this is done by finding a numerical property common to all a ∈ A, and adding a clause to r accordingly.

For compositions, we find an entity e in r, and replace e with a variable y and add to r a clause such that the denotation of the clause is {e}. We also check for discard ambiguous questions that yield more than one answer for entity e. Table 7 gives the exact rules for generation.

Machine-generated (MG) questions To have AMT workers paraphrase SPARQL queries into natural language, we need to present them in an understandable form. Therefore, we automatically generate a question they can paraphrase. When we generate SPARQL queries, new predicates are added to the query (Table 7) . We manually annotate 503 templates mapping predicates to text for different compositionality types (with 377 unique KB predicates). We annotate the templates in the context of several machine-generated questions to ensure that they result templates are in understandable language. We use those templates to modify the original WEBQUESTIONSSP question according to the meaning of the generated SPARQL query. E.g., the template for ?x ns:book.author.works written obj is "the author who wrote OBJ". Table 8 shows various examples of such templates. "Obj" is replaced in turn by the actual name according to Freebase of the object at hand. Freebase represents events that contain multiple arguments using a special node in the knowledge-base called CVT that represents the event, and is connected 3. New SPARQL Term ?x ns:film.film.produced_by ns:erwin_stoff 4. New Term Template The movie that was produced by obj Figure 6 : Overview of data collection process. Blue text denotes different stages of the term addition, green represents the obj value, and red the intermediate text to connect the new term and seed question with edges to all event arguments. Therefore, some of our templates include two predicates that go through a CVT node, and they are denoted in Table 8 with '+'.

To fuse the templates with the original WE-BQUESTIONSSP natural language questions, templates contain lexical material that glues them back to the question conditioned on the compositionality type. For example, in CONJ questions we use the coordinating phrase "and is", so that "the author who wrote OBJ" will produce "Who was born in London and is the author who wrote OBJ". First word distribution We find that in WE-BQUESTIONS almost all questions start with a wh-word, but in COMPLEXWEBQUESTIONS 22% of the questions start with another word, again showing substantial paraphrasing from the original questions. Figure 7 Shows the distribution of first words in questions. r. ?x pred 1 ?n.ORDER BY DESC(?n) LIMIT 1 "Which school that Sir Ernest Rutherford attended has the latest founding date?" COMPAR.

r. ?x pred 1 ?n. FILTER ?n < V "Which of the countries bordering Mexico have an army size of less than 1050?" COMP.

r[e/y]. ?y pred 1 obj. "Where is the end of the river that originates in Shannon Pot?" Table 7 : Rules for generating a complex query r from a query r ('.' in SPARQL corresponds to logical and). The query r returns the variable ?x, and contains an entity e. We denote by r[e/y] the replacement of the entity e with a variable ?y. pred 1 and pred 2 are any KB predicates, obj is any KB entity, V is a numerical value, and ?c is a variable of a CVT type in Freebase which refers to events. The last column provides an example for a NL question for each type.

Freebase Predicate Template ns:book.author.works written "the author who wrote obj" ns:aviation.airport.airlines + ns:aviation.airline airport presence.airline "the airport with the obj airline" ns:award.competitor.competitions won "the winner of obj" ns:film.actor.film + ns:film.performance.film "the actor that played in the film obj"

Generating Noisy Supervision

We created a heuristic for approximating the amount of global word re-ordering performed by AMT workers and creating noisy supervision. For every question, we constructed a matrix A, where A ij is the similarity between token i in the MG question and token j in the NL question. Similarity is 1 if lemmas match, or the cosine distance according to GloVe embedding, when above a threshold, and 0 otherwise. This allows us to compute an approximate word alignment between the MG question and the NL question tokens and assess whether word re-ordering occurred. For a natural language CONJ question of length n and a machine-generated question of length m with a known split point index r, the algorithm first computes the best point to split the NL question assuming there is no re-ordering. This is done iterating over all candidate split points p, and returning the split point p * 1 that maximizes:

EQUATION (1): Not extracted; please refer to original document.

We then compute p * 1 by trying to find the best split point, assuming that there is re-ordering in the NL questions: We then determine the final split point and whether re-ordering occurred by comparing the two values and using the higher one.

In COMP questions, two split points are returned, representing the beginning and end of the phrase that is to be sent to the QA model. Therefore, if r 1 , r 2 are the known split points in the machine-generated questions, we return p 1 , p 2 that maximize: and NL question. The red line indicates a known MG split point. The blue line is the approximated NL split point. Below is a graph of each candidate split point score.

0≤i

By adding the output logit from RASOR, we improved test F1 from 32.6, as reported byTalmor et al. (2017), to 38.6.