Answering Complex Questions Using Open Information Extraction
Authors
Abstract
While there has been substantial progress in factoid question-answering (QA), answering complex questions remains challenging, typically requiring both a large body of knowledge and inference techniques. Open Information Extraction (Open IE) provides a way to generate semi-structured knowledge for QA, but to date such knowledge has only been used to answer simple questions with retrieval-based methods. We overcome this limitation by presenting a method for reasoning with Open IE knowledge, allowing more complex questions to be handled. Using a recently proposed support graph optimization framework for QA, we develop a new inference model for Open IE, in particular one that can work effectively with multiple short facts, noise, and the relational structure of tuples. Our model significantly outperforms a state-of-the-art structured solver on complex questions of varying difficulty, while also removing the reliance on manually curated knowledge.
1 Introduction
Effective question answering (QA) systems have been a long-standing quest of AI research. Structured curated KBs have been used successfully for this task (Berant et al., 2013; Berant and Liang, 2014) . However, these KBs are expensive to build and typically domain-specific. Automatically constructed open vocabulary (subject; predicate; object) style tuples have broader coverage, but have only been used for simple questions where a single tuple suffices (Fader et al., 2014; Yin et al., 2015) .
Our goal in this work is to develop a QA system that can perform reasoning with Open IE (Banko et al., 2007) tuples for complex multiple-choice questions that require tuples from multiple sentences. Such a system can answer complex questions in resource-poor domains where curated knowledge is unavailable. Elementary-level science exams is one such domain, requiring complex reasoning (Clark, 2015) . Due to the lack of a large-scale structured KB, state-of-the-art systems for this task either rely on shallow reasoning with large text corpora Cheng et al., 2016) or deeper, structured reasoning with a small amount of automatically acquired (Khot et al., 2015) or manually curated knowledge.
Consider the following question from an Alaska state 4th grade science test:
Which object in our solar system reflects light and is a satellite that orbits around one planet? (A) Earth (B) Mercury (C) the Sun (D) the Moon This question is challenging for QA systems because of its complex structure and the need for multi-fact reasoning. A natural way to answer it is by combining facts such as (Moon; is; in the solar system), (Moon; reflects; light), (Moon; is; satellite), and (Moon; orbits; around one planet).
A candidate system for such reasoning, and which we draw inspiration from, is the TABLEILP system of . TABLEILP treats QA as a search for an optimal subgraph that connects terms in the question and answer via rows in a set of curated tables, and solves the optimization problem using Integer Linear Programming (ILP). We similarly want to search for an optimal subgraph. However, a large, automatically extracted tuple KB makes the reasoning context different on three fronts: (a) unlike reasoning with tables, chaining tuples is less important and reliable as join rules aren't available; (b) conjunctive evidence becomes paramount, as, unlike a long table row, a single tuple is less likely to cover the entire question; and (c) again, unlike table rows, tuples are noisy, making combining redundant evidence essential. Consequently, a table-knowledge centered inference model isn't the best fit for noisy tuples.
To address this challenge, we present a new ILP-based model of inference with tuples, implemented in a reasoner called TUPLEINF. We demonstrate that TUPLEINF significantly outperforms TABLEILP by 11.8% on a broad set of over 1,300 science questions, without requiring manually curated tables, using a substantially simpler ILP formulation, and generalizing well to higher grade levels. The gains persist even when both solvers are provided identical knowledge. This demonstrates for the first time how Open IE based QA can be extended from simple lookup questions to an effective system for complex questions.
2 Related Work
We discuss two classes of related work: retrievalbased web question-answering (simple reasoning with large scale KB) and science questionanswering (complex reasoning with small KB).
Web QA: There exist several systems for retrieval-based Web QA problems (Ferrucci et al., 2010; Brill et al., 2002) . While structured KBs such as Freebase have been used in many (Berant et al., 2013; Berant and Liang, 2014; Kwiatkowski et al., 2013) , such approaches are limited by the coverage of the data. QA systems using semistructured Open IE tuples (Fader et al., 2013 (Fader et al., , 2014 Yin et al., 2015) or automatically extracted web tables (Sun et al., 2016; Pasupat and Liang, 2015) have broader coverage but are limited to simple questions with a single query. Science QA: Elementary-level science QA tasks require reasoning to handle complex questions. Markov Logic Networks (Richardson and Domingos, 2006) have been used to perform probabilistic reasoning over a small set of logical rules (Khot et al., 2015) . Simple IR techniques have also been proposed for science tests and Gaokao tests (equivalent to the SAT exam in China) (Cheng et al., 2016) .
The work most related to TUPLEINF is the aforementioned TABLEILP solver. This approach focuses on building inference chains using man-ually defined join rules for a small set of curated tables. While it can also use open vocabulary tuples (as we assess in our experiments), its efficacy is limited by the difficulty of defining reliable join rules for such tuples. Further, each row in some complex curated tables covers all relevant contextual information (e.g., each row of the adaptation table contains (animal, adaptation, challenge, explanation)), whereas recovering such information requires combining multiple Open IE tuples.
3 Tuple Inference Solver
We first describe the tuples used by our solver. We define a tuple as (subject; predicate; objects) with zero or more objects. We refer to the subject, predicate, and objects as the fields of the tuple.
3.1 Tuple Kb
We use the text corpora (S) from Clark et al. 2016to build our tuple KB. S contains 5 ⇥ 10 10 tokens (280 GB of plain text) extracted from Web pages as well as around 80,000 sentences from various domain-targeted sources. For each test set, we use the corresponding training questions Q tr to retrieve domain-relevant sentences from S. Specifically, for each multiplechoice question (q, A) 2 Q tr and each choice a 2 A, we use all non-stopword stemmed tokens in q and a as an ElasticSearch 1 query against S. We take the top 200 hits, run Open IE v4, 2 and aggregate the resulting tuples over all a 2 A and over all questions in Q tr to create the tuple KB (T ). 3
3.2 Tuple Selection
Given a multiple-choice question qa with question text q and answer choices A={a i }, we select the most relevant tuples from T and S as follows.
Selecting from Tuple KB: We use an inverted index to find the 1,000 tuples that have the most overlapping tokens with question tokens tok (qa). 4 We also filter out any tuples that overlap only with tok (q) as they do not support any answer. We compute the normalized TF-IDF score by treating the question, q, as a query and each tuple, t, as a Figure 1 : An example support graph linking a question (top), two tuples from the KB (colored) and an answer option (nitrogen). document:
tf(x, q) = 1 if x 2 q, 0 otherwise idf(x) = log (1 + N/n x ) tf-idf(t, q) = X x2t\q idf (x)
where N is the total number of tuples in the KB and n x is the number of tuples containing x. We normalize the tf-idf score by the number of tokens in t and q, and take the 50 top-scoring tuples T qa .
On-the-fly tuples from text: To handle questions from new domains not covered by the training set, we extract additional tuples on the fly from S (similar to Sharma et al. (2015) ). We perform the same ElasticSearch query described earlier for building T. We ignore sentences that cover none or all answer choices as they are not discriminative. We also ignore long sentences (>300 characters) and sentences with negation 5 as they tend to lead to noisy inference. We then run Open IE on these sentences and re-score the resulting tuples using the Jaccard score 6 due to the lossy nature of Open IE, and finally take the 50 top-scoring tuples T 0 qa .
3.3 Support Graph Search
Similar to TABLEILP, we view the QA task as searching for a graph that best connects the terms in the question (qterms) with an answer choice via the knowledge; see Figure 1 for a simple illustrative example. Unlike standard alignment models used for tasks such as Recognizing Textual Entailment (RTE) (Dagan et al., 2010), however, we must score alignments between a set T qa [ T 0 qa of structured tuples and a (potentially multi-sentence) multiple-choice question qa.
The qterms, answer choices, and tuples fields form the set of possible vertices, V, of the support graph. Edges connecting qterms to tuple fields and tuple fields to answer choices form the set of possible edges, E. The support graph, G(V, E), is a subgraph of G(V, E) where V and E denote "active" nodes and edges, resp. We define an ILP optimization model to search for the best support graph (i.e., the active nodes and edges) as follows.
Variables
The ILP has a binary variable for each qterm (x q ), tuple (x t ), tuple field (x f ), and answer choice (x a ), indicating whether the corresponding graph node is active. There is a binary activity variable (x e ) for each edge e 2 E. For efficiency, we only create a qterm!field edge and a field!choice edge if the corresponding coefficient is no smaller than a certain threshold (0.1 and 0.2, resp.).
Objective Function
The objective function coefficient c e of each edge e(t, h) is determined by a word-overlap score. 7 While TABLEILP used WordNet (Miller, 1995) paths to compute the edge weight, this measure results in unreliable scores when faced with longer phrases found in Open IE tuples.
Compared to a curated KB, it is easy to find Open IE tuples that match irrelevant parts of the questions. To mitigate this issue, we scale the coefficients c q of qterms in our ILP objective to focus on important terms. Since the later terms in a question tend to provide the most critical information, we scale qterm coefficients based on their position in the question. Also, qterms that appear in almost all of the selected tuples tend not to be discriminative as any tuple would support such a qterm. Hence we scale qterm coefficients inversely by the frequency with which they occur in the selected tuples. Appendix A describes the coefficient for qterm as well as other variables in detail.
Constraints
Since Open IE tuples do not come with schema and join rules, we can define a substantially simpler model compared to TABLEILP. This reduces the reasoning capability but also eliminates the reliance on hand-authored join rules and regular expressions used in TABLEILP. We discovered (see empirical evaluation) that this simple model can achieve the same score as TABLEILP on the Regents test (target test set used by TABLEILP) and generalizes better to different grade levels.
We start with a few constraints defining what is an active node or edge, shown as the first groups of constraints in 2, 4, 4, 4, 2) ; the model can be improved with more careful parameter selection spurious edges in the support graph, we limit the number of active edges from an active tuple, question choice, tuple fields, and qterms (second group of constraints in Table 1 ). Our model is also capable of using multiple tuples to support different parts of the question as illustrated in Figure 1 . To avoid spurious tuples that only connect with the question (or choice) or ignore the relation being expressed in the tuple, we add constraints that require each tuple to connect a qterm with an answer choice (third group of constraints in Table 1 ). We also define new constraints based on the Open IE tuple structure. Since an Open IE tuple expresses a fact about the tuple's subject, we require that the subject must be active. To avoid issues such as (Planet; orbit; Sun) matching the sample question in the introduction ("Which object. . .orbits around a planet"), we also add an ordering constraint (fourth group in Table 1) .
We note that TUPLEINF only combines parallel evidence, i.e., each tuple must individually connect words in the question to the answer choice. For reliable multi-hop reasoning using OpenIE tuples, one can add inter-tuple connections to the support graph search, controlled by a small number of rules over Open IE predicates. Learning such rules for the Science domain is an open problem and potential avenue for future work.
4 Experiments
Comparing our method with two state-of-the-art systems for 4th and 8th grade science exams, we demonstrate that (a) TUPLEINF with only automatically extracted tuples significantly outperforms TABLEILP with its original curated knowl- (Howell, 2012) at p = 0.05. We consider two question sets. (1) 4th Grade set (1220 train, 1304 test) is a 10x larger superset of the NY Regents questions , and includes professionally written licensed questions. (2) 8th Grade set (293 train, 282 test) contains 8th grade questions from various states. 8 We consider two knowledge sources:
(1) The Sentence corpus (S) consists of domain-targeted 80K sentences and 280 GB of plain text extracted from web pages used by . This corpus is used as a collection of sentences by the IR solver. It is also used to create the tuple KB T (Sec. 3.1) and on-the-fly questionspecific tuples T 0 qa (Sec. 3.2) for TUPLEINF. (2) TABLEILP uses ⇠70 Curated tables (C) containing about 7,600 rows, designed for 4th grade NY Regents exams.
We compare TUPLEINF with two state-of-theart baselines. IR is a simple yet powerful information-retrieval baseline that selects the answer option with the best matching sentence in a corpus. TABLEILP is the stateof-the-art structured inference baseline developed for science questions. Table 3 : TUPLEINF is complementarity to IR, resulting in a strong ensemble the same knowledge (C+T), 10 the improved selection and simplified model of TUPLEINF 11 results in a statistically significant improvement. Our simple model, TUPLEINF(C + T), also achieves scores comparable to TABLEILP on the latter's target Regents questions (61.4% vs TABLEILP's reported 61.5%) without any specialized rules. Table 3 shows that while TUPLEINF achieves similar scores as the IR solver, the approaches are complementary (structured lossy knowledge reasoning vs. lossless sentence retrieval). The two solvers, in fact, differ on 47.3% of the training questions. To exploit this complementarity, we train an ensemble system which, as shown in the table, provides a substantial boost over the individual solvers. Further, IR + TUPLEINF is consistently better than IR + TABLEILP.
4.1 Results
Finally, in combination with IR and the statistical association based PMI solver (which scores 54.1% by itself) of , TUPLEINF achieves a score of 58.2% on the 4th grade set. This compares favorably to TABLEILP's ensemble score of 56.7%, again attesting to TUPLEINF's strength. 12
5 Error Analysis
We describe four classes of failure of TUPLEINF, and the future work they suggest.
Missing Important Words: Which material will spread out to completely fill a larger container? (A) air (B) ice (C) sand (D) water In this question, we have tuples that support that water will spread out and fill a larger container, but miss the critical word "completely". A method for detecting salient question words would help here.
Lossy IE: Which action is the best method to separate a mixture of salt and water? . . . The IR solver correctly answers this question by using the sentence: Separate the salt and water mixture by evaporating the water. However, TUPLEINF is not able to answer this question as Open IE is unable to extract tuples from this imperative sentence. While the additional structure from Open IE is generally helpful for more robust matching, the conversion to tuples sometimes loses important bits of information.
Bad Alignment: Which of the following gases is necessary for humans to breathe in order to live? (A) Oxygen (B) Carbon dioxide (C) Helium (D) Water vapor TUPLEINF returns "Carbon dioxide" as the answer because of the tuple (humans; breathe out; carbon dioxide). The chunk "to breathe" in the question has a high alignment score to the "breathe out" relation in the tuple, even though they have completely different meaning. An improved phrase alignment module can mitigate this issue.
Out of Scope: Deer live in forest for shelter. If the forest was cut down, which situation would most likely happen? . . . Such questions require modeling a state presented in the question and reasoning over this state, which is out of scope of our solver.
6 Conclusion
We presented a new QA system, TUPLEINF, that can reason over a large, potentially noisy knowledge base of (subject, predicate, object) style tuples, in order to answer complex questions. Our results establish TUPLEINF as a new state-of-theart structured reasoning solver for elementarylevel science that does not rely on curated knowledge and generalizes to higher grade levels. Our error analysis points to lossy IE and textual misalignments as two main causes of failure, suggesting future work around incorporating tuple context and distributional similarity measures.
https://www.elastic.co/products/elasticsearch 2 http://knowitall.github.io/openie 3 Available at http://allenai.org/data.html 4 All tokens are stemmed and stop-word filtered.
containing not, 'nt, or except6 | tok (t) \ tok (qa) | / | tok (t) [ tok (qa) |
w(t, h) =| tok (t) \ tok (h) | / | tok (h) |
See Appendix B for how tables (and tuples) are used by TUPLEINF (and TABLEILP).11 On average, TABLEILP (TUPLEINF) has 3,403 (1,628, resp.) constraints and 982 (588, resp.) variables. TUPLE-INF's ILP can be solved in half the time taken by TABLEILP, reducing the overall question answering time by 68.6%.12 We observed no difference in scores on the 8th grade set.