Question Answering via Integer Programming over Semi-Structured Knowledge

Daniel Khashabi
Tushar Khot
Ashish Sabharwal
P. Clark
Oren Etzioni
D. Roth
IJCAI
2016
View in Semantic Scholar

Abstract

Answering science questions posed in natural language is an important AI challenge. Answering such questions often requires non-trivial inference and knowledge that goes beyond factoid retrieval. Yet, most systems for this task are based on relatively shallow Information Retrieval (IR) and statistical correlation techniques operating on large unstructured corpora. We propose a structured inference system for this task, formulated as an Integer Linear Program (ILP), that answers natural language questions using a semi-structured knowledge base derived from text, including questions requiring multi-step inference and a combination of multiple facts. On a dataset of real, unseen science questions, our system significantly outperforms (+14%) the best previous attempt at structured reasoning for this task, which used Markov Logic Networks (MLNs). It also improves upon a previous ILP formulation by 17.7%. When combined with unstructured inference methods, the ILP system significantly boosts overall performance (+10%). Finally, we show our approach is substantially more robust to a simple answer perturbation compared to statistical correlation methods.

1 Introduction

Answering questions posed in natural language is a fundamental AI task, with a large number of impressive QA systems built over the years. Today's Internet search engines, for instance, can successfully retrieve factoid style answers to many natural language queries by efficiently searching the Web. Information Retrieval (IR) systems work under the assumption that answers to many questions of interest are often explicitly stated somewhere [Kwok et al., 2001] , and all one needs, in principle, is access to a sufficiently large corpus. Similarly, statistical correlation based methods, such as those using Pointwise Mutual Information or PMI [Church and Hanks, 1989] , work under the assumption that many questions can be answered by looking for words that tend to co-occur with the question words in a large corpus.

While both of these approaches help identify correct answers, they are not suitable for questions requiring reasoning, Figure 1 : TableILP searches for the best support graph (chains of reasoning) connecting the question to an answer, in this case June. Constraints on the graph define what constitutes valid support and how to score it (Section 3.3). such as chaining together multiple facts in order to arrive at a conclusion. Arguably, such reasoning is a cornerstone of human intelligence, and is a key ability evaluated by standardized science exams given to students. For example, consider a question from the NY Regents 4th Grade Science Test:

Figure 1: TableILP searches for the best support graph (chains of reasoning) connecting the question to an answer, in this case June. Constraints on the graph define what constitutes valid support and how to score it (Section 3.3).

In New York State, the longest period of daylight occurs during which month? (A) June (B) March (C) December (D) September

We would like a QA system that, even if the answer is not explicitly stated in a document, can combine basic scientific and geographic facts to answer the question, e.g., New York is in the north hemisphere; the longest day occurs during the summer solstice; and the summer solstice in the north hemisphere occurs in June (hence the answer is June). Figure 1 illustrates how our system approaches this, with the highlighted support graph representing its line of reasoning.

Further, we would like the system to be robust under simple perturbations, such as changing New York to New Zealand (in the southern hemisphere) or changing an incorrect answer option to an irrelevant word such as "last" that happens to have high co-occurrence with the question text.

To this end, we propose a structured reasoning system, called TableILP, that operates over a semi-structured knowledge base derived from text and answers questions by chaining multiple pieces of information and combining parallel evidence. 1 The knowledge base consists of tables, each of which is a collection of instances of an n-ary relation defined over natural language phrases. E.g., as illustrated in Figure 1 , a simple table with schema (country, hemisphere) might contain the instance (United States, Northern) while a ternary table with schema (hemisphere, orbital event, month) might contain (North, Summer Solstice, June). TableILP treats lexical constituents of the question Q, as well as cells of potentially relevant tables T , as nodes in a large graph G Q,T , and attempts to find a subgraph G of G Q,T that "best" supports an answer option. The notion of best support is captured via a number of structural and semantic constraints and preferences, which are conveniently expressed in the Integer Linear Programming (ILP) formalism. We then use an off-the-shelf ILP optimization engine called SCIP [Achterberg, 2009] to determine the best supported answer for Q.

Following a recently proposed AI challenge , we evaluate TableILP on unseen elementary-school science questions from standardized tests. Specifically, we consider a challenge set consisting of all nondiagram multiple choice questions from 6 years of NY Regents 4th grade science exams. In contrast to a state-of-theart structured inference method [Khot et al., 2015] for this task, which used Markov Logic Networks (MLNs) [Richardson and Domingos, 2006] , TableILP achieves a significantly (+14% absolute) higher test score. This suggests that a combination of a rich and fine-grained constraint language, namely ILP, even with a publicly available solver is more effective in practice than various MLN formulations of the task. Further, while the scalability of the MLN formulations was limited to very few (typically one or two) selected science rules at a time, our approach easily scales to hundreds of relevant scientific facts. It also complements the kind of questions amenable to IR and PMI techniques, as is evidenced by the fact that a combination (trained using simple Logistic Regression ) of TableILP with IR and PMI results in a significant (+10% absolute) boost in the score compared to IR alone.

Our ablation study suggests that combining facts from multiple tables or multiple rows within a table plays an important role in TableILP's performance. We also show that TableILP benefits from the table structure, by comparing it with an IR system using the same knowledge (the table rows) but expressed as simple sentences; TableILP scores significantly (+10%) higher. Finally, we demonstrate that our approach is robust to a simple perturbation of incorrect answer options: while the simple perturbation results in a relative drop of 20% and 33% in the performance of IR and PMI methods, respectively, it affects TableILP's performance by only 12%. proposed an ensemble approach for the science QA task, demonstrating the effectiveness of a combination of information retrieval, statistical association, rule-based reasoning, and an ILP solver operating on semistructured knowledge. Our ILP system extends their model with additional constraints and preferences (e.g., semantic relation matching), substantially improving QA performance.

2 Related Work

A number of systems have been developed for answering factoid questions with short answers (e.g., "What is the capital of France?") using document collections or databases (e.g., Freebase [Bollacker et al., 2008] , NELL [Carlson et al., 2010] ), for example [Brill et al., 2002; Fader et al., 2014; Ferrucci et al., 2010; Ko et al., 2007; Yih et al., 2014; Yao and Van Durme, 2014; Zou et al., 2014] . However, many science questions have answers that are not explicitly stated in text, and instead require combining information together. Conversely, while there are AI systems for formal scientific reasoning (e.g., [Gunning et al., 2010; Novak, 1977] ), they require questions to be posed in logic or restricted English. Our goal here is a system that operates between these two extremes, able to combine information while still operating with natural language.

The task of Recognizing Textual Entailment (RTE) is also closely related, as QA can be cast as entailment (Does corpus entail question+answer? [Bentivogli et al., 2008] ). However, RTE has primarily focused on the task of linguistic equivalence, and has not addressed questions where some form of scientific reasoning is required. Recent work on Natural Logic [Angeli and Manning, 2014; MacCartney, 2009] has extended RTE to account for the logical structure within language. Our work can be seen as going one step further, to add a layer of structured reasoning on top of this; in fact, we use an RTE engine as a basic subroutine for comparing individual table cells in our ILP formulation.

ILP based discrete optimization has been successful in several NLP tasks [Roth and Yih, 2004; Chang et al., 2010; Berant et al., 2010; Srikumar and Roth, 2011; Goldwasser and Roth, 2011] . While our ILP formulation also operates on natural language text, our focus is on the use of a specific semi-structured table representation for QA. Cohen [2000] studied tables with natural language text requiring soft matching, with a focus on efficiently computing the top few candidates given a database query. In contrast, our system, given a natural language question, (implicitly) seeks to generate a query that produces the most supported answer.

3 Qa As Subgraph Optimization

We begin with our knowledge representation formalism, followed by our treatment of QA as an optimal subgraph selection problem over such knowledge, and then briefly describe our ILP model for subgraph selection.

3.2 Qa As A Search For Desirable Support Graphs

We treat question answering as the task of pairing the question with an answer such that this pair has the best support in the knowledge base, measured in terms of the strength of a "support graph" defined as follows.

Given a multiple choice question Q and tables T , we can define a labeled undirected graph G Q,T over nodes V and edges E as follows. We first split Q into lexical constituents (e.g., non-stopword tokens, or chunks) q = {q } and answer options a = {a m }. For each table T i , we consider its cells t = {t ijk } as well as column headers h = {h ik }. The nodes of G Q,T are then V = q ∪ a ∪ t ∪ h. For presentation purposes, we will equate a graph node with the lexical entity it represents (such as a table cell or a question constituent). The undirected edges of G Q,T are E = ((q ∪ a) × (t ∪ h)) ∪ (t × t) ∪ (h × h) excluding edges both whose endpoints are within a single table.

Informally, an edge denotes (soft) equality between a question or answer node and a table node, or between two table nodes. To account for lexical variability (e.g., that tool and instrument are essentially equivalent) and generalization (e.g., that a dog is an animal), we replace string equality with a phrase-level entailment or similarity function w : E → [0, 1] that labels each edge e ∈ E with an associated score w(e). We use entailment scores (directional) from q to t ∪ h and from t ∪ h to a, and similarity scores (symmetric) between two nodes in t. 2 In the special case of column headers across two tables, the score is (manually) set to either 0 or 1, indicating whether this corresponds to a meaningful join.

Intuitively, we would like the support graph for an answer option to be connected, and to include nodes from the ques- tion, the answer option, and at least one table. Since each table row represents a coherent piece of information but cells within a row do not have any edges in G Q,T (the same holds also for cells and the corresponding column headers), we use the notion of an augmented subgraph to capture the underlying table structure. Let G = (V, E) be a subgraph of G Q,T . The augmented subgraph G + is formed by adding to G edges

(v 1 , v

2 ) such that v 1 and v 2 are in V and they correspond to either the same row (possibly the header row) of a table in T or to a cell and the corresponding column header. Definition 1. A support graph G = G(Q, T, a m ) for a question Q, tables T , and an answer option a m is a subgraph (V, E) of G Q,T with the following basic properties:

1. V ∩ a = {a m }, V ∩ q = φ, V ∩ t = φ;

2. w(e) > 0 for all e ∈ E; 3. if e ∈ E ∩ (t × t) then there exists a corresponding e ∈ E ∩ (h × h) involving the same columns; and 4. the augmented subgraph G + is connected. A support graph thus connects the question constituents to a unique answer option through table cells and (optionally) table headers corresponding to the aligned cells. A given question and tables give rise to a large number of possible support graphs, and the role of the inference process will be to choose the "best" one under a notion of desirable support graphs developed next. We do this through a number of additional structural and semantic properties; the more properties the support graph satisfies, the more desirable it is.

3.3 Ilp Formulation

We model the above support graph search for QA as an ILP optimization problem, i.e., as maximizing a linear objective function over a finite set of variables, subject to a set of linear inequality constraints. A summary of the model is given below. 3 We note that the ILP objective and constraints aren't tied to the particular domain of evaluation; they represent general properties that capture what constitutes a well supported answer for a given question. Table 1 summarizes the notation for various elements of the problem, such as t ijk for cell (j, k) of table i. All core variables in the ILP model are binary, i.e., have domain {0, 1}. For each element, the model has a unary variable capturing whether this element is part of the support graph G, i.e., it is "active". For instance, row r ij is active if at least one cell in row j of table i is in G. The model also has pairwise "alignment" variables, capturing edges of G Q,T . The alignment BASIC PAIRWISE ACTIVITY VARIABLES variable for an edge e in G Q,T is associated with the corresponding weight w(e), and captures whether e is included in G. To improve efficiency, we create a pairwise variable for e only if w(e) is larger than a certain threshold. These unary and pairwise variables are then used to define various types of constraints and preferences, as discussed next.

Table 1: Notation for the ILP formulation.

y (t ijk , t ij k ) cell to cell y (t ijk , q ) cell to question constituent y (h ik , q ) header to question constituent y (t ijk ,

To make the definitions clearer, we introduce all basic variables used in our optimization in Table 2 , and will use them later to define constraints explicitly. We use the notation x (.) to refer to a unary variable parameterized by a single element of the optimization, and y (., .) to refer to a pairwise variable parameterized by a pair of elements. Unary variables represent the presence of a specific element as a node in the support graph G. For example x (T i ) = 1 if and only if the table T i is active in G. Similarly, y (t ijk , q ) = 1 if and only if the corresponding edge is present in G, which we alternatively refer to as an alignment between cell (j, k) of table i and the -th constituent of the question. As previously mentioned, in practice we do not create all possible pairwise variables. Instead we choose the pairs alignment score w(e) exceeds a pre-set threshold. For example, we create

Table 2: Variables used for defining the optimization problem for TableILP solver. All variables have domain {0, 1}.

y (t ijk , t i j k ) only if w(t ijk , t i j k ) ≥ MINCELLCELLALIGNMENT. 4

The objective function is a weighted linear sum over all variables instantiated for a given question answering problem. 5 A small set of auxiliary variables is defined for linearizing complicated constraints.

Constraints are a significant part of our model, used for imposing the desired behavior on the support graph. Due to lack of space, we discuss only a representative subset here. 6 Some constraints relate variables to each other. For example, unary variables are defined through constraints that relate them to the corresponding pairwise variables. For instance, for active row variable x (r ij ), we ensure that it is 1 if and only if at least one cell in row j is active:

x (r ij ) ≥ y (t ijk , * ) , ∀(t ijk , * ) ∈ R ij , ∀i, j, k,

where R ij is collection of pairwise variables with one end in row j of table i.

In the remainder of this section, we outline some of the important characteristics we expect in our model, and provide details of a few illustrative constraints.

Basic Lookup

Consider the following question:

Which characteristic helps a fox find food? (A) sense of smell (B) thick fur (C) long tail (D) pointed teeth In order to answer such lookup-style questions, we generally seek a row with the highest aggregate alignment to question constituents. We achieve this by incorporating the questiontable alignment variables with the alignment scores, w(e), as coefficients and the active question constituents variable with a constant coefficient in the objective function. Since any additional question-table edge with a positive entailment score (even to irrelevant tables) in the support graph would result in an increase in the score, we disallow tables with alignments only to the question (or only to a choice) and add a small penalty for every table used in order to reduce noise in the support graph. We also limit the maximum number of alignments of a question constituent and table cells, in order to prevent one constituent or cell from having a large influence on the objective function and thereby the solution:

( * ,q )∈Q l y ( * , q ) ≤ MAXALIGNMENTSPERQCONS, ∀l

where Q l is the set of all pairwise variables with one end in the question constituent .

Parallel Evidence

For certain questions, evidence needs to be combined from multiple rows of a table. For example, Sleet, rain, snow, and hail are forms of (A) erosion (B) evaporation (C) groundwater (D) precipitation To answer this question, we need to combine evidence from multiple table entries from the weather terms table, (term, type), namely (sleet, precipitation), (rain, precipitation), (snow, precipitation), and (hail, precipitation). To achieve this, we allow multiple active rows in the support graph. Similar to the basic constraints, we limit the maximum number of active rows per table and add a penalty for every active row to ensure only relevant rows are considered for reasoning:

j x (r ij ) ≤ MAXROWSPERTABLE, ∀i

To encourage only coherent parallel evidence within a single table, we limit our support graph to always use the same columns across multiple rows within a table, i.e., every active row has the active cells corresponding to the same set of columns.

Evidence Chaining

Questions requiring chaining of evidence from multiple tables, such as the example in Figure 1 , are typically the most challenging in this domain. Chaining can be viewed as performing a join between two tables. We introduce alignments between cells across columns in pairs of tables to allow for chaining of evidence. To help minimize potential noise introduced by chaining irrelevant facts, we add a penalty for every inter-table alignment and also rely on the 0/1 weights of header-to-header edges to ensure only semantically meaningful table joins are considered.

Semantic Relation Matching

Our constraints so far have only looked at the content of the table cells, or the structure of the support graph, without explicitly considering the semantics of the table schema. By using alignments between the question and column headers (i.e., type information), we exploit the table schema to prefer alignments to columns relevant to the "topic" of the question. In particular, for questions of the form "which X . . .", we prefer answers that directly entail X or are connected to cells that entail X. However, this is not sufficient for questions such as:

What is one way to change water from a liquid to a solid? (A) decrease the temperature (B) increase the temperature (C) decrease the mass (D) increase the mass Even if we select the correct table, say r change-init-fin (c, i, f ) that describes the initial and final states for a phase change event, both choice (A) and choice (B) would have the exact same score in the presence of table rows (increase temperature, solid, liquid) and (decrease temperature, liquid, solid). The table, however, does have the initial vs. final state structure. To capture this semantic structure, we annotate pairs of columns within certain tables with the semantic relationship present between them. In this example, we would annotate the phase change table with the relations: changeFrom(c, i), changeTo(c, f ), and fromTo(i, f ).

Given such semantic relations for table schemas, we can now impose a preference towards question-table alignments that respect these relations. We associate each semantic relation with a set of linguistic patterns describing how it might be expressed in natural language. TableILP then uses these patterns to spot possible mentions of the relations in the question Q. We then add the soft constraint that for every pair of active columns in a table (with an annotated semantic relation) aligned to a pair of question constituents, there should be a valid expression of that relation in Q between those constituents. In our example, we would match the relation fromTo(liquid, solid) in the table to "liquid to a solid" in the question via the pattern "X to a Y" associated with fromTo(X,Y), and thereby prefer aligning with the correct row (decrease temperature, liquid, solid).

4 Evaluation

We compare our approach to three existing methods, demonstrating that it outperforms the best previous structured approach [Khot et al., 2015] and produces a statistically significant improvement when used in combination with IR-based methods . For evaluations, we use a 2-core 2.5 GHz Amazon EC2 linux machine with 16 GB RAM.

Question Set. We use the same question set as , which consists of all non-diagram multiple-choice questions from 12 years of the NY Regents 4th Grade Science exams. 7 The set is split into 108 development questions and 129 hidden test questions based on the year they appeared in (6 years each). All numbers reported below are for the hidden test set, except for question perturbation experiments which relied on the 108 development questions.

Test scores are reported as percentages. For each question, a solver gets a score of 1 if it chooses the correct answer and 1/k if it reports a k-way tie that includes the correct answer. On the 129 test questions, a score difference of 9% (or 7%) is statistically significant at the 95% (or 90%, resp.) confidence interval based on the binomial exact test [Howell, 2012] .

Corpora. We work with three knowledge corpora:

1. Web Corpus: This corpus contains 5 × 10 10 tokens (280 GB of plain text) extracted from Web pages. It was collected by Charles Clarke at the University of Waterloo, and has been used previously by Turney [2013] and . We use it here to compute statistical co-occurrence values for the PMI solver.

2. Sentence Corpus : This includes sentences from the Web corpus above, as well as around 80,000 sentences from various domain-targeted sources for elementary science: a Regents study guide, CK12 textbooks (www.ck12.org), and web sentences with similar content as the course material.

3. Table Corpus (cf. Section 3.1): This includes 65 tables totaling around 5,000 rows, designed based on the development set and study guides, as well as 4 Open IEstyle [Banko et al., 2007] automatically generated tables totaling around 2,600 rows. 8

4.1 Solvers

TableILP (our approach). Given a question Q, we select the top 7 tables from the Table Corpus using the the standard TF-IDF score of Q with tables treated as bag-of-words documents. For each selected table, we choose the 20 rows that overlap with Q the most. This filtering improves efficiency and reduces noise. We then generate an ILP and solve it using the open source SCIP engine [Achterberg, 2009] , returning the active answer option a m from the optimal solution. To check for ties, we disable a m , re-solve the ILP, and compare the score of the second-best answer, if any, with that of a m .

MLN Solver (structured inference baseline). We consider the current state-of-the-art structured reasoning method developed for this specific task by Khot et al. [2015] . We compare against their best performing system, namely Praline, which uses Markov Logic Networks [Richardson and Domingos, 2006 ] to (a) align lexical elements of the question with probabilistic first-order science rules and (b) to control inference. We use the entire set of 47,000 science rules from their original work, which were also derived from same domaintargeted sources as the ones used in our Sentence Corpus.

IR Solver (information retrieval baseline). We use the IR baseline by , which selects the answer option that has the best matching sentence in a corpus. Specifically, for each answer option a i , the IR solver sends q + a i as a query to a search engine (we use Lucene) on the Sentence Corpus, and returns the search engine's score for the top retrieved sentence s, where s must have at least one nonstopword overlap with q, and at least one with a i . The option with the highest Lucene score is returned as the answer.

PMI Solver (statistical co-occurrence baseline). We use the PMI-based approach by , which selects the answer option that most frequently co-occurs with the question words in a corpus. Specifically, it extracts unigrams, bigrams, trigrams, and skip-bigrams from the question and each answer option. For a pair (x, y) of n-grams, their pointwise mutual information (PMI) [Church and Hanks, 1989] in the corpus is defined as log p(x,y) p(x)p(y) where p(x, y) is the cooccurrence frequency of x and y (within some window) in the corpus. The solver returns the answer option that has the largest average PMI in the Web Corpus, calculated over all pairs of question n-grams and answer option n-grams.

4.2 Results

We first compare the accuracy of our approach against the previous structured (MLN-based) reasoning solver. We also compare against IR(tables), an IR solver using table rows expressed as sentences, thus embodying an unstructured approach operating on the same knowledge as TableILP. Table 3 shows, among the two structured inference approaches, TableILP outperforms the MLN baseline by 14%. The preliminary ILP system reported by achieves only a score of 43.8% on this question set. Further, given the same semi-structured knowledge (i.e., the Table Corpus), TableILP is substantially (+10%) better at exploiting the structure than the IR(tables) baseline, which, as mentioned above, uses the same data expressed as sentences.

Table 3: TableILP significantly outperforms both the prior MLN reasoner, and IR using identical knowledge as TableILP

Complementary Strengths

While their overall score is similar, TableILP and IR-based methods clearly approach QA very differently. To assess whether TableILP adds any new capabilities, we considered the 50 (out of 129) questions incorrectly answered by PMI solver (ignoring tied scores). On these unseen but arguably more difficult questions, TableILP answered 27 questions correctly, achieving a score of 54% compared to the random chance of 25% for 4-way multiple-choice questions. Results with IR solver were similar: TableILP scored 24.75 on the 52 questions incorrectly answered by IR (i.e., 47.6% accuracy). Table 4 shows the results, with the final combination at 69% representing a significant improvement over individual solvers. Table 5 summarizes various ILP and support graph statistics for TableILP, averaged across all test questions.

Table 5: TableILP statistics averaged across questions

Ilp Solution Properties

The optimization model has around 50 high-level constraints, which result, on average, in around 4000 inequalities over 1000 variables. Model creation, which includes computing pairwise entailment scores using WordNet, takes 1.9 seconds on average per question, and the resulting ILP is solved by the SCIP engine in 2.1 seconds (total for all four options), using around 1,300 LP iterations for each option. 10 Thus, TableILP takes only 4 seconds to answer a question using multiple rows across multiple tables (typically 140 rows in total), as compared to 17 seconds needed by the MLN solver for reasoning with four rules (one per answer option). While the final support graph on this question set relies mostly on a single table to answer the question, it generally combines information from more than two rows (2.3 on average) for reasoning. This suggests parallel evidence is more frequently used on this dataset than evidence chaining.

4.3 Ablation Study

To quantify the importance of various components of our system, we performed several ablation experiments, summarized in Table 6 No Open IE tables: To evaluate the impact of relatively unstructured knowledge from a large corpus, we removed the tables containing Open IE extractions (Section 3.2). The 9% drop in the score shows that this knowledge is important and TableILP is able to exploit it even though it has a very simple triple structure. This opens up the possibility of extending our approach to triples extracted from larger knowledge bases.

No Lexical Entailment: Finally, we test the effect of changing the alignment metric w (Section 3.2) from WordNet based scores to a simple asymmetric word-overlap measured as score(T, H) = |T ∩H| |H| . Relying on just word-matching results in an 11% drop, which is consistent with our knowledge often being defined in terms of generalities.

4.4 Question Perturbation

One desirable property of QA systems is robustness to simple variations of a question, especially when a variation would make the question arguably easier for humans.

To assess this, we consider a simple, automated way to perturb each 4-way multiple-choice question: (1) query Microsoft's Bing search engine (www.bing.com) with the question text and obtain the text snippet of the top 2,000 hits; (2) create a list of strings by chunking and tokenizing the results;

(3) remove stop words and special characters, as well as any words (or their lemma) appearing in the question; (4) sort the remaining strings based on their frequency; and (5) replace the three incorrect answer options in the question with the most frequently occurring strings, thereby generating a new question. For instance:

In New York State, the longest period of daylight occurs during which month? (A) eastern (B) June (C) history (D) years As in this example, the perturbations (italicized) are often not even of the correct "type", typically making them much easier for humans. They, however, still remain difficult for solvers. For each of the 108 development questions, we generate 10 new perturbed questions, using the 30 most frequently occurring words in step (5) above. While this approach can introduce new answer options that should be considered correct as well, only 3% of the questions in a random sample exhibited this behavior. Table 7 shows the performance of various solvers on the resulting 1,080 perturbed questions. As one might expect, the PMI approach suffers the most at a 33% relative drop. TableILP's score drops as well (since answer type matching isn't perfect), but only by 12%, attesting to its higher resilience to simple question variation.

Table 7: Drop in solver scores (on the development set, rather than the hidden test set) when questions are perturbed

5 Conclusion

Answering real science questions is a challenging task because they are posed in natural language, require extensive domain knowledge, and often require combining multiple facts together. We presented TableILP, a system that can answer such questions, using a semi-structured knowledge base. We treat QA as a subgraph selection problem and then formulate this as an ILP optimization. Most importantly, this formulation allows multiple, semi-formally expressed facts to be combined to answer questions, a capability outside the scope of IR-based QA systems. In our experiments, this approach significantly outperforms both the previous best attempt at structured reasoning for this task, and an IR engine provided with the same knowledge. It also significantly boosts performance when combined with unstructured methods (IR and PMI). These results suggest that the approach is both viable and promising for natural language question answering.

A Appendix: Ilp Model For Tableilp

Table 8: Notation for the ILP formulation.

Table 10. Not extracted; please refer to original document.

Table 9. Not extracted; please refer to original document.

Table 11. Not extracted; please refer to original document.

Table 12. Not extracted; please refer to original document.

Table 13: The set of all constraints used in our ILP formulation. The set of variables and are defined in Table 2. More intuition about constraints is included in Section 3. The sets used in the definition of the constraints are defined in Table 12.

B Appendix: Features In Solver Combination

To combine the predictions from all the solvers, we learn a Logistic Regression model that returns a probability for an answer option, a i , being correct based on the following features.

Solver-independent features: Given the solver scores s j for all the answer options j, we generate the following set of features for the answer option a i , for each of the solvers: 1. Score = s i 2. Normalized score = si j sj 3. Softmax score = exp (si) j exp(sj )

4. Best Option, set to 1 if this is the top-scoring option = I(s i = max s j ) TableILP-specific features: Given the proof graph returned for an option, we generate the following 11 features apart from the solver-independent features:

1. Average alignment score for question constituents 1 y (tijk, t ij k ) w(tijk, t ij k ) − 0.1 y (tijk, q ) w(q , tijk) y (hik, q ) w(q , hik) y (tijk, am) w(tijk, am) y (hik, am) w(hik, am)

Unary Variables x (Ti) 1.0 x (rij) -1.0 x ( ik ) 1.0 x (hik) 0.3 x (tijk) 0.0 x (q ) 0.3

EQUATION (5): Not extracted; please refer to original document.

Collection of basic variables connected to column k of table i

EQUATION (6): Not extracted; please refer to original document.

Collection of basic variables connected to row j of table i:

EQUATION (7): Not extracted; please refer to original document.

Collection of non-choice basic variables connected to row j of table i:

EQUATION (8): Not extracted; please refer to original document.

Collection of non-question basic variables connected to row j of table i: Kij = (t ijk , t ij k ); ∀k, i , j , k ∪ {(t ijk , am); ∀k, m}

Collection of basic variables connected to table i:

EQUATION (10): Not extracted; please refer to original document.

Collection of non-choice basic variables connected to table i: Ni = {(h ik , q ); ∀l}∪ (t ijk , t ij k ); ∀j, k, i , j , k ∪{(t ijk , q ); ∀j, k, l} (11) Collection of basic variables connected to question constituent q :

EQUATION (12): Not extracted; please refer to original document.

Collection of basic variables connected to option m Om = {(t ijk , am); ∀i, j, k} ∪ {(h ik , am); ∀i, k}

Collection of basic variables in column k of table i connected to option m:

EQUATION (14): Not extracted; please refer to original document.

Table 12: All the sets useful in definitions of the constraints in Table 13 .

If any cell in row j of table i is active, the row should be active.

x (rij) ≥ y (t ijk , e) , ∀(t ijk , e) ∈ Rij, ∀i, j, k

If the row j of table i is active, at least one cell in that row must be active as well.

(t ijk ,e)∈R ij

EQUATION (16): Not extracted; please refer to original document.

Column j header should be active if any of the basic variables with one end in this column header are active.

x (h ik ) ≥ y (h ik , e) , ∀(h ik , e) ∈ H ik , ∀i, k

If the header of column j variable is active, at least one basic variable with one end in the end in the header

EQUATION (18): Not extracted; please refer to original document.

Column k is active if at least one of the basic variables with one end in this column are active.

x ( ik ) ≥ y (t ijk , e) , ∀(t ijk , e) ∈ C ik , ∀i, k

If the column k is active, at least one of the basic variables with one end in this column should be active.

(t ijk ,e)∈C ik

EQUATION (20): Not extracted; please refer to original document.

If a basic variable with one end in table i is active, the table variable is active.

EQUATION (21): Not extracted; please refer to original document.

If the table i is active, at least one of the basic variables with one end in the table should be active.

(t,e)∈T i

EQUATION (22): Not extracted; please refer to original document.

If any of the basic variables with one end in option am are on, the option should be active as well.

x (am) ≥ y (x, am) , ∀(e, am) ∈ Om (23)

If the question option am is active, there is at least one active basic element connected to it (e,a)∈Om

EQUATION (24): Not extracted; please refer to original document.

If any of the basic variables with one end in the constituent q , the constituent must be active.

x (q ) ≥ y (e, q ) , ∀(e, q ) ∈ Q l

If the constituent q is active, at least one basic variable connected to it must be active.

(e,q )∈Q l y (e, q ) ≥ x (q )

Choose only a single option.

EQUATION (27): Not extracted; please refer to original document.

There is an upper-bound on the number of active tables; this is to limit the solver and reduce the chance of using spurious tables.

EQUATION (28): Not extracted; please refer to original document.

The number of active rows in each table is upper-bounded.

EQUATION (29): Not extracted; please refer to original document.

The number of active constituents in each question is lower-bounded. Clearly We need to use the question definition in order to answer a question.

EQUATION (30): Not extracted; please refer to original document.

A cell is active if and only if the sum of coefficients of all external alignment to it is at least a minimum specified value (t ijk ,e)∈E i,j,k

EQUATION (31): Not extracted; please refer to original document.

A title is active if and only if the sum of coefficients of all external alignment to it is at least a minimum specified value (e)∈H i,k

y (t ijk , e) ≥ x (t ijk ) × MINACTIVETITLEAGGRALIGNMENT, ∀i, k (32)

If a column is active, at least one of its cells must be active as well.

EQUATION (33): Not extracted; please refer to original document.

At most a certain number of columns can be active for a single option

k y ( ik , am) ≤ MAXACTIVECHOICECOLUMN, ∀i, m (34)

If a column is active for a choice, the table is active too.

EQUATION (35): Not extracted; please refer to original document.

If a table is active for a choice, there must exist an active column for choice.

EQUATION (36): Not extracted; please refer to original document.

If a table is active for a choice, there must be some non-choice alignment. y (Ti, am) ≤ (e,e )∈N i y e, e , ∀i, m

Answer should be present in at most a certain number of tables y (Ti, am) ≤ MAXACTIVETABLECHOICEALIGNMETS, ∀i, m (38)

If a cell in a column, or its header is aligned with a question option, the column is active for question option as well. If a column is active for a choice, the table is active for an option as well.

y ( ik , am) ≤ y (Ti, am) , ∀i, k, m

If the table is active for an option, at least one column is active for a choice y (Ti, am) ≤ k y ( ik , am) , ∀i, m

Create an auxiliary variable x (whichTermIsActive) with objective weight 1.5 and activate it, if there a "which" term in the question. l 1 {q = "which"} ≤ x (whichTermIsActive) 44Create an auxiliary variable x (whichTermIsAligned) with objective weight 1.5. Add a boost if at least one of the table cells/title aligning to the choice happens to have a good alignment ({w(., .) > MINALIGNMENTWHICHTERM}) with the "which" terms, i.e. WHICHTERMSPAN constituents after "which".

i (e 1 ,e 2 )∈T i y (e1, e2) ≥ x (whichTermIsAligned) (45)

A question constituent may not align to more than a certain number of cells (e,q )∈Q l y (e, q ) ≤ MAXALIGNMENTSPERQCONS 46Disallow aligning a cell to two question constituents if they are too far apart; in other words add the following constraint if the two constituents q and q are more than QCONSCOALIGNMAXDIST apart from each other:

EQUATION (47): Not extracted; please refer to original document.

For any two two question constraints that are not more than QCONSCOALIGNMAXDIST apart create an auxiliary binary variable x (cellProximityBoost) and set its weight in the objective function to be 1/(l − l + 1), where l and l are the indices of the two question constituents. With this we boost objective score if a cell aligns to two question constituents that are within a few words of each other

x (cellProximityBoost) ≤ y (t ijk , q ) , x (cellProximityBoost) ≤ y (t ijk , q ) , ∀i, j, k (48)

If a relation match is active, both the columns for the relation must be active

EQUATION (49): Not extracted; please refer to original document.

If a column is active, a relation match connecting to the column must be active

x ( ik ) ≤ k (r ( ik , ik , q , q ) + r ( ik , ik , q , q )), ∀k

If a relation match is active, the column cannot align to the question in an invalid position r ( ik , ik , q , q ) ≤ 1 − y (t ijk ,q ) , whereq ≤ q and t ijk ∈ ik (51)

If a row is active, at least a certain number of its cells must be active

EQUATION (52): Not extracted; please refer to original document.

If row is active, it must have non-choice alignments.

x (rij) ≤ (n,n )∈L ij y (n, n)

If row is active, it must have non-question alignments x (rij) ≤ (n,n )∈K ij y (n, n)

If two rows of a table are active, the corresponding active cell variables across the two rows must match; in other words, the two rows must have identical activity signature

x (rij) + x (r ij ) + x (t ijk )

− x (t ij k ) ≤ 2, ∀i, j, j , k, k (55)

If two rows are active, then at least one active column in which they differ (in tokenized form) must also be active; otherwise the two rows would be identical in the proof graph.

EQUATION (56): Not extracted; please refer to original document.

If a table is active and another table is also active, at least one inter-table active variable must be active;

x (Ti) + x (T i ) + j,k,j ,k y (t ijk , t i j k ) ≥ 1, ∀i, i (57) : The set of all constraints used in our ILP formulation. The set of variables and are defined in Table 2 . More intuition about constraints is included in Section 3. The sets used in the definition of the constraints are defined in Table 12 .

A preliminary version of our ILP model was used in the ensemble solver of. We build upon this earlier ILP formulation, providing further details and incorporating additional syntactic and semantic constraints that improve the score by 17.7%.

In our evaluations, w for entailment is a simple WordNet-based[Miller, 1995] function that computes the best word-to-word alignment between phrases, scores these alignments using WordNet's hypernym and synonym relations normalized using relevant wordsense frequency, and returns the weighted sum of the scores. w for similarity is the maximum of the entailment score in both directions. Alternative definitions for these functions may also be used.

Details of the ILP model may be found in the appendix.

An exhaustive list of the minimum alignment thresholds for creating pairwise variables is inTable 10in the appendix.5 The complete list of weights for unary and pairwise variables is included inTable 9in the appendix.6 The complete list of the constraints is explained inTable 13in the appendix.

These are the only publicly available state-level science exams. http://www.nysedregents.org/Grade4/Science/home.html 8 Table Corpus and the ILP model are available at allenai.org.

Details of the 11 features may be found in the Appendix B. 10 Commercial ILP solvers (e.g., CPLEX, Gurobi) are much faster than the open-source SCIP solver we used for evaluations.