What Does My QA Model Know? Devising Controlled Probes Using Expert Knowledge

Kyle Richardson
Ashish Sabharwal
Transactions of the Association for Computational Linguistics
2020
View in Semantic Scholar

Abstract

Abstract Open-domain question answering (QA) involves many knowledge and reasoning challenges, but are successful QA models actually learning such knowledge when trained on benchmark QA tasks? We investigate this via several new diagnostic tasks probing whether multiple-choice QA models know definitions and taxonomic reasoning—two skills widespread in existing benchmarks and fundamental to more complex reasoning. We introduce a methodology for automatically building probe datasets from expert knowledge sources, allowing for systematic control and a comprehensive evaluation. We include ways to carefully control for artifacts that may arise during this process. Our evaluation confirms that transformer-based multiple-choice QA models are already predisposed to recognize certain types of structural linguistic knowledge. However, it also reveals a more nuanced picture: their performance notably degrades even with a slight increase in the number of “hops” in the underlying taxonomic hierarchy, and with more challenging distractor candidates. Further, existing models are far from perfect when assessed at the level of clusters of semantically connected probes, such as all hypernym questions about a single concept.

1 Introduction

Automatically answering questions, especially in the open-domain setting (i.e., where minimal or no contextual knowledge is explicitly provided), requires bringing to bear considerable amount of

Benchmark Tasks

1.OpenBook QA (OBQA) (Mihaylov et al., 2018) Question: background knowledge and reasoning abilities. For example, knowing the answers to the two questions in Figure 1 requires identifying a specific ISA relation (i.e., that cooking is a type of learned behavior) as well as recalling the definition of a concept (i.e., that global warming is defined as a worldwide increase in temperature). In the multiple-choice setting, which is the variety of question-answering (QA) that we focus on in this paper, there is also pragmatic reasoning involved in selecting optimal answer choices (e.g., while greenhouse effect might in some other context be a reasonable answer to the second question in Figure 1, global warming is a preferable candidate). Recent successes in QA, driven largely by the creation of new resources (Zellers et al., 2018; Talmor et al., 2019; Bhagavatula et al., 2019; Khot et al., 2020, etc) and advances in model pre-training (Radford et al., 2018; Devlin et al., 2019) , raise a natural question: do state-of-theart multiple-choice QA (MCQA) models that excel at standard tasks really have basic knowledge and reasoning skills?

Figure 1: An illustration of our experimental setup and probing methodology.

Which

Most existing MCQA datasets are constructed through either expensive crowd-sourcing (Welbl et al., 2017) or hand engineering effort, in the former case making it possible to collect large amounts of data at the cost of losing systematic control over the semantics of the target questions. Hence, doing a controlled experiment to answer such a question for QA is difficult given a lack of targeted challenge datasets.

Having definitive empirical evidence of model competence on any given phenomenon requires constructing a wide range of systematic tests. For example, in measuring competence of definitions, not only do we want to see that the model can handle individual questions such as Figure 1 .1 inside of benchmark tasks, but that it can answer a wider range of questions that exhaustively cover a broad set of concepts and question perturbations (i.e., systematic adjustments to how the questions are constructed). The same applies to ISA reasoning; not only is it important to recognize in the question in Figure 1 .1 that cooking is a learned behavior, but also that cooking is a general type of behavior or, through a few more inferential steps, a type of human activity.

In this paper, we look at systematically constructing such tests by exploiting the vast amounts of structured information contained in various types of expert knowledge such as knowledge graphs and lexical taxonomies. Our general methodology works as illustrated in Figure 1 : given any MCQA model trained on a set of benchmark tasks, we systematically generate a set of synthetic dataset probes (i.e., MCQA renderings of the target information) from information in expert knowledge sources. We then use these probes to ask two empirical questions: 1) how well do models trained on benchmark tasks perform on these probing tasks and; 2) can such models be re-trained to master new challenges with minimal performance loss on their original tasks?

While our methodology is amenable to any knowledge source and set of models/benchmark tasks, we focus on probing state-of-the-art transformer models (Devlin et al., 2019; Liu et al., 2019b) in the domain of science MCQA. For sources of expert knowledge, we use WordNet, a comprehensive lexical ontology, and other publicly available dictionary resources. We devise probes that measure model competence in definition and taxonomic knowledge in different settings (including hypernymy, hyponymy, and synonymy detection, and word sense disambiguation). This choice is motivated by fact that the science domain is considered particularly challenging for QA (Clark et al., 2013; Clark, 2015; Clark et al., 2019) , and existing science benchmarks are known to involve widespread use of such knowledge (see ; Boratko et al. (2018) for analysis), which is also arguably fundamental to more complex forms of reasoning.

We show that accurately probing QA models via synthetic datasets is not straightforward, as unexpected artifacts can easily arise in such data. This motivates our carefully constructed baselines and close data inspection to ensure probe quality.

Our results confirm that transformer-based QA models have a remarkable ability to recognize certain types of knowledge captured in our probeseven without additional fine-tuning. Such models can even outperform strong task-specific models trained directly on our probing tasks (e.g., on definitions, our best model achieves 77% test accuracy without specialized training, as opposed to 51% for a task-specific LSTM-based model). We also show that the same models can be effectively refine-tuned on small samples (even 100 examples) of probe data, and that high performance on the probes tends to correlate with a smaller drop in the model's performance on the original QA task.

Our comprehensive assessment reveals several interesting nuances to the overall positive trend. For example, the performance of even the best QA models degrades substantially on our hyponym probes (by 8-15%) when going from 1-hop links to 2-hops. Further, the accuracy of even our best models on the WordNetQA probe drops by 14-44% under our cluster-based analysis, which assesses whether a model knows several facts about each individual concept, rather than just being good at answering isolated questions. State-ofthe-art QA models thus have much room to improve even in some fundamental building blocks, namely definitions and taxonomic hierarchies, of more complex forms of reasoning.

2 Related Work

We follow recent work on constructing challenge datasets for probing neural models, which has primarily focused on the task of natural language inference (NLI) (Glockner et al., 2018; Naik et al., 2018; McCoy et al., 2019; Rozen et al., 2019; Warstadt et al., 2019) . Most of this work looks at constructing data through adversarial generation methods, which have also been found useful for creating stronger models (Kang et al., 2018) . There has also been work on using synthetic data of the type we consider in this paper (Poliak et al., 2018a; Geiger et al., 2019; Richardson et al., 2020) . We closely follow the methodology of Richardson et al. (2020) , who use hand-constructed linguistic fragments to probe NLI models and study model re-training using a variant of the inoculation by fine-tuning strategy of Liu et al. (2019a) . In contrast, we focus on probing open-domain MCQA models (see Si et al. (2019) for a related study in the reading comprehension setting) as well as constructing data from much larger sources of structured knowledge.

Our main study focuses on probing the BERT model and fine-tuning approach of Devlin et al. (2019) , and other variants thereof, which are all based on the transformer architecture of Vaswani et al. (2017) . Related to our efforts, there have been recent studies into the types of relational knowledge contained in large-scale knowledge models (Petroni et al., 2019; Kassner and Schütze, 2019; , which, similar to our work, probe models using structured knowledge sources. This prior work, however, primarily focuses on unearthing the knowledge contained in the underlying language models as is without further training, using simple (single token) clozestyle probing tasks and templates (similar to what we propose in Section 3). In contrast, we focus on understanding the knowledge contained in language models after they have been trained for a QA end-task using benchmark datasets in which such knowledge is expected to be widespread. Further, our evaluation is done before and after these models are fine-tuned on our probe QA tasks, using a more complex set of QA templates and target inferences.

The use of lexical resources and knowledge graphs such as WordNet to construct datasets has a long history, and has recently appeared in work on adversarial attacks (Glockner et al., Liang, 2017) and general task construction (Pilehvar and Camacho-Collados, 2019; Pasupat and Liang, 2015) . In the area of MCQA, there is related work on constructing questions from tuples (Jauhar et al., 2016; Talmor et al., 2019) , both of which involve standard crowd annotation to elicit question-answer pairs (see also Seyler et al. (2017) ; Reddy et al. (2017) ). In contrast to this work, we focus on generating data in an entirely automatic fashion, which obviates the need for expensive annotation and gives us the flexibility to construct much larger datasets that control a rich set of semantic aspects of the target questions.

T d ⊆ {def} × C × D Concepts with Example Sentences T e ⊆ {ext} × C × S Concepts with Words T l ⊆ {lemma} × C × W ISA Relations (WN only) T i ⊆ {isa ↑ ,isa ↓ } × C × C Final Set of Triples for G T = T d ∪ T e ∪ T l ∪ T i

3 Dataset Probes And Construction

Our probing methodology starts by constructing challenge datasets (Figure 1 , yellow box) from a target set of knowledge resources. Each of our probing datasets consists of multiple-choice questions that include a question q and a set of answer choices or candidates {a 1 , ...a N }. This section describes in detail the 5 different datasets we build, which are drawn from two sources of expert knowledge, namely WordNet (Miller, 1995) 1 and the GNU Collaborative International Dictionary of English (GCIDE). 2 We describe each resource in turn, and explain how the resulting dataset probes, which we call WordNetQA and DictionaryQA, are constructed. For convenience, we will describe each source of expert knowledge as a directed, edge-labeled graph G. The nodes of this graph are V = C ∪ W ∪ S ∪ D, where C is a set of atomic concepts, W a set of words, S a set of sentences, and D a set of definitions (see Table 1 for details for WordNet and GCIDE). Each edge of G is directed from an atomic concept in C to another node in V , and is labeled with a relation, such as hypernym or isa ↑ , from a set of relations R (see Table 1 ).

Table 1: A description of the different resources used to construct the probes in terms of abstract triples.

When defining our probe question templates, it will be useful to view G as a set of (relation, source, target) triples T ⊆ R×C×V. Due to their origin in an expert knowledge source, such triples preserve semantic consistency. For instance, when the relation in a triple is def, the corresponding edge maps a concept in C to a definition in D.

To construct probe datasets, we rely on two heuristic functions, defined below for each individual probe: GEN Q (τ ), which generates gold question-answer pairs (q, a) from a set of triples τ ⊆ T and question templates Q, and DISTR(τ ), which generates distractor answers choices {a 1 , ...a N −1 } based on another set of triples τ (where usually τ ⊂ τ ). For brevity, we will use GEN(τ ) to denote GEN Q (τ ), leaving question templates Q implicit.

3.1 Wordnetqa

WordNet is an English lexical database consisting of around 117k concepts, which are organized into groups of synsets that each contain a gloss (i.e., a definition of the target concept), a set of representative English words (called lemmas), and, in around 33k synsets, example sentences. In addition, many synsets have ISA links to other synsets that express complex taxonomic relations. Table 1 summarizes how we formulate WordNet as a set of triples T of various types. These triples together represent a directed, edge-labeled graph G. Our main motivation for using WordNet, as opposed to a resource such as ConceptNet (Havasi et al., 2007) , is the availability of glosses (D) and example sentences (S), which allows us to construct natural language questions that contextualize the types of concepts we want to probe.

Example Generation GEN(τ ). We build 4 individual datasets based on semantic relations native to WordNet (see Miller et al. (1990) ): hypernymy (i.e., generalization or ISA reasoning up a taxonomy, ISA ↑ ), hyponymy (ISA ↓ ), synonymy, and definitions. To generate a set of questions in each case, we employ a number of rule templates Q that operate over tuples. A subset of such templates is shown in Table 2 . The templates were designed to mimic naturalistic questions we observed in our science benchmarks.

Table 2: Details of the GEN(τ) function used to construct gold question-answer pairs (q, a) from a triple graph G.

utter.v.01 parrot.v.02 recite.v.02 spell.v.01 count.v.03 count-down.v.01 mispell.v.01 ISA ↑ ISA ↑ ISA ↑ ISA ↑ ISA ↑ ISA ↑ Graph Triples Question/Answers Question+Answer about Hypernymy/ISA ↑ (isa ↑ , count.v.03 ,recite.v.02)

q. In the sentence The toddler could count, the word count is a type of: a. recite event... For example, suppose we wish to create a question q about the definition of a target concept c ∈ C. We first select a question template from Q that first introduces the concept c and its lemma l ∈ W in context using the example sentence s ∈ S, and then asks to identify the corresponding WordNet gloss d ∈ D, which serves as the gold answer a. The same is done for ISA reasoning; each question about a hypernym/hyponym relation between two concepts c → ↑/↓ c ∈ T i (e.g., dog → ↑/↓ animal/terrier) first introduces a context for c and then asks for an answer that identifies c (which is also provided with a gloss so as to contain all available context).

Hyponym And Sister Distractors

(isa ↓ , count.v.02 ,count-down..) a 1 . count-down event de- fined as... (1-hop hyponym distractor) (isa ↓ , recite.v.02 ,spell.v.01)

In the latter case, the rules (isa r , c, c ) ∈ T i in Table 2 cover only direct ISA links from c in direction r ∈ {↑, ↓}. In practice, for each c and direction r, we construct tests that cover the set HOPS(c, r) of all direct as well as derived ISA relations of c:

HOPS(c, r):= (isa r , c, c ) ∈ T i ∪ HOPS(c , r)

This allows us to evaluate the extent to which models are able to handle complex forms of reasoning that require several inferential steps or hops. 3

Probe Type

Triple Input τ Generation Templates from Q Example Questions and Answers (q, a) Definitions: defining words in context.

(def, c i , d) (ex, c i , s) (word, c i , w) q. In the sentence [s], the word [w] is best defined as: a. [d]

q. In the sentence The baby nestled her head, the word nestled is best defined as: a. position comfortably Hypernymy:

ISA ↑ reason- ing in context (symbolically c i =>c i ). (def, c i , d) (isa ↑ , c i , c i ) (ex, c i , s) (word, c i , w) (word, c i , w ) q. In [s], the word or concept [w]

is best described as a type of a.

[w ] defined as [d] q. In The thief eluded the police, the word or concept eluded is best described as a type of a. escape event defined as to run away from..

Hyponymy:

ISA ↓ reasoning given context. (symbolically c i <=c i ) (def, c i , d) (isa ↓ , c i , c i ) (ex, c i , s) (word, c i , w) (word, c i , w )

q. Given the context [s], which of the following word or concept is a specific type of [w] a.

[w ] defined as [d] q. Given the context they awaited her arrival, which of the following word or concept is a specific type of arrival? a. crash landing, defined as an emergency landing under circumstances where....' Synonymy:

Related words. (def, c i , d) (word, c i , w 1 ) (word, c i , w 2 )

q. Which set of words best corresponds to the definition

[d]? a. [{w 1 , w 2 , ...}]

q. Which set of words best corresponds to the definition a grammatical category in inflected languages governing agreement between nouns and pronouns...? a. gender,... Distractor Generation: DISTR(τ ). An example of how distractors are generated is shown in Figure 2 , which relies on similar principles as above. For each concept c, we choose 4 distractor answers that are close in the WordNet semantic space. For example, when constructing hypernymy tests for c from the set HOPS(c, ↑), we build distractors by drawing from HOPS(c, ↓) (and vice versa), as well as from the -deep sister family of c, defined as follows. The 1-deep sister family is simply c's siblings or sisters, i.e., the other childrenc = c of the parent node c of c. For > 1, the -deep sister family also includes all descendants of eachc up to − 1 levels deep, denoted HOPS −1 (c, ↓). Formally:

Figure 2: A portion of the WordNet ISA graph (top) and an example distractor function DISTR(τ) (bottom) used to generate distractor choices {a′1, a′2, a′3} for a question q based on information in the graph.

SISTER (c):= x ∈ HOPS −1 (c, ↓) | (isa ↑ , c, c ) ∈ T i , (isa ↑ ,c, c ) ∈ T i ,c = c

For definitions and synonyms we build distractors from all of these sets (with a similar restriction on the depth of SISTER distractors as noted above). In doing this, we can systematically investigate model performance on a wide range of distractor sets.

3.1.1 Perturbations And Semantic Clusters

Based on how we generate data, for each concept c (i.e., atomic WordNet synset) and probe type (i.e., definitions, hypernymy, etc.), we have a wide variety of questions related to c that manipulate 1) the complexity of reasoning that is involved (e.g., the number of inferential hops) and; 2) the types of distractors (or distractor perturbations) that are employed. We call such sets semantic clusters. As we describe in the next section, semantic clusters allow us to devise new types of evaluation that reveal whether models have comprehensive and consistent knowledge of target concepts (e.g., evaluating whether a model can correctly answer several questions associated with a concept, as opposed to a few disjoint instances). Details of the individual datasets are shown in Table 3 . From these sets, we follow Richardson et al. (2020) in allocating a maximum of 3k examples for training and reserve the rest for development and testing. Since we are interested in probing, having large held-out sets allows us to do detailed analysis and cluster-based evaluation.

Table 3: Details of our dataset probes, which includes (for WordNetQA above) the number of unique (q, a) pairs, as well as the total number of all questions including perturbations w/ Perturb. (varied distractor choices).

3.2 Dictionaryqa

The DictionaryQA dataset is created from the GCIDE dictionary, which is a comprehensive open-source English dictionary built largely from the Webster's Revised Unabridged Dictionary (Webster, 1913) . Each entry consists of a word, its part-of-speech, its definition, and an optional example sentence (see Table 4 ). Overall, 33k entries (out of a total of 155k) contain example GCIDE Dictionary Entries word: gift, pos: n., definition: Anything given; anything voluntarily transferred by one person to another without compensation; a present; entry example: None. word: gift, pos: n., definition: A bribe; anything given to corrupt. entry example: None. word: gift, pos: n., definition: Some exception inborn quality or characteristic; a striking or special talent or aptitude;.. entry example: the gift of wit; a gift for speaking. sentences/usages. As with the WordNet probes, we focus on this subset so as to contextualize each word being probed. In contrast to Word-Net, GCIDE does not have ISA relations or explicit synsets, so we take each unique entry to be a distinct sense. We then use the dictionary entries to create a probe that centers around word-sense disambiguation, as described below.

Table 4: Example dictionary entries for the word gift.

Example and Distractor Generation. To generate gold questions and answers, we use the same generation templates for definitions exemplified in Figure 2 for WordNetQA. To generate distractors, we simply take alternative definitions for the target words that represent a different word sense (e.g., the alternative definitions of gift shown in Table 4 ), as well as randomly chosen definitions if needed to create a 5-way multiple choice question. As above, we reserve a maximum of 3k examples for training. Since we have only 9k examples in total in this dataset (see WordSense in Table 3 ), we also reserve 3k each for development and testing.

We note that initial attempts to build this dataset through standard random splitting gave rise to certain systematic biases that were exploited by the choice-only baseline models described in the next section, and hence inflated overall model scores. After several efforts at filtering we found that, among other factors, using definitions from entries without example sentences as distractors (e.g., the first two entries in Table 4 ) had a surprising correlation with such biases. This suggests that possible biases involving differences between dictionary entries with and without examples can taint the resulting automatically generated MCQA dataset (for more discussion on the pitfalls involved with automatic dataset construction, see Section 5).

4 Probing Methodology And Modeling

Given the probes above, we now can start to answer the empirical questions posed at the begin-ning. Our main focus is on looking at transformerbased MCQA models trained in the science domain (using the benchmarks shown in Table 5 ). In this section, we provide details of MCQA and the target models, as well as several baselines that we use to sanity check our new datasets. To evaluate model competence, we look at a combination of model performance after science pre-training and after additional model fine-tuning using the lossless inoculation strategy of Richardson et al. (2020) (Section 4.2) . In Section 4.3, we also discuss a cluster-level accuracy metric for measuring performance over semantic clusters.

Table 5: The MCQA training datasets used. #Question denotes the number of training samples in our version of each dataset, N the number of choices.

4.1 Task Definition And Modeling

Given a dataset D = {(q (d) , {a

(d) 1 , ..., a (d) N })} |D| d

consisting of pairs of questions stems q and answer choices a i , the goal is to find the correct answer a i * that correctly answers each q. Throughout this paper, we look at 5-way multiple-choice problems (i.e., where each N = 5).

Question+Answer Encoder. To model this, our investigation centers around the use of the transformer-based (Vaswani et al., 2017) BERT encoder and fine-tuning approach of Devlin et al. (2019) (see also Radford et al. (2018) ). For each question and individual answer pair q (j) a i , we assume the following rendering of this input:

q (j) a i := [CLS] q (j) [SEP] a (j) i [SEP]

which is run through the pre-trained BERT encoder to generate a representation for q (j) a i using the hidden state representation for CLS (i.e., the classifier token) c i :

c (j) i = BERT(q (j) a i ) ∈ R H

The probability of a given answer p

(j) i is then com- puted as p (j) i ∝ e v•c (j)

i , which uses an additional set of classification parameters v ∈ R H that are optimized (along with the full transformer network) by taking the final loss of the probability of each correct answer p i * over all answer choices:

L = d∈|D| − log p (d) i *

We specifically use BERT-large uncased with whole-word masking, as well as the RoBERTalarge model from Liu et al. (2019b) , which is a more robustly trained version of the original BERT model. Our system uses the implementations provided in AllenNLP and Huggingface (Wolf et al., 2019) .

Baselines and Sanity Checks. When creating synthetic datasets, it is important to ensure that systematic biases, or annotation artifacts (Gururangan et al., 2018), are not introduced into the resulting probes and that the target datasets are sufficiently challenging (or good, in the sense of Hewitt and Liang (2019)). To test for this, we use several of the MCQA baseline models first introduced in Mihaylov et al. (2018) , which take inspiration from the LSTM-based models used in Conneau et al. (2017) for NLI and various partialinput baselines based on these models.

Following the notation from Mihaylov et al. (2018) , for any given sequence s of tokens in

{q (j) , a (j) 1 , ..., a (j) N } in D, an encoding of s is given as h (j) s = BiLSTM(EMBED(s)) ∈ R |s|×2h (

where h is the dimension of the hidden state in each directional network, and EMBED(•) is an embedding function that assigns token-level embeddings to each token in s 4 ). A contextual representation for each s is then built by applying an element-wise max operation over h s as follows:

r (j) s = max(h (j) s ) ∈ R 2h

With these contextual representations, different baseline models can be constructed. For example, a Choice-Only model, which is a variant of the well-known hypothesis-only baseline used in NLI (Poliak et al., 2018b) , scores each choice c i in the following way:

α (j) i = W T r (j) c i ∈ R for W T ∈ R 2h

independently of the question and assigns a probability to each answer p

i ∝ e α (j) i . A slight variant of this model, the Choice-tochoice model, tries to single out a given answer choice relative to other choices by scoring all choice pairs α

(j) i,i = ATT(r (j) c i , r (j)

c i ) ∈ R using a learned attention mechanism ATT and finding the choice with the minimal similarity to other options (for full details, see their original paper). In using these partial-input baselines, which we train directly on each target probe, we can check whether systematic biases related to answer choices were introduced into the data creation process.

A Question-to-choice model, in contrast, uses the contextual representations for each question and individual choice and an attention model ATT model to get a score α

(j) q,i = ATT(r (j) q , r (j) c i ) ∈ R as above.

Here we also experiment with using ESIM (Chen et al., 2017) to generate the contextual representations r, as well as a simpler VecSimilarity model that measures the average vector similarity 5 between question and answer tokens:

α (j) q,i = SIM(EMBED(q (j) ), EMBED(c (j) i ))

. In contrast to the models above, these sets of baselines are used to check for artifacts between questions and answers that are not captured in the partial-input baselines (see discussion in Feng et al. 2019) and ensure that the overall MCQA tasks are sufficiently difficult for our transformer models.

4.2 Inoculation And Pre-Training

Using the various models introduced above, we train these models on benchmark tasks in the science domain and look at model performance on our probes with and without additional training on samples of probe data, building on the idea of inoculation from Liu et al. (2019a) . Model inoculation is the idea of continuing to train models on new challenge tasks (in our cases, separately for each probe) using only a small amount of examples. Unlike in ordinary fine-tuning, the goal is not to learn an entirely re-purposed model, but to improve on (or vaccinate against) particular phenomena (e.g., our synthetic probes) that potentially deviate from a model's original training distribution (but that nonetheless might involve knowledge already contained in the model).

In the variant proposed in Richardson et al. (2020) , for each pre-trained (science) model and architecture M a we continue training the model on k new probe examples (with a maximum of k = 3k) under a set of different hyper-parameter configurations j ∈ {1, ..., J} and identify, for each k, the model M a,k * with the best aggregate performance S on the original (orig) and new task:

M a,k * = arg max M ∈{M a,k 1 ,...,M a,k J } AVG S new (M ), S orig (M )

As in Richardson et al. (2020) , we found all models to be especially sensitive to different learning rates, and performed comprehensive hyperparameters searches that also manipulate the number of iterations and random seeds used. Using this methodology, we can see how much exposure to new data it takes for a given model to master a new task, and whether there are phenomena that stress particular models (e.g., lead to catastrophic forgetting of the original task). Given the restrictions on the number of fine-tuning examples, our assumption is that when models are able to maintain good performance on their original task during inoculation, the quickness with which they are able to learn the inoculated task provides evidence of prior competence, which is precisely what we aim to probe. To measure past performance, we define a model's inoculation cost as the difference in the performance of this model on its original task before and after inoculation.

We pre-train on an aggregated training set of the benchmark science exams detailed in Table 5 6 , and created an aggregate development set of around 4k science questions for evaluating overall science performance and inoculation costs. To handle the mismatch between number of answer choices in these sets, we made all sets 5-way by adding empty answers as needed. We also experimented with a slight variant of inoculation, called add-some inoculation, which involves balancing the inoculation training sets with naturalistic science questions. We reserve the MCQL dataset in Table 5 for this purpose, and experiment with balancing each probe example with a science example (x1 matching) and adding twice as many science questions (x2 matching, up to 3k) for each new example.

4.3 Evaluating Model Competence

The standard way to evaluate our MCQA models is by looking at the overall accuracy of the correct answer prediction, or what we call instancelevel accuracy (as in Table 6 ). Given the nature of our data and the existence of semantic clusters as detailed in Section 3.1.1 (i.e., sets of questions and answers under different distractor choices and inference complexity), we also measure a model's cluster-level (or strict cluster) accuracy, which requires correctly answering all questions in a cluster. Example semantic clusters are shown in Table 7 ; in the first case, there are 6 ISA ↑ questions (including perturbations) about the concept trouser.n.01 (e.g., involving knowing that trousers are a type of consumer good and garment/clothing), which a model must answer in order to receive full credit.

Table 6: Instance-level accuracy (%) results on all baselines and main models.

Table 7: Example questions and answers/inferences (involving ISA reasoning) that illustrate semantic clusters, as well as model predictions (shown as # correct questions/total # questions with perturbations).

Our cluster-based analysis is motivated by the idea that if a model truly knows the meaning of a given concept, such as the concept of trousers, then it should be able to answer arbitrary questions about this concept without sensitivity to varied distractors. While our strict cluster metric is simplistic, it takes inspiration from work on visual QA (Shah et al., 2019) , and allows us to evaluate how consistent and robust models are across our different probes, and to get insight into whether errors are concentrated on a small set of concepts or widespread across clusters.

5 Results And Findings

In this section, we provide the results of the empirical questions first introduced in Figure 1 , starting with the results of our baseline models.

Are our Probes Sufficiently Challenging? As shown in Table 6 , most of our partial-input baselines (i.e., Choice-Only and Choice-to-Choice models) failed to perform well on our dataset probes across a wide range of models, showing that such probes are generally immune from biases relating to how distractors were generated. As already discussed in Section 3.2, however, initial versions of the DictionaryQA dataset had unforeseen biases partly related to whether distractors were sampled from entries without example sentences, which resulted in high Choice-Only-GloVe scores ranging around 56% accuracy before a filtering step was applied to remove these distractors. We had similar issues with the hypernymy probe which, even after a filtering step that used our Choice-to-Choice-GloVe model, still leads to high results on the BERT and RoBERTa choiceonly models. Given that several attempts were made to entirely de-duplicate the different splits (both in terms of gold answers and distractor types), the source of these biases is not at all obvious, which shows how easy it is for unintended biases in expert knowledge to appear in the resulting datasets and the importance of having rigorous baselines. We also note the large gap in some cases between the BERT and RoBERTa versus GloVe choice-only models, which highlights the need for having partial-input baselines that use the best available models.

Using a more conventional set of Task-Specific QA models (i.e., the LSTM-based Question-to-Choice models trained directly on the probes), we can see that results are not particularly strong on any of the datasets, suggesting that our probes are indeed sufficiently challenging and largely immune from overt artifacts. The poor performance of the VecSimilarity (which uses pretrained Word2Vec embeddings without additional training) provides additional evidence that elementary lexical matching strategies are insufficient for solving any of the probing tasks.

How well do pre-trained MCQA models do? Science models that use non-transformer based encoders, such as the ESIM model with GloVe and ELMO, perform poorly across all probes, in many cases scoring near random chance, showing limits to how well they generalize from science to other tasks even with pre-trained GloVe and ELMO embeddings. In sharp contrast, the transformer models have mixed results, the most striking result being the RoBERTa models on the definitions and synonymy probes (achieving a test accuracy of 77% and 61%, respectively), which outperform several of the task-specific LSTM models trained directly on the probes. At first glance, this suggests that RoBERTa, which generally far outpaces even BERT across most probes, has high competence of definitions and synonyms even without explicit training on our new tasks.

Given the controlled nature of our probes, we can get a more detailed view of how well the science models are performing across different reasoning and distractor types, as shown in the first column of Figure 3 for ESIM and RoBERTa. The ESIM science model without training has uniformly poor performance across all categories, whereas the performance of RoBERTa is more varied. Across all datasets and number of hops (i.e., the rows in the heat maps), model performance for RoBERTa is consistently highest among examples with random distractors (i.e., the first column), and lowest in cases involving distractors that are closest in WordNet space (e.g., sister and ISA, or up/down, distractors of distance k = 1). This is not surprising, given that, in the first case, random distractors are likely to be the easiest category (and the opposite for distractors close in space), but suggests that RoBERTa might ESIM+Glove-Science (no-training) ESIM+Glove-Science (100 ex.)

Figure 3: Combined model accuracies on the different WordNetQA datasets (divided by red lines) broken down (where possible) into number of hops k (rows) and types of distractor sets and hops k′ (rows) across the different stages of inoculation (# ex.). The dashed red lines show some trends related to multi-hop inference.

ESIM+Glove-Science (3000 ex.)

RoBERTa-Science (no-training) RoBERTa-Science (100 ex.) RoBERTa-Science (3000 ex.)

Figure 3: Combined model accuracies on the different WordNetQA datasets (divided by red lines) broken down (where possible) into number of hops k (rows) and types of distractor sets and hops k (rows) across the different stages of inoculation (# ex.). The dashed red lines show some trends related to multi-hop inference.

only be getting the easiest cases correct.

Model performance also clearly degrades for hypernymy and hyponymy across all models as the number of hops k increases (see red dashed boxes). For example, problems that involve hyponym reasoning with sister distractors of distance k = 1 (i.e., the second column) degrades from 47% to 15% when the number of hops k increases from 1 to 4. This general tendency persists even after additional fine-tuning, as we discuss next, and gives evidence that models are limited in their capacity for certain types of multi-hop inferences.

As discussed by Petroni et al. (2019) , the choice of generation templates can have a significant effect on model performance. The results so far should therefore be regarded as a lower bound on model competence. It is possible that model performance is high for definitions, for example, because the associated templates best align with the science training distribution (which we know little about). For this reason, the subsequent inoculation step is important-it gives the model an opportunity to learn about our target templates and couple this learned knowledge with its general knowledge acquired during pre-training and science training (which is, again, what we aim to probe). at each k number of challenge examples (x axis). We also plot the effect of using add-some inoculation shown in the blue (1x matching) and black (x2 matching) lines.

As shown in Figure 4 , transformer models tend to learn most tasks fairly quickly while keeping constant scores on their original tasks (i.e., the flat dashed lines observed in plots 1-4), which gives evidence of high competence. In both cases, addsome inoculation proves to be a cheap and easy way to 1) improve scores on the probing tasks (i.e., the solid black and blue lines in plot 1) and; 2) minimize loss on science (e.g., the blue and black dashed lines in plots 2-4). The opposite is the case for ESIM (plots 5-6); models are generally unable to simultaneously learn individual probes without degrading on their original task, and adding more science data during inoculation confuses models on both tasks.

Figure 4: Inoculation plots showing accuracy of challenge tasks (red solid lines) and original tasks (red dashed lines) using the best aggregate model Ma,k∗ at each k number of challenge examples (x axis). We also plot the effect of using add-some inoculation shown in the blue (1x matching) and black (x2 matching) lines.

As shown in Figure 3 , RoBERTa is able to significantly improve performance across most categories even after inoculation with a mere 100 examples (the middle plot), which again provides strong evidence of prior competence. As an example, RoBERTa improves on 2-hop hyponymy inference with random distractors by 18% (from 59% to 77%). After 3k examples, the model has high performance on virtually all categories (the same score increases from 59% to 87%), however results still tends to degrade as a function of hop and distractor complexity, as discussed above. Despite the high performance of our transformer models after inoculation, model performance on most probes (with the exception of Definitions) averages around 80% for our best models. This suggests that there is still considerable room for improvement, especially for synonymy and word sense, which is a topic that we discuss more in Section 6.

Are Models Consistent across Clusters? Table 8 shows cluster-level accuracies for the different WordNetQA probes. As with performance across the different inference/distractor categories, these results are mixed. For some probes, such as definitions, our best models appear to be rather robust; e.g., our RoBERTa model has a cluster accuracy of 75%, meaning that it can answer all questions perfectly for 75% of the target concepts and that errors are concentrated on a small minority (25%) of concepts. On synonymy and hypernymy, both BERT and RoBERTa appear robust on the majority of concepts, showing that errors are similarly concentrated. In contrast, our best model on hyponymy has an accuracy of 36%, meaning that its errors are spread across many concepts, thus suggesting less robustness. Table 7 shows a selection of semantic clusters involving ISA reasoning, as well as the model performance over different answers (shown symbolically) and perturbations. For example, in the the second case, the cluster is based around the concept/synset oppose.v.06 and involves 4 inferences and a total 24 questions (i.e., inferences with perturbations). Our weakest model, ESIM, answers only 5 out of 24 questions correctly, whereas RoBERTa gets 21/24. In the other cases, RoBERTa gets all clusters correct, whereas BERT and ESIM get none of them correct.

Table 8: Cluster-level accuracies (%) on the WordNetQA dev. sets for inoculated models and best Choice-only model. ∆ show the absolute difference in percentage points with instance-level accuracies.

We emphasize that these results only provide a crude look into model consistency and robustness. Recalling again the details in Table 3 , probes differ in terms of average size of clusters. Hyponymy, in virtue of having many more questions per cluster, might simply be a much more difficult dataset. In addition, such a strict evaluation does not take into account potential errors inside of clusters, which is an important issue that we discuss in the next section. We leave addressing such issues and coming up with more insightful cluster-based metrics for future work.

6 Discussion And Conclusion

We presented several new challenge datasets and a novel methodology for automatically building such datasets from knowledge graphs and taxonomies. We used these to probe state-of-theart open-domain QA models (centering around models based on variants of BERT). While our general methodology is amendable to any target knowledge resource or QA model/domain, we focus on probing definitions and ISA knowledge using open-source dictionaries and MCQA models trained in the science domain.

We find, consistent with recent probing studies (Petroni et al., 2019) , that transformer-based models have a remarkable ability to answer questions that involve complex forms of relational knowledge, both with and without explicit exposure to our new target tasks. In the latter case, a newer RoBERTa model trained only on benchmark science tasks is able to outperform several task-specific LSTM-based models trained directly on our probing data. When re-trained on small samples (e.g., 100 examples) of probing data using variations of the lossless inoculation strategy from Richardson et al. (2020) , RoBERTa is able to master many aspects of our probes with virtually no performance loss on its original QA task.

These positive results suggest that transformerbased models, especially models additionally finetuned on small samples of synthetic data, can be used in place of task-specific models used for querying relational knowledge, as has already been done for targeted tasks such as word sense disambiguation (Huang et al., 2019) . Since models seem to already contain considerable amounts of relational knowledge, our simple inoculation strategy, which tries to nudge models to bring out this knowledge explicitly, could serve as a cheaper alternative to recent attempts to build architectures that explicitly incorporate structured knowledge (Peters et al., 2019); we see many areas where our inoculation strategy could be improved for such purposes, including having more complex loss functions that manage old and new information, as well as using techniques that take into account network plasticity (Paik et al., 2019) .

The main appeal of using automatically generate datasets is the ability to systematically manipulate and control the complexity of target questions, which allows for more controlled experimentation and new forms of evaluation. Despite the positive results described above, results that look directly at the effect of different types of distractors and the complexity of reasoning show that our best models, even after additional fine-tuning, struggle with certain categories of hard distractors and multi-hop inferences. For some probes, our cluster-based analysis also reveals that errors are widespread across concept clusters, suggesting that models are not always consistent and robust. These results, taken together with our findings about the vulnerability of synthetic datasets to systematic biases, suggest that there is much room for improvement and that the positive results should be taken with a grain of salt. Developing better ways to evaluate semantic clusters and model robustness would be a step in this direction.

We emphasize that using synthetic versus naturalistic QA data comes with important trade-offs. While we are able to generate large amounts of systematically controlled data at virtually no cost or need for manual annotation, it is much harder to validate the quality of such data at such a scale and such varying levels of complexity. Conversely, with benchmark QA datasets, it is much harder to perform the type of careful manipulations and cluster-based analyses we report here. While we assume that the expert knowledge we employ, in virtue of being hand-curated by human experts, is generally correct, we know that such resources are fallible and error-prone. Initial crowd-sourcing experiments that look at validating samples of our data show high agreement across probes and that human scores correlate with the model trends across the probe categories. More details of these studies are left for future work.

https://wordnet.princeton.edu/ 2 http://gcide.gnu.org.ua/

In practice, most WordNet synsets have no more than 5 hops. We use this as a default limit when building datasets.

As inMihaylov et al. (2018), we experiment with using both GloVe(Pennington et al., 2014) and ELMO (Peters et al., 2018) pre-trained embeddings for EMBED.

For this we use pre-trained Word2Vec vectors(Mikolov et al., 2013) and cosine as our SIM(•) function.

To save space, we do not report the scores on each individual science dataset, but verified that our best models achieve results comparable performance to SOTA. We also note that for our RoBERTa-large model, we used a special set of pre-trained weights that were built via an additional pre-training stage on the RACE dataset(Lai et al., 2017).