AdvEntuRe: Adversarial Training for Textual Entailment with Knowledge-Guided Examples

Dongyeop Kang
Tushar Khot
Ashish Sabharwal
E. Hovy
ACL
2018
View in Semantic Scholar

Abstract

We consider the problem of learning textual entailment models with limited supervision (5K-10K training examples), and present two complementary approaches for it. First, we propose knowledge-guided adversarial example generators for incorporating large lexical resources in entailment models via only a handful of rule templates. Second, to make the entailment model—a discriminator—more robust, we propose the first GAN-style approach for training it using a natural language example generator that iteratively adjusts to the discriminator’s weaknesses. We demonstrate effectiveness using two entailment datasets, where the proposed methods increase accuracy by 4.7% on SciTail and by 2.8% on a 1% sub-sample of SNLI. Notably, even a single hand-written rule, negate, improves the accuracy of negation examples in SNLI by 6.1%.

1 Introduction

The impressive success of machine learning models on large natural language datasets often does not carry over to moderate training data regimes, where models often struggle with infrequently observed patterns and simple adversarial variations. A prominent example of this phenomenon is textual entailment, the fundamental task of deciding whether a premise text entails ( ) a hypothesis text. On certain datasets, recent deep learning entailment systems (Parikh et al., 2016; Gong et al., 2018) have achieved close to human level performance. Nevertheless, the problem is far from solved, as evidenced by how easy it is to generate minor adversarial ex- Table 1 : Failure examples from the SNLI dataset: negation (Top) and re-ordering (Bottom). P is premise, H is hypothesis, and S is prediction made by an entailment system (Parikh et al., 2016) .

Table 1: Failure examples from the SNLI dataset: negation (Top) and re-ordering (Bottom). P is premise, H is hypothesis, and S is prediction made by an entailment system (Parikh et al., 2016).

P:

The dog did not eat all of the chickens. H: The dog ate all of the chickens. S: entails (score 56.5%) P: The red box is in the blue box. H: The blue box is in the red box. S: entails (score 92.1%) amples that break even the best systems. As Table 1 illustrates, a state-of-the-art neural system for this task, namely the Decomposable Attention Model (Parikh et al., 2016) , fails when faced with simple linguistic phenomena such as negation, or a re-ordering of words. This is not unique to a particular model or task. Minor adversarial examples have also been found to easily break neural systems on other linguistic tasks such as reading comprehension (Jia and Liang, 2017) .

A key contributor to this brittleness is the use of specific datasets such as SNLI (Bowman et al., 2015) and SQuAD (Rajpurkar et al., 2016) to drive model development. While large and challenging, these datasets also tend to be homogeneous. E.g., SNLI was created by asking crowd-source workers to generate entailing sentences, which then tend to have limited linguistic variations and annotation artifacts (Gururangan et al., 2018) . Consequently, models overfit to sufficiently repetitive patterns-and sometimes idiosyncrasies-in the datasets they are trained on. They fail to cover long-tail and rare patterns in the training distribution, or linguistic phenomena such as negation that would be obvious to a layperson.

To address this challenge, we propose to train textual entailment models more robustly using ad-versarial examples generated in two ways: (a) by incorporating knowledge from large linguistic resources, and (b) using a sequence-to-sequence neural model in a GAN-style framework.

The motivation stems from the following observation. While deep-learning based textual entailment models lead the pack, they generally do not incorporate intuitive rules such as negation, and ignore large-scale linguistic resources such as PPDB (Ganitkevitch et al., 2013) and Word-Net (Miller, 1995) . These resources could help them generalize beyond specific words observed during training. For instance, while the SNLI dataset contains the pattern two men people, it does not contain the analogous pattern two dogs animals found easily in WordNet.

Effectively integrating simple rules or linguistic resources in a deep learning model, however, is challenging. Doing so directly by substantially adapting the model architecture (Sha et al., 2016; Chen et al., 2018) can be cumbersome and limiting. Incorporating such knowledge indirectly via modified word embeddings (Faruqui et al., 2015; Mrkšić et al., 2016) , as we show, can have little positive impact and can even be detrimental.

Our proposed method, which is task-specific but model-independent, is inspired by dataaugmentation techniques.

We generate new training examples by applying knowledge-guided rules, via only a handful of rule templates, to the original training examples. Simultaneously, we also use a sequence-to-sequence or seq2seq model for each entailment class to generate new hypotheses from a given premise, adaptively creating new adversarial examples. These can be used with any entailment model without constraining model architecture.

We also introduce the first approach to train a robust entailment model using a Generative Adversarial Network or GAN (Goodfellow et al., 2014) style framework. We iteratively improve both the entailment system (the discriminator) and the differentiable part of the data-augmenter (specifically the neural generator), by training the generator based on the discriminator's performance on the generated examples. Importantly, unlike the typical use of GANs to create a strong generator, we use it as a mechanism to create a strong and robust discriminator.

Our new entailment system, called AdvEntuRe, demonstrates that in the moderate data regime, adversarial iterative data-augmentation via only a handful of linguistic rule templates can be surprisingly powerful. Specifically, we observe 4.7% accuracy improvement on the challenging SciTail dataset (Khot et al., 2018) and a 2.8% improvement on 10K-50K training subsets of SNLI. An evaluation of our algorithm on the negation examples in the test set of SNLI reveals a 6.1% improvement from just a single rule.

2 Related Work

Adversarial example generation has recently received much attention in NLP. For example, Jia and Liang (2017) generate adversarial examples using manually defined templates for the SQuAD reading comprehension task. Glockner et al. (2018) create an adversarial dataset from SNLI by using WordNet knowledge. Automatic methods have also been proposed to generate adversarial examples through paraphrasing. These works reveal how neural network systems trained on a large corpus can easily break when faced with carefully designed unseen adversarial patterns at test time. Our motivation is different. We use adversarial examples at training time, in a data augmentation setting, to train a more robust entailment discriminator. The generator uses explicit knowledge or hand written rules, and is trained in a end-to-end fashion along with the discriminator.

Incorporating external rules or linguistic resources in a deep learning model generally requires substantially adapting the model architecture (Sha et al., 2016; Kang et al., 2017) . This is a model-dependent approach, which can be cumbersome and constraining. Similarly non-neural textual entailment models have been developed that incorporate knowledge bases. However, these also require model-specific engineering (Raina et al., 2005; Silva et al., 2018) .

An alternative is the model-and taskindependent route of incorporating linguistic resources via word embeddings that are retro-fitted (Faruqui et al., 2015) or counterfitted (Mrkšić et al., 2016) to such resources. We demonstrate, however, that this has little positive impact in our setting and can even be detrimental. Further, it is unclear how to incorporate knowledge sources into advanced representations such as contextual embeddings (McCann et al., 2017; Peters et al., 2018) . We thus focus on a task-specific but model-independent approach. Logical rules have also been defined to label existing examples based on external resources (Hu et al., 2016) . Our focus here is on generating new training examples.

Our use of the GAN framework to create a better discriminator is related to CatGANs (Wang and Zhang, 2017) and TripleGANs (Chongxuan et al., 2017) where the discriminator is trained to classify the original training image classes as well as a new 'fake' image class. We, on the other hand, generate examples belonging to the same classes as the training examples. Further, unlike the earlier focus on the vision domain, this is the first approach to train a discriminator using GANs for a natural language task with discrete outputs.

3 Adversarial Example Generation

We present three different techniques to create adversarial examples for textual entailment. Specifically, we show how external knowledge resources, hand-authored rules, and neural language generation models can be used to generate such examples. Before describing these generators in detail, we introduce the notation used henceforth.

We use lower-case letters for single instances (e.g., x, p, h), upper-case letters for sets of instances (e.g., X, P, H), blackboard bold for models (e.g., D), and calligraphic symbols for discrete spaces of possible values (e.g., class labels C). For the textual entailment task, we assume each example is represented as a triple (p, h, c) , where p is a premise (a natural language sentence), h is a hypothesis, and c is an entailment label: (a) entails ( ) if h is true whenever p is true; (b) contradicts ( ) if h is false whenever p is true; or (c) neutral (#) if the truth value of h cannot be concluded from p being true. 1 We will introduce various example generators in the rest of this section. Each such generator, G ρ , is defined by a partial function f ρ and a label g ρ . If a sentence s has a certain property required by f ρ (e.g., contains a particular string), f ρ transforms it into another sentence s and g ρ provides an entailment label from s to s . Applied to a sentence s, G ρ thus either "fails" (if the pre-requisite isn't met) or generates a new entailment example triple, s, f ρ (s), g ρ . For instance, consider the 1 The symbols are based on Natural Logic (Lakoff, 1970) and use the notation of MacCartney and Manning (2012) . Table 2 : Various generators G ρ characterized by their source, (partial) transformation function f ρ as applied to a sentence s, and entailment label g ρ generator for ρ:=hypernym(car, vehicle) with the (partial) transformation function f ρ :="Replace car with vehicle" and the label g ρ :=entails. f ρ would fail when applied to a sentence not containing the word "car". Applying f ρ to the sentence s="A man is driving the car" would generate s'="A man is driving the vehicle", creating the example (s, s , entails).

Table 2: Various generators Gρ characterized by their source, (partial) transformation function fρ as applied to a sentence s, and entailment label gρ

Source ρ f ρ (s) g ρ Knowledge Base, G KB WordNet hyper(x, y) anto(x, y) syno(x, y) Replace x with y in s PPDB x ≡ y SICK c(x, y) c Hand-authored, G H Domain knowledge neg negate(s) Neural Model, G s2s Training data (s2s, c) G s2s c (s) c

The seven generators we use for experimentation are summarized in Table 2 and discussed in more detail subsequently. While these particular generators are simplistic and one can easily imagine more advanced ones, we show that training using adversarial examples created using even these simple generators leads to substantial accuracy improvement on two datasets.

3.1 Knowledge-Guided Generators

Large knowledge-bases such as WordNet and PPDB contain lexical equivalences and other relationships highly relevant for entailment models. However, even large datasets such as SNLI generally do not contain most of these relationships in the training data. E.g., that two dogs entails animals isn't captured in the SNLI data. We define simple generators based on lexical resources to create adversarial examples that capture the underlying knowledge. This allows models trained on these examples to learn these relationships.

As discussed earlier, there are different ways of incorporating such symbolic knowledge into neural models. Unlike task-agnostic ways of approaching this goal from a word embedding perspective (Faruqui et al., 2015; Mrkšić et al., 2016) or the model-specific approach (Sha et al., 2016; Chen et al., 2018) , we use this knowledge to generate task-specific examples. This allows any entailment model to learn how to use these relationships in the context of the entailment task, helping them outperform the above task-agnostic alternative.

Our knowledge-guided example generators, G KB ρ , use lexical relations available in a knowledge-base: ρ := r(x, y) where the relation r (such as synonym, hypernym, etc.) may differ across knowledge bases. We use a simple (partial) transformation function, f ρ (s):="Replace x in s with y", as described in an earlier example. In some cases, when part-of-speech (POS) tags are available, the partial function requires the tags for x in s and in r(x, y) to match. The entailment label g ρ for the resulting examples is also defined based on the relation r, as summarized in Table 2 .

This idea is similar to Natural Logic Inference or NLI (Lakoff, 1970; Sommers, 1982; Angeli and Manning, 2014) where words in a sentence can be replaced by their hypernym/hyponym to produce entailing/neutral sentences, depending on their context. We propose a context-agnostic use of lexical resources that, despite its simplicity, already results in significant gains. We use three sources for generators:

Wordnet

(Miller, 1995) is a large, handcurated, semantic lexicon with synonymous words grouped into synsets. Synsets are connected by many semantic relations, from which we use hyponym and synonym relations to generate entailing sentences, and antonym relations to generate contradicting sentences 2 . Given a relation r(x, y), the (partial) transformation function f ρ is the POS-tag matched replacement of x in s with y, and requires the POS tag to be noun or verb. NLI provides a more robust way of using these relations based on context, which we leave for future work.

Ppdb

(Ganitkevitch et al., 2013) is a large resource of lexical, phrasal, and syntactic paraphrases. We use 24,273 lexical paraphrases in their smallest set, PPDB-S (Pavlick et al., 2015) , as equivalence relations, x ≡ y. The (partial) transformation function f ρ for this generator is POStagged matched replacement of x in s with y, and the label g ρ is entails.

SICK (Marelli et al., 2014) is dataset with entailment examples of the form (p, h, c), created to evaluate an entailment model's ability to capture compositional knowledge via hand-authored rules. We use the 12,508 patterns of the form c(x, y) extracted by Beltagy et al. (2016) by comparing sentences in this dataset, with the property that for each SICK example (p, h, c), replacing (when applicable) x with y in p produces h. For simplicity, we ignore positional information in these patterns. The (partial) transformation function f ρ is replacement of x in s with y, and the label g ρ is c.

3.2 Hand-Defined Generators

Even very large entailment datasets have no or very few examples of certain otherwise common linguistic constructs such as negation, 3 causing models trained on them to struggle with these constructs. A simple model-agnostic way to alleviate this issue is via a negation example generator whose transformation function f ρ (s) is negate(s), described below, and the label g ρ is contradicts. negate(s): If s contains a 'be' verb (e.g., is, was), add a "not" after the verb. If not, also add a "did" or "do" in front based on its tense. E.g., change "A person is crossing" to "A person is not crossing" and "A person crossed" to "A person did not cross." While many other rules could be added, we found that this single rule covered a majority of the cases. Verb tenses are also considered 4 and changed accordingly. Other functions such as dropping adverbial clauses or changing tenses could be defined in a similar manner.

Both the knowledge-guided and hand-defined generators make local changes to the sentences based on simple rules. It should be possible to extend the hand-defined rules to cover the long tail (as long as they are procedurally definable). However, a more scalable approach would be to extend our generators to trainable models that can cover a wider range of phenomena than hand-defined rules. Moreover, the applicability of these rules generally depends on the context which can also be incorporated in such trainable generators.

3.3 Neural Generators

For each entailment class c, we use a trainable sequence-to-sequence neural model (Sutskever et al., 2014; Luong et al., 2015) to generate an entailment example (s, s , c) from an input sentence s. The seq2seq model, trained on examples labeled c, itself acts as the transformation function f ρ of the corresponding generator G s2s

c . The label g ρ is set to c. The joint probability of seq2seq model is:

EQUATION (2): Not extracted; please refer to original document.

The loss function for training the seq2seq is:

EQUATION (3): Not extracted; please refer to original document.

where L is the cross-entropy loss between the original hypothesis H c and the predicted hypothesis. Cross-entropy is computed for each predicted word w i against the same in H c given the sequence of previous words in H c .φ c are the optimal parameters in G s2s c that minimize the loss for class c. We use the single most likely output to generate sentences in order to reduce decoding time.

3.4 Example Generation

The generators described above are used to create new entailment examples from the training data. For each example (p, h, c) in the data, we can create two new examples:

p, f ρ (p), g ρ and h, f ρ (h), g ρ .

The examples generated this way using G KB and G H can, however, be relatively easy, as the premise and hypothesis would differ by only a word or so. We therefore compose such simple ("first-order") generated examples with the original input example to create more challenging "second-order" examples. We can create secondorder examples by composing the original example (p, h, c) with a generated sentence from hypothesis, f ρ (h) and premise, f ρ (p). Figure 1 depicts how these two kinds of examples are generated from an input example (p, h, c).

Figure 1: Generating first-order (blue) and second-order (red) examples.

First, we consider the second-order example between the original premise and the transformed hypothesis: (p, f ρ (h), (c, g ρ )), where , defined in the left half of Table 3 , composes the input example label c (connecting p and h) and the generated example label g ρ to produce a new label. For instance, if p entails h and h entails f ρ (h), p would entail f ρ . In other words, ( , ) is . For example, composing ("A man is playing soccer", "A man is playing a game", ) with a generated P H

Table 3: Entailment label composition functions⊕ (left) and ⊗ (right) for creating second-order examples. c and gρ are the original and generated labels, resp. v: entails, f: contradicts, #: neutral, ?: undefined

P' H'

Entailment in data (x) Generation (z)

First/Second-order entailment between z & x Figure 1 : Generating first-order (blue) and second-order (red) examples. Table 3 : Entailment label composition functions (left) and (right) for creating second-order examples. c and g ρ are the original and generated labels, resp. : entails, : contradicts, #: neutral, ?: undefined hypothesis f ρ (h): "A person is playing a game." will give a new second-order entailment example: ("A man is playing soccer", "A person is playing a game", ).

p ⇒ h h ⇒ h p ⇒ h p ⇒ h p ⇒ p p ⇒ h c g ρ c g ρ ? ? # # # # ? ? ? ? # # # # # # # # # # # # # # # # # #

Second, we create an example from the generated premise to the original hypothesis:

( f ρ (p), h, (g ρ , c)).

The composition function here, denoted and defined in the right half of Table 3 , is often undetermined. For example, if p entails f ρ (p) and p entails h, the relation between f ρ (p) and h is undetermined i.e. ( , ) =?. While this particular composition often leads to undetermined or neutral relations, we use it here for completeness. For example, composing the previous example with a generated neutral premise, f ρ (p): "A person is wearing a cap" would generate an example ("A person is wearing a cap", "A man is playing a game", #)

The composition function is the same as the "join" operation in natural logic reasoning (Icard III and Moss, 2014) , except for two differences: (a) relations that do not belong to our three entailment classes are mapped to '?', and (b) the exclusivity/alternation relation is mapped to contradicts. The composition function , on the other hand, does not map to the join operation.

3.5 Implementation Details

Given the original training examples X, we generate the examples from each premise and hypothesis in a batch using G KB and G H . We also generate new hypothesis per class for each premise using G s2s

c . Using all the generated examples to train the model would, however, overwhelm the original training set. For examples, our knowledge-guided generators G KB can be applied in 17,258,314 different ways.

To avoid this, we sub-sample our synthetic examples to ensure that they are proportional to the input examples X, specifically they are bounded to α|X| where α is tuned for each dataset. Also, as seen in Table 3 , our knowledge-guided generators are more likely to generate neutral examples than any other class. To make sure that the labels are not skewed, we also sub-sample the examples to ensure that our generated examples have the same class distribution as the input batch. The SciTail dataset only contains two classes: entails mapped to and neutral mapped to . As a result, generated examples that do not belong to these two classes are ignored.

The sub-sampling, however, has a negative sideeffect where our generated examples end up using a small number of lexical relations from the large knowledge bases. On moderate datasets, this would cause the entailment model to potentially just memorize these few lexical relations. Hence, we generate new entailment examples for each mini-batch and update the model parameters based on the training+generated examples in this batch.

The overall example generation procedure goes as follows: For each mini-batch X (1) randomly choose 3 applicable rules per source and sentence (e.g., replacing men with people based on PPDB in premise is one rule), (2) produce examples Z all using G KB , G H and G s2s , (3) randomly sub-select examples Z from Z all to ensure the balance between classes and |Z|= α|X|. Figure 2 shows the complete architecture of our model, AdvEntuRe (ADVersarial training for textual ENTailment Using Rule-based Examples.). The entailment model D is shown with the white box and two proposed generators are shown using black boxes. We combine the two symbolic untrained generators, G KB and G H into a single G rule model. We combine the generated adversarial examples Z with the original training examples X to train the discriminator. Next, we describe how the individual models are trained and finally present our new approach to train the generator based on the discriminator's performance.

Figure 2: Overview of AdvEntuRe, our model for knowledge-guided textual entailment.

4.1 Discriminator Training

We use one of the state-of-the-art entailment models (at the time of its publication) on SNLI, decomposable attention model (Parikh et al., 2016) with intra-sentence attention as our discriminator D. The model attends each word in hypothesis with each word in the premise, compares each pair of the attentions, and then aggregates them as a final representation. This discriminator model can be easily replaced with any other entailment model without any other change to the AdvEntuRe architecture. We pre-train our discriminator D on the original dataset, X=(P, H, C) using:

EQUATION (5): Not extracted; please refer to original document.

where L is cross-entropy loss function between the true labels, Y and the predicted classes, andθ are the learned parameters.

4.2 Generator Training

Our knowledge-guided and hand-defined generators are symbolic parameter-less methods which are not currently trained. For simplicity, we will refer to the set of symbolic rule-based generators as G rule := G KB ∪ G H . The neural generator G s2s , on the other hand, can be trained as described earlier. We leave the training of the symbolic models for future work.

4.3 Adversarial Training

We now present our approach to iteratively train the discriminator and generator in a GAN-style framework. Unlike traditional GAN (Goodfellow et al., 2014) on image/text generation that aims to obtain better generators, our goal is to build a robust discriminator regularized by the generators (G s2s and G rule ). The discriminator and generator are iteratively trained against each other to achieve better discrimination on the augmented Algorithm 1 Training procedure for AdvEntuRe.

1: pretrain discriminator D(θ) on X; 2: pretrain generators G s2s c (φ) on X; 3: for number of training iterations do 4:

for mini-batch B ← X do 5:

generate examples from G 6:

Z G ⇐G(B; φ), 7: balance X and Z G s.t. |Z G | ≤ α|X| 8:

optimize discriminator:

9:θ = argmin θ L D (X + Z G ; θ) 10:

optimize generator:

11:φ = argmin φ L G s2s (Z G ; L D ; φ) 12:

Update θ ←θ; φ ←φ data from the generator and better example generation against the learned discriminator. Algorithm 1 shows our training procedure.

First, we pre-train the discriminator D and the seq2seq generators G s2s on the original data X. We alternate the training of the discriminator and generators over K iterations (set to 30 in our experiments).

For each iteration, we take a mini-batch B from our original data X. For each mini-batch, we generate new entailment examples, Z G using our adversarial examples generator. Once we collect all the generated examples, we balance the examples based on their source and label (as described in Section 3.5). In each training iteration, we optimize the discriminator against the augmented training data, X +Z G and use the discriminator loss to guide the generator to pick challenging examples. For every mini-batch of examples X +Z G , we compute the discriminator loss L(C; D(X + Z G ; θ)) and apply the negative of this loss to each word of the generated sentence in G s2s . In other words, the discriminator loss value replaces the cross-entropy loss used to train the seq2seq model (similar to a REINFORCE (Williams, 1992) reward). This basic approach uses the loss over the entire batch to update the generator, ignoring whether specific examples were hard or easy for the discriminator. Instead, one could update the generator per example based on the discriminator's loss on that example. We leave this for future work.

5 Experiments

Our empirical assessment focuses on two key questions: (a) Can a handful of rule templates improve a state-of-the-art entailment system, especially with moderate amounts of training data? (b) Can iterative GAN-style training lead to an improved discriminator?

To this end, we assess various models on the two entailment datasets mentioned earlier: SNLI (570K examples) and SciTail (27K examples). 5 To test our hypothesis that adversarial example based training prevents overfitting in small to moderate training data regimes, we compare model accuracies on the test sets when using 1%, 10%, 50%, and 100% subsamples of the train and dev sets.

We consider two baseline models: D, the Decomposable Attention model (Parikh et al., 2016) with intra-sentence attention using pre-trained word embeddings (Pennington et al., 2014) ; and D retro which extends D with word embeddings initialized by retrofitted vectors (Faruqui et al., 2015) . The vectors are retrofitted on PPDB, Word-Net, FrameNet, and all of these, with the best results for each dataset reported here. Our proposed model, AdvEntuRe, is evaluated in three flavors: D augmented with examples generated by G rule , G s2s , or both, where G rule = G KB ∪ G H . In the first two cases, we create new examples for each batch in every epoch using a fixed generator (cf. Section 3.5). In the third case (D + G rule + G s2s ), we use the GAN-style training.

We uses grid search to find the best hyperparameters for D based on the validation set: hidden size 200 for LSTM layer, embedding size 300, dropout ratio 0.2, and fine-tuned embeddings.

The ratio between the number of generated vs. original examples, α is empirically chosen to be 1.0 for SNLI and 0.5 for SciTail, based on validation set performance. Generally, very few generated examples (small α) has little impact, while too many of them overwhelm the original dataset resulting in worse scores (cf. Appendix for more details). Table 4 summarizes the test set accuracies of the different models using various subsampling ratios for SNLI and SciTail training data.

Table 4: Test accuracies with different subsampling ratios on SNLI (top) and SciTail (bottom).

5.1 Main Results

We make a few observations. First, D retro is ineffective or even detrimental in most cases, except on SciTail when 1% (235 examples) or 10% (2.3K examples) of the training data is used. The gain in these two cases is likely because retrofitted lexical rules are helpful with extremely less data training while not as data size increases.

On the other hand, our method always achieves the best result compared to the baselines (D and D retro ). Especially, significant improvements are made in less data setting: +2.77% in SNLI (1%) and 9.18% in SciTail (1%). Moreover, D + G rule 's accuracy on SciTail (100%) also outperforms the previous state-of-the-art model (DGEM (Khot et al., 2018) , which achieves 77.3%) for that dataset by 1.7%. Among the three different generators combined with D, both G rule and G s2s are useful in Sci-Tail, while G rule is much more useful than G s2s on SNLI. We hypothesize that seq2seq model trained on large training sets such as SNLI will be able to reproduce the input sentences. Adversarial examples from such a model are not useful since the entailment model uses the same training examples. However, on smaller sets, the seq2seq model would introduce noise that can improve the robustness of the model.

5.2 Ablation Study

To evaluate the impact of each generator, we perform ablation tests against each symbolic generator in D + G rule and the generator G s2s c for each entailment class c. We use a 5% sample of SNLI and a 10% sample of SciTail. The results are summarized in Table 5 .

Table 5: Test accuracies across various rules R and classes C. Since SciTail has two classes, we only report results on two classes of Gs2s

Interestingly, while PPDB (phrasal paraphrases) helps the most (+3.6%) on SNLI, simple negation rules help significantly (+8.2%) on Sc-iTail dataset. Since most entailment examples in SNLI are minor rewrites by Turkers, PPDB often contains these simple paraphrases. For SciTail, the sentences are authored independently with limited gains from simple paraphrasing. However, a model trained on only 10% of the dataset (2.3K examples) would end up learning a model relying on purely word overlap. We believe that the sim- Table 6 : Given a premise P (underlined), examples of hypothesis sentences H' generated by seq2seq generators G s2s , and premise sentences P' generated by rule based generators G rule , on the full SNLI data. Replaced words or phrases are shown in bold. This illustrates that even simple, easy-to-define rules can generate useful adversarial examples. a person is in a blue dog is in a park P (or H) a dirt bike rider catches some air going off a large hill P': G KB(PPDB) ρ=≡,g ρ = a dirt motorcycle rider catches some air going off a large hill P': G KB(SICK) ρ=c,g ρ =# a dirt bike man on yellow bike catches some air going off a large hill P': G KB(WordNet) ρ=syno,g ρ = a dirt bike rider catches some atmosphere going off a large hill

Table 6: Given a premise P (underlined), examples of hypothesis sentences H’ generated by seq2seq generators Gs2s, and premise sentences P’ generated by rule based generators Grule, on the full SNLI data. Replaced words or phrases are shown in bold. This illustrates that even simple, easy-to-define rules can generate useful adversarial examples.

P': G Hand ρ=neg,g ρ =

a dirt bike rider do not catch some air going off a large hill ple negation examples introduce neutral examples with high lexical overlap, forcing the model to find a more informative signal. On the other hand, using all classes for G s2s results in the best performance, supporting the effectiveness of the GAN framework for penalizing or rewarding generated sentences based on D's loss. Preferential selection of rules within the GAN framework remains a promising direction. Table 6 shows examples generated by various methods in AdvEntuRe. As shown, both seq2seq and rule based generators produce reasonable sentences according to classes and rules. As expected, seq2seq models trained on very few examples generate noisy sentences. The quality of our knowledge-guided generators, on the other hand, does not depend on the training set size and they still produce reliable sentences.

5.4 Case Study: Negation

For further analysis of the negation-based generator in Table 1 , we collect only the negation examples in test set of SNLI, henceforth referred to as nega-SNLI. Specifically, we extract examples where either the premise or the hypothesis contains "not", "no", "never", or a word that ends with "n't'. These do not cover more subtle ways of expressing negation such as "seldom" and the use of antonyms. nega-SNLI contains 201 examples with the following label distribution: 51 (25.4%) neutral, 42 (20.9%) entails, 108 (53.7%) contradicts. Table 7 shows examples in each category.

Table 7. Not extracted; please refer to original document.

6 Conclusion

We introduced an adversarial training architecture for textual entailment. Our seq2seq and knowledge-guided example generators, trained in an end-to-end fashion, can be used to make any base entailment model more robust. The effectiveness of this approach is demonstrated by the significant improvement it achieves on both SNLI and SciTail, especially in the low to medium data regimes. Our rule-based generators can be expanded to cover more patterns and phenomena, and the seq2seq generator extended to incorporate per-example loss for adversarial training. Table 8 shows the number of rules and additional examples for G KB . Figure 3 shows training (dotted) accuracies on sub-sampled training datasets and testing (solid) accuracies on original test dataset X test of D over different sub-sampling percentages of the training set. Since SciTail (27K) is much smaller than SNLI (570K), SciTail fluctuates a lot at smaller sub-samples while SNLI converges with just 50% of the examples. test accuracies with different balancing ratio, α (xaxis) from 0.5, 1.0, ... 3.0 from |z|= α * |x| where |x| is fixed as batch size. The generated examples z are useful up to a point, but the performance quickly degrades for α > 1.0 as they overwhelm the original dataset x. Table 9 shows the grid search results of retrofitting vectors (Faruqui et al., 2015) with different lexical resources. To obtain the strongest baseline, we choose the best performing vectors for each sub-sample ratio and each dataset. Usually, PPDB and WordNet are two most useful resources for both SNLI and SciTail. E In-Depth Analysis: D+R Table 5 and Table 6 show more in-depth analysis with different sub-sampling ratio on SNLI and Sc-iTail. The dotted line is training accuracy, and the solid red (D +G rule ) and sold black (D) shows testing accuracies. Table 9 : Results of the word vectors retrofitted on different lexicons on each dataset. We pick the best vectors for each task and sub-sampling ratio. Figure 6 : D +G rule with different ratio for SNLI.

Figure 6: D +Grule with different ratio for SNLI.

Table 9: Results of the word vectors retrofitted on different lexicons on each dataset. We pick the best vectors for each task and sub-sampling ratio.

A similar approach was used in a parallel work to generate an adversarial dataset from SNLI(Glockner et al., 2018).

Only 211 examples (2.11%) in the SNLI training set contain negation triggers such as not, 'nt, etc.4 https://www.nodebox.net/code/index.php/Linguistics

SNLI has a 96.4%/1.7%/1.7% split and SciTail has a 87.3%/4.8%/7.8% split on train, valid, and test sets, resp.

This is much less than the full test accuracy of 84.52%.

Figure 4: Effect of balancing ratio between z and x.

Figure 5: D +Grule with different ratio for SciTail.