Go To:

Paper Title Paper Authors Table Of Contents Abstract References
Report a problem with this paper

Learning to Rationalize for Nonmonotonic Reasoning with Distant Supervision



The black-box nature of neural models has motivated a line of research that aims to generate natural language rationales to explain why a model made certain predictions. Such rationale generation models, to date, have been trained on dataset-specific crowdsourced rationales, but this approach is costly and is not generalizable to new tasks and domains. In this paper, we investigate the extent to which neural models can reason about natural language rationales that explain model predictions, relying only on distant supervision with no additional annotation cost for human-written rationales. We investigate multiple ways to automatically generate rationales using pre-trained language models, neural knowledge models, and distant supervision from related tasks, and train generative models capable of composing explanatory rationales for unseen instances. We demonstrate our approach on the defeasible inference task, a nonmonotonic reasoning task in which an inference may be strengthened or weakened when new information (an update) is introduced. Our model shows promises at generating post-hoc rationales explaining why an inference is more or less likely given the additional information, however, it mostly generates trivial rationales reflecting the fundamental limitations of neural language models. Conversely, the more realistic setup of jointly predicting the update or its type and generating rationale is more challenging, suggesting an important future direction.

1 Introduction

Deep neural models perform increasingly well across NLP tasks, but due to their black-box nature, their success comes at the cost of our understanding of the system. The lack of transparency for why a model made a particular prediction may-among other problems-introduce fairness issues (Dodge et al. 2019) , and hide the fact that often a model is right for the wrong reasons due to learning dataset-specific shortcuts and annotation artifacts (Gururangan et al. 2018; Poliak et al. 2018) . There is growing interest in NLP in opening the black-box, through surrogate models (Ribeiro, Singh, and Guestrin 2016) , counterfactual evaluation (Tenney et al. 2020) , examining the inner structure of the neural network (Raffel et al. 2017; Jain et al. 2020) , or generating natural language explanations. We focus on the latter A conference room is where people have meetings at work.

A group of people sitting around a rectangular table having either pieces of paper or laptops in front of them.

They have a work meeting.


They are in a conference room.


They are in a library. approach. Recent work by Camburu et al. (2018) and Rajani et al. (2019) collected human-written explanations for the natural language inference (NLI; Bowman et al. 2015) and commonsense question answering (CommonSenseQA; Talmor et al. 2019) tasks and trained models to predict explanations for new instances. Such supervision is not always accessible, is expensive to obtain, and is unlikely to generalize well across datasets.

In this work, we explore learning to rationalize using a distant supervision approach without additional annotation cost. We focus on the Defeasible Inference task (δ-NLI; Rudinger et al. 2020) , illustrated in Figure 1 . Given premise and hypothesis sentences, and an update sentence, the goal of the (discriminative) δ-NLI task is to recognize whether the update weakens or strengthens the entailment of the hypothesis by the premise. For example, the update that "the people are in a conference room" strengthens the hypothesis that "they are in a work meeting". An alternative (generative) task is to generate the update given the premise, hypothesis, and update type (strengthener or weakener).

Figure 1: An illustration of the NLI, δ-NLI (Defeasible NLI), and δ-NLI Rationale Generation tasks.

We present the Defeasible Inference Rationale Generation task, with the goal of generating natural language rationales that explain why a hypothesis is more likely after learning about a strengthener update and less likely after learning about a weakener update. To that end, we create the e-δ-NLI dataset by augmenting the δ-NLI dataset with rationales from various sources, including pre-trained language models, knowledge bases, and supervision from a related task. We then train two types of language model-based rationale generation models: post-hoc models that generate a rationale given access the target values (i.e., the update or update type); and joint models that jointly generate the target value along with the rationale. The overall workflow of our approach is shown in Figure 2 .

Figure 2: The complete training process: (1) collecting rationales from various sources, (2) Keeping the top k most helpful rationales; (3) training a generative model. During inference, we apply the generative model directly to the inputs.

We evaluate the models with both automatic and human evaluations. The results of the post-hoc models are promising, with most generated rationales considered relevant and factually correct, and 40% on average considered explanatory. In line with prior work by Kumar and Talukdar (2020) , further analysis revealed that models trained to post-hoc rationalize develop strategies to trivially map the target value to one of the several patterns associated with it in the training data, such as "the update implies that hypothesis". We consider the joint setup, in which the model has no access to the target value, to be more realistic. On this challenging setup, that hinders the models' ability to learn trivial shortcuts, the performance is worse, warranting future research in this direction. 1

2 Background

Natural Language Inference. Recognizing Textual Entailment (RTE; Dagan et al. 2013), or, in its newer variant, Natural Language Inference (NLI; Bowman et al. 2015) , is defined as a 3-way classification task. Given a premise sentence P and a hypothesis sentence H, the goal is to determine whether P entails, contradicts, or is neutral with H. P is said to entail H if a human reading P would typically infer that H is most likely true. H is neutral if it could be but is not necessarily true given P.

In recent years, several large-scale datasets for the task have been released (e.g. Williams, Nangia, and Bowman 2018; Nie et al. 2020) , encouraging training neural models. We focus on the Stanford Natural Language Inference dataset (SNLI; Bowman et al. 2015) , in which image captions serve as premises, and hypotheses were crowdsourced. Explainable NLI. Since deep learning has become the dominant paradigm in NLP research, efforts have been devoted to opening the "black-box" and interpreting neural models' predictions. One approach looks into the model's weights and traces back salient spans from the input that affected the prediction. The attention mechanism (Bahdanau, Cho, and Bengio 2015) , which is popular across NLP models, facilitates this through the attention weights (Raffel et al. 2017; Jain et al. 2020) . However, whether or not attention weights provide reliable insights into the model's decisionmaking process is still debatable (Serrano and Smith 2019; Jain and Wallace 2019; Wiegreffe and Pinter 2019) .

An alternative approach is to generate natural language explanations for the model's decisions. This is typically done by training a model on free-form human explanations (Camburu et al. 2018; Rajani et al. 2019; Wang et al. 2019; Zellers et al. 2019) , however, such supervision is not always available, and is costly to obtain. To that end, we propose a distant supervision approach that requires no additional supervision. Among other data source, we leverage the e-SNLI dataset (Camburu et al. 2018) , in which premise-hypothesis pairs from SNLI have been augmented with human-written explanations for the gold labels.

There are several setups for interpretation methods: (i) ante-hoc: generating the rationale from the input, and providing it to the decision-making model with the input (Lei, Barzilay, and Jaakkola 2016; Bastings, Aziz, and Titov 2019; Kumar and Talukdar 2020) or without it (Jain et al. 2020) ; (ii) joint: generating the rationale and the label jointly (Narang et al. 2020) ; and (iii) post-hoc: generating a rationale given the input and the gold or predicted label. The motivation for the first approach is to produce faithful rationales, i.e. rationales representing the model's true decision process. However, there is no guarantee that the decision-making model actually uses the rationales. Moreover, in some cases the selected rationale is not sufficient to make the prediction without the input (Wiegreffe, Sarah and Marasović, Ana, and Smith, Noah A. 2020) , while in others, label-specific rationale templates may make the label prediction trivial given the rationale (Kumar and Talukdar 2020) . We focus on the latter two approaches: joint and post-hoc, while acknowledging that our rationales are not constructed to be faithful. 2 Defeasible Inference. Defeasible reasoning is a nonmonotonic logic in which valid inferences can become invalid when new information is introduced. For example, "Tweety is a bird" entails that "Tweety flies" unless provided with additional information such as "Tweety is a penguin" (Reiter 1980) . Despite being a fundamental mode of human reasoning, modern NLP research paid little attention to nonmonotonic reasoning (e.g. Qin et al. 2019; Bhagavatula et al. 2019) . Recently, Rudinger et al. (2020) coupled defeasible reasoning with natural language inference by adding an update sentence U to the premise P and hypothesis H. Expanding the traditional definition, U may either weaken or strengthen H.

Two defeasible inference (δ-NLI) tasks were introduced: discriminative defeasible inference, in which given P, H, and U, the goal is to classify the update as either weakener or strengthener (update type, T ); and generative defeasible inference, in which given P, H, and T the goal is to generate an update U with the required type. The dataset for these tasks was built by crowdsourcing update sentences for neutral sentence-pairs from existing NLI datasets, Specifically, here we use the SNLI portion of their data.

Unsupervised Knowledge Extraction from Pre-trained LMs. Pre-trained Language Models (LMs) based on the neural transformer architecture (Vaswani et al. 2017) , such as GPT2 (Radford et al. 2019) Figure 2 : The complete training process: (1) collecting rationales from various sources, (2) Keeping the top k most helpful rationales;

(3) training a generative model. During inference, we apply the generative model directly to the inputs.

2019) have greatly improved the performance on NLP tasks that require world knowledge and commonsense reasoning. While the best practice is to fine-tune the LM, they may also be used in an unsupervised manner. Petroni et al. 2019and Davison, Feldman, and Rush (2019) completed commonsense knowledge bases (KB) by converting triplets into free-form text and predicting or scoring the target concept. Tamborrino et al. 2020leveraged masked LMs to score the plausibility of answer choices in multiple-choice commonsense question answering (QA) tasks. Shwartz et al. 2020used LMs to generate information-seeking clarification questions (e.g. "What is the definition of...") and their answers for providing relevant knowledge for commonsense QA tasks, which yielded similar performance gains to models utilizing KBs. Similarly, Latcinnik and Berant (2020) used LMs to generate a textual hypothesis which was used by the answer scorer of a multiple choice QA task.

3 E-Δ-Nli Dataset

We now describe e-δ-NLI (Explanations for Defeasible NLI). We augmented the δ-NLI dataset described in Section 2 with rationales that explain why a hypothesis is more likely after learning about a strengthener update and less likely after learning about a weakener update. Rather than eliciting rationales from humans, we take a distant supervision approach and gather rationales from various sources, as exemplified in Table 1 and described below.

Table 1: Examples of rationales generated from each of the sources. W stands for a weakener update and S for strengthener.

3.1 Collecting Rationales

Certain spans in the inputs H and U are highly salient for classifying the update type in the discriminative δ-NLI task. We hypothesize that these same spans will be salient for the task of generating rationales. Therefore we use the δ-NLI update type classifier and score each token in the input by its attention weight from the token in the final layer, and extract the set of top 20% non-continuous spans with respect to that score, denoted as S. For example, in Table 1 , the most salient spans are highlighted in orange (hypothesis) and green (update). We use the following sources to extract or generate rationales.

Vanilla LM. We generate two types of rationales: definitions and purposes for single spans, and relationships for a pair of spans. We use SpaCy (Honnibal and Montani 2017) to keep only the grammatical salient spans S G ⊆ S by filtering out stop words and keeping both the entire (noun or verb) phrase and its head for each span. Following Shwartz et al. 2020, we prompt the LM with "[context]. The definition of np is" for each noun phrase in S G , and " [context] . The purpose of vp is" for verb phrases in S G . We set the context to the concatenation of premise and hypothesis (P+ H) when the target phrase is in the hypothesis, and to P+ U when it is from the update.

In addition, we generate the relationship between pairs of spans. We take the top 3 most similar pairs of s u (subset of S G originated from U) and s h (subset of S G originated from H), judged by the cosine similarity between their word2vec embeddings (Mikolov et al. 2013) . 3 We prompt the LM with "P+ U+ H. The relationship between s u and s h is that".

We use GPT2-M (Radford et al. 2019) via the Transformers package (Wolf et al. 2019) . We limit the rationale length to up to 12 tokens, and use Nucleus sampling (Holtzman et al. 2020) with p = 0.35, and temperature = 1.0 to generate at most 20 rationales for each prompt. 4 Knowledge-Enhanced LM. To further instill commonsense knowledge into the LM, we follow Guan et al. (2020) and continue pre-training GPT2-M on triplets from Con-ceptNet (Speer, Chin, and Havasi 2017) converted to natural language using the templates from Davison, Feldman, and Rush (2019) . For example, (a glass of milk, UsedFor, drinking) is converted to "A glass of milk is used for drinking". We train the LM on the transformed triplets for 2 epochs. We then use the LM as previously detailed to generate definitions, purposes, and relationships. We use Nucleus sampling with p = 0.5, temperature = 0.7, and generate up to 5 rationales for each prompt.

Instance Rationales

Vanilla LM P: [...] pedestrians walking down street filled with vendors and umbrella carts. The relationship between "a busy Manhattan sidewalk selling hotdogs" and "weekly farmer's market" is that they both exist in tandem, but not necessarily together.

H: The vendors are there for the weekly farmer's market. W: They are on a busy Manhattan sidewalk selling hotdogs.

Kg-Enhanced Lm

P: A person wearing red and white climbs a foggy mountain. The purpose of "rock climbing" is to reach a high place. H: A person is rock climbing.

The relationship between "rope" and "climbing" is that rope has property used to climb. S: The person is attached to a rope going up the side of the mountain.

COMeT P: A baby boy in an elmo chair with lots of toys in the background.

H precondition: The baby boy is seen as joyful. H: The baby boy in the elmo chair is happy.

U postconditions: As a result, boy's mom feels to console. W: The baby boy's mom is wiping tears from his eyes.


The brown dog catches a ball in the air.

Catching a ball in the air implies that the dog plays with the ball. H: The dog plays with the ball outside.

Bushes are outside. S: The ball skips into the bushes.

NLI-derived w/ Highlights P: A woman wearing [...] and sunglasses, walks through a shopping outlet.

If a woman is carrying bags, then she is buying goods. H: The woman is buying goods. S: The woman is carrying shopping bags. COMeT. COMeT (Bosselut et al. 2019 ) is a LM-based knowledge base completion model. We use the model trained on ATOMIC (Sap et al. 2019 ), a commonsense KB consisting of if-then triplets concerning everyday situations, along multiple dimensions. We generate the postconditions following the update (xWant, xEffect, xReact, xAttr, oWant, oEffect, oReact) and the preconditions that lead to the hypothesis (xNeed, xIntent, xAttr) . We use beam search with beam size of 5 as the decoding strategy, keeping the entire beam, and replace PersonX with the syntactic subject of the input sentence.

NLI-derived. We repurpose a model for the related task of NLI rationale generation for our task of rationale generation for δ-NLI. To that end, we reproduced the WT5 model suggested by Narang et al. (2020) . The model is based on the T5 encoder-decoder language model (Raffel et al. 2020) , and is trained on the e-SNLI dataset (Camburu et al. 2018) to jointly generate the label (entailment, contradiction) and the rationale for a given premise and hypothesis pair. More concretely, the input consists of the task prefix and the inputs (explain nli premise: P hypothesis: H) while the expected output is label explanation: R. During inference, we set the premise to P+ U, i.e., treating the update as part of the premise, and provide it to the model along with H. The model generates the binary entailment label (excluding neutral) between P+ U and H, and the rationale that explains the label. 5

NLI-derived with Highlights. Each instance in e-SNLI highlights salient spans in the input that the annotators considered helpful for explaining the label. We train a variant of the T5-based e-SNLI model that gets (only) the highlighted words as input and outputs the label and the rationale. We then generate rationales for the δ-NLI dataset by applying the model to salient spans in S G that originated in U or H.

3.2 Filtering Rationales

Following the collection step, each instance in the δ-NLI dataset is now augmented with a list of candidate rationales explaining its label (update type). To further improve the quality of this distant supervision, we rank and keep the best rationales. In particular, we would like to keep the rationales that are most helpful for predicting the label. Ideally, we would want to train a δ-NLI classifier that gets P, H, U, and the rationale as input and outputs the update type. However, this causes a circular problem because we don't yet know which rationales are reliable. Table 2 : The different training setups we experiment with. We add special tokens to mark the boundaries of each input and output span, e.g.

Table 2: The different training setups we experiment with. We add special tokens to mark the boundaries of each input and output span, e.g. [premise] marks the beginning of the premise.

[premise] marks the beginning of the premise.

Hence, again we use e-SNLI as a proxy. We train a classifier on e-SNLI that gets the premise, hypothesis, and rationale as inputs and predicts the entailment label (entailment, contradiction). Specifically, we fine-tune a binary RoBERTa classifier (Liu et al. 2019) with the following input format: P R H. For a δ-NLI instance (P, H, T , U) with a set of candidate rationales {R i } N R i=1 (of various sources), we compute: o = NLI(P + U R i H), where o is a 2-dimensional vector representing the confidence of the classifier in each label. We score each rationale by the confidence assigned to the label associated with its update type: strengtheners as entailment and weakener as contradiction 6 , and rank the rationales accordingly. We keep the top 10% ranked rationales for each instance, yielding 8 rationales per instance on average.

We follow the original split for train (80%), test (10%), and development (10%) sets. By augmenting the data with multiple rationales per original δ-NLI instance, the final eδ-NLI dataset consists of 731,579 training, 15,781 test, and 15,527 development instances. Figure 3 shows the percent of rationale sources in the dataset and rationales that remained after filtering from each source.

Figure 3: Top: Percentage of each source among the rationales in the final e-δ-NLI dataset. Bottom: Percentage of rationales that remained after filtering from each source.

4 Rationale Generation Model

We use the e-δ-NLI dataset to train various generative models with the goal of generating rationales that explain why a hypothesis is more likely after learning about a strengthener update and less likely after learning about a weakener.

Every instance in the e-δ-NLI dataset consists of a premise P, hypothesis H, update type T , update U, and a set of rationales

{R i } N R i=1 . During training, we treat every (P, H, T , U, R) for R∈ {R i } N R

i=1 as a separate instance.

4.1 Architecture And Implementation Details

We fine-tune transformer-based pre-trained LMs on the e-δ-NLI dataset. Specifically, we use GPT2-XL (Radford et al. 2019) and Bart-L (Lewis et al. 2020) . 7 We use the Transformers package (Wolf et al. 2019 ), training each model for two epochs with batch size of 8 (GPT2), and 128 (Bart) on a Quadro RTX 8000 GPU machine.

4.2 Training Objective

We minimize the conditional log-likelihood of the output given the input:

L = − n i=1 log p(x out i |x out

In particular, for GPT2, which is a standard LM model, the loss is computed over the entire sequence [x in ; x out ], whereas in Bart, which is an encoder-decoder model, the loss is computed only over the output sequence, x out .

We experiment with various training setups described in Table 2 . Our setups can be divided into two categories. The first category is Post-hoc Rationalization, in which the model has access to the target values (i.e., update and update type) and is required to explain it. Our main task in this category is Rationale Generation (1). It is formulated as generating a rationale conditioned on the premise, hypothesis, update, and update type. Similarly, we can generate each of the update type (2) and update (3) given all other fields. These two setups are orthogonal to our goal, but we combine them with (1) in a multi-task setup (4) where we expect them to improve the model's generalizability (Shwartz and Dagan 2018; Zellers et al. 2019) and improve the performance on the main task. The second and more realistic category is Joint Prediction and Rationalization, in which the model jointly predicts either update type (5) or update (6) along with an explanation.

5 Results

For each combination of rationale generation training setup, we generated a rationale for each instance in the test set using beam search with 5 beams. We evaluated the generated rationales both in terms of automatic metrics and human evaluation. The results are shown in Table 3 .

Table 3: Automatic and human evaluation of rationale generation for the test set. Human evaluation results are presented for strengtheners and weakeners separately (S/W).

5.1 Automatic Evaluation

We used standard n-gram overlap metrics: the precisionoriented BLEU score (Papineni et al. 2002) and recalloriented ROUGE score (Lin 2004) . Specifically, we used BLEU-4 that measures overlap of n-grams up to n = 4, and ROUGE-L that measures longest matching sequences, and compared multiple predictions against multiple distantly supervised rationales as references. The result of the automatic measures are reported in Table 3 . In general, GPT2based models achieve better automatic scores. We also observe additive gain using multi-task setup on both BLEU and ROUGE scores.

5.2 Human Evaluation

Since automatic metrics have demonstrated low correlation with human judgments across various NLG tasks (Novikova et al. 2017) , and because our automatic metrics only evaluate the generated rationales against the distantly supervised rationales (in place of human-written references), we also conduct a more reliable evaluation using human judges on Amazon Mechanical Turk. We sampled 200 instances, along with a generated rationale for each model. Following Shwartz et al. 2020, we asked workers to determine whether a rationale was 1) grammatical, not entirely grammatical but understandable, or completely not understandable; 2) relevant to the instance (P, H, and U); 3) factually correct or likely true; and 4) explanatory of the update type (i.e. why the strengthener makes the hypothesis more likely or the weakener makes it less likely). To ensure the quality of annotations, we required that the workers be located in the US, UK, or Canada, and have a 99% approval rate for at least 5,000 prior tasks. We aggregated annotations from 3 workers using majority vote. The annotations yielded fair levels of agreement, with Fleiss' Kappa (Landis and Koch 1977) between κ = 0.22 for relevance and κ = 0.37 for being explanatory. We analyze the results from the following perspectives:

Best Setup. Across models, most rationales are grammatical or understandable (83%-99%). The best performance is achieved by Rationale BART-L, in which 80% of the rationales were considered relevant, over 55% correct, and between 33% (weakeners) to 47% (strengtheners) explanatory. Also, in general, better rationales are generated for strengthener than weakener.

LM and Objective. The multi-task setup did not improve the rationale generation performance. Among the post-hoc rationalization category, Bart-based models substantially outperformed GPT2-based models. Post-hoc vs. Joint. In the post-hoc rationalization setups, access to the target values (update more than update type) yielded more explanatory rationales (Expl. score in Table 3 ), but as discussed in Section 6.2, they are often trivial. The joint setup proved to be extremely challenging, with only 0.5%-8.5% of the rationales considered explanatory.

6.1 Quality Of The Distant Supervision

We study the quality of rationales in the e-δ-NLI dataset through human evaluation. We repeated the same crowdsourcing setup, this time evaluating the distantly supervised rationales (i.e. after filtering) of 100 random instances. Table 4 shows that the quality of the training data is surprisingly worse than that of the generated rationales. Specifically, rationales originating from LMs are often judged as incorrect and non-explanatory, much due to statements such as "The definition of s is s". Conversely, NLI-derived rationales are identified as the most explanatory ones, in agreement with our filtering step which kept the highest percents of NLI-derived rationales (58%). As we show in Section 6.2, most generated rationales are in the format of e-SNLI rationales, which might explain the discrepancy between the quality of the generated rationales and that of the training data (in which only 7.1% of the rationales are NLI-derived rationales).

Table 4: Human evaluation for the distant supervision rationales in the test set. Results (percents) are presented for strengtheners and weakeners separately (S/W).

6.2 Quality Of Generated Rationales

We manually analyzed the rationales generated by the best model (Rationale Bart-L) that were considered grammatical, relevant, and correct by humans.

(1) P: Four individuals are sitting on a small dock by the water as a boat sails by. H: Four people sitting near the ocean. W: They're in Egypt. R: Before, four people needed to go to the beach.

(2) P: Two men in orange uniforms stand before a train and do some work. H: Tall humans working. S: The men can easily touch the top of the train with their hands. R: The men can [...] train with their hands implies that they are working.

Table 5: Patterns of rationales generated by Rationale BartL that were considered explanatory. H, S, and W stand for Hypothesis, Strengthener and Weakener.
Table 6: Examples for the common error types.

(3) P: A cyclist dressed in black and white is pointing. H: A cyclist dressed in black and white points towards the sky. W: A man asked the cyclist which building is the bank. R: Before, a cyclist needed to go to the store. Table 7 : Percent of rationales with each error type.

Table 7: Percent of rationales with each error type.

Explanatory. We analyzed the 160 rationales that were considered "explanatory" (94 for strengtheners and 66 for weakeners), and found that almost all of them fit into one of several patterns of rationales that are trivial to generate given the target value (update type). These patterns are displayed in Table 5 . We see this as further motivation to focus on the joint setup in future research.

Non-Explanatory. We sampled and analyzed 100 rationales that were annotated as "non-explanatory" by workers (50-50 for strengtheners and weakeners). We found the following common types of errors, and categorized each rationale into one or more categories. The result is shown in Table 7 , and exemplified in Table 6. (1) Insufficient: providing one of several required reasoning hops. (2) Incorrect implications: following one of the templates in Table 5 , but not making sense. (3) Incorrect post/pre-conditions: involving wrong inferences about the post-conditions of U or the preconditions of H. (4) Partially correct: following a pattern in Table 5 , incorrectly using part of U or H. For example in Table 6 , "the group is on vacation is a rephrasing of resort", instead of rephrasing of "they are at a resort". (5) Repetitive statements: defining terms or relationships between a pair of terms, by repeating the term ("The definition of s is s"). (6) Wrong template: following wrong templates in Table 5, e.g. generating "X is a rephrasing of Y" when Table 8 : Ablation studies human evaluation. Results (percents) are presented for strengtheners and weakeners (S/W).

Table 8: Ablation studies human evaluation. Results (percents) are presented for strengtheners and weakeners (S/W).

X implies Y ("The people are eating fresh seafood is a rephrasing of sitting near the ocean"). (7) Rationalizing the premise: the rationale explains the premise instead of the hypothesis (e.g. "U implies P").

We observe a large portion of errors (especially for weakener) are from error type (1) where the rationale needs to be completed by another hop of reasoning.

6.3 Ablation Studies

We conduct ablation studies in which we ablate either (i) the filtering step (randomly selecting a rationale from each source), or (ii) all sources besides NLI-derived rationales from our e-δ-NLI dataset. In both cases, we trained the best setup (Rationale Bart-L) and evaluated the results using the same human evaluation setup described in Section 5.2.

The results are reported in Table 8 . Both ablations increase the relevance of rationales while hurting their factual correctness and producing less explanatory weakener rationales. In the case of the second ablation, this is likely due to the fact that most model-generated rationales in the format of the NLI-derived rationales copy parts of the input into label-specific templates, yielding relevant but not necessarily correct or explanatory rationales.

7 Conclusion

We presented an approach for generating rationales for the defeasible inference task, i.e., explaining why a given update either strengthened or weakened the hypothesis. We experimented with various training setups categorized into post-hoc rationalization and joint prediction and rationalization. Rather than collecting human explanations, we chose to train our models in a distant supervision approach that requires no additional annotation cost and may generalize better across datasets. The results indicated that the posthoc rationalization setup is easier than the joint setup, with many of the post-hoc generated rationales considered by humans as explanatory. Nonetheless, the model's success may be attributed to its access to the update type, which enabled learning a trivial mapping from the update type to rationale templates associated with it in the training data. The joint setup, on the other hand, proved to be more challenging. We hope that future work will focus on jointly predicting a label and generating a rationale, which is a more realistic setup and which may yield less trivial and more faithful rationales.

The code and data are available at: https://github.com/ fabrahman/RationaleGen.

Humans also post-hoc rationalize decisions, and it is known to be flawed(Gazzaniga and LeDoux 2013). For recent works discussing rationale faithfulness, seeHase and Bansal (2020),Jacovi and Goldberg (2020), andWiegreffe, Sarah and Marasović, Ana, and Smith, Noah A. (2020).

For multi-word spans we use maximum word-level similarity.4 Hyper-parameter values were chosen empirically from p ∈ {0.35, 0.5, 0.75}, temperature ∈ {0.7, 1}, #samples ∈ {5, 20}.

In practice, we only take the rationales and ignore the labels. But if we map entailment to strengthener and contradiction to weakener, we get 64% accuracy on the update type prediction. We note that this is an approximation. The definition of defeasible inference requires that a weakener makes the hypothesis less likely but not necessarily unlikely, while a strengthener makes the hypothesis more likely but not necessarily likely.

Equating strengtheners with entailment and weakeners as contradiction is a simplifying assumption, which is not strictly true.7 In our preliminary experiments, we also experimented with T5, but we did not observe any improvements.


  • Bahdanau, D.; Cho, K.; and Bengio, Y. 2015. Neural Ma- chine Translation by Jointly Learning to Align and Trans- late. In ICLR.
    Return to section: 2 Background
  • Bastings, J.; Aziz, W.; and Titov, I. 2019. Interpretable Neu- ral Predictions with Differentiable Binary Variables. In ACL.
    Return to section: 2 Background
  • Gazzaniga, M. S.; and LeDoux, J. E. 2013. The integrated mind. Springer Science & Business Media.
  • Guan, J.; Huang, F.; Zhao, Z.; Zhu, X.; and Huang, M. 2020. A knowledge-enhanced pretraining model for commonsense story generation. TACL .
  • Gururangan, S.; Swayamdipta, S.; Levy, O.; Schwartz, R.; Bowman, S.; and Smith, N. A. 2018. Annotation Artifacts in Natural Language Inference Data. In NAACL.
    Return to section: 1 Introduction
  • Hase, P.; and Bansal, M. 2020. Evaluating Explainable AI: Which Algorithmic Explanations Help Users Predict Model Behavior? In ACL.
  • The Curious Case of Neural Text Degeneration. In ICLR.
    Return to section: 2 Background, 3.1 Collecting Rationales
  • Honnibal, M.; and Montani, I. 2017. spacy 2: Natural lan- guage understanding with bloom embeddings, convolutional neural networks and incremental parsing. To appear 7(1).
    Return to section: 3.1 Collecting Rationales
  • Jacovi, A.; and Goldberg, Y. 2020. Towards Faithfully Inter- pretable NLP Systems: How Should We Define and Evaluate Faithfulness? In ACL.
  • Jain, S.; and Wallace, B. C. 2019. Attention is not Explana- tion. In NAACL.
  • Jain, S.; Wiegreffe, S.; Pinter, Y.; and Wallace, B. C. 2020. Learning to Faithfully Rationalize by Construction. In Ju- rafsky, D.; Chai, J.; Schluter, N.; and Tetreault, J. R., eds., ACL.
    Return to section: 1 Introduction, 2 Background
  • Bhagavatula, C.; Le Bras, R.; Malaviya, C.; Sakaguchi, K.; Holtzman, A.; Rashkin, H.; Downey, D.; Yih, W.-t.; and Choi, Y. 2019. Abductive Commonsense Reasoning. In ICLR.
    Return to section: 2 Background
  • Kumar, S.; and Talukdar, P. P. 2020. NILE : Natural Lan- guage Inference with Faithful Natural Language Explana- tions. In Jurafsky, D.; Chai, J.; Schluter, N.; and Tetreault, J. R., eds., ACL.
    Return to section: Weakener, 2 Background
  • Landis, J. R.; and Koch, G. G. 1977. The measurement of observer agreement for categorical data. biometrics . Latcinnik, V.; and Berant, J. 2020. Explaining question an- swering models through text generation. arXiv .
    Return to section: 5.2 Human Evaluation
  • Lei, T.; Barzilay, R.; and Jaakkola, T. 2016. Rationalizing Neural Predictions. In EMNLP.
    Return to section: 2 Background
  • Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Mo- hamed, A.; Levy, O.; Stoyanov, V.; and Zettlemoyer, L. 2020. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Compre- hension. In ACL.
  • Lin, C.-Y. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out.
    Return to section: 5.1 Automatic Evaluation
  • Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; and Stoyanov, V. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv .
    Return to section: 3.2 Filtering Rationales
  • Mikolov, T.; Chen, K.; Corrado, G.; and Dean, J. 2013. Ef- ficient estimation of word representations in vector space. arXiv .
    Return to section: 3.1 Collecting Rationales
  • Narang, S.; Raffel, C.; Lee, K.; Roberts, A.; Fiedel, N.; and Malkan, K. 2020. WT5?! Training Text-to-Text Models to Explain their Predictions. arXiv .
    Return to section: 2 Background, P:
  • Nie, Y.; Williams, A.; Dinan, E.; Bansal, M.; Weston, J.; and Kiela, D. 2020. Adversarial NLI: A New Benchmark for Natural Language Understanding. In ACL.
    Return to section: 2 Background
  • Novikova, J.; Dušek, O.; Curry, A. C.; and Rieser, V. 2017. Why We Need New Evaluation Metrics for NLG. In EMNLP.
    Return to section: 5.2 Human Evaluation
  • Bosselut, A.; Rashkin, H.; Sap, M.; Malaviya, C.; Celikyil- maz, A.; and Choi, Y. 2019. COMET: Commonsense Trans- formers for Automatic Knowledge Graph Construction. In ACL.
    Return to section: P:
  • Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W.-J. 2002. BLEU: a method for automatic evaluation of machine trans- lation. In ACL.
    Return to section: 5.1 Automatic Evaluation
  • Petroni, F.; Rocktäschel, T.; Riedel, S.; Lewis, P.; Bakhtin, A.; Wu, Y.; and Miller, A. 2019. Language Models as Knowledge Bases? In EMNLP-IJCNLP. Hong Kong, China. Poliak, A.; Naradowsky, J.; Haldar, A.; Rudinger, R.; and Van Durme, B. 2018. Hypothesis Only Baselines in Natural Language Inference. In *SEM.
  • Qin, L.; Bosselut, A.; Holtzman, A.; Bhagavatula, C.; Clark, E.; and Choi, Y. 2019. Counterfactual Story Reasoning and Generation. In EMNLP-IJCNLP.
    Return to section: 2 Background
  • Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; and Sutskever, I. 2019. Language Models are Unsupervised Multitask Learners. OpenAI Blog .
    Return to section: 2 Background, 4.1 Architecture And Implementation Details
  • Raffel, C.; Luong, M.; Liu, P. J.; Weiss, R. J.; and Eck, D. 2017. Online and Linear-Time Attention by Enforcing Monotonic Alignments. In CML.
    Return to section: 1 Introduction, 2 Background
  • Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; and Liu, P. J. 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. JMLR .
    Return to section: P:
  • Rajani, N. F.; McCann, B.; Xiong, C.; and Socher, R. 2019. Explain Yourself! Leveraging Language Models for Com- monsense Reasoning. In ACL.
    Return to section: Weakener, 2 Background
  • Reiter, R. 1980. A logic for default reasoning. Artificial intelligence .
    Return to section: 2 Background
  • Ribeiro, M. T.; Singh, S.; and Guestrin, C. 2016. " Why should I trust you?" Explaining the predictions of any clas- sifier. In ACM SIGKDD.
    Return to section: 1 Introduction
  • Rudinger, R.; Shwartz, V.; Hwang, J. D.; Bhagavatula, C.; Forbes, M.; Le Bras, R.; Smith, N. A.; and Choi, Y. 2020. Thinking Like a Skeptic: Defeasible Inference in Natural Language. In Findings of ACL: EMNLP.
    Return to section: Weakener, 2 Background
  • Bowman, S.; Angeli, G.; Potts, C.; and Manning, C. D. 2015. A large annotated corpus for learning natural language infer- ence. In EMNLP.
    Return to section: Weakener, 2 Background
  • Sap, M.; Le Bras, R.; Allaway, E.; Bhagavatula, C.; Lourie, N.; Rashkin, H.; Roof, B.; Smith, N. A.; and Choi, Y. 2019. Atomic: An atlas of machine commonsense for if-then rea- soning. In AAAI.
    Return to section: P:
  • Serrano, S.; and Smith, N. A. 2019. Is Attention Inter- pretable? In ACL.
  • Shwartz, V.; and Dagan, I. 2018. Paraphrase to Explicate: Revealing Implicit Noun-Compound Relations. In ACL.
    Return to section: 4.2 Training Objective
  • Shwartz, V.; West, P.; Le Bras, R.; Bhagavatula, C.; and Choi, Y. 2020. Unsupervised Commonsense Question An- swering with Self-Talk. In EMNLP.
  • Speer, R.; Chin, J.; and Havasi, C. 2017. ConceptNet 5.5: an open multilingual graph of general knowledge. In AAAI.
  • Talmor, A.; Herzig, J.; Lourie, N.; and Berant, J. 2019. Com- monsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge. In NAACL.
    Return to section: Weakener
  • Tamborrino, A.; Pellicanò, N.; Pannier, B.; Voitot, P.; and Naudin, L. 2020. Pre-training Is (Almost) All You Need: An Application to Commonsense Reasoning. In ACL.
  • Tenney, I.; Wexler, J.; Bastings, J.; Bolukbasi, T.; Coenen, A.; Gehrmann, S.; Jiang, E.; Pushkarna, M.; Radebaugh, C.; Reif, E.; and Yuan, A. 2020. The Language Interpretability Tool: Extensible, Interactive Visualizations and Analysis for NLP Models. In EMNLP.
    Return to section: 1 Introduction
  • Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, L.; and Polosukhin, I. 2017. At- tention is All you Need. In Neurips.
    Return to section: 2 Background
  • Wang, C.; Liang, S.; Zhang, Y.; Li, X.; and Gao, T. 2019. Does it Make Sense? And Why? A Pilot Study for Sense Making and Explanation. In ACL.
    Return to section: 2 Background
  • Camburu, O.-M.; Rocktäschel, T.; Lukasiewicz, T.; and Blunsom, P. 2018. e-SNLI: Natural Language Inference with Natural Language Explanations. In Neurips.
    Return to section: Weakener, 2 Background, P:
  • Wiegreffe, S.; and Pinter, Y. 2019. Attention is not not Ex- planation. In EMNLP-IJCNLP.
    Return to section: 2 Background
  • Wiegreffe, Sarah and Marasović, Ana, and Smith, Noah A. 2020. Measuring Association Between Labels and Free-Text Rationales. arXiv .
    Return to section: 2 Background
  • Williams, A.; Nangia, N.; and Bowman, S. 2018. A Broad- Coverage Challenge Corpus for Sentence Understanding through Inference. In NAACL-HLT.
    Return to section: 2 Background
  • Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; and Brew, J. 2019. HuggingFace's Transformers: State-of-the- art Natural Language Processing. arXiv .
    Return to section: 3.1 Collecting Rationales, 4.1 Architecture And Implementation Details
  • Zellers, R.; Holtzman, A.; Rashkin, H.; Bisk, Y.; Farhadi, A.; Roesner, F.; and Choi, Y. 2019. Defending Against Neural Fake News. In Neurips.
    Return to section: 2 Background, 4.2 Training Objective
  • Dagan, I.; Roth, D.; Sammons, M.; and Zanzotto, F. M. 2013. Recognizing Textual Entailment: Models and Appli- cations. Morgan & Claypool Publishers.
  • Davison, J.; Feldman, J.; and Rush, A. 2019. Commonsense Knowledge Mining from Pretrained Models. In EMNLP- IJCNLP.
    Return to section: 2 Background, 3.1 Collecting Rationales
  • Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL.
  • Dodge, J.; Liao, Q. V.; Zhang, Y.; Bellamy, R. K. E.; and Dugan, C. 2019. Explaining Models: An Empirical Study of How Explanations Impact Fairness Judgment. In IUI.
    Return to section: 1 Introduction