Paragraph-Level Commonsense Transformers with Recurrent Memory

Saadia Gabriel
Chandra Bhagavatula
Vered Shwartz
Ronan Le Bras
Maxwell Forbes
Yejin Choi
AAAI
2021
View in Semantic Scholar

Abstract

Human understanding of narrative texts requires making commonsense inferences beyond what is stated in the text explicitly. A recent model, COMeT, can generate such inferences along several dimensions such as pre- and post-conditions, motivations, and mental-states of the participants. However, COMeT was trained on short phrases, and is therefore discourse-agnostic. When presented with each sentence of a multi-sentence narrative, it might generate inferences that are inconsistent with the rest of the narrative. We present the task of discourse-aware commonsense inference. Given a sentence within a narrative, the goal is to generate commonsense inferences along predefined dimensions, while maintaining coherence with the rest of the narrative. Such large-scale paragraph-level annotation is hard to get and costly, so we use available sentence-level annotations to efficiently and automatically construct a distantly supervised corpus. Using this corpus, we train PARA-COMeT, a discourse-aware model that incorporates paragraph-level information to generate coherent commonsense inferences from narratives. PARA-COMeT captures both semantic knowledge pertaining to prior world knowledge, and episodic knowledge involving how current events relate to prior and future events in a narrative. Our results confirm that PARA-COMeT outperforms the sentence-level baselines, particularly in generating inferences that are both coherent and novel.

1 Introduction

Narrative understanding is a long-standing challenge in the field of natural language processing (NLP) (Charniak 1972; Winograd 1972) . Arguably, the most crucial aspect of narrative understanding is the ability to make implicit commonsense inferences about entities and events in a story and refining them as the story unfolds (Pettijohn and Radvansky 2016; Williams, Lieberman, and Winston 2017; Rashkin et al. 2018; Qin et al. 2019) . This ability in humans is seamless, yet essential for coherent understanding of narrative text. Can NLP systems explicitly generate commonsense inferences, that a human might implicitly make while reading a narrative? Figure 1 : Discourse-agnostic models generate inferences relevant to the local context, but these generations can often be generic or incorrect at the narrative-level. Discourseaware models take the rest of the context into account to make globally coherent inferences.

Figure 1: Discourse-agnostic models generate inferences relevant to the local context, but these generations can often be generic or incorrect at the narrative-level. Discourseaware models take the rest of the context into account to make globally coherent inferences.

Being able to generate commonsense inferences has important practical implications. Commonsense Transformer (COMET, Bosselut et al. 2019) , proposed recently, generates commonsense inferences for a given phrase or sentence, capturing pre-and post-conditions along nine inferential dimensions found in the ATOMIC (Sap et al. 2019 ) knowledge base. 1 The commonsense inferences generated by COMET have been effectively applied to downstream applications such as sarcastic comment generation (Chakrabarty et al. 2020) , therapy chatbots (Kearns et al. 2020) , abductive natural language generation (Bhagavatula et al. 2019) , and automated story plot generation (Ammanabrolu et al. 2020) .

However, the COMET inferences suffer from a major shortcoming -they are generated for a sentence in isolation and fail to account for the full paragraph-level narrative context. This often results in the generation of inferences that are inconsistent or unlikely when considering the previous narrative context. For example in Figure 1 , given only the sentence "Ella walked around the property," one might infer that she did this because she wanted to "get exercise" or "admire the view". While such an inference is reasonable for the sentence in isolation, it is inconsistent given the full context -e.g., "The water bill at Ella's house had been high. Ella walked around the property." Instead, a more reasonable inference in light of the full context is that "She wanted to fix a leak."

We introduce the task of generating implicit discourseaware commonsense inferences for narrative text, and present PARA-COMET, a transformer-based, controlled generation model for the task. Instead of collecting expensive and cumbersome annotated data for direct supervision for this task, PARA-COMET is distantly supervised through sentence-level inferences obtained either from the COMET model or by heuristically matching a sentence to events found in the ATOMIC knowledge base. We define and use a coherence metric that measures the likelihood of each candidate inference in the context of the story to improve their paragraph-level consistency.

We show that PARA-COMET generates coherent discourse-aware inferences and performs better than discourse-agnostic baselines in both automated and manual evaluation. Yet, even the best model generates implausible inferences (23% of the inferences), and inferences that contradict the paragraph-level context (in 44% of the stories). This stresses the difficulty of the task and calls for further research. We release our models and data as an initial step towards advancing paragraph-level commonsense understanding. 2

Table 1: Examples generated from the models in this paper: a discourse-agnostic (sentence-level) baseline, vs. our discourseaware PARA-COMET. We highlight the sentence that each inference was generated for in bold. Inferences are marked as plausible (3) or implausible (7).

2 Background

Sentence-level Commonsense Inferences. A key component of our distant supervision approach is the availability of sentence-level commonsense inferences. The ATOMIC knowledge base (Sap et al. 2019) consists of such if-then knowledge about causes and effects, agents and themes of events, and their actions and mental states. An ATOMIC entry is encoded as a triplet < e 1 , d, e 2 >, where e 1 is an event phrase, d is an inferential dimension and e 2 is the inference along the given dimension. ATOMIC defines nine inferential dimensions such as xIntent: the agent's intent, oEffect: the effect on the patient(s) etc. (See Table 2 ). The event e 1 and the inference e 2 are natural language templates consisting of variables PersonX for the agent and PersonY for the (possibly unknown) patient(s). 3 While ATOMIC contains nearly 880K triples, it is not nearly enough to capture the full range and generality of possible events, which is immeasurably vast and impossible to manually enumerate. Furthermore, due to lexical variability, events are rarely found as-is in ATOMIC. To that end, COMET (Bosselut et al. 2019 ) was developed as a transformer-based knowledge model trained on ATOMIC to generate commonsense inferences for a given phrase/sentence. Thus, both ATOMIC and COMET are natural candidates to obtain sentence-level commonsense inferences.

Table 2: Natural language templates for ATOMIC dimensions.

Reasoning about narratives. A related line of work to ours is script learning, that defines a structured representation for prototypical series of events (Schank and Abelson 1977) . An event (e.g., going to a restaurant) is decomposed into components such as the participants (customer, waiter, cook, etc.) , subevents (sitting down, asking for menus, etc.), and their various pre-and post-conditions. In later work, scripts were also referred to as "narrative event chains", and multiple methods to learn the narrative chains from raw text were developed (Chambers and Jurafsky 2008; Jans et al. 2012; Pichotta and Mooney 2014; Rudinger et al. 2015) . Similarly, the Choice of Plausible Alternatives (COPA) task (Roemmele, Bejan, and Gordon 2011) proposes a benchmark for commonsense causal reasoning. It asks which of two alternatives has a causal relationship (either cause or effect) with a given premise. Finally, the temporal ordering of events is often studied along with typical times and duration (Kozareva and Hovy 2011; Granroth-Wilding and Clark 2016; Li, Ding, and Liu 2018; Zhou et al. 2019) .

Types of commonsense inferences. While most commonsense work pertains to knowledge such as that captured by ConceptNet (Speer, Chin, and Havasi 2017) , in this paper we focus on commonsense based on naive psychology, a core human ability that allows people to reason about mental states such as reactions, intents, goals and beliefs (Heider 1958) . ATOMIC is specifically designed to capture such knowledge and we focus on such socially motivated commonsense, though our distant supervision approach and our proposed model are extensible to other knowledge bases and forms of commonsense.

3 Discourse-Aware Commonsense Inference

Our work is motivated by the question: can NLP systems explicitly generate commonsense inferences, that a human might implicitly make while reading a narrative? To tackle this question, we formalize and introduce the discourseaware commonsense inference task. 4 Table 1 : Examples generated from the models in this paper: a discourse-agnostic (sentence-level) baseline, vs. our discourseaware PARA-COMET. We highlight the sentence that each inference was generated for in bold. Inferences are marked as plausible () or implausible (). Formally, given a narrative with T sentences {S 1 , S 2 ...S T }, the goal is to generate a set of commonsense inferences for the nine inferential dimensions (Table 2) for each sentence S i . This set of inferences generated for S i must also be consistent with the entire narrative. Maintaining consistency with the full narrative context requires reasoning about the relationship between past and future events. Table 1 shows some examples of discourse-aware (paragraph-level) and discourse-agnostic (sentence-level) inferences. Sentence-level inferences are often inconsistent with the narrative. For example, the inference that a character needed "to be in a pool" when the earlier context shows they are gardening (first row in Table 1 ) or that a character wants to "practice more" when it has been established they are confident in their own abilities (third row).

4 Distant Supervision Approach

Sentence-level inferences (e.g. those obtained from COMeT) are inadequate to train models for our proposed task and obtaining direct supervision of discourse-aware inferences is cumbersome and expensive. Therefore, we use distant supervision to loosely align sentences in a narrative to their discourse-aware commonsense inferences. First, we obtain discourse-agnostic inferences from either the COMeT model or the ATOMIC knowledge base. Next, we filter out inferences that are inconsistent with the rest of the narrative (described in Section 4.3). Thus, we obtain silver standard training data for training models for our task. Additionally, we create a smaller-scale validation set by manually validating inferences through a crowdsourcing annotation task (Section 4.4).

4.1 Source Of Narratives

The basis for our dataset are English stories from the ROC-Stories corpus (Mostafazadeh et al. 2016) , which consists of 98K five-sentence stories authored by workers on Amazon Mechanical Turk. Understanding these stories requires commonsense and temporal inferences that we aim to capture. We split the original ROCStories train set into train, dev, and test sets in a 90/5/5 ratio.

4.2 Discourse-Agnostic Inferences

We aim to generate the types of commonsense inferences defined by the ATOMIC knowledge base (Sap et al. 2019) . We obtain discourse-agnostic inferences using either of the following approaches.

Heuristic: For each sentence S i in the story, we get an initial set of candidate inferences R i by extracting ATOMIC tuples, < e 1 , d, e 2 >, in which e 1 and S i share either noun phrases or verb phrases. We repurpose the ROUGE metric (Lin 2004 ) to measure the surface-level relevance of a particular event e 1 to a sentence S i . Specifically, we compute the ROUGE-1 F 1 score, which considers unigrams, and keep the top 10 inferences with respect to the score for each sentence and dimension.

Model-Based: We use COMET to generate commonsense inferences for each sentence S i in the story. We use beam search with a beam size of 10 to obtain a set of inferences for each sentence and dimension combination. See Appendix A.1 for more details on the distant supervision data curation process.

4.3 From Discourse-Agnostic To Discourse-Aware Inferences

The inferences obtained by both heuristic and model-based methods (Section 4.2) only consider one sentence at a time.

To improve coherence with the rest of the narrative, we filter the inferences that have a low coherence with the given narrative. Specifically, inspired by information theory (Shannon 1948; Hale 2001) , we define coherence as a measure based on the cross entropy of the story tokens conditioned on a particular candidate knowledge inference. For a tuple < e 1 , d, e 2 >∈ R i matched to a sentence S i , and a language model Θ, we compute the cross entropy loss of the tokens in the story, where < d, e 2 > follow

S i : CE(S 1 , ...S i , < d, e 2 >, ...S 5 )

. 5 We use a transformerbased language model, and convert < d, e 2 > to natural language using hand-crafted templates shown in Table 2 . In practice, we divide the dimensions into causes (xNeed, xIntent, xAttr) and effects (xWant, xEffect, xReact, oWant, oEffect, oReact) . For cause inferences, we compute coherence with the previous and current sentences in the story (as shown by Figure 2 ). For effect inferences we use the full story. This allows us to effectively measure how well the extracted inferences may follow from past or predict future story events.

Figure 2: An example of the different narrative input types used for measuring coherence during distant supervision. Sentences used as input are highlighted in purple. The same strategy is used for determining the input during PARACOMET inference.

To ensure an equal distribution of inferences across dimensions, we order inferences by coherence score and keep the top 5 inferences for each sentence and dimension type. 5 Here we define cross entropy loss as CE(t1, ..., tn) =

− 1 n n i=1 log2pΘ(ti|t1, ..., ti−1).

This filtering step is designed to reduce the number of contradicting inferences in our distant supervision corpus.

4.4 Validation Set

We validate a subset of the development set through crowdsourcing to obtain a gold evaluation set. We used Amazon Mechanical Turk and asked workers to judge the relevance of inferences for a given sentence within a story, leaving the interpretation of relevance to the best judgement of annotators. 6 Generally, we found that annotators adhered to a strict definition of relevance in which ambiguous inferences that may still be relevant to the story context at some point in the story timeline are labeled as irrelevant. See Table 3 for examples.

Table 3: Examples from the distantly supervised dataset. We highlight the most relevant (i.e. potentially contradictory or supporting) sections in the story for each inference being considered. LM score shows the average token log probability. Inferences are marked as relevant (3) or irrelevant (7).

We randomly sampled 542 stories from the development set, and for each story we randomly selected a sentence and a dimension, and annotated the 5 inferences associated with them. We had 3 annotators judge each example, and used the majority vote to obtain a gold label. We filtered out low quality annotations by manually checking for workers with low inter-annotator agreement and frequently incorrect labeling. 7 Our annotations yielded fair inter-annotator agreement of Fleiss' κ = 0.338 (Fleiss 1971 ) (p-value < .001). Despite the challenges of this task, this value is higher or comparable to prior work achieved for the evaluation of commonsense knowledge. 8 The final evaluation subset consists of 607 inferences, across all different dimensions, from 313 unique stories that were found to be relevant by multiple human annotators (34.29% of the inferences judged).

5 Model

We draw inspiration from the distinction between semantic and episodic memory (Tulving and Donaldson 1972) , and consider implicit commonsense knowledge in two ways: 1) semantic knowledge, grounded in world knowledge and culture-specific social knowledge (e.g., "leaks lead to high water bills"), and 2) episodic knowledge, grounded in causal understanding and epistemic reasoning-i.e. reasoning that relates past events to current events (e.g., "if a person gets a high water bill, they will want to find out why"). We introduce two variants of the PARA-COMET controlled generation model: a memory-less model that focuses on semantic knowledge drawn from the context, and a model augmented with recurrent memory that allows us to explicitly incorporate episodic knowledge. Figure 3 demonstrates generating inferences for a narrative using PARA-COMET with recurrent memory.

Figure 3: An illustration of PARA-COMET with a memory component. The model predicts an inference for a given sentence in the narrative (e.g., the second) and a requested ATOMIC dimension.

Memory-less Model. Given a story context c = {S 1 , S 2 , . . . , S T } of T sentences and a selected sentence S i , 6 We restrict annotators to US only. 7 These were done primarily by workers who spent less than 20 seconds on a HIT. See Appendix A.3 for full details of annotation process.

8 κ = 0.23 in judging commonsense knowledge triplets in (Feldman, Davison, and Rush 2019) and between κ = 0.289 and κ = 0.483 in commonsense story generation in (Guan et al. 2020 ).

Inference

Relevant? LM score Natalie's favorite movie is The Wizard of Oz... PersonX wanted: to see the film -3.384 I was at the grocery store...I see the lines were very long...

PersonX then feels: relieved -3.408 Jim wanted to learn Spanish. He tried taking a class...

PersonY/Others want: to catch up -3.602 Our building had a summer bbq party today. The manager took photos...

PersonX wants: to enjoy the party -3.915 Chris realizes that he rarely watches cable TV anymore. He calls...to cancel...

PersonX wanted: to be a good customer -3.952 My grandparents lived in Alabama. We used to travel there...I miss traveling there... PersonX is seen as: sad -3.518 Table 3 : Examples from the distantly supervised dataset. We highlight the most relevant (i.e. potentially contradictory or supporting) sections in the story for each inference being considered. LM score shows the average token log probability. Inferences are marked as relevant () or irrelevant (). Jordan was writing a new novel.

She was facing a block on the next chapter.

Jordan decided to take a break from writing.

She went outside and took a nice walk.

After the walk, Jordan was able to write the next chapter.

S4-S5

Update C o :

C o + proj(M o )

to take a break.

. . .

Current Token To Generate (T I )

Update Figure 3 : An illustration of PARA-COMET with a memory component. The model predicts an inference for a given sentence in the narrative (e.g., the second) and a requested ATOMIC dimension.

we set the input to:

x = S 1 || S 2 . . . S T || s || d

(1) where s and d are special tokens. s represents the index of the selected sentence, while d represents the required dimension in ATOMIC. || denotes concatenation. In the example in Figure 3 , the input provided to the model is:

x = Jordan was writing... < |sent2| > < |xW ant| > We fine-tuned the base GPT2 model to generate the expected output, which is an inference for the dimension d and sentence S i .

Memory-augmented Model. To incorporate inferences generated for other sentences in the story while generating inferences for a given sentence, we extend the model with a recurrent memory component, inspired by episodic memory. M m ∈ R R m ×L r ×H is the external memory, where R m is the maximum number of inferences per instance to store in memory, L r is the maximum inference sequence length, 9 and H is the hidden state dimension.

The memory-augmented model takes as input a memory update matrix M u ∈ R R u ×L r ×H , where R u is the number of inferences used to update memory, and incorporates it into the memory matrix:

EQUATION (2): Not extracted; please refer to original document.

⊕ stands for matrix concatenation, and f emb is an embedding layer trained jointly with the rest of the model. After the memory is updated, we average M m across the token dimension to get θ mem ∈ R R m ×H :

θ mem = 1 L r • L r l=1 M ml (3)

We denote the context representation obtained from GPT2's hidden state as C o ∈ R L c ×H , where L c is the context sequence length. We average it across all tokens, obtaining θ ctx ∈ R H . We then prune the memory to the top-k most relevant inferences, measured by cosine similarity between the memory θ mem and context vectors θ ctx . The memory output M o ∈ R H is the average of the top-k inferences.

Finally, we reweigh the context representation C o to consider the memory:

EQUATION (4): Not extracted; please refer to original document.

Where proj is a linear projection layer used to project the memory output into the same hidden dimensional space as the context representation. At training time, the memory consists of previously extracted relations from our distant supervision, while at test time, it consists of previously generated inferences, recalling the model's prior decisions. For both PARA-COMET model variants, we minimize the cross entropy loss of the entire sequence (input and output).

6 Experimental Setup

6.1 Training Setup

All models are implemented using the Transformers package (Wolf et al. 2019) , and trained for a maximum of 20 epochs. Training is performed using an Adam optimizer with linear warmup (Kingma and Ba 2015) . We also simulate a batch size of 16 using gradient accumulation and an actual batch size of 4. The learning rate is 2 −5 . All other hyperparameters follow Radford et al. (2019) . We retrieve the top k = 1 inferences from memory. We use the 124M parameter version of the GPT2 model, which was pretrained on WebText. See Appendix A.2 for more details.

6.2 Decoding Setup

For decoding, we use beam search with a beam size of b ∈ {1, 10}. The maximum decoding length is 50 tokens. Unlike at training time, where we take a single dimension for each sentence in each story, at decoding time we generate inferences from every dimension for every sentence. For both training and decoding, all experiments are run using 64 Intel(R) Xeon(R) Gold 6130 x86-64 CPUs at 2.10GHz and a Quadro RTX 8000 GPU.

6.3 Baselines

As a baseline, we use the COMET model, pre-trained on ATOMIC, to generate sentence-level inferences for each sentence in the story. 10 As an additional baseline, we use a retrieval model (BERT-KNN) based on the K-Nearest Neighbor search algorithm (k=1). We embed ATOMIC events using BERT (Devlin et al. 2019) , then find the closest ATOMIC event node for each story sentence to get a set of matching inferences.

7 Evaluation

We report the performance of all models for automatic evaluation and the top 6 model variations (two COMET variations and four PARA-COMET variations) for human evaluation. For PARA-COMET, we report the variants with and without memory, trained on either the heuristic matching approach (PARA-H) or the model-based approach (PARA-M), as described in Section 4.2.

10 See the original paper for details.

7.1 Human Evaluation

We follow a similar crowdsourcing setup to the validation presented in Section 4.4 to measure the quality of generated inferences. We sampled 336 inferences from 56 unique stories. We show crowdworkers the full story, a specified dimension, and a generated inference. We specify the assignment of PersonX to the syntactic subject of the sentence. 11 Following Zhang et al. 2017, we ask workers to judge the likelihood of inferences based on a 5-point Likert scale: obviously true (5), generally true (4), plausible (3), neutral or unclear (2), and doesn't make sense (1). Table 4 displays the percent of inferences judged as plausible or true (3-5), and plausible (3), and the average rating per inference (using majority voting).

Table 4: Human evaluation results. We highlight the overall best performing model in bold.

Overall, PARA-COMET generations are scored with higher average ratings, between 3.05 and 3.44 points compared to 2.57 and 2.93 points for the COMET baseline variants. Specifically, the memory-augmented variants produced notably more plausible inferences than any other model. We observed that inferences in this category tend to be less obvious-e.g. restating information from the context, producing generic inferences-and recover plausible implicit inferences.

The memory-augmented model performed slightly worse than the model without memory, and the model-based supervision type was preferred to the heuristic one. The difference in average ratings between PARA-COMET with and without memory was smaller with model-based supervision (-.02 points) than with heuristic distant supervision (-.16 points), likely due to the fact that the memory component propagates errors (i.e. previously generated wrong inferences).

Finally, we found that the "doesn't make sense" category captures cases where models produce inferences that contradict the other sentences in the story. For example, "Jason's mother ran into the kitchen" yielded the inference "PersonX needed to go home", when the story previously indicated Jason's mother was already at home.

7.2 Automatic Evaluation

Similarity to the gold inferences. We follow the ATOMIC and COMET automatic evaluation setup using BLEU (Papineni et al. 2001) , which measures the n-gram overlap between the generated and gold inferences.

Novelty. Following Jastrzebski et al. 2018, we compute novelty by measuring the percentage of generated inferences that do not appear verbatim in ATOMIC. We account for slight paraphrases by counting as novel the generated inferences that have an edit distance ratio of less than 0.95 with all ATOMIC events.

Discourse-level Coherence. We use natural language inference (NLI; Dagan et al. 2013) as a proxy for measuring the narrative-level coherence of the predicted inferences. We define coherence as follows -at the very least, the story must Table 5 : Performance according to the automatic evaluation metrics. The NLI score is the percent of stories for which the model predicted entail or neutral.

Table 5: Performance according to the automatic evaluation metrics. The NLI score is the percent of stories for which the model predicted entail or neutral.

not contradict any of the predictions, and it may possibly entail some of the predictions. We use the pretrained Sem-BERT model (Zhang et al. 2019) , a variant of BERT augmented with explicit semantic role labels, to compute NLI labels (entailment, neutral, contradiction). Since SemBERT was trained on sentence-level NLI tasks, we use each sentence of a narrative as the premise and a generated inference as the hypothesis, then aggregate across sentences. We also normalize the inferences to more closely match the language the model was trained on, using hand-crafted templates. 12 Table 5 provides a summary of the automatic evaluation results on the gold subset. The PARA-COMET variants outperform the sentence-level baselines across all metrics. The novelty results show that PARA-COMET models are capable of generating inferences that did not appear in the original ATOMIC knowledge graph. The memoryaugmented models generated inferences that were more coherent with the story, reducing the percents of contradicting inferences from 46.15% (in COMeT) to 44.41%.

We also report the performance of PARA-H model without pre-training (i.e. with a randomly initialized GPT-2 model). Pre-training has a small effect on BLEU (0.09-0.26%), indicating that while there is some benefit to the knowledge gained during pre-training, models are also able to generalize from the large amount of distant supervision.

8 Case Study: Personal Narratives

To test the ability of the model to generalize to more complex narratives requiring further pragmatic reasoning (Sap et al. PARA-M: PersonX then feels satisfied. Table 6 : Example from COSMOSQA with COMeT (beam-10) and PARA-COMeT predictions.

Table 6: Example from COSMOSQA with COMeT (beam10) and PARA-COMeT predictions.

story/sentence/dimension triples from personal blog posts in the COSMOSQA machine reading comprehension test set (Huang et al. 2019) . We found that the PARA-COMeT model is effective at predicting inferences in an unsupervised setting with 49.55 % of relations labeled as true and 20.72% of relations labeled as plausible (vs. 20.72% and 27.03% for COMeT). Table 6 shows an example of inferences generated for blog posts.

A.1 Efficient Collection Of Distant Supervision

To make the collection of distant supervision more efficient, we parallelize the processing of candidate inferences, computing the language model scores in batches of 130 candidate inferences. We split the process across 3 GPUs, with 30000 instances handled by each machine. We also cache the extracted verb and noun phrases. The average runtime for 30,000 instances is approximately 5 hours for the heuristic distant supervision and 50 hours for the model-based distant supervision.

A.2 Additional Training Details

We added the tokens shown in Table 7 to the base GPT2 model. The base PARA-COMET models are trained for 8 epochs, which took approximately 1.5 hours per epoch. The memory-augmented models are trained for 9 epochs, with approximately 2 hours per epoch. We used the pre-trained COMeT model with 116M parameters.

We manually tuned the learning rate based on the validation loss, testing values in the range of [.2,.00002] . All other hyperparameters follow the specifications in the original GPT2 paper. All results are from one trained version of each model and run with a fixed random seed.

A.3 Additional Human Evaluation And Annotation Details

We paid the annotators $0.25 per MTurk HIT, which we estimate is equivalent to a payrate of $10-$15 per hour. We encouraged annotators to leave feedback and comments. To ensure the quality of annotations, we required that annotators be located in the US and have at least 98% approval rate on 1,000 prior HITs. In the validation task described in Section 4.4, we instructed the workers to judged the relevance of the inferences as detailed in Figure 4 . In the human evaluation task described in Section 7.1, we presented the workers with the following descriptions of the Likert scale, exemplified with the sentence "I really wanted a dog." (we used the term "conclusion" rather than "inference" to simplify the terminology):

Figure 4. Not extracted; please refer to original document.

1. Doesn't Make Sense: The conclusion clearly contradicts the story or is nonsensical given the context. Example: "PersonX wanted: colorless green ideas."

2. Unclear or Neutral: The conclusion is completely unrelated to the story context, though not necessarily contradictory. Additional information is needed. Example: "PersonX wanted: a cat."

3. Plausible: The conclusion could be relevant, but is not directly related or slightly off-topic. Example: "PersonX needed: to move to a new house." (Explanation: pets require more space and it's possible PersonX's house is too small for pets).

4. Generally True: The conclusion is not stating information already in the story context, but it follows from the story and is very probable. Example: "PersonX needed: to go to the pet store".

5. Obviously True:

The conclusion is clearly relevant to the story context. These conclusions restate information from the original story context. Example: "PersonX wanted: a dog". Table 8 displays the full human evaluation results and Table 9 shows additional generated examples.

Table 9: Examples generated from the models in this paper: a discourse-agnostic sentence-level baseline, vs. our discourseaware PARA-COMET. We highlight the sentence that each inference was generated for in bold. Inferences are marked as plausible (3) or implausible (7).

A.4 Case Study Details

To adapt the dataset to our task, we keep the narrative contexts but leave out aligned question/answer pairs. We restricted the set to five-sentence narratives, truncating longer narratives, to use our models pretrained on the ROCStories distant supervision data.

See Table 2for a full list of inferential dimensions in ATOMIC.

Code and data will be made available upon publication.

We refer to PersonY in ATOMIC as patient, one or more people who are affected or acted upon by the action of the verb. We don't make the semantic distinction between patient and theme.4 We use the term discourse-aware to refer to data/systems that use paragraph-level information. Similarly, discourse-agnostic systems only use sentence-level information.

We use a maximum memory size (R m ) of 45 inferences and a maximum sequence length of 100 tokens.

We manually corrected incorrect parses such as those in which the subject of the sentence is not a person.

We removed colons from the templates inTable 2, lowercased the inferences, converted verbs to their infinitive form using Word-Net, and replaced PersonX with the pronoun "they".

ConclusionWe introduced a new task of discourse-aware commonsense inference over narratives. To target this task, we proposed a new model, PARA-COMET, trained using distant supervision, that captures narrative discourse.Despite the challenges of the task, we demonstrated the effectiveness of our approach using both automatic and human evaluations. In particular, our models were able to generate more implicit and novel discourse-aware inferences. In the future, we are interested in exploring further extensions of our work to downstream paragraph-and narrative-level tasks that may benefit from access to commonsense knowledge .