Cosmos QA: Machine Reading Comprehension with Contextual Commonsense Reasoning

Lifu Huang
Ronan Le Bras
Chandra Bhagavatula
Yejin Choi
EMNLP/IJCNLP
2019
View in Semantic Scholar

Abstract

Understanding narratives requires reading between the lines, which in turn, requires interpreting the likely causes and effects of events, even when they are not mentioned explicitly. In this paper, we introduce Cosmos QA, a large-scale dataset of 35,600 problems that require commonsense-based reading comprehension, formulated as multiple-choice questions. In stark contrast to most existing reading comprehension datasets where the questions focus on factual and literal understanding of the context paragraph, our dataset focuses on reading between the lines over a diverse collection of people’s everyday narratives, asking such questions as “what might be the possible reason of ...?", or “what would have happened if ..." that require reasoning beyond the exact text spans in the context. To establish baseline performances on Cosmos QA, we experiment with several state-of-the-art neural architectures for reading comprehension, and also propose a new architecture that improves over the competitive baselines. Experimental results demonstrate a significant gap between machine (68.4%) and human performance (94%), pointing to avenues for future research on commonsense machine comprehension. Dataset, code and leaderboard is publicly available at https://wilburone.github.io/cosmos.

1 Introduction

Reading comprehension requires not only understanding what is stated explicitly in text, but also reading between the lines, i.e., understanding what is not stated yet obviously true (Norvig, 1987) .

For example, after reading the first paragraph in Figure 1 , we can understand that the writer is not a child, yet needs someone to dress him or her every Figure 1 : Examples of COSMOS QA. ( indicates the correct answer.) Importantly, (1) the correct answer is not explicitly mentioned anywhere in the context paragraph, thus requiring reading between the lines through commonsense inference and (2) answering the question correctly requires reading the context paragraph, thus requiring reading comprehension and contextual commonsense reasoning.

Figure 1. Not extracted; please refer to original document.

morning, and appears frustrated with the current situation. Combining these clues, we can infer that the plausible reason for the writer being dressed by other people is that he or she may have a physical disability.

As another example, in the second paragraph of Figure 1 , we can infer that the woman was admitted to a psychiatric hospital, although not mentioned explicitly in text, and also that the job of the hospital staff is to stop patients from committing suicide. Furthermore, what the staff should arXiv:1909.00277v1 [cs.CL] 31 Aug 2019 have done, in the specific situation described, was to stop the woman from taking the elevator.

There are two important characteristics of the problems presented in Figure 1 . First, the correct answers are not explicitly mentioned anywhere in the context paragraphs, thus requiring reading between the lines through commonsense inference. Second, selecting the correct answer requires reading the context paragraphs. That is, if we were not provided with the context paragraph for the second problem, for example, the plausible correct answer could have been B or C instead.

In this paper, we focus on reading comprehension that requires contextual commonsense reasoning, as illustrated in the examples in Figure 1 . Such reading comprehension is an important aspect of how people read and comprehend text, and yet, relatively less studied in the prior machine reading literature. To support research toward commonsense reading comprehension, we introduce COSMOS QA (Commonsense Machine Comprehension), a new dataset with 35, 588 reading comprehension problems that require reasoning about the causes and effects of events, the likely facts about people and objects in the scene, and hypotheticals and counterfactuals. Our dataset covers a diverse range of everyday situations, with 21, 886 distinct contexts taken from blogs of personal narratives.

The vast majority (93.8%) of our dataset requires contextual commonsense reasoning, in contrast with existing machine comprehension (MRC) datasets such as SQuAD (Rajpurkar et al., 2016) , RACE (Lai et al., 2017) , Narrative QA (Kočiskỳ et al., 2018) , and MCScript (Ostermann et al., 2018) , where only a relatively smaller portion of the questions (e.g., 27.4% in MCScript) require commonsense inference. In addition, the correct answer cannot be found in the context paragraph as a text span, thus we formulate the task as multiple-choice questions for easy and robust evaluation. However, our dataset can also be used for generative evaluation, as will be demonstrated in our empirical study.

To establish baseline performances on COS-MOS QA, we explore several state-of-the-art neural models developed for reading comprehension. Furthermore, we propose a new architecture variant that is better suited for commonsense-driven reading comprehension. Still, experimental results demonstrate a significant gap between ma-chine (68.4% accuracy) and human performance (94.0%). We provide detailed analysis to provide insights into potentially promising research directions.

2.1 Context Paragraphs

We gather a diverse collection of everyday situations from a corpus of personal narratives (Gordon and Swanson, 2009) from the Spinn3r Blog Dataset (Burton et al., 2009) . Appendix A provides additional details on data pre-processing.

2.2 Question And Answer Collection

We use Amazon Mechanical Turk (AMT) to collect questions and answers. Specifically, for each paragraph, we ask a worker to craft at most two questions that are related to the context and require commonsense knowledge. We encourage the workers to craft questions from but not limited to the following four categories:

• Causes of events: What may (or may not) be the plausible reason for an event?

• Effects of events: What may (or may not) happen before (or after, or during) an event?

• Facts about entities: What may (or may not) be a plausible fact about someone or something?

• Counterfactuals: What may (or may not) happen if an event happens (or did not happen)?

These 4 categories of questions literally cover all 9 types of social commonsense of . Moreover, the resulting commonsense also aligns with 19 ConceptNet relations, e.g., Causes, HasPrerequisite and MotivatedByGoal, covering about 67.8% of ConceptNet types. For each question, we also ask a worker to craft at most two correct answers and three incorrect answers. We paid workers $0.7 per paragraph, which is about $14.8 per hour. Appendix B provides additional details on AMT instructions.

2.3 Validation

We create multiple tasks to have humans verify the data. Given a paragraph, a question, a correct answer and three incorrect answers, 1 we ask AMT workers to determine the following sequence of questions: (1) whether the paragraph is inappropriate or nonsensical, (2) whether the question is nonsensical or not related to the paragraph, (3) whether they can determine the most plausible correct answer, (4) if they can determine the correct answer, whether the answer requires commonsense knowledge, and (5) if they can determine the correct answer, whether the answer can be determined without looking at the paragraph.

We follow the same criterion as in Section 2.2 and ask 3 workers to work on each question set. Workers are paid $0.1 per question. We consider as valid question set where at least two workers correctly picked the intended answer and all of the workers determined the paragraph/question/answers as satisfactory. Finally we obtain 33, 219 valid question sets in total.

2.4 Unanswerable Question Creation

With human validation, we also obtain a set of questions for which workers can easily determine the correct answer without looking at the context or using commonsense knowledge. To take advantage of such questions and encourage AI systems to be more consistent with human understanding, we create unanswerable questions for COSMOS QA. Specifically, from validation outputs, we collect 2, 369 questions for which at least two workers correctly picked the answer and at least on worker determined that it is answerable without looking at the context or requires no common sense. We replace the correct choice of these questions with a "None of the above" choice.

To create false negative training instances, we randomly sample 70% of questions from the 33, 219 good question sets and replace their least challenging negative answer with "None of the above". Specifically, we fine-tune three BERT 2 next sentence prediction models on COSMOS: BERT(A|P, Q), BERT(A|P ), BERT(A|Q), where P , Q, A denotes the paragraph, question, and answer. BERT(A| ) denotes the possibility of an answer A being the next sentence of . The least challenging negative answer is determined by

A = arg min( ∀ ⊆{P,Q} BERT(A| )) 2.5 Train / Dev / Test Split

We finally obtain 35, 588 question sets for our COSMOS dataset. To ensure that the development and test sets are of high quality, we identify a group of workers who excelled in the generation task for question and answers, and randomly sample 7K question sets authored by these excellent workers as test set, and 3K question sets as development set. The remaining questions are all used as training set. Table 1 shows dataset statistics. Figure 2 compares frequent trigram prefixes in COSMOS and SQuAD 2.0 (Rajpurkar et al., 2018) . Most of the frequent trigram prefixes in COS-MOS, e.g., why, what may happen, what will happen are almost absent from SQuAD 2.0, which demonstrates the unique challenge our dataset contributes. We randomly sample 500 answerable questions to manually categorize according to their contextual commonsense reasoning types. Figure 3 shows representative examples. Table 2 shows the distribution of the question types.

Figure 2: Distribution of trigram prefixes of questions in COSMOS QA and SQuAD 2.0

Figure 3: Examples of each type of commonsense reasoning in COSMOS QA. (3 indicates the correct answer.)

Table 1: Statistics of training, dev and test sets of COSMOS QA.

Table 2: The distribution of contextual commonsense reasoning types in COSMOS.

2.6 Data Analysis

• Pre-/post-conditions: causes/effects of an event.

• Motivations: intents or purposes.

• Reactions: possible reactions of people or objects to an event.

• Temporal events: what events might happen before or after the current event.

• Situational facts: facts that can be inferred from the description of a particular situation.

• Counterfactuals: what might happen given a counterfactual condition.

• Other: other types, e.g., cultural norms.

2.7 Bert With Multiway Attention

Multiway attention (Wang et al., 2018a; Zhu et al., 2018) has been shown to be effective in capturing the interactions between each pair of input paragraph, question and candidate answers, leading to better context interpretation, while BERT finetuning (Devlin et al., 2018 ) also shows its prominent ability in commonsense inference. To fur-ther enhance the context understanding ability of BERT fine-tuning, we perform multiway bidirectional attention over the BERT encoding output. Figure 4 shows the overview of the architecture.

Figure 4: Architecture overview of BERT with multiway attention: Solid lines and blocks show the learning of multiway attentive context paragraph representatio .

Encoding with Pre-trained BERT Given a paragraph, a question, and a set of candidate answers, the goal is to select the most plausible correct answer from the candidates. We formulate the input paragraph as P = {p 0 , p 1 , ..., p n }, the question as Q = {q 0 , q 1 , ..., q k } and a candidate answer as A = {a 0 , a 1 , ..., a s }, where p i , q i and a i is the i-th word of the paragraph, question and candidate answer respectively. Following (Devlin et al., 2018) , given the input P , Q and A, we apply the same tokenizer and concatenate all tokens as a new sequence Multiway Attention To encourage better context interpretation, we perform multiway attention over BERT encoding output. Taking the paragraph P as an example, we compute three types of attention weights to capture its correlation to the question, the answer, and both the question and answer, and get question-attentive, answer-attentive, and question and answer-attentive paragraph represen-tationsH

P = HP W t + bt M Q P = Softmax(HP H Q )HQ M A P = Softmax(HP H A )HA M QA P = Softmax(HP H QA )HQA

where W t and b t are learnable parameters. Next we fuse these representations with the original encoding output of P

F Q P = σ([HP M Q P : HP − M Q P ]W P + bP ) F A P = σ([HP M A P : HP − M A P ]W P + bP ) F QA P = σ([HP M QA P : HP − M QA P ]W P + bP )

where [:] denotes concatenation operation. W P , b P are learnable parameters for fusing paragraph representations. σ denotes ReLU function. Finally, we apply column-wise max pooling on [F Q P : F A P : F QA P ] and obtain the new paragraph P1: We called Sha-sha and your Henry (grandma and grandpa -they came up with those names, don't blame me!) to alert them, and then called Uncle Danny. At around 2 am, with the contractions about 2 minutes apart, we headed to the hospital. When we got there I was only 2 cm dilated, but my blood pressure was high so they admitted me.

Q: Why Is Everyone Rushing To The Hospital?

A: There is someone sick at the hospital.

C: There is a child to be birthed.

B: There is a sick grandpa.

D: None of the above choices. representation F P . Similarly, we can also obtain a new representation F Q and F A for Q and A respectively. We use F = [F P : F Q : F A ] as the overall vector representation for the set of paragraph, question and a particular candidate answer.

Pre-/Post-Condition

Classification For each candidate answer A i , we compute the loss as follows:

L(Ai|P, Q) = − log exp(W f Fi)

P = HP W t + bt M Q P = Softmax(HP H > Q )HQ M A P = Softmax(HP H > A )HA M QA P = Softmax(HP H > QA )HQA

where W t and b t are learnable parameters. Next we fuse these representations with the original encoding output of P

F Q P = ([HP M Q P : HP M Q P ]W P + bP ) F A P = ([HP M A P : HP M A P ]W P + bP ) F QA P = ([HP M QA P : HP M QA P ]W P + bP )

where We explore two categories of baseline methods: reading comprehension approaches and pretrained language model based approaches.

Sliding Window (Richardson et al., 2013) measures the similarity of each candidate answer with each window with m words of the paragraph. Stanford Attentive Reader (Chen et al., 2016) performs a bilinear attention between the question and paragraph for answer prediction. Gated-Attention Reader (Dhingra et al., 2017) performs multi-hop attention between the question and a recurrent neural network based paragraph encoding states. Co-Matching (Wang et al., 2018b) captures the interactions between question and paragraph, as well as answer and paragraph with attention. Commonsense-RC (Wang et al., 2018a) applies three-way unidirectional attention to model interactions between paragraph, question, and answers. GPT-FT (Radford et al., 2018) is based on a generative pre-trained transformer language model, following a fine-tuning step on COSMOS. text interpretation, we perform multiway attention over BERT encoding output. Taking the paragraph P as an example, we compute three types of attention weights to capture its correlation to the question, the answer, and both the question and answer, and get question-attentive, answer-attentive, and question and answer-attentive paragraph represen-tationsH We explore two categories of baseline methods: reading comprehension approaches and pretrained language model based approaches.

P = HP W t +

Sliding Window (Richardson et al., 2013) measures the similarity of each candidate answer with each window with m words of the paragraph. Stanford Attentive Reader (Chen et al., 2016) performs a bilinear attention between the question and paragraph for answer prediction. Gated-Attention Reader (Dhingra et al., 2017) performs multi-hop attention between the question and a recurrent neural network based paragraph encoding states. Co-Matching (Wang et al., 2018b) captures the interactions between question and paragraph, as well as answer and paragraph with attention. Commonsense-RC (Wang et al., 2018a) applies three-way unidirectional attention to model interactions between paragraph, question, and answers. GPT-FT (Radford et al., 2018) is based on a generative pre-trained transformer language model, following a fine-tuning step on COSMOS. Multiway Attention To encourage better con-text interpretation, we perform multiway attention over BERT encoding output. Taking the paragraph P as an example, we compute three types of attention weights to capture its correlation to the question, the answer, and both the question and answer, and get question-attentive, answer-attentive, and question and answer-attentive paragraph represen-tationsH We explore two categories of baseline methods: reading comprehension approaches and pretrained language model based approaches.

P = HP W t +

Commonsense-RC (Wang et al., 2018a) applies three-way unidirectional attention to model interactions between paragraph, question, and answers. GPT-FT (Radford et al., 2018) is based on a generative pre-trained transformer language model, following a fine-tuning step on COSMOS. Sliding Window (Richardson et al., 2013) 25.0 24.9 Stanford Attentive Reader (Chen et al., 2016) UD 45.3 44.4 Gated-Attention Reader (Dhingra et al., 2017) Multi-hop UD 46.9 46.2 Co-Matching (Wang et al., 2018b) UD UD 45.9 44.7 Commonsense-Rc (Wang et al., 2018a) UD UD UD 47.6 48.2 GPT-FT (Radford et al., 2018) UD 54.0 54.4 BERT-FT (Devlin et al., 2018 Human Performance To get human performance on COSMOS QA, we randomly sample 200 question sets from the test set, and ask 3 workers from AMT to select the most plausible correct answer. Each worker is paid $0.1 per question set. We finally combine the predictions for each question with majority vote. Table 3 shows the characteristics and performance of varying approaches and human performance. 3 Most of the reading comprehension approaches apply attention to capture the correlation between paragraph, question and each candidate answer and tend to select the answer which is the most semantically closed to the paragraph. For example, in Figure 5 , the Commonsense-RC baseline mistakenly selected the choice which has the most overlapped words with the paragraph without any commonsense reasoning. However, our analysis shows that more than 83% of correct answers in COSMOS QA are not stated in the given paragraphs, thus simply comparing the semantic 3 Appendix C shows the implementation details. relatedness doesn't work well. Pre-trained language models with fine-tuning achieve more than 20% improvement over reading comprehension approaches.

Figure 5: Prediction comparison between our approach (B) with Commonsense-RC (C) and BERT-FT (A).

Table 3: Comparison of varying approachs (Accuracy %). Att: Attention, UD: Unidirectional, BD: Bidirectional

2.9 Results And Analysis

By performing attention over BERT-FT, the performance is further improved, which demonstrates our assumption that incorporating interactive attentions can further enhance the context interpretation of BERT-FT. For example, in Figure 5 , BERT-FT mistakenly selected choice A which can be possibly entailed by the paragraph. However, by performing multiway attention to further enhance the interactive comprehension of context, question and answer, our approach successfully selected the correct answer.

3.1 Ablation Study

Many recent studies have suggested the importance of measuring the dataset bias by checking the model performance based on partial information of the problem (Gururangan et al., 2018; Cai et al., 2017) . Therefore, we report problem ablation study in Table 4 using BERT-FT as a sim-Q: What is the most likely reason that I decided to clean the cupboards ? P: I cleaned the two large bottom cupboards and threw a ton of old stuff away. Dustin's parents like to drop off boxes of food like we're refugees or something. It's always appreciated, and some of it is edible. Most of what I threw away was from last year when Dustin's great-aunt was moving into her new apartment home (retirement center) and they cleaned out her pantry.

Table 4. Not extracted; please refer to original document.

B:

We were getting more food and needed to create room.

A: I was getting tired of having food in the house. C: Dustin and I split up and I need to get rid of his old stuff. D: None of the above choices. Ablating other components of the problems cause more significant drops in performance.

3.2 Knowledge Transfer Through Fine-Tuning

Recent studies (Howard and Ruder, 2018; Min et al., 2017; Devlin et al., 2018) have shown the benefit of fine-tuning on similar tasks or datasets for knowledge transfer. Considering the unique challenge of COSMOS, we explore two related multiple-choice datasets for knowledge transfer: RACE (Lai et al., 2017) , a large-scale reading comprehension dataset, and SWAG (Zellers et al., 2018) , a large-scale commonsense inference dataset. Specifically, we first fine-tune BERT on RACE or SWAG or both, and directly test on COS-MOS to show the impact of knowledge transfer. Furthermore, we sequentially fine-tune BERT on both RACE or SWAG and COSMOS. As Table 5 shows, with direct knowledge transfer, RACE provides significant benefit than SWAG since COS-MOS requires more understanding of the interaction between paragraph, question and each candidate answer. With sequentially fine-tuning, SWAG provides better performance, which indicates that P1: A woman had topped herself by jumping off the roof of the hospital she had just recently been admitted to. She was there because the first or perhaps latest suicide attempt was unsuccessful. She put her clothes on, folded the hospital gown and made the bed. She walked through the unit unimpeded and took the elevator to the top floor.

Table 5: Knowledge transfer through fine-tuning. (%)

Q: What would have happened to the woman if the staff at the hospital were doing their job properly? A: The woman would have been stopped before she left to take the elevator to the top floor and she would have lived. B: She would have been ushered to the elevator with some company. C: She would have managed to get to the elevator quicker with some assistance. D: None of the above choices. with fine-tuning on SWAG, BERT can obtain better commonsense inference ability, which is also beneficial to COSMOS.

3.3 Error Analysis

We randomly select 100 errors made by our approach from the dev set, and identify 4 phenomena:

Complex Context Understanding: In 30% of the errors, the context requires complicated crosssentence interpretation and reasoning. Taking P1 in Figure 6 as an example, to correctly predict the answer, we need to combine the context information that the woman attempted to suicide before but failed, she made the bed since she determined to leave, and she took the elevator and headed to the roof, and infer that the woman was attempting to suicide again. Inconsistent with Human Common Sense: In 33% of the errors, the model mistakenly selected the choice which is not consistent with human common sense. For example, in P2 of Figure 6 , both choice A and choice C could be potentially correct answers. However, from human common sense, it's not safe to leave a baby alone at home. Multi-turn Commonsense Inference: 19% of the errors are due to multi-turn commonsense inference. For example, in P3 of Figure 6 , the model needs to first determine the cause of headache is that she chatted with friends and forgot to sleep using common sense. Further, with counterfactual reasoning, if she didn't chat to her friends, then she wouldn't have gotten up with a headache. Unanswerable Questions: 14% of the errors are from unanswerable questions. The model cannot handle "None of the above" properly since it cannot be directly entailed by the given paragraph or the question. Instead, the model needs to compare the potential of all the other candidate choices.

Figure 6. Not extracted; please refer to original document.

3.4 Generative Evaluation

In real world, humans are usually asked to perform contextual commonsense reasoning without being provided with any candidate answers.

To test machine for human-level intelligence, we leverage a state-of-the-art natural language generator GPT2 (Radford et al., 2019) to automatically generate an answer by reading the given paragraph and question. Specifically, we fine-tune a pre-trained GPT2 language model on all the Paragraph, Question, Correct Answer of COSMOS training set, then given each Paragraph, Question from test set, we use GPT2-FT to generate a plausible answer. We automatically evaluate the generated answers against human authored correct answers with varying metrics in Table 6 . We also create a AMT task to have 3 workers select all plausible answers among 4 automatically generated answers and a "None of the aboce" choice for 200 question sets. We consider an answer as correct only if all 3 workers determined it as correct. Figure 7 shows examples of automatically generated answers by pre-trained GPT2 and GPT2-FT as well as human authored correct answers. We observe that by fine-tuning on COSMOS, GPT2-FT generates more accurate answers. Although intuitively there may be multiple correct answers to the questions in COSMOS QA, our analysis shows that more than 84% of generated correct answers identified by human are semantically consistent with the gold answers in COSMOS, which demonstrates that COSMOS can also be used as a benchmark for generative commonsense reasoning. Appendix E shows more details and examples for generative evaluation.

Figure 7: Examples of human authored correct answers, and automatically generated answers by pretrained GPT2 and GPT2-FT. (3indicates the answer is correct while 7shows that the answer is incorrect.)

Table 6: Generative performance of pre-trained GPT2 and GPT2-FT on COSMOS QA. All automatic metric scores are averaged from 10 sets of sample output.

P1: I cleaned the two large bottom cupboards and threw a ton of old stuff away. Dustin's parents like to drop off boxes of food like we're refugees or something. It's always appreciated, and some of it is edible. Most of what I threw away was from last year when Dustin's great-aunt was moving into her new apartment home (retirement center) and they cleaned out her pantry.

Q: What Is The Most Likely Reason That I Decided To Clean The Cupboards?

Human 1: We were getting more food and needed to create room.

Gpt2-Ft:

I had gone through everything before and it was no longer able to hold food.

GPT2: I had never cleaned cupboards before when I moved here.

P2: My head hurts. I had so much fun at a chat with some scrap friends last Saturday night that I forgot to sleep. I ended up crawling into bed around 7AM.

She would have gotten up early and spend the night in bed.

He may miss the next few games.

Q: What May Have Happened If She Did N'T Chat To Her Scrap Friends?

Human 1: She would go to bed and sleep better.

Gpt2:

She was so happy that I woke her up early , just in time to get her back to sleep.

Human 2: She would not have gotten up with a headache.

We can expect him to be out for the rest of the week as the season progresses. Figure 7 : Examples of human authored correct answers, and automatically generated answers by pretrained GPT2 and GPT2-FT. (indicates the answer is correct while shows that the answer is incorrect.)

Metrics GPT2 GPT2-FT BLEU (Papineni et al., 2002) 10.7 21.0 METEOR (Banerjee and Lavie, 2005) 7.2 8.6 ROUGE-L (Lin, 2004) 13.9 22.1 CIDEr (Vedantam et al., 2015) 0.05 0.17 BERTScore F1 (Zhang et al., 2019b) 41.9 44.5 Human 11.0% 29.0% Table 7 : Comparison of the COSMOS QA to other multiple-choice machine reading comprehension datasets: P: contextual paragraph, Q: question, A: answers, MC: Multiple-choice, and -means unknown.

Table 7: Comparison of the COSMOS QA to other multiple-choice machine reading comprehension datasets: P: contextual paragraph, Q: question, A: answers, MC: Multiple-choice, and - means unknown.

P3:

Bertrand Berry has been announced as out for this Sunday's game with the New York Jets. Of course that comes as no surprise as he left the Washington game early and did not practice yesterday. His groin is now officially listed as partially torn.

Q: What Might Happen If His Groin Is Not Healed In Good Time?

Human 1: He will be benched for the rest of the season because of his injury.

4 Related Work

There have been many exciting new datasets developed for reading comprehension, such as SQuAD (Rajpurkar et al., 2016) , NEWSQA (Trischler et al., 2017) , SearchQA (Dunn et al., 2017) , Narra-tiveQA (Kočiskỳ et al., 2018 ), ProPara (Mishra et al., 2018 , CoQA (Reddy et al., 2018) , ReCoRD (Zhang et al., 2018) , MCTest (Richardson et al., 2013), RACE (Lai et al., 2017) , CNN/Daily Mail (Hermann et al., 2015), Children's Book Test (Hill et al., 2015) , and MCScript (Ostermann et al., 2018) . Most these datasets focus on relatively explicit understanding of the context paragraph, thus a relatively small or unknown fraction of the dataset requires commonsense reasoning, if at all. A notable exception is ReCoRD (Zhang et al., 2018) that is designed specifically for challenging reading comprehension with commonsense reasoning. COSMOS complements ReCoRD with three unique challenges: (1) our context is from webblogs rather than news, thus requiring commonsense reasoning for everyday events rather than news-worthy events. (2) All the answers of ReCoRD are contained in the paragraphs and are assumed to be entities. In contrast, in COSMOS, more than 83% of answers are not stated in the paragraphs, creating unique modeling challenges.

(3) COSMOS can be used for generative evaluation in addition to multiple-choice evaluation.

There also have been other datasets focusing specifically on question answering with commonsense, such as CommonsenseQA (Talmor et al., 2018) and Social IQa (Sap et al., 2019) , and various other types of commonsense inferences (Levesque et al., 2012; Rahman and Ng, 2012; Gordon, 2016; Roemmele et al., 2011; Mostafazadeh et al., 2017; Zellers et al., 2018) . The unique contribution of COSMOS is combining reading comprehension with commonsense reasoning, requiring contextual commonsense reasoning over considerably more complex, diverse, and longer context. Table 7 shows comprehensive comparison among the most relevant datasets.

There have been a wide range of attention mechanisms developed for reading comprehension datasets (Hermann et al., 2015; Kadlec et al., 2016; Chen et al., 2016; Dhingra et al., 2017; Seo et al., 2016; Wang et al., 2018b) . Our work investigates various state-of-the-art approaches to reading comprehension, and provide empirical insights into the design choices that are the most effective for contextual commonsense reasoning required for COSMOS.

5 Conclusion

We introduced COSMOS QA, a large-scale dataset for machine comprehension with contextual commonsense reasoning. We also presented extensive empirical results comparing various state-ofthe-art neural architectures to reading comprehension, and demonstrated a new model variant that leads to the best result. The substantial headroom (25.6%) between the best model performance and human encourages future research on contextual commonsense reasoning.

If a question is crafted with two correct answers, we will create two question sets with each correct answer and the same three incorrect answers.

Through the whole paper, BERT refers to the pre-trained BERT large uncased model from https://github. com/huggingface/pytorch-pretrained-BERT

j=1 exp(W f Fj))2.8 Baseline MethodsWe explore two categories of baseline methods: reading comprehension approaches and pretrained language model based approaches.

https://spacy.io/ 5 Through the whole paper, BERT refers to the pre-trained BERT large uncased model from https://github. com/huggingface/pytorch-pretrained-BERT

https://github.com/huggingface/ pytorch-pretrained-BERT

Figure 8: Performance on COSMOS with various amount of training data