Are You Smarter Than a Sixth Grader? Textbook Question Answering for Multimodal Machine Comprehension
We introduce the task of Multi-Modal Machine Comprehension (M3C), which aims at answering multimodal questions given a context of text, diagrams and images. We present the Textbook Question Answering (TQA) dataset that includes 1,076 lessons and 26,260 multi-modal questions, taken from middle school science curricula. Our analysis shows that a significant portion of questions require complex parsing of the text and the diagrams and reasoning, indicating that our dataset is more complex compared to previous machine comprehension and visual question answering datasets. We extend state-of-the-art methods for textual machine comprehension and visual question answering to the TQA dataset. Our experiments show that these models do not perform well on TQA. The presented dataset opens new challenges for research in question answering and reasoning across multiple modalities.
In some ways, a cell resembles a plastic bag full of Jell-O. Its basic structure is a cell membrane filled with cytoplasm. The cytoplasm of a eukaryotic cell is like Jell-O containing mixed fruit. It also contains a nucleus and other organelles.
The image below shows the Prokaryotic cell. A prokaryote is a single-celled organism that lacks a membrane-bound nucleus (karyon), mitochondria, or any other membrane-bound organelle. In the prokaryotes, all the intracellular water-soluble components (proteins, DNA and metabolites) are located together in the cytoplasm enclosed by the cell membrane, rather than in separate cellular compartments.
We obtained a set of instructional diagrams per lesson using the same method as above, de-duplicating diagrams that were already present in the lessons and diagrams that accompanied questions. Rich captions for this set of diagrams were also obtained using crowd-sourcing. Each human subject was provided with examples of rich captions, the lesson and a diagram and was asked to write down rich captions using the vocabulary and scientific concepts explained in the lesson.
What is the outer surrounding part of the Nucleus?
We introduce the task of Multi-Modal Machine Comprehension (M 3 C), which aims at answering multimodal questions given a context of text, diagrams and images. We present the Textbook Question Answering (TQA) dataset that includes 1,076 lessons and 26,260 multi-modal questions, taken from middle school science curricula. Our analysis shows that a significant portion of questions require complex parsing of the text and the diagrams and reasoning, indicating that our dataset is more complex compared to previous machine comprehension and visual question answering datasets. We extend state-of-the-art methods for textual machine comprehension and visual question answering to the TQA dataset. Our experiments show that
Question answering (QA) has been a major research focus of the natural language processing (NLP) community for several years and more recently has also gained significant popularity within the computer vision community.
There have been several QA paradigms in NLP, which can be categorized by the knowledge used to answer questions. This knowledge can range from structured and confined knowledge bases (e.g., Freebase [4, 3] ) to unstructured and unbounded natural language form (e.g., documents on the web  ). A middle ground between these approaches has been the popular paradigm of Machine Comprehension (MC) [20, 18] , where the knowledge (often referred to as the context) is unstructured, and restricted in size to a short set of paragraphs.
Question answering in the vision community, referred to as Visual Question Answering (VQA), has become popular, in part due to the availability of large image-based QA datasets [17, 19, 29, 1, 30, 9] . In a sense, VQA is a machine comprehension task, where the question is in natural language form, and the context is the image.
World knowledge is multi-modal in nature, spread across text documents, images and videos. A system that can answer arbitrary questions about the world must learn to comprehend these multi-modal sources of information. We thus propose the task of Multi-Modal Machine Comprehension (M 3 C), an extension of the traditional textual machine comprehension to multi-modal data. In this paradigm, the task is to read a multi-modal context along with a multi-modal question and provide an answer, which may also be multimodal in nature. This is in contrast with the conventional question answering task, in which the context is usually about a single modality (either language or vision).
In contrast to the VQA paradigm, M 3 C also has an advantage from a modelling perspective. VQA tasks typically require common sense knowledge to answer many questions, in addition to the image itself. For example, the question "Does this person have 20/20 vision?" from the VQA dataset  requires the system to detect eye-glasses and then use the common sense that a person with perfect or 20/20 vision would typically not wear eye glasses. This need for common sense makes the QA task more interesting, but also leads to an unbounded knowledge resource. Since automatically acquiring common sense knowledge is a very difficult task (with a large body of ongoing research), it is a common practice to train systems for VQA solely on the training splits of these datasets. The resulting systems can thus only expect to answer questions that require common sense knowledge implicitly contained within the questions in the training splits. The knowledge required for M 3 C on the other hand is bounded to the multi-modal context supplied with the question. This makes the knowledge acquisition more manageable and serves as a good test bed for visual and textual reasoning.
Towards this goal, we present the Textbook Question Answering (TQA) dataset drawn from middle school science curricula (Figure 1 ). The textual and diagrammatic content in middle school science reference fairly complex phenomena that occur in the world  . Our analysis in Section 4 shows that parsing this linguistic and visual content is fairly challenging and a significant proportion of questions posed to students at this level require reasoning. This makes TQA a good test bed for the M 3 C paradigm. TQA consists of 1,076 lessons containing 78,338 sentences and 3,455 images (including diagrams). Each lesson has a set of questions which are answerable using the content taught in the lesson. The TQA dataset has 26,260 questions with 12,567 of them having an accompanying diagram, split into training, validation and test at a lesson level.
We describe the Textbook Question Answering (TQA) dataset in Section 3 and provide an in-depth analysis of the lesson contexts, questions and answer sources in Section 4. We also provide baselines in Section 5 using models that have been proven to work well in other MC and VQA tasks. These models extend attention mechanisms between query and context, where the context (visual and textual) is fit within a memory. Our experiments show that these models do not work very well on TQA. This is presumably due to the following reasons: The length of the context (lessons) is very large and training an attention network (Memory Networks  ) of this size is non-trivial; there are many different modalities of information that need to be combined into the memory. Most questions cannot be answered by simple lookup, require information from multiple sentences and/or images, and require non-trivial reasoning; Current approaches for multi-hop reasoning work well on synthetic data like bAbI  , but are hard to train in a general setting such as this dataset. These challenges offered by the TQA dataset make it a valuable resource for the vision and natural language communities, and we encourage other researchers to work on this challenging task. TQA can be downloaded at http://textbookqa.org .
Visual Question Answering There has been a surge of interest in the field of language and vision over the past few years, most notably in the area of visual question answering. This has in part been motivated by the availability of large image and video question answering datasets.
The DAQUAR dataset  was one of the earliest question answering datasets in the image domain. Soon after, much larger datasets including COCO-QA  , FM-IQA  , Visual Madlibs  and VQA  were released. Each of these four datasets obtained images from Microsoft COCO dataset  . While COCO-QA questions were automatically generated, the remaining datasets used human annotators to write questions. In contrast to our TQA dataset, in all these datasets the question is in a natural language form, and the context is an image. More recently, Zhu et al. released the Visual7W dataset  which contained multiple choice visual answers in addition to textual answers. While most past works and datasets in the field of question answering in language and vision focused on images, researchers have also made inroads using videos. Tapaswi et al. released the Movie-QA dataset  which requires the system to analyze clips in the movie to answer questions. They also provide movie-subtitles, plots and scripts as additional information sources.
The presented TQA dataset differs from the above datasets in the following ways. First, the contexts as well as the questions are multi-modal in nature. Second, in contrast to the above VQA paradigm (learn from question-answer pairs and test on question-answer pairs), TQA uses the proposed paradigm of M 3 C (read a context and answer questions; learn from context-question-answer tuples and test on context-question-answer tuples). In contrast to the VQA paradigm which often requires unbounded common-sense knowledge to answer many questions, the M 3 C paradigm confines the knowledge required to the accompanying context. Another big difference arises from the use of science textbooks and science diagrams in TQA as compared to natural images in past datasets. Science diagrams often represent complex concepts, such as events or systems, that are difficult to portray in a single natural image. Along with the middle school science concepts explained in the lesson text, these images lend themselves more easily to questions that require reasoning. Hence TQA serves as a great QA test bed with confined knowledge acquisition and reasoning.
Early works on visual question answering (VQA) involved encoding the question using a Recurrent Neural Network, encoding the image using a Convolutional Neural Network and combining them to answer the question [1, 17] . Subsequently, attention mechanisms were successfully employed in VQA, whereby either the question in its entirety or the individual words attend to different patches in the image [30, 27, 28] . More recently,  employed attention both ways, between the text and the image and showed its benefits. The winner of the recent VQA workshop employed Multimodal Compact Bilinear Pooling  at the attention layer instead of the commonly used element wise product/concatenation mechanisms. Our baselines show that networks with standard attention models do not perform very well on the TQA dataset and we discuss the reasons with possible solutions in Section 5.
Machine Comprehension in NLP Akin to the availability of several VQA datasets in computer vision, the NLP community has introduced several machine comprehension (MC) datasets over the past few years. Cloze datasets (where the system is asked to fill in words that have been removed from a passage) including CNN and DailyMail  as well as Childrens Book Test  are a good proxy to the traditional MC tasks and have the added benefit of being automatically produced. More traditional MC datasets such as MCTest  were limited in size, but recently larger ones such as the Stanford Question Answering (SQuAD) dataset have been introduced with 100,000 questions.
Attention mechanisms, largely inspired by Bahdanau et al.  have become very popular in textual MC systems. There are several variations to using attention including dynamic attention [10, 6] where the attention weights at a time step depend on attention weights at previous time steps. An-other popular technique employed is based on Memory Networks [26, 27] with a multi-hop approach, where the attention layer is followed by a query summarization stage and then fed into more rounds of attention on the memory.
The release of the SQuAD dataset has led to a number of new approaches proposed for the task of MC. We extended the approach by Seo et al.  , which currently lies at position 2 on the SQuAD leaderboard, to adapt it to our Multimodal MC task 1 . Our results show that on the text questions, the absolute accuracy is lower than its achieved numbers on the SQuAD dataset. This along with our analysis in Section 4 indicate that the TQA is quite challenging and warrants further research.
3. Tqa Dataset
We now describe the Textbook Question Answering dataset and provide an in-depth analysis in Section 4.
3.1. Dataset Structure
The Textbook Question Answering (TQA) dataset is drawn from middle school science curricula. It consists of 1,076 lessons from Life Science, Earth Science and Physical Science textbooks downloaded 2 from http://www. ck12.org. This material conforms to national and state curriculum guidelines and is actively being used by teachers and students in the United States and worldwide.
Lessons Figure 1 shows an overview of the dataset. Each lesson consists of textual content, in the form of paragraphs of text as well as visual content, consisting of diagrams and natural images. Each lesson also comes with a Vocabulary Section which provides definitions of scientific concepts introduced in that lesson and a Lesson Summary which is typically restricted to five sentences and summarizes the key concepts in that lesson. In total, the 1,076 lessons consist of 78,338 sentences and 3,455 images. In addition, lessons also contain links to online Instructional Videos (totalling 2,156 videos across all lessons) which explain concepts with more visual illustrations 3 . Instructional Diagrams We found that textual content in the textbooks was very comprehensive and sufficient to understand the concepts presented in the lesson. However, the textual content and image captions did not comprehensively describe the images presented the lessons. As a result, the lessons were not sufficient to understand the concepts and answer all questions with diagrams. We conjecture that this knowledge gap is filled by teachers in the classrooms, explaining a concept and an accompanying diagram on the whiteboard. To bridge this gap in the dataset, we added a small set of diagrams (typically between three to five), which we refer to as Instructional Diagrams, to lessons in the textbooks that have diagram questions (Section 3.2). We also add rich captions describing the scientific concepts illustrated in the diagram. An example is shown in Figure 1 .
Questions Each lesson has a set of multiple choice questions that address concepts taught in that lesson. The number of choices varies from two to seven. TQA has a total of 26,260 questions including 12,567 having an accompanying diagram. We hereby refer to questions with a diagram as diagram questions, and ones without as text questions.
Dataset Splits TQA is split into a training, validation and test set at lesson level. The training set consists of 666 lessons and 15,154 questions, the validation set consists of 200 lessons and 5,309 questions and the test set consists of 210 lessons and 5,797 questions. On occasions, multiple lessons have an overlap in the concepts they teach. Care has been taken to group these lessons before splitting the data, so as to minimize the concept overlap between data splits.
3.2. Dataset Curation
The lessons in the TQA dataset are obtained from the Life Science, Earth Science and Physical Science Textbooks and Web Concepts downloaded from the CK-12 website. Lessons contain text, images, links to instructional videos, vocabulary definitions and lesson summaries. Questions are obtained from Workbooks and Quizzes from the website. Additional diagram questions and instructional diagrams are obtained using crowd-sourcing.
Diagram Questions Our initial analysis showed that the number of diagram questions was very small compared to the number of text questions. In part, this is due to the fact that diagram questions are harder to generate. To supplement this set, we obtained a list of scientific topics from each lesson, used these as queries to Google Image Search and downloaded the top results. These were manually filtered down to images that had content similar to the lessons. We thus obtained 2,749 diagrams spread across 85 lessons. Multiple choice questions for these diagrams were then ob-tained using crowd-sourcing 4 . Each human subject was provided with the full lesson and a diagram and was asked to write down a middle school science question that required the diagram to answer it correctly, and was answerable using the provided lesson.
4. Tqa Analysis
In this section we provide an analysis of the lesson contexts, questions, answers and the information content needed to answer questions in the TQA dataset. Figure 2 shows the distribution of the number of sentences and images across the lessons in the dataset. About 50% of lessons have 5-10 images and more than 75% of the lessons have more than 50 sentences. The length of the lessons in TQA is typically higher than past MC datasets such as SQuAD  , making it difficult to add the entire context into memory and then attending to it. This suggests the need for either an Information Retrieval based preprocessing step or a hierarchical model such as Hierarchical Memory Networks  . Furthermore, the multi-modal nature of the contexts in lessons and questions poses new challenges and warrants further research.
Text Questions Figure 3(a) shows the distribution of the length of questions in the dataset. This distribution shows that compared to VQA  , TQA has longer questions (the mode of the distribution here is 8 compared to 5 for VQA). Figure 3(b) shows the distribution of the questions across the W categories (what, where, when, who, why, how and which). Interestingly, the Other category has a fair number of questions. Further analysis shows that a good fraction of questions written down in standard workbooks are assertive statements as opposed to interrogative statements. This could be another reason why baseline models in Section 5 perform poorly on the dataset.
Diagram Questions The diagrams in the questions of the TQA dataset are similar to the diagrams in the questions of the AI2D dataset presented by Kembhavi et al.  in terms of content and complexity. Kembhavi et al. propose using diagram parse graphs to represent diagrams and use a hierarchical representation of constituents and relationships. We analysed AI2D and found that there is high correlation between the complexity of a diagram (measured by the number of constituents and relationships in the diagram) and the number of text boxes located in that diagram. Figure 3(c) shows the distribution of the number of text boxes across the diagrams in the questions in the TQA dataset as a proxy to the distribution of diagram complexity. This shows that the diagrams in the questions are quite complex and further analysis below shows that a rich parsing of these diagrams is often needed to answer questions.
4.3. Knowledge Scope To Answer Questions
We also analyze the knowledge scope required to answer questions in the dataset in Figure 4 for each question type. This analysis was performed by human subjects on 250 randomly sampled questions in each type. Figure 4(a) shows the scope needed for text questions. A significant number of text questions require multiple sentences within a paragraph to be answered correctly, and some questions require information spread across the entire lesson. This is in contrast to past MC datasets like SQuAD  where a majority of questions can be answered by 1 sentence. Figure 4(b) shows the scope for diagram questions. Most questions require parsing the question diagram, and of these, a significant number in addition need text and images from the context. Figure 4(c) shows the degree of diagram parsing required to answer questions, given that the diagram is needed. Very few questions can be answered with just a classification of the diagram, and more than 50% need a rich structure to be parsed out of the diagram. Finally, Figure 4(d) shows that fewer than 5% of diagrams can be trivially answered by just the raw OCR text. An example of this case, is where just the correct answer option lies within the text boxes in the image and the wrong options are unrelated to the diagram. This analysis shows that questions in the TQA dataset often require multiple pieces of context information presented in multiple modalities, rendering the dataset challenging.
4.4. Qualitative Examples
True/False Several multiple choice questions in the dataset have just 2 choices: True and False. As one might expect with middle school questions, these are not simple look-up questions but require complex parsing and reasoning. Figure 5 shows 3 examples. The first requires relating too high and below and also requires parsing multiple sentences. The second requires parsing the flow chart in the diagram and counting the steps. Counting is a notoriously hard task for present day QA systems as has been seen in the VQA dataset  . The third requires converting the numerical phrase 2/3 to two thirds as opposed to two and three, and then reasoning that two thirds is more than one-third. Figure 6 shows examples of questions