Go To:

Paper Title Paper Authors Table Of Contents Abstract References
Home
Report a problem with this paper

Natural Language Rationales with Full-Stack Visual Reasoning: From Pixels to Semantic Frames to Commonsense Graphs

Authors

  • Ana Marasović
  • Chandra Bhagavatula
  • J. S. Park
  • Ronan Le Bras
  • Noah A. Smith
  • Yejin Choi
  • FINDINGS
  • 2020
  • View in Semantic Scholar

Abstract

Natural language rationales could provide intuitive, higher-level explanations that are easily understandable by humans, complementing the more broadly studied lower-level explanations based on gradients or attention weights. We present the first study focused on generating natural language rationales across several complex visual reasoning tasks: visual commonsense reasoning, visual-textual entailment, and visual question answering. The key challenge of accurate rationalization is comprehensive image understanding at all levels: not just their explicit content at the pixel level, but their contextual contents at the semantic and pragmatic levels. We present RationaleˆVT Transformer, an integrated model that learns to generate free-text rationales by combining pretrained language models with object recognition, grounded visual semantic frames, and visual commonsense graphs. Our experiments show that free-text rationalization is a promising research direction to complement model interpretability for complex visual-textual reasoning tasks. In addition, we find that integration of richer semantic and pragmatic visual features improves visual fidelity of rationales.

1 Introduction

Explanatory models based on natural language rationales could provide intuitive, higher-level explanations that are easily understandable by humans (Miller, 2019) . In Figure 1 , for example, the natural language rationale given in free-text provides a much more informative and conceptually relevant explanation to the given QA problem compared to the non-linguistic explanations that are often provided as localized visual highlights on the image. The latter, while pertinent to what the vision component of the model was attending to, cannot provide the full scope of rationales for such complex reasoning tasks as illustrated in Figure 1 . Indeed, explanations for higher-level conceptual reasoning can be best conveyed through natural language, as has been studied in literature on (visual) NLI (Do et al., 2020; Camburu et al., 2018) , (visual) QA (Wu and Mooney, 2019; Rajani et al., 2019) , arcade games (Ehsan et al., 2019) , fact checking (Atanasova et al., 2020) , image classification (Hendricks et al., 2018) , motivation prediction (Vondrick et al., 2016) , algebraic word problems (Ling et al., 2017) , and self-driving cars (Kim et al., 2018) .

Figure 1: An illustrative example showing that explaining higher-level conceptual reasoning cannot be well conveyed only through the attribution of raw input features (individual pixels or words); we need natural language.

In this paper, we present the first focused study on generating natural language rationales across several complex visual reasoning tasks: visual commonsense reasoning, visual-textual entailment, and visual question answering. Our study aims to complement the more broadly studied lower-level explanations such as attention weights and gradients in deep neural networks (Simonyan et al., 2014; Zhang et al., 2017; Montavon et al., 2018, among others) . Because free-text rationalization is a challenging research question, we assume the gold answer for a given instance is given and scope our investigation to justifying the gold answer.

The key challenge in our study is that accurate rationalization requires comprehensive image understanding at all levels: not just their basic content at the pixel level (recognizing "waitress", "pancakes", "people" at the table in Figure 1 ), but their contextual content at the semantic level (understanding the structural relations among objects and entities through action predicates such as "delivering" and "pointing to") as well as at the pragmatic level (understanding the "intent" of the pointing action is to tell the waitress who ordered the pancakes).

We present RATIONALE VT TRANSFORMER, an integrated model that learns to generate free-text rationales by combining pretrained language models based on (Radford et al., 2019) with visual features. Besides commonly used features derived from object detection ( Fig. 2a ), we explore two new types of visual features to enrich base mod- Figure 1 : An illustrative example showing that explaining higher-level conceptual reasoning cannot be well conveyed only through the attribution of raw input features (individual pixels or words); we need natural language. els with semantic and pragmatic knowledge: (i) visual semantic frames, i.e., the primary activity and entities engaged in it detected by a grounded situation recognizer ( Fig. 2b ; Pratt et al., 2020) , and (ii) commonsense inferences inferred from an image and an optional event predicted from a visual commonsense graph ( Fig. 2c ; Park et al., 2020) . 1 We report comprehensive experiments with careful analysis using three datasets with human rationales: (i) visual question answering in VQA-E (Li et al., 2018) , (ii) visual-textual entailment in E-SNLI-VE (Do et al., 2020) , and (iii) an answer justification subtask of visual commonsense reasoning in VCR (Zellers et al., 2019a) . Our empirical findings demonstrate that while free-text rationalization remains a challenging task, newly emerging state-of-the-art models support rationale generation as a promising research direction to complement model interpretability for complex visual-textual reasoning tasks. In particular, we find that integration of richer semantic and pragmatic visual knowledge can lead to generating rationales with higher visual fidelity, especially for tasks that require higher-level concepts and richer background knowledge.