Understanding Convolutional Neural Networks for Text Classification
Authors
Abstract
We present an analysis into the inner workings of Convolutional Neural Networks (CNNs) for processing text. CNNs used for computer vision can be interpreted by projecting filters into image space, but for discrete sequence inputs CNNs remain a mystery. We aim to understand the method by which the networks process and classify text. We examine common hypotheses to this problem: that filters, accompanied by global max-pooling, serve as ngram detectors. We show that filters may capture several different semantic classes of ngrams by using different activation patterns, and that global max-pooling induces behavior which separates important ngrams from the rest. Finally, we show practical use cases derived from our findings in the form of model interpretability (explaining a trained model by deriving a concrete identity for each filter, bridging the gap between visualization tools in vision tasks and NLP) and prediction interpretability (explaining predictions).
1 Introduction
Convolutional Neural Networks (CNNs), originally invented for computer vision, have been shown to achieve strong performance on text classification tasks (Bai et al., 2018; Kalchbrenner et al., 2014; Wang et al., 2015; Johnson and Zhang, 2015; Iyyer et al., 2015) as well as other traditional Natural Language Processing (NLP) tasks (Collobert et al., 2011) , even when considering relatively simple one-layer models (Kim, 2014) .
As with other architectures of neural networks, explaining the learned functionality of CNNs is still an active research area. The ability to interpret neural models can be used to increase trust in model predictions, analyze errors or improve the model (Ribeiro et al., 2016) . The problem of interpretability in machine learning can be divided into two concrete tasks: Given a trained model, model interpretability aims to supply a structured explanation which captures what the model has learned. Given a trained model and a single example, prediction interpretability aims to explain how the model arrived at its prediction. These can be further divided into white-box and black-box techniques. While recent works have begun to supply the means of interpreting predictions (Alvarez-Melis and Jaakkola, 2017; Lei et al., 2016; Guo et al., 2018) , interpreting neural NLP models remains an under-explored area.
Accompanying their rising popularity, CNNs have seen multiple advances in interpretability when used for computer vision tasks (Zeiler and Fergus, 2014) . These techniques unfortunately do not trivially apply to discrete sequences, as they assume a continuous input space used to represent images. Intuitions about how CNNs work on an abstract level also may not carry over from image inputs to text-for example, pooling in CNNs has been used to induce deformation invariance (Le-Cun et al., 1998 , which is likely different than the role it has when processing text.
In this work, we examine and attempt to understand how CNNs process text, and then use this information for the more practical goals of improving model-level and prediction-level explanations.
We identify and refine current intuitions as to how CNNs work. Specifically, current common wisdom suggests that CNNs classify text by working through the following steps (Goldberg, 2016) : 1) 1-dimensional convolving filters are used as ngram detectors, each filter specializing in a closely-related family of ngrams.
2) Max-pooling over time extracts the relevant arXiv:1809.08037v2 [cs.CL] 12 Aug 2019 ngrams for making a decision.
3) The rest of the network classifies the text based on this information.
We refine items 1 and 2 and show that:
• Max-pooling induces a thresholding behavior, and values below a given threshold are ignored when (i.e. irrelevant to) making a prediction. Specifically, we show an experiment for which 40% of the pooled ngrams on average can be dropped with no loss of performance (Section 4).
• Filters are not homogeneous, i.e. a single filter can, and often does, detect multiple distinctly different families of ngrams (Section 5.3).
• Filters also detect negative items in ngramsthey not only select for a family of ngrams but often actively suppress a related family of negated ngrams (Section 5.4).
We also show that the filters are trained to work with naturally-occurring ngrams, and can be easily misled (made to produce values substantially larger than their expected range) by selected nonnatural ngrams. These findings can be used for improving model-level and prediction-level interpretability (Section 6). Concretely: 1) We improve model interpretability by deriving a useful summary for each filter, highlighting the kinds of structures it is sensitive to. 2) We improve prediction interpretability by focusing on informative ngrams and taking into account also the negative cues.
2 Background: 1D Text Convolutions
We focus on the task of text classification. We consider the common architecture in which each word in a document is represented as an embedding vector, a single convolutional layer with m filters is applied, producing an m-dimensional vector for each document ngram. The vectors are combined using max-pooling followed by a ReLU activation. The result is then passed to a linear layer for the final classification.
For an n-words input text w 1 , ..., w n we embed each symbol as d dimensional vector, resulting in word vectors w 1 , ..., w n ∈ R d . The resulting d×n matrix is then fed into a convolutional layer where we pass a sliding window over the text. For each l-words ngram:
u i = [w i , ..., w i+ −1 ] ∈ R d× ; 0 ≤ i ≤ n −
And for each filter f j ∈ R d× we calculate u i , f j . The convolution results in matrix F ∈ R n×m . Applying max-pooling across the ngram dimension results in p ∈ R m which is fed into ReLU non-linearity. Finally, a linear fullyconnected layer W ∈ R c×m produces the distribution over classification classes from which the strongest class is outputted. Formally:
u i = [w i ; ...; w i+ −1 ] F ij = u i , f j p j = ReLU(max i F ij ) o = softmax(Wp)
In practice, we use multiple window sizes ∈ L, L N by using multiple convolution layers in parallel and concatenating the resulting p vectors. We note that the methods in this work are applicable for dilated convolutions as well.
3 Datasets And Hyperparameters
For our empirical experiments and results presented in this work we use three text classification datasets for Sentiment Analysis, which involves classifying the input text (user reviews in all cases) between positive and negative. The specific datasets were chosen for their relative variety in size and domain as well as for the relative simplicity and interpretability of the binary sentiment analysis task.
The three datasets are: a) MR: sentence polarity dataset v1.0 introduced by Pang and Lee (2005) , containing 10k evenly split short (sentences or snippets) movie reviews. b) Elec: electronic product reviews for sentiment classification introduced by Johnson For word embeddings, we use the pre-trained GloVe Wikipedia 2014-Gigaword 5 embeddings (Pennington et al., 2014 ), which we fine-tune with the model. We use embedding dimension of 50, filter sizes of ∈ {2, 3, 4} words, and m ∈ {10, 50} filters. Models are implemented in PyTorch and trained with the Adam optimizer.
4 Identifying Important Features
Current common wisdom posits that filters serve as ngram detectors: each filter searches for a specific class of ngrams, which it marks by assigning them high scores. These highest-scoring detected ngrams survive the max-pooling operation. The final decision is then based on the set of ngrams in the max-pooled vector (represented by the set of corresponding filters). Intuitively, ngrams which any filter scores highly (relative to how it scores other ngrams) are ngrams which are highly relevant for the classification of the text.
In this section we refine this view by attempting to answer the questions: what information about ngrams is captured in the max-pooled vector, and how is it used for the final classification? 1
4.1 Informative Vs. Uninformative Ngrams
Consider the pooled vector p ∈ R m on which the classification is based. Each value p j = ReLU(max i u i , f j ) stems from a filter-ngram interaction, and can be traced back to the ngram u i = [w i , ..., w i+ −1 ] that triggered it. Denote the set of ngrams contributing to p as S p . Ngrams not in S p do not influence the decision of the classifier. But what about the ngrams that are in S p ? Previous attempts in prediction-based interpretation of CNNs for text highlight the ngrams in S p and their scores as means of explaining the prediction. We take here a more refined view. Note that the final classification does not observe the ngram identities directly, but only through the scores assigned to them by the filters. Hence, the information in p must rely on the assigned scores.
Conceptually, we separate ngrams in S p into two classes, deliberate and accidental. Deliberate ngrams end up in S p because they were scored high by their filter, likely because they are informative regarding the final decision. In 1 Although this work focuses on text classification, the findings in this section apply to any neural architecture which utilizes global max pooling, for both discrete and continuous domains. To our knowledge this is the first work that examines the assumption that max-pooling induces classifying behavior. Previously, Ruderman et al. (2018) showed that other assumptions to the functionality of max-pooling as deformation stabilizers (relevant only in continuous domains) do not necessarily hold true. contrast, accidental ngrams end up in S p despite having a low score, because no other ngram scored higher than them. These ngrams are likely not informative for the classification decision. Can we tease apart the deliberate and accidental ngrams?
We assume that there is threshold for each filter, where values above the threshold signal informative information regarding the classification, while values below the threshold are uninformative and can be ignored for the purpose of classification. We thus search for the threshold that separate the two classes. However, as we cannot measure directly which values p j influence the final decision, we opt instead for measuring correlation between p j values and the predicted label for the vector p.
The linearity of the decision function Wp allows to measure exactly how much p j is weighted for the logit of label class k. The class which filter f j contributes to is c j = arg max k W kj 2 . We refer to class c j as the class identity of filter f j .
By assigning each filter a class identity c j and comparing it to the predicted label we arrive at a correlation label-whether the filter's identity class matches the final decision by the network. Concretely, we run the classifier over a set of texts, resulting in pooled vectors p i and network predictions c i . For each filter j we then consider the values p i j and whether c i = c j . For each filter, we obtain a dataset (p 1
j , c 1 = c j ), ..., (p D j , c D = c j )
, and we look for a threshold t j that separates p i j for which c i = c j from those where
c i = c j . (X, Y ) j = {(p i j , c i = c j ) | j < m & i < D}
In an ideal case, the set is linearly separable and we can easily separate informative from uninformative values: if p i j > t j then the classifier's prediction agrees with the filter's label, and otherwise they disagree. In practice, the set is not separable. We instead work with the purity of a filter-threshold combination, defined as the percentage of informative (correlative) ngrams which were scored above the threshold 3 . Formally, given threshold dataset (X, Y ):
purity(f, t) = |{(x, y) ∈ (X, Y ) f | x ≥ t & y = true}| |{(x, y) ∈ (X, Y ) f | x ≥ t}|
We heuristically set the threshold of a filter to the lowest value that achieves a sufficiently high purity (we experimentally find that a purity value of 0.75 works well).
In Figure 2b ,c we show examples for threshold datasets for a model trained on the MR sentiment analysis task.
Threshold Effectiveness We described a method for obtaining per-filter threshold values. But is the threshold assumption-that items below a given threshold do not participate in the decision-even correct? To assess the quality of threshold obtained by our proposal and validate the thresholding assumption, we discard values that do not pass the threshold for each filter and observe the performance of the model. Practically, we replace the ReLU non-linearity with a threshold function:
threshold(x, t) =
x, if x ≥ t 0, otherwise Figure 1 presents the results on the MR dataset (we observed similar results on the Elec dataset). where the threshold is set for each filter separately, based on a shared purity value. If the thresholding assumption is correct and our way of deriving the threshold is effective, we expect to not see a drop in accuracy. Indeed, for purity value of 0.75, we observe that the model performance improves slightly when replacing the ReLU with a per-filter threshold, indicating that the thresholding model is indeed a good approximation for the feature behavior. The percentage of informative (non-accidental) values in p is roughly a linear function of the purity (Figure 1c ). With a purity value of 0.75 4 , we discard roughly 44% of the values in p-and hence 44% of the ngrams in S p .
Not all filters behave in a similar way, however. In Figure 2 we show an example for a filter-#6 in the figure-which is especially uninformative: by applying the lowest threshold which satisfies a purity of 0.75, we discard 99.99% of activations. Therefore in the experiments in Figure 1 , this filter is effectively unused, yet it does not cause loss in performance. In essence, the threshold classifier identified and effectively discarded a filter which is not useful to the model.
To summarize, we validated our assumptions and shown empirically that global max-pooling indeed induces a functionality of separating important and not important activation signals using a latent (presumably soft) threshold. For the rest of this work we will assume a known threshold value for every filter in the model which we can use to identify important ngrams.
5 What Is Captured By A Filter?
Previous work looked at the top-k scoring ngrams for each filter. However, focusing on the top-k does not tell a complete story. We insead look at the set of deliberate ngrams: those that pass the filter's threshold value. Common intuition suggests that each filter is homogeneous and specializes in detecting a specific classes of ngrams. For example, a filter may specializing in detecting ngrams such as "had no issues", "had zero issues", and "had no problems". We challenge this view and show that filters often specialize in multiple distinctly different semantic classes by utilizing activation patterns which are not necessarily maximized. We also show that filters may not only identify good ngrams, but may also actively supress bad ones.
5.1 Slot Activation Vectors
As discussed in Section 2, for each ngram u = [w 1 , ..., w ] and for each filter f we calculate the score u, f . The ngram score can be decomposed as a sum of individual word scores by considering the inner products between every word embedding w i in u and every parallel slice in f :
u, f = −1 i=0 w i , f id:i(d+1)
We refer to slice f id:i(d+1) as slot i of the filter weights, denoted as f (i). Instead of taking the sum of these inner products, we can instead interpret them directly-saying that w i , f (i) captures how much slot i in f is activated by the ith word in the ngram 5 . We can now move from examining the activation of an ngram-filter pair u := [w 1 ; ...; w ], f to examining its slot activation vector: ( w 1 , f (1) , ..., w , f ( ) ). The slot activation vector captures how much each word in the ngram contributes to its activation.
5.2 Naturally Occurring Vs. Possible Ngrams
We distinguish naturally occurring or observed ngrams, which are ngrams that are observed in a large corpus, from possible ngrams which are any combination of words from the vocabulary. The possible ngrams are a superset of the naturally occurring ones. Given a filter, we can find its topscoring naturally occurring ngram by searching 5 We note that this breakdown does not consider the filter's bias, if one is used. This bias is a single number (per filter) which is added to the sum of slot activations to arrive at the ngram activation which is passed to the max-pooling layer. Bias can be accommodated by appending an additional "bias word" with an embedding vector of [1, ..., 1] to every ngram. Regardless, as this bias is identical for all ngrams for the filter in question, it has no role in identifying which ngrams the filter is most similar to, and we can ignore it in this context. over all ngrams in a corpus. We can find its topscoring possible ngram by maximizing each slot value individually. We observe there is a big and consistent gap in scores between the top-scoring natural ngrams and top-scoring possible ngrams. In our Elec model, when averaging over all filters, the top naturally-occurring ngrams score 30% less than the top possible ngrams. Interestingly, the top-scoring natural ngrams almost never fully activate all slots in a filter. Table 1 shows the top-scoring naturally occurring and possible ngrams for nine filters in the Elec model. In each of the top scoring natural ngrams, at least one slot receives a low activation. Table 2 zooms in on one of the filters and shows its top-7 naturally occurring ngrams and top-7 most activated words in each slot. Here, most top-scoring ngrams maximize slot #3 with words such as invaluable and perfect, however some ngrams such as "works as good" and "still holding strong" maximize slots #1 and #2 respectively, instead.
Additionally, most top-scoring words do not appear to be utilized in high-scoring ngrams at all. This can be explained with the following: if a word such as crt rarely or never appears in slot #1 alongside other high-scoring words in other slots, then crt can score highly with no consequence. Table 2 : Top-k words by slot scores and top-k ngrams by filter scores from the Elec model. In bold are words from the top-k ngrams which appear in the top-k slot words -i.e. words which maximize their slot.
Since an ngram containing crt at slot #1 will rarely pass the max-pooling layer, its score at that slot is essentially random.
On naturally occurring ngrams, the filters do not achieve maximum values in all slots but only on some of them. Why? We consider two hypotheses to explain this behavior: (i) Each filter captures multiple semantic classes of ngrams, and each class has some dominating slots and some non-dominating slots (which we define as a slot activation pattern).
(ii) A slot may not be maximized because it's not used to detect word existence, but rather lack of existence-ensuring that specific words do not occur.
We investigate both hypotheses in Sections 5.3 and 5.4 respectively.
Adversarial Potential
We note in passing that this discrepancy in scores between naturally occurring and possible ngrams can be used to derive adversarial examples that cause a trained model to misclassify. By inserting a few seemingly random ngrams, we can cause filters to activate beyond their expected range, potentially driving the model to misclassification. We reserve this area of exploration for future work.
5.3 Clustering (Hypothesis (I))
We explore hypothesis (i) by clustering thresholdpassing (naturally occurring) ngrams in each filter according to their activation vectors. We use Mean Shift Clustering (Fukunaga and 1975; Cheng, 1995) , an algorithm that does not require specifying an a-priori number of clusters, and does not make assumptions about their shapes. Mean Shift considers the feature vectors as sampled from an underlying probability density function 6 . Each cluster captures a different slot activation pattern. We use the cluster's centroid as the prototypical slot activation for that cluster. Table 3 shows a sample clustering output. The clustering algorithm identified two clusters: one primarily containing ngrams of the pattern DET INTENSITY-ADVERB POSITIVE-WORD, while the second contains ngrams that begin with phrases like go wrong. 7 The centroids for these clusters capture the activation patterns well: low-medium-high and highhigh-low for clusters 1 and 2 respectively.
To summarize, by discarding noisy ngrams which do not pass the filter's threshold and then clustering those that remain according to their slot activation patterns, we arrived at a clearer image of the semantic classes of ngrams that a given filter specializes in capturing. In particular, we reveal that filters are not necessarily homogeneous: a single filter may detect several different semantic patterns, each one of them relying on a different slot activation pattern.
5.4 Negative Ngrams (Hypothesis (Ii))
Our second theory to explain the discrepancy between the activations of naturally occurring and possible ngrams is that certain filter slots are not used to detect a class of highly activating words, but rather to rule out a class of highly negative words. We refer to these as negative ngrams.
For example, Table 3 shows an ngram pattern for which slot #1 contains determiners and other "filler" tokens such as hyphens, periods and commas with relatively weak slot activations. Hypothesis (ii) suggests that this slot may receive a strong negative score for words such as not and n't, causing such negated patterns to drop below the threshold. Indeed, ngrams containing not or n't in slot #1 do not pass the threshold for this filter.
We are interested in a more systematic method of identifying these cases. Identifying negative slot activations would be very useful for understanding the semantics captured by a filter and the reasoning behind the dismissal of an ngram, as we discuss in Sections 6.1 and 6.2 respectively.
We achieve this by searching the belowthreshold ngram space for ngrams which are "flipped versions" of above-threshold ngrams. Concretely: Given ngram u which was scored highly by filter f , we search for low-scoring ngrams u such that the hamming distance between u and u is low. By doing this for the topk scoring ngrams per cluster, we arrive at a comprehensive set of negative ngrams. In Table 4 we show a sample output of this algorithm.
Furthermore, we can divide negative ngrams into two cases: 1) Lowering the ngram score below the threshold by replacing high-scoring words with low-scoring words. 2) Lowering the ngram score below the threshold by replacing words with a low positive score with words with a highly-negative score. Case 2 is more interesting because it embodies cases where hypothesis (ii) is relevant. Additionally, it highlights ngrams where a strongly positive word in one slot was negated with another strongly negative word in another slot. Table 4 shows examples in bold.
In order to identify "Case 2" negative ngrams, we heuristically test whether the "changed" words' scores directly influence the status of the activation relative to the threshold: given an already identified negative ngram, if the ngram score-sans the bottom-k negative slot activations (considering a hamming distance of k and given that there are k negative slot activations)-passes the threshold, yet it does not pass the threshold by including the negative slot activations, then the ngram is considered a "Case 2" negative ngram.
6 Interpretability
In this section we show two practical implications of the findings above: improvements in both model-level and prediction-level interpretability of 1D CNNs for text classification.
6.1 Model Interpretability
As in computer vision, we can now interpret a trained CNN model by "visualizing" its filters and interpreting the visible shapes-in other words, defining a high-level description of what the filter detects. We propose to associate each filter with the following items: 1) The class which this filter's strong signals contribute to (in the sentiment task: positive or negative); 2) The threshold value for the filter, together with its purity and coverages percentages (which essentially capture how informative this filter is); 3) A list of semantic patterns identified by this filter. Each list item corresponds to a slot-activations cluster. For each cluster we present the top-k ngrams activating it, and for each ngram we specify its total activation, its slot-activation vector, and its list of bottom-k negative ngrams with their activations and slot activations. In particular, by clustering the activated ngrams according to their slot activation patterns and showing the top-k in each clusters, we get a much more refined coverage of the linguistic patterns that are captured by the filter.
6.2 Prediction Interpretability
Previous prediction-based interpretation attempts traced back the ngrams from the max-pooling layer. Here we improve these previous attempts by considering only ngrams that pass the threshold for their filter. This results in a more concise and relevant explanation (Figure 1) . Figure 3 shows two examples. Note that in example #1, many negative-class filters were "forced" to choose an ngram in max-pooling despite there not being strongly negative phrases-but those ngrams do not pass the threshold and are thus cleaned from the explanation. Additionally we can use the individual slot activations to tease-apart the contribution of each word in the ngram. Finally, we can also mark cases of negative-ngrams (Section 5.4), where an ngram has high slot activations for some words, but these are negated by a highly-negative slot and as a consequence are not selected by max-pooling, or are selected but do not pass the filter's threshold.
7 Conclusion
We have refined several common wisdom assumptions regarding the way in which CNNs process and classify text. First, we have shown that maxpooling over time induces a thresholding behavior on the convolution layer's output, essentially separating between features that are relevant to the final classification and features that are not. We used this information to identify which ngrams are important to the classification. We also associate each filter with the class it contributes to. We decompose the ngram score into word-level scores by treating the convolution of a filter as a sum of word-level convolutions, allowing us to examine the word-level composition of the activation. Specifically, by maximizing the word-level activations by iterating over the vocabulary, we observed that filters do not maximize activations at the word-level, but instead form slot activation patterns that give different types of ngrams similar Figure 3: Examples predicted positive and negative respectively by a model trained on the Elec dataset, along with their explanations. Ngrams which passed the threshold are in bold, and case 2 negative ngrams are in italics. For clarity's sake we trained a small model which uses ten filters.
activation strengths. This provides empirical evidence that filters are not homogeneous. By clustering high-scoring ngrams according to their slotactivation patterns we can identify the groups of linguistic patterns captured by a filter. We also show that filters sometimes opt to assign negative values to certain word activations in order to cause the ngrams which contain them to receive a low score despite having otherwise highly activating words. Finally, we use these findings to suggest improvements to model-based and predictionbased interpretability of CNNs for text.
In the case of non-linear fully-connected layers, the question of how each feature contributes to each class is significantly harder to answer. Possible methods include saliency map methods or gradient-based methods. Recently,Guo et al. (2018) has attributed labels to filters using Bayesian inference and other image annotations.
The purity metric can be considered as the precision metric for this task.4 We note that empirically and intuitively, the more filters we utilize in the network, the less correlation there is between each filter's class and the final classification, as the decision is being made by a greater consensus. This means that demanding a higher purity will be accompanied by lower coverage, relative to other experiments, and more ngrams will be discarded. The "correct" purity level for a filter then is a function of the model and dataset used, and should be investigated using the train or validation datasets.
Intuitively, we can think of the sampling noise as the ngram embeddings, and the probability distribution as defined by a function of the filter weights.7 In the Yelp dataset, go wrong overwhelmingly occurs in a negated context such as "can't go wrong" and "won't go wrong", which explains why it is detected by a positive filter.