Commonly Uncommon: Semantic Sparsity in Situation Recognition
Authors
Abstract
Semantic sparsity is a common challenge in structured visual classification problems, when the output space is complex, the vast majority of the possible predictions are rarely, if ever, seen in the training set. This paper studies semantic sparsity in situation recognition, the task of producing structured summaries of what is happening in images, including activities, objects and the roles objects play within the activity. For this problem, we find empirically that most substructures required for prediction are rare, and current state-of-the-art model performance dramatically decreases if even one such rare substructure exists in the target output. We avoid many such errors by (1) introducing a novel tensor composition function that learns to share examples across substructures more effectively and (2) semantically augmenting our training data with automatically gathered examples of rarely observed outputs using web data. When integrated within a complete CRF-based structured prediction model, the tensor-based approach outperforms existing state of the art by a relative improvement of 2.11% and 4.40% on top-5 verb and noun-role accuracy, respectively. Adding 5 million images with our semantic augmentation techniques gives further relative improvements of 6.23% and 9.57% on top-5 verb and noun-role accuracy.
1. Introduction
Many visual classification problems, such as image captioning [29] , visual question answering [2] , referring expressions [23] , and situation recognition [44] have structured, semantically interpretable output spaces. In contrast to classification tasks such as ImageNet [37] , these problems typically suffer from semantic sparsity; there is a combinatorial number of possible outputs, no dataset can cover them all, and performance of existing models degrades significantly when evaluated on rare or unseen in- Figure 1 : Three situations involving carrying, with semantic roles agent, the carrier, item, the carried, agentpart, the part of the agent carrying, and place, where the situation is happening. For carrying, there are many possible carry-able objects (nouns that can fill the item role), which is an example of semantic sparsity. Such rarely occurring substructures are challenging and cause significant errors, affecting not only performance on role-values but also verbs.
Role Value
AGENT
puts [3, 46, 9, 44] . In this paper, we consider situation recognition, a prototypical structured classification problem with significant semantic sparsity, and develop new models and semantic data augmentation techniques that significantly improve performance by better modeling the underlying semantic structure of the task. Situation recognition [44] is the task of producing structured summaries of what is happening in images, including activities, objects and the roles those objects play within the activity. This problem can be challenging because many activities, such as carrying, have very open ended semantic roles, such as item, the thing being carried (see Figure 1) ; nearly any object can be carried and the training data will never contain all possibilities. This is a prototypical instance of semantic sparsity: rare outputs constitute as a function of the total number of training examples for the least frequent role-noun pair in each situation. Uncommon target outputs, those observed fewer than 10 times in training (yellow box), are common, constituting 35% of all required predictions. Such semantic sparsity is a central challenge for situation recognition.
a large portion of required predictions (35% in the imSitu dataset [44] , see Figure 2 ), and current state-of-the-art performance for situation recognition drops significantly when even one participating object has few samples for it's role (see Figure 3 ). We propose to address this challenge in two ways by (1) building models that more effectively share examples of objects between different roles and (2) semantically augmenting our training set to fill in rarely represented noun-role combinations. We introduce a new compositional Conditional Random Field formulation (CRF) to reduce the effects of semantic sparsity by encouraging sharing between nouns in different roles. Like previous work [44] , we use a deep neural network to directly predict factors in the CRF. In such models, required factors for the CRF are predicted using a global image representation through a linear regression unique to each factor. In contrast, we propose a novel tensor composition function that uses low dimensional representations of nouns and roles, and shares weights across all roles and nouns to score combinations. Our model is compositional, independent representations of nouns and roles are combined to predict factors, and allows for a globally shared representation of nouns across the entire CRF.
This model is trained with a new form of semantic data augmentation, to provide extra training samples for rarely observed noun-role combinations. We show that it is possible to generate short search queries that correspond to partial situations (i.e. "man carrying baby" or "carrying on back" for the situations in Figure 1 ) which can be used for web image retrieval. Such noisy data can then be incorporated in pre-training by optimizing marginal likelihood, effectively performing a soft clustering of values for unlabeled aspects of situations. This data also supports, as we will show, self training where model predictions are used to prune the set of images before training the final predictor.
Experiments on the imSitu dataset [44] demonstrate that our new compositional CRF and semantic augmentation techniques reduce the effects of semantic sparsity, with strong gains for relatively rare configurations. We show that each contribution helps significantly, and that the combined approach improves performance relative to a strong CRF baseline by 6.23% and 9.57% on top-5 verb and nounrole accuracy, respectively. On uncommon predictions, our methods provide a relative improvement of 8.76% on average across all measures. Together, these experiments demonstrate the benefits of effectively targeting semantic sparsity in structured classification tasks.
2. Background
Situation Recognition Situation recognition has been recently proposed to model events within images [19, 36, 43, 44] , in order to answer questions beyond just "What activity is happening?" such as "Who is doing it?", "What are they doing it to?", "What are they doing it with?". In general, formulations build on semantic role labelling [17] , a problem in natural language processing where verbs are automatically paired with their arguments in a sentence (for example, see [8] ). Each semantic role corresponds to a question about an event, (for example, in the first image of Figure 1 , the semantic role agent corresponds to "who is doing the carrying?" and agentpart corresponds to "how is the item being carried?").
We study situation recognition in imSitu [44] , a largescale dataset of human annotated situations containing over 500 activities, 1,700 roles, 11,000 nouns, 125,000 images. imSitu images are collected to cover a diverse set of sit-uations. For example, as seen in Figure 2 , 35% of situations annotated in the imSitu development set contain at least one rare role-noun pair. Situation recognition in im-Situ is a strong test bed for evaluating methods addressing semantic sparsity: it is large scale, structured, easy to evaluate, and has a clearly measurable range of semantic sparsity across different verbs and roles. Furthermore, as seen in Figure 3 , semantic sparsity is a significant challenge for current situation recognition models. Formal Definition In situation recognition, we assume a discrete sets of verbs V , nouns N , and frames F . Each frame f 2 F is paired with a set of semantic roles E f . Every element in V is mapped to exactly one f . The verb set V and frame set F are derived from FrameNet [13] , a lexicon for semantic role labeling, while the noun set N is drawn from WordNet [34] . Each semantic role e 2 E f is paired with a noun value n e 2 N [ {?}, where ? indicates the value is either not known or does not apply. The set of pairs of semantic roles and their values is called a realized frame, R f = {(e, n e ) : e 2 E f }. Realized frames are valid only if each e 2 E f is assigned exactly one noun n e .
Given an image, the task is to predict a situation, S =
(v, R f ), specified by a verb v 2 V
3. Methods
This section presents our compositional CRFs and semantic data augmentation techniques. Figure 4 shows an overview of our compositional conditional random field model, which is described below. Conditional Random Field Our CRF for predicting a situation, S = (v, R f ), given an image i, decomposes over the verb v and semantic role-value pairs (e, n e ) in the realized frame R f = {(e, n e ) : e 2 E f }, similarly to previous work [44] . The full distribution, with potentials for verbs v and semantic roles e takes the form:
3.1. Compositional Conditional Random Field
EQUATION (1): Not extracted; please refer to original document.
The CRF admits efficient inference: we can enumerate all verb-semantic roles that occur and then sum all possible semantic role values that occurred in a dataset. Each potential in the CRF is log linear:
EQUATION (2): Not extracted; please refer to original document.
e (v, e, n e , i; ✓) = e e(v,e,ne,i,✓)
where e and v encode scores computed by a neural network. To learn this model, we assume that for an image i in dataset Q there can, in general, be a set A i of possible ground truth situations 1 . We optimize the log-likelihood of observing at least one situation S 2
A i : X i2Q log ⇣ 1 Y S2Ai (1 p(S|i; ✓)) ⌘ (4)
Compositional Tensor Potential In previous work, the CRF potentials (Equation 2 and 3 ) are computed using a global image representation, a p-dimensional image vector g i 2 R p , derived by the VGG convolutional neural network [40] . Each potential value is computed by a linear regression with parameters, ✓, unique for each possible decision of verb and verb-role-noun (we refer to this as image regression in Figure 4 ), for example for the verb-role-noun potential in Equation 3:
EQUATION (5): Not extracted; please refer to original document.
Such a model does not directly represent the fact that nouns are reused between different roles, although the underlying neural network could hypothetically learn to encode such reuse during fine tuning. Instead, we introduce compositional potentials that make such reuse explicit.
To formulate our compositional potential, we introduce a set of m-dimensional vectors D = {d n 2 R m |n 2 N }, one vector for each noun in N , the set of nouns. We create a set matrices T = {H (v,e) 2 R p⇥o |(v, e) 2 E f }, one matrix for each verb, semantic role pair occurring in all frames E f , that map image representations to o-dimensional verbrole representations. Finally, we introduce a tensor of global composition weights, C 2 R m⇥o⇥p . We define a tensor weighting function, T , which takes as input a verb, v, semantic role, e, noun, n, and image representation, g i as:
EQUATION (6): Not extracted; please refer to original document.
The tensor weighting function constructs an image specific verb-role representation by multiplying the global image vector and the verb-role matrix g T i H (v,e) . Then, it combines a global noun representation, the image specific role representation, and the global image representation with outer products. Finally, it weights each dimension of the outer product with a weight from C. The weights in C indicate which features of the 3-way outer product are important. The final potential is produced by summing up all of the elements of the tensor produced by T :
e (v, e, n e , i) = M X x=0 O X y=0 P X z=0 T (v, e, n e , g i )[x, y, z] (7)
1 imSitu provides three realized frames per example image. The tensor produced by T in general will be high dimensional and very expressive. This allows use of small dimensionality representations, making the function more robust to small numbers of samples for each noun.
Clean Agent Source Dirt Tool
The potential defined in Equation 7 can be equivalently formulated as :
EQUATION (8): Not extracted; please refer to original document.
Where A is a matrix with the same parameters as C but flattened to layout the noun and role dimensions together. By aligning terms with Equation 5, one can see that tensor potential offers an alternative parametrized to the linear regression that uses many more general purpose parameters, those of C. Furthermore, it eliminates any one parameter from ever being uniquely associated with one regression, instead compositionally using noun and verb-role representations to build up the parameters of the regression.
3.2. Semantic Data Augmentation
Situation recognition is strongly connected to language. Each situation can be thought of as simple declarative sentence about an activity happening in an image. For example, the first situation in Figure 1 could be expressed as "man carrying baby on chest outside" by knowing the prototypical ordering of semantic roles around verbs and inserting prepositions. This relationship can be used to reduce semantic sparsity by using image search to find images that could contain the elements of a situations.
We convert annotated situations to phrases for semantic augmentation by exhaustively enumerating all possible sub-pieces of realized situations that occur in the imSitu training set (see Section 4 for implementation details). For example, in first situation of Figure 1, we get the pieces: (carrying, {(agent, man)}), (carrying, {(agent, man), (item, baby)}), ect. Each of these substructures is converted deterministically to a phrase using a template specific for every verb. For example, the template for carrying is "{agent} carrying {item} {with agentpart} {in place}." Partial situations are realized into phrases by taking the first gloss in Wordnet of the synset associated with every noun in the substructure, inserting them into the corresponding slots of the template, and discarding unused slots. For example, the phrases for the sub-pieces above are realized as "man carrying" and "man carrying baby." These phrases are used to retrieve images from Google image search and construct a set, W = {(i, v, R f )}, of images annotated with a verb and partially complete realized frames, by assigning retrieved images to the sub-piece that generated the retrieval query. 2 Pre-training Images retrieved from the web can be incorporated in a pre-training phase. The images retrieved only have partially specified realized situations as labels. To account for this, we instead compute the marginal likelihood, p, of the partially observed situations in W :
EQUATION (9): Not extracted; please refer to original document.
During pretraining, we optimize the marginal log-likelihood of W . This objective provides a partial clustering over the unobserved roles left unlabeled during the retrieval process. Self Training Images retrieved from the web contain significant noise. This is especially true for role-noun combinations that occur infrequently, limiting their utility for pretraining. Therefore, we also consider filtering images in W after a model has already been trained on fully supervised data from imSitu. We rank images in W according top as computed by the trained model and filter all those not in the top-k for every unique R f in W . We then pretrain on this subset of W , train again on imSitu, and then increase k. We repeat this process until the model no longer improves.
4. Experimental Setup
Models All models were implemented in Caffe [21] and use a pretrained VGG network [40] for the base image representation with the final two fully connected layers replaced with two fully connected layers of dimensionality 1024. We finetune all layers of VGG for all models. For our tensor potential we use noun embedding size, m = 32, and role embedding size o = 32, and the final layer of our VGG network as the global image representation where p = 1024. Larger values of m and o did seem to improve results but were too slow to pretrain so we omit them. In experiments where we use the image regression in conjunction with a compositional potential, we remove regression parameters associated with combinations seen fewer than 10 times on the imSitu training set to reduce overfitting.
Baseline We compare our models to two alternative methods for introducing effective sharing between nouns. The first baseline (Noun potential in Table 1 and 2) adds a potential into the baseline CRF for nouns independent of roles. We modify the probability, from Equation 9 of a situation, S, given an image i, to not only decompose by pairs of roles, e and nouns n e in a realized frame R f , but also nouns n e :
p(S|i; ✓) / v (v, i; ✓) Y
(e,ne)2R f e (v, e, n e , i; ✓) ne (n e , i)
because often no phrase could retrieve correct images. Longer phrases tended to have much lower precision.
The added potential, ne , is computed using a regression from a global image representation for each unique n e . The second baseline we consider is compositional but does not use a tensor based composition method. The model instead constructs many verb-role representations and combines them with noun representations using inner-products (Inner product composition in Table 1 and 2). In this model, as in the tensor model in Section 3, we use a global image representation g i 2 R p and a set noun vectors, d n 2 R m for every noun n. We also assume t verb-role matrices H t,v,e 2 R o⇥p for every verb-role in E f . We compute the corresponding potential as in Equation 11:
e (v, e, n e , i) =
EQUATION (11): Not extracted; please refer to original document.
The model is motivated by compositional models used for semantic role labeling [14] and allows us to trade-off the need to reduce parameters associated with nouns and expressivity. We grid search values of t such that t • o was at most 256, the largest size network we could afford to run and o = m, a requirement on the inner product. We found the best setting at t = 16, o = m = 16.
Decoding We experimented with two decoding methods for finding the best scoring situation under the CRF models. Systems which used the compositional potentials performed better when first predicting a verb v m using the max-marginal over semantic roles: v m = arg max v
P
(e,ne) p(v, R f |i) and then predict a realized frame, R m f , with max score for v m :
R m f = arg max R f p(v m , R f |i).
All other systems performed better maximizing jointly for both verb and realized frame.
Optimization All models were trained with stochastic gradient descent with momentum 0.9 and weight decay 5e-4. Pretraining in semantic augmentation was conducted with initial learning rate of 1e-3, gradient clipping at 100, and batch size 360. When training on imSitu data, we use an initial learning rate of 1e-5. For all models, the learning rate was reduced by a factor of 10 when the model did not improve on the imSitu dev set.
Semantic Augmentation
In experiments with semantic augmentation, images were retrieved using Google image search. We retrieved 200 medium sized, full-color, safe search filtered images per query phrase. We produced over 1.5 million possible query phrases from the imSitu training set, the majority extremely rare. We limited the phrases to any that occur between 10 and 100 times in imSitu and for phrases that occur between 3 and 10 times we accepted only those containing at most one noun. Roughly 40k phrases were used to retrieve 5 million images from the web. All duplicate images occurring in imSitu were removed. For pretraining, we ran all experiments up to 50k updates (roughly Table 2 : Situation prediction results on the rare portion imSitu development set. The results are divided by models which were only trained on imSitu data, rows 1-5, and models which use web data through semantic data augmentation, marked as +SA in rows 6-8. Models marked with +reg also include image regression potentials used in the baseline. Semantic data augmentation with the baseline hurts for rare cases. Semantic augmentation yields larger relative improvement on rare cases and a composition-based model is required to realize these gains. 4 epochs). For self training, we only self train on rare realized frames (those 10 or fewer times in imSitu train set). Self training yielded diminishing gains after two iterations and we ran the first iteration at k=10 and the second at k=20.
Evaluation
We use the standard data split for imSitu [44] with 75k train, 25k development, and 25k test images. We follow the evaluation setup defined for imSitu, evaluating verb predictions (verb) and semantic role-value pair predictions (value) and full structure correctness (value-all). We report accuracy at top-1, top-5 and given the ground truth verb and the average across all measures (mean). We also report performance for examples requiring rare (10 or fewer examples in the imSitu training set) predictions.
5. Results
Compositional Tensor Potential Our results on the full imSitu dev set are presented in Table 1 in rows 1-5. Overall results demonstrate that adding a noun potential (row 2) and our baseline composition model (row 3) are ineffective and perform worse than the baseline CRF (row 1). We hypothesize that systematic variation in object appearance between roles is challenging for these models. Our tensor composition model (row 4) is able to better capture such variation and effectively share information among nouns, reflected by improvements in value and value-all accuracy given ground truth verbs while maintaining high top-1 and top-5 verb accuracy. However, as expected, many situations cannot be predicted only compositionally based on nouns (consider that a horse sleeping looks very different than a horse swimming and nothing like a person sleeping). Combination of the image regression potential and our tensor composition potential (row 5) yields the best performance, indicating they are modeling complementary aspects of the problem. Our final model (row 5) only trained on imSitu data outperforms the baseline on every measure, improving over 1.70 points overall.
Results on the rare portion of the imSitu dataset are presented in Table 2 in rows 1-5. Our final model (row 5) provides the best overall performance (mean column) on rare cases among models trained only on imSitu data, improving by 0.64 points on average. All models struggle to get correctly entire structures (value-all columns), indicating rare predictions are extremely hard to get completely correct while the baseline model which only uses image regression potentials performs the best. We hypothesize that image regression potentials may allow the model to more easily coordinate predictions across roles simultaneously because role-noun combinations that always co-occur will always have the same set of regression weights. Semantic Data Augmentation Our results on the full im-Situ development set are presented in Table 1 in rows 6-8. Overall results indicate that semantic data augmentation helps all models, while our tensor model (row 7) benefits more than the baseline (row 6). Self training improves the tensor model slightly (row 8), making it perform better on top-1 and top-5 predictions but hurting performance given gold verbs. On average, our final model outperforms the baseline CRF trained on identical data by 2.04 points.
Results on the rare portion of the imSitu dataset are presented in Table 2 in rows 6-8. Surprisingly, on rare cases semantic augmentation hurts the baseline CRF (line 6). Rare instance image search results are extremely noisy. On close inspection, many of the returned results do not contain the target activity at all but instead contain target nouns. We hypothesize that without an effective global noun representation, the baseline CRF cannot extract meaningful information from such extra data. On the other hand, our tensor model (line 7) improves on these rare cases overall and with self training improves further (line 8).
Overall Results Experiments show that (a) our tensor model is able perform better in comparable data settings, (b) our semantic augmentation techniques largely benefit all models, and (c) our tensor model benefits more from semantic augmentation. We also present our full performance on top-5 verb across all numbers of samples in Figure 5 . While our compositional CRF with semantic augmentation outperforms the baseline CRF, both models continue to struggle on uncommon cases. Our techniques seem to give most benefit for examples requiring predictions of structures seen between 5 and 35 times, while providing somewhat less benefit to even rarer ones. It is challenging future work to make further improvements for extremely rare outputs.
We also evaluated our models on the imSitu test set exactly once. The results are summarized in Table 3 for the full imSitu test set and in Table 4 for the rare portion. Gen- Our final compositional CRF with semantic data augmentation outperforms the baseline CRF on rare cases (fewer than 10 training examples), but both models continue to struggle with semantic sparsity. For our final model, the largest improvement relative to the baseline are for cases with 5-35 examples on the training set. eral trends established on the imSitu dev set are supported. We provide examples in Figure 6 of predictions our final system made on rare examples from the development set.
6. Related Work
Learning to cope with semantic sparsity is closely related to zero-shot or k-shot learning. Attribute-based learning [24, 25, 12] , cross-modal transfer [39, 28, 15, 26] and using text priors [32, 18] have all been proposed but they study classification or other simplified settings. For the structured case, image captioning models [45, 22, 7, 11, 33, 20, 35, 31] have been observed to suffer from a lack of diversity and generalization [42] . Recent efforts to gain insight on such issues extract subject-verb-object (SVO) triplets from captions and count prediction failures on rare tuples [3] Figure 6 : Output from our final model on development examples containing rare role-noun pairs. The first row contains examples where the model correctly predicts the entire structures in the top-5 (top-5, value-all). We highlight the particular role-noun pairs that make the examples rare with a yellow box and put the number occurrences of it in the imSitu training set. The second row contains examples where the verb was correctly predicted in the top-5 but not all the values were predicted correctly. We highlight incorrect predictions in red. Many such predictions occur zero times in the training set (ex. the third image on the second row). All systems struggle with such cases.
izes to verbs with more than two arguments. Compositional models have been explored in a number of applications in natural language processing, such as sentiment analysis [41] , dependency parsing [27] , text similarity [4] , and visual question answering [1] as effective tools for combining natural language elements for prediction. Recently, bilinear pooling [30] and compact bilinear pooling [16] have been proposed as second-order feature representations for tasks such as fine grained recognition and visual question answer. We build on such methods, using low dimensional embeddings of semantic units and expressive outer product computations.
Using the web as a resource for image understanding has been studied through NEIL [6] , a system which continuously queries for concepts discovered in text, and Levan [10] , which can create detectors from user specified queries. Web supervision has also been explored for pretraining convolutional neural networks [5] or for finegrained bird classification [5] and common sense reasoning [38] . Yet we are the first to explore the connection between semantic sparsity and language for automatically generating queries for semantic web augmentation and we are able to show improvement on a large scale, fully supervised structured prediction task.
7. Conclusion
We studied situation recognition, a prototypical instance of a structured classification problem with significant semantic sparsity. Despite the fact that the vast majority of the possible output configurations are rarely observed in the training data, we showed it was possible in introduce new compositional models that effectively share examples among required outputs and semantic data augmentation techniques that significantly improved performance. In the future, it will be important to introduce similar techniques for related problems with semantic sparsity and generalize these ideas to the zero-shot learning.
While these templates do not generate completely fluent phrases, preliminary experiments found them sufficiently accurate for image search