Learning Interpretable Spatial Operations in a Rich 3D Blocks World

Yonatan Bisk
Kevin J. Shih
Yejin Choi
D. Marcu
AAAI
2018
View in Semantic Scholar

Abstract

In this paper, we study the problem of mapping natural language instructions to complex spatial actions in a 3D blocks world. We first introduce a new dataset that pairs complex 3D spatial operations to rich natural language descriptions that require complex spatial and pragmatic interpretations such as "mirroring", "twisting", and "balancing". This dataset, built on the simulation environment of Bisk, Yuret, and Marcu (2016), attains language that is significantly richer and more complex, while also doubling the size of the original dataset in the 2D environment with 100 new world configurations and 250,000 tokens. In addition, we propose a new neural architecture that achieves competitive results while automatically discovering an inventory of interpretable spatial operations (Figure 5)

One of the longstanding challenges of AI, first introduced as SHRDLU in early 70s (Winograd 1971) , is to build an agent that can follow natural language instructions in a physical environment. The ultimate goal is to create systems that can interact in the real world using rich natural language. However, due to the complex interdisciplinary nature of the challenge (Harnad 1990) , which spans across several fields in AI, including robotics, language, and vision, most existing studies make varying degrees of simplifying assumptions.

On one end of the spectrum is rich robotics paired with simple constrained language (Roy and Reiter 2005; Tellex et al. 2011) , as acquiring a large corpus of natural language grounded with a real robot is prohibitively expensive (Misra et al. 2014; Thomason et al. 2017) . On the other end of the spectrum are approaches based on simulation environments, which support broader deployment at the cost of unrealistic simplifying assumptions about the world (Bisk, Yuret, and Marcu 2016; Wang, Liang, and Manning 2016) . In this paper, we seek to reduce the gap between two complementary research efforts by introducing a new level of complexity to both the environment and the language associated with the interactions. Lifting Grid Assumptions We find that language situated in a richer world leads to richer language. One such example is presented in Figure 1 . To correctly place the UPS block, the system must understand the complex physical, spatial, and pragmatic meaning of language including: (1) the 3D concept of a tower, (2) that new or fourth are referencing an assumed future, and (3) that mirror implies an axis and reflection. However, concepts such as above are often outside the scope of most existing language grounding systems.

Figure 1: Example language instruction in our new dataset. The action requires fine-grained positioning and utilizes a complex concept: mirror.

In this work, we introduce a new dataset that allows for learning significantly richer and more complex spatial language than previously explored. Building on the simulator provided by Bisk, Yuret, and Marcu (2016) , we create roughly 13,000 new crowdsourced instructions (9 per action), nearly doubling the size of the original dataset in the 2D blocks world introduced in their previous work. We address the challenge of realism in the simulated data by introducing three crucial but previously absent complexities:

1. 3D block structures (lifting 2D assumptions) 2. Fine-grained real valued locations (lifting grid assumptions)

3. Rotational, angled movements (lifting grid assumptions)

Learning Interpretable Operators In addition, we introduce an interpretable neural model for learning spatial operations in the rich 3D blocks world. In particular, in our model instead of using a single layer conditioned on the language for interpreting the operations, we have the model choose which parameters to apply via a softmax over the

The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18) Bisk, Yuret, and Marcu (2016) This work

Freq Relations: left, up, right, directly, above, until New Relations: degrees, rotate, clockwise, covering, corner, top, down, below, bottom, slide, space, between 45, layer, mirror, arch, towers, equally, twist, balance, . .. Figure 2 : Example goal states in our work as compared to the previous Blocks dataset. Our work extends theirs to include rotations, 3D construction, and human created designs. This has a dramatic effect on the language used. Rich worlds facilitate rich language, above are the most common relations in their data and the most common new relations in ours. possible parameter vectors to use. Specifically, by having the model decide for each example which parameters to use, the model picks among 32 different networks, deciding which is appropriate for a given sentence. Learning these networks and when to apply them enables the model to cluster spatial functions. Secondly, by encouraging low entropy in the selector, the model converges to nearly one-hot representations during training. A side effect of this decision is that the final model exposes an API which can be used interactively for focusing the model's attention and choosing its actions. We will exploit this property when generating plots in Figure 5 showing the meaning of each learned function. Our model is still fully end-to-end trainable despite choosing its own parameters and composeable structure, leading to a modular network structure similar to (Andreas et al. 2016) .

Figure 2: Example goal states in our work as compared to the previous Blocks dataset. Our work extends theirs to include rotations, 3D construction, and human created designs. This has a dramatic effect on the language used. Rich worlds facilitate rich language, above are the most common relations in their data and the most common new relations in ours.

Figure 5: A 2D projected visualization of the operations our model discovered. Common operations are described in Table 2. Short arrows are mostly in 3D, and nearly all operations exhibit different behaviors depending on location in the world.

The rest of the paper is organized as follows. We first discuss related work, introduce our new dataset, followed by our new model. We then present empirical evaluations, analysis on the internal representations, and error analysis. We conclude with the discussion for future work.

https://groundedlanguage.github.io/

In principle, we could work over an RGB rendering of the world, but doing so would add layers of vision complexity that do not help address the dominant language understanding problems.

We did not perform a grid search for parameters, but we did find the 2D model performed better when a relu was used and Batch-Normalization(Ioffe and Szegedy 2015). Finally, the depth values and kernel were set to 1 when training exclusively in 2D.

Related Work

Advances in robotics, language, and vision are all applicable to this domain. The intersection of robotics and language have seen impressive results in grounding visual attributes (Kollar, Krishnamurthy, and Strimel 2013; Matuszek et al. 2014) , spatial reasoning (Steels and Vogt 1997; Roy 2002; Guadarrama et al. 2013) , and action taking (MacMahon, Stankiewicz, and Kuipers 2006; Yu and Siskind 2013) . For example, recent work (Thomason et al. 2015; has shown how these instructions can be combined with exploration on physical robotics to follow instructions and learn representations online.

Within computer vision Visual Question Answering (Antol et al. 2015) has been widely popular. Unfortunately, it is unclear what models are learning and how much they are understanding versus memorizing bias in the training data (Ribeiro, Singh, and Guestrin 2016) . Datasets and models have also recently been introduced for visual reasoning (Johnson et al. 2017; Santoro et al. 2017) and referring expressions (Kazemzadeh et al. 2014; Mao et al. 2016) .

Finally, within the language community, interest in action understanding follows naturally from research in semantic parsing (Andreas and Klein 2015; Artzi and Zettlemoyer 2013) . Here, the community has traditionally been focused on more complex and naturally occurring text, though this has not always been possible for the navigation domain.

Simultaneously, work within NLP (Bisk, Marcu, and Wong 2016; Wang, Liang, and Manning 2016) and Robotics (Li et al. 2016) returned to the question of action taking and scene understanding in SHRDLU style worlds. The goal with this modern incarnation was to truly solicit natural language from humans without limiting their vocabulary or referents. This was an important step in moving towards unconstrained language understanding.

The largest corpus was provided by Bisk, Yuret, and Marcu (2016) . In this work, the authors presented pairs of scenes with simulated blocks to users of Amazon's Mechanical Turk. Turkers would then describe actions or instructions that their imagined collaborator needs to perform to transform the input scene into the target (e.g. Moving a block to the side of another). An important aspect of this dataset is that participants assume they are speaking to another human. This means they do not limit their vocabulary, space of references, simplify their grammar, or even write carefully. The annotators assume that whomever will be reading what they submit is capable of error correction, spatial reasoning, and complex language understanding. This provides an important, and realistic, basis for training artificial language understanding agents. Follow-up work has investigated advances to language representations (Pišl and Mareček 2017) , spatial reasoning (Tan and Bansal 2018) , and reinforcement learning approaches for mapping language to latent action sequences (Misra, Langford, and Artzi 2017) .

Creating Realistic Data

To facilitate closing the gap between simulation and reality, blocks should not have perfect locations, orderings, or alignments. They should have jitter, nuanced alignments, rotations and the haphazard construction of real objects. Figure 2 shows example how our new configurations aim to capture that realism (right) as compared to previous work (left). Previous work created target configurations by downsampling MNIST (LeCun et al. 1998) digits. This enabled them to create interpretable but unrealistic 2D final representations and the order in which blocks were combined was determined Table 1 : Corpus statistics for our dataset as compared to previous work (Bisk 16) , and the total statistics when combined.

Table 1: Corpus statistics for our dataset as compared to previous work (Bisk 16), and the total statistics when combined.

by a heuristic to simulate drawing/writing. In our data, we solicited creations from people around our lab and their children, not affiliated with the project. They built whatever they wanted (open concept domain), in three dimensions, and were allowed to rotate the blocks. For example, the animal on the left is an elephant whose trunk, tail, and legs curve. Additionally, because humans built the configurations, we were able to capture the order in which blocks were placed for a more natural trajectory. Realism brings with it important new challenges discussed below.

Real Valued Coordinate Spaces The discretized world seen in several recent datasets (Bisk, Yuret, and Marcu 2016; Wang, Liang, and Manning 2016) simplifies spatial reasoning. Simple constructions like left and right can be reduced to exact offsets that do not require context specific interpretations (e.g. right = +[1, 0, 0]). In reality, these concepts depend on the scene around them. For example, in the rightmost image of Figure 2 , it is correct to say that the McDonald's block is right of Adidas, but also that SRI is right of Heineken, despite both having different offsets. The modifier mirroring disambiguates the meaning for us.

Semantically Irrelevant Noise

It is important to note that with realism comes noise. Occasionally, an annotator may bump a block or shift the scene a little. Despite repeated efforts to clean and curate the data, most people did not consider this noise noteworthy because it was semantically irrelevant to the task. For example, if while performing an action, a nearby block jostles, it does not change the holistic understanding of the scene. For this reason, we only evaluate the placement of the block that "moved the furthest." This is a baby step towards building models invariant to changes in the scene orthogonal to the goal.

Physics One concession we were forced to make was relaxing physics. Unlike prior work (Wang et al. 2017) , we insisted that the final configurations roughly adhere to physics (e.g. minimizing overhangs, no floating blocks, limited intersection), but we found volunteers too often gave up if we forced them to build entirely with physics turned on. This also means that intermediary steps that in the real world require a counter-weight can be constructed one step at a time.

Language Our new corpus contains nearly all of the concepts of previous work, but introduces many more. Figure 2 shows the most common relations in prior work, and the most common new concepts. We see that these predominantly focus on rotation (degrees, clockwise, ...) and 3D construction (arch, balance, ...), but higher level concepts like mirroring or balancing pose fundamentally new challenges.

Corpus Statistics

Our new dataset comprises 100 configurations split 70-20-10 between training, testing, and development. Each configuration has between five and twenty steps (and blocks). We present type and token statistics in Table 1 , where we use NLTK's (Bird, Klein, and Loper 2009) treebank tokenizer. This yields higher token counts in previous works due to different assumptions about punctuation.

Not all of our annotators made use of the full 20 blocks. As such, we have fewer utterances than the original dataset for the same number of goal configurations. Yet, we find that the instructions for completing our tasks are more nuanced and therefore result in slightly longer sentences on average. Finally, we note that while the datasets are similar, there are significant enough differences that one should not simply assume that training on the combined dataset will necessarily yield a "better" model on either one individually. There are important linguistic and spatial reasoning differences between the two that make our proposed data much more difficult. We present all modeling results on both subsets of the data and the full combined dataset.

Evaluation And Angles

We follow the evaluation setup by prior work and evaluate by reporting the average distance (L 2 in block lengths) between where a block should be placed and the model's prediction. This metric naturally extends to 3D.

EQUATION (1): Not extracted; please refer to original document.

We also devise a metric for evaluating rotations. In our released data, 1 we captured block orientations as quaternions. This allows for a complete and accurate re-rendering of the exact block orientations produced by our annotators. However, the most semantically meaningful angle is the Eulerian rotation around the Y-axis. We will therefore evaluate error as the minimal angle between the ground truth and prediction in radians as:

EQUATION (2): Not extracted; please refer to original document.

Example Phenomena

In the following example, nine instructions (three per annotator) are provided for the proper placement of McDonald's. We see a diverse set of concepts that include counting, abstract notions like mirror or parallel, geometric concepts like a square or row, and even constraints specified by three different blocks.

1. Predict Figure 3 : Our target prediction model uses the sentence to produce distributions over operations and blocks (arguments). The argument values illuminate regions of the world before the selected operation is applied. This final representation is used to predict offsets in (x, y, z, θ) space. In practice, two bi-LSTMs were used and the final vector contains rotation information. Later in the same task, the agent will be asked to rotate a block and place it between the two stacks. We present here just a few excerpts wherein the same action is described in five different ways.

Figure 3: Our target prediction model uses the sentence to produce distributions over operations and blocks (arguments). The argument values illuminate regions of the world before the selected operation is applied. This final representation is used to predict offsets in (x, y, z, θ) space. In practice, two bi-LSTMs were used and the final vector contains rotation information.

… east of block 2 … 0 3 1 0 1 9 Argument Operation d a d op World Att M op [ ] v op Operation Att 2. Attend 3. Apply CNNs v op [cxicxjcxkdxidxjdxk] i,j,k = (x, y, z) * d op World * d a

t 7 t 8

1 Rotate SRI to the right ... 2 rotate it 45 degrees clockwise ... 3 only half of one rotation so its corners point where its edges did ... 4 the logo faces the top right corner of the screen... 5 Spin SRI slightly to the right and then set it in the middle of the 4 stacks

To complete these instructions requires understanding angles, a new set of verbs (rotate, spin, ...) , and references to the block's previous orientation. The final example, indicates that a spin is necessary, but assumes the goal of having it balance between the two stacks is sufficient information to choose the right angle.

The world knowledge and concepts necessary to complete this task are well beyond the ability of any systems we are currently aware of or expect to be built in the near future. Our goal is to provide data and an environment which more accurately reflects the complexity of grounding language to actions. Where previous work broadened the community's understanding of the types of natural language people use by recreating a blocks world with real human annotators, we felt they did not go far enough in really covering the space of actions and therefore language naturally present in even this constrained world.

Model

In addition to our dataset, we propose an end-to-end trainable model that is both competitive in performance and has an easily interpretable internal representation. The model takes in a natural language instruction for block manipulation and a 3D representation of the world as input, and outputs where the chosen block should be moved. The model can be broken down into three primary components: Our overall model architecture is shown in Figure 3 . By keeping the model modular we can both control the bottlenecks that learning must use for representation and provide ourselves post hoc access to interpretable scene and action representations (explored further in interpretability section). Without these, the model allows sentences and operations to be represented by arbitrary N-dimensional vectors.

Language Encoder

As is common, we use bidirectional LSTMs (Hochreiter and Schmidhuber 1997; Schuster and Paliwal 1997) to encode the input sentence. We use two LSTMs: one for predicting blocks to attend to, one for choosing the operations to apply. Both LSTMs share a vocabulary embedding matrix, but have no other means of communication. We experimented with using a single LSTM as well as conditioning one on the other, but found it degraded performance.

Once we have produced a representation for arguments h a and operations h o , we multiply each by their own feedforward layers, then softmax to produce a distribution over 20 blocks and 32 operations for d a and d op , respectively.

EQUATION (3): Not extracted; please refer to original document.

Argument Softmax The first output of our model is an attention over the block IDs. The input world is represented by a 3D tensor of IDs. 2 We can convert this to a one-hot representation and multiply it by the distribution to get an attention per "pixel" (hereby referred to as argument attention map) equal to the model's confidence. In practice we found that the model was better able to learn when the attention map was multiplied by 10. This may be due to parameter initialization. Additionally, we do not allow the model to attend to background so it is masked out (result in Figure 3 )

We use the operator * to represent the inner product.

EQUATION (4): Not extracted; please refer to original document.

Operation Softmax The second distribution we predict is over functions for spatial relations. Here the model needs to choose how far and in what directions to go from the blocks it has chosen to focus on. Unfortunately, there is no a priori set of such functions as we have specifically chosen not to try and pretrain/bias the model in this capacity, so the model must perform a type of clustering where it simultaneously chooses a weighted sum of functions and trains their values. As noted previously, for the sake of interpretability, we force the encoding for operations (d op ) to be a latent softmax distribution over 32 logits. The final operation vector that is passed along to the convolutional model is computed as:

v op = M op d op (5)

Here, M op is a set of 32 basis vectors. The output vector v op is a weighted average across all 32 basis vectors, using d op to weight each individual basis. The goal of this formulation is such that each of the 32 basis vectors will be independently interpretable by replacing d op with a 1-hot vector, allowing us to see what type of spatial operation each vector represents. The choice of 32 basis vectors was an empirical one. We only experimented with powers of two, but it is quite likely a more optimal value exists.

Predicting A Location

The second half of our pipeline features a convolutional model that combines the encoded operation and argument blocks with the world representation to determine the final location of the block-to-move.

Given the aforementioned argument attention map (tensor A of size B × D × H × W × 1, our model starts by applying the operation vector v op at every location of the map, weighted by each location's attention score. This creates a world representation of size B × D × H × W × |v op |. We then pass this world through two convolutional layers using tanh or relu nonlinearities.

In order to predict the final location for the block-to-move, we apply a final 1 × 1 × 1 convolutional layer to predict offsets and their respective confidences for each location relative to a coordinate grid (8 values total). The coordinate grid is a constant 3D tensor generated by uniformly sampling points across each coordinate axis to achieve the desired resolution. Given the coordinate grid, the goal of the learned convolutional model is to, at every sampled point, predict offsets for x, y, z, θ, as well as a confidence for each predicted offset. This formulation was similarly used for keypoint localization in (Singh, Hoiem, and Forsyth 2016) . Let g x (i, j, k) be the x coordinates for all sampled grid points at grid location (i, j, k) and let d x (i, j, k) and c x (i, j, k) be the respective offsets and confidences, then the final predictedx coordinate for the block-to-move is computed as:

EQUATION (6): Not extracted; please refer to original document.

Here, confidences c x (i, j, k) are softmax normalized across all grid points. Predictions forŷ,ẑ are computed similarly. We computeθ without a coordinate grid such that:

θ = i,j,k c θ (i, j, k)d θ (i, j, k).

Implementation Details

Our model is trained end-to-end using Adam (Kingma and Ba 2014) with a batch size of 32.The convolutional aspect of the model has 3 layers and operates on a world representation of dimensions 32 × 4 × 64 × 64 × 32 (batch, depth, height, width, channels). The first convolutional layer uses a filter of size 4 × 5 × 5 and the second of size 4 × 3 × 3, each followed by a tanh nonlinearity for the 3D model 3 . Both layers output a tensor with the same dimensions as the input world. The final predicton layer is a 1 × 1 × 1 filter that projects the 32 dimensional vector at each location down to 8 values as detailed in the previous section. We further include an entropy term to encourage peakier distributions in the argument and operation softmaxes.

Interpretability And Visualizing The Model

One of the features of our model is its interpretability, which we ensured by placing information bottlenecks within the architecture. By designing the language-to-operation encoding process as predicting a probability distribution over a set of learned basis vectors, we can interpret each vector as a separate operation and visualize the behaviors of each operation vector individually. 2: Utterances with low entropy for Op predictions were mapped to their corresponding argmax dimension. We extracted relevant phrase here for common dimensions.

Figure 4: Interpolations of operations 23 (north) and 26 (east) being applied at nine locations around the world.

Visualizing Operations We generated Figure 5 by placing a single block in the world and moving it around a 9 by 9 grid and passing a 1-hot operation choice vector to our model. We then plot a vector from the block's center to the predicted target location. We see many simple and expected relationships (left, right, ...), but importantly we see the operations are location specific functions, not simply offsets. Operations on the edges of the world are more fine-grained and many move directly to a region of the world (e.g. 9 = "center"), not simply an offset. It is also possible that some of the more dramatic edge vectors may serve as a failsafe mechanism for misinterpreted instructions. In particular, nearly all of the operations when applied in the bottom right corner redirect to the center of the board rather than off of it.

Table 2: Utterances with low entropy for Op predictions were mapped to their corresponding argmax dimension. We extracted relevant phrase here for common dimensions.

Additionally, while shown here in 2D, all of our predictions are actually in 3D and contain rotation predictions. In Table 3 : A comparison of our interpretable model with previous results (top) in addition to our performance on our new corpus (v2). Finally, we show how training jointly on both corpora has only a very moderate effect on performance, indicating the complementarity of the data. Target values are error measurements in block-lengths (lower is better).

Table 3: A comparison of our interpretable model with previous results (top) in addition to our performance on our new corpus (v2). Finally, we show how training jointly on both corpora has only a very moderate effect on performance, indicating the complementarity of the data. Target values are error measurements in block-lengths (lower is better).

Interpolating Operations The 1-hot operations can be treated like API calls where several can be called at the same time and interpolated. Figure 4 shows the predicted offsets when interpolating operations 23 (north) and 26 (east). There are two important takeaways from this. First, we see that when combined, we can sweep out angles in the first quadrant to reference them all. Second, we see that magnitudes and angles change as we move closer to the edges of the world. This result is intuitive and desired. Specifically, a location like "to the right" has a variable interpretation depending on how much space exists in the world, and the model is trying to make sure not to push a block off the table. In practice, our analysis found very few clear cases of the model using this power. More commonly, mass would be split between two very similar operations or the sentence was a compound construction (left of X and above Y). We did find that operation 11 correlated with the description between but it is difficult to divine why from the grid. An important extension for future work will be to construct a model which can apply multiple operations to several distinct arguments.

Linguistic Paraphrase Using the validation data, we clustered sentences by their predicted operation vectors. To pick out phrases we only look at sentences with very low entropy distributions (highly confident) and we present our findings in Table 2 . We see that specifications range from short one word indicators (e.g. below) to nearly complete sentences (on the east side of the nvidia cube so that there is no space in between them). This also touches on the fact that several operations have the same direction but different magnitudes. Specifically, operation 23 means far above, not directly, and we see this in the visualized grid as well.

Results

In Table 3 , we compare our model against existing work, and evaluate on both the original Blocks data (v1) and our new corpus (v2). While our primary goal was the creation of an Figure 5 : A 2D projected visualization of the operations our model discovered. Common operations are described in Table 2 . Short arrows are mostly in 3D, and nearly all operations exhibit different behaviors depending on location in the world. interpretable model and the introduction of new spatial and linguistic phenomena, it is important to see that our model also performs well. We note three important results: First, we see that our model outperforms the original model of Bisk 16, and is only slightly weaker than Pišl 17. Our technique does outperform theirs when given the correct source block, so it is possible that we can match their performance with tuning.

Second, our results indicate that the new data (v2) is harder than v1, both in terms of isolating the correct block to move (91 vs 98% accuracy) and average error (1.15 vs 0.80) on the End-to-End setting. Further, a model trained on the union of our corpora improved in source prediction on both the v1 and v2 test sets, but target location performance was either unaffected or slightly deteriorated. This indicates to us that the new dataset is in fact complementary and adds new constructions. Finally, our model has an average error of 0.058 radians (three degrees). In validation, 46% of predictions require a rotation. 1,374 of 1491 predictions are within 2 degrees of the correct orientation. The remainder have dramatically larger errors (36 at 30 • , 81 at 45 • ). This means that the model is learning to interpret the scene and utterance correctly in the vast majority of cases.

Error Analysis

Several of our model's worst performing examples are included in Table 4 . The model's error is presented alongside the goal configuration and misunderstood instruction.

Table 4: Several of our worst performing results. Errors are in block lengths, the images are the goal configuration, and the instructions have been lowercased and tokenized.

The first example specifies the goal location using an abstract concept (tower) and the offset (equidistant) implies recognition of a larger pattern. The second example specifies the goal location in terms of "the 4 stacks", again without naming any of them and in 3D. Finally, the third demonstrates a particularly nice phenomenon in human language where a plan is specified, the speaker provides categorizing information to enable its recognition, and then can use this newly defined concept as a referent. No models to our knowledge have the ability to dynamically form new concepts in this manner.

place the block that is to the right of the stella block as the highest block on the board. it should be in line with the bottom block . Table 5 : Example utterance which requires both understanding that highest is a 3D concept, and inferring that the 2D concept of a line has been rotated to be in the z-dimension.

Table 5: Example utterance which requires both understanding that highest is a 3D concept, and inferring that the 2D concept of a line has been rotated to be in the z-dimension.

Rotations Despite a strong performance by the model on rotations, there are a number of cases that were completely overlooked. Upon inspection, these appear to be predominantly cases where the rotation is not explicitly mentioned, but instead assumed or implied:

• place toyota on top of sri in the same direction .

• take toyota and place it on top of sri .

• ... making part of the inside of the curve of the circle .

The first two should be the focus of immediate future work as they only require trusting that a new block should trust the orientation of an existing one below it unless there is a compelling reason (e.g. balance) to rotate it. The third case, returns to our larger discussion on understanding geometric shapes and is probably out of scope for most approaches.

Conclusions

This work presents a new model which moves beyond simple spatial offset predictions (+x, +y, +z) to learn functions which can be applied to the scene. We achieve this without losing interpretability. In addition, we introduce a new corpus of 10,000 actions and 250,000 tokens which contains a plethora of new concepts (subtle movements, balance, rotation) to advance research in action understanding.