Learning Interpretable Spatial Operations in a Rich 3D Blocks World
In this paper, we study the problem of mapping natural language instructions to complex spatial actions in a 3D blocks world. We first introduce a new dataset that pairs complex 3D spatial operations to rich natural language descriptions that require complex spatial and pragmatic interpretations such as "mirroring", "twisting", and "balancing". This dataset, built on the simulation environment of Bisk, Yuret, and Marcu (2016), attains language that is significantly richer and more complex, while also doubling the size of the original dataset in the 2D environment with 100 new world configurations and 250,000 tokens. In addition, we propose a new neural architecture that achieves competitive results while automatically discovering an inventory of interpretable spatial operations (Figure 5)
One of the longstanding challenges of AI, first introduced as SHRDLU in early 70s (Winograd 1971) , is to build an agent that can follow natural language instructions in a physical environment. The ultimate goal is to create systems that can interact in the real world using rich natural language. However, due to the complex interdisciplinary nature of the challenge (Harnad 1990) , which spans across several fields in AI, including robotics, language, and vision, most existing studies make varying degrees of simplifying assumptions.
On one end of the spectrum is rich robotics paired with simple constrained language (Roy and Reiter 2005; Tellex et al. 2011) , as acquiring a large corpus of natural language grounded with a real robot is prohibitively expensive (Misra et al. 2014; Thomason et al. 2017) . On the other end of the spectrum are approaches based on simulation environments, which support broader deployment at the cost of unrealistic simplifying assumptions about the world (Bisk, Yuret, and Marcu 2016; Wang, Liang, and Manning 2016) . In this paper, we seek to reduce the gap between two complementary research efforts by introducing a new level of complexity to both the environment and the language associated with the interactions. Lifting Grid Assumptions We find that language situated in a richer world leads to richer language. One such example is presented in Figure 1 . To correctly place the UPS block, the system must understand the complex physical, spatial, and pragmatic meaning of language including: (1) the 3D concept of a tower, (2) that new or fourth are referencing an assumed future, and (3) that mirror implies an axis and reflection. However, concepts such as above are often outside the scope of most existing language grounding systems.
In this work, we introduce a new dataset that allows for learning significantly richer and more complex spatial language than previously explored. Building on the simulator provided by Bisk, Yuret, and Marcu (2016) , we create roughly 13,000 new crowdsourced instructions (9 per action), nearly doubling the size of the original dataset in the 2D blocks world introduced in their previous work. We address the challenge of realism in the simulated data by introducing three crucial but previously absent complexities:
1. 3D block structures (lifting 2D assumptions) 2. Fine-grained real valued locations (lifting grid assumptions)
3. Rotational, angled movements (lifting grid assumptions)
Learning Interpretable Operators In addition, we introduce an interpretable neural model for learning spatial operations in the rich 3D blocks world. In particular, in our model instead of using a single layer conditioned on the language for interpreting the operations, we have the model choose which parameters to apply via a softmax over the
The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18) Bisk, Yuret, and Marcu (2016) This work
Freq Relations: left, up, right, directly, above, until New Relations: degrees, rotate, clockwise, covering, corner, top, down, below, bottom, slide, space, between 45, layer, mirror, arch, towers, equally, twist, balance, . .. Figure 2 : Example goal states in our work as compared to the previous Blocks dataset. Our work extends theirs to include rotations, 3D construction, and human created designs. This has a dramatic effect on the language used. Rich worlds facilitate rich language, above are the most common relations in their data and the most common new relations in ours. possible parameter vectors to use. Specifically, by having the model decide for each example which parameters to use, the model picks among 32 different networks, deciding which is appropriate for a given sentence. Learning these networks and when to apply them enables the model to cluster spatial functions. Secondly, by encouraging low entropy in the selector, the model converges to nearly one-hot representations during training. A side effect of this decision is that the final model exposes an API which can be used interactively for focusing the model's attention and choosing its actions. We will exploit this property when generating plots in Figure 5 showing the meaning of each learned function. Our model is still fully end-to-end trainable despite choosing its own parameters and composeable structure, leading to a modular network structure similar to (Andreas et al. 2016) .