Target-driven visual navigation in indoor scenes using deep reinforcement learning
Two less addressed issues of deep reinforcement learning are (1) lack of generalization capability to new goals, and (2) data inefficiency, i.e., the model requires several (and often costly) episodes of trial and error to converge, which makes it impractical to be applied to real-world scenarios. In this paper, we address these two issues and apply our model to target-driven visual navigation. To address the first issue, we propose an actor-critic model whose policy is a function of the goal as well as the current state, which allows better generalization. To address the second issue, we propose the AI2-THOR framework, which provides an environment with high-quality 3D scenes and a physics engine. Our framework enables agents to take actions and interact with objects. Hence, we can collect a huge number of training samples efficiently. We show that our proposed method (1) converges faster than the state-of-the-art deep reinforcement learning methods, (2) generalizes across targets and scenes, (3) generalizes to a real robot scenario with a small amount of fine-tuning (although the model is trained in simulation), (4) is end-to-end trainable and does not need feature engineering, feature matching between frames or 3D reconstruction of the environment.
Many tasks in robotics involve interactions with physical environments and objects. One of the fundamental components of such interactions is understanding the correlation and causality between actions of an agent and the changes of the environment as a result of the action. Since the 1970s, there have been various attempts to build a system that can understand such relationships. Recently, with the rise of deep learning models, learning-based approaches have gained wide popularity  ,  .
In this paper, we focus on the problem of navigating a space to find a given target goal using only visual input. Successful navigation requires learning relationships between actions and the environment. This makes the task well suited to a Deep Reinforcement Learning (DRL) approach. However, general DRL approaches (e.g.,  ,  ) are designed to learn a policy that depends only on the current state, and the target goal is implicitly embedded in the model parameters. Hence, it is necessary to learn new model parameters for a new target. This is problematic since training DRL agents is computationally expensive. * This work is part of the Plato project of the Allen Institute for Artificial Intelligence (AI2) and it was done while the first author was an intern at AI2. In order to achieve higher adaptability and flexibility, we introduce a target-driven model. Our model takes the visual task objective as an input, hence we can avoid retraining for every new target goal. Our model learns a policy that jointly embeds the target goal and the current state. Essentially, an agent learns to take its next action conditioned on both its current state and target, rather than its current state only. Hence, there is no need to re-train the model for new targets. A key intuition that we rely on is that different training episodes share information. For example, agents explore common routes during the training stage while being trained for finding different targets. Various scenes also share generalizable aspects (e.g., a fridge is usually near a microwave). In short, we exploit the fact that learning for new targets will be easier with the models that have been trained for other targets.
Unfortunately, training and quantitatively evaluating DRL algorithms in real environments is often tedious. One reason is that running systems in a physical space is time consuming. Furthermore, acquiring large-scale action and interaction data of real environments is not trivial via the common image dataset collection techniques. To this end, we developed one of the first simulation frameworks with high-quality 3D scenes, called The House Of inteRactions (AI2-THOR). Our simulation framework enables us to collect a large number of visual observations for action and reaction in different environments. For example, an agent can freely navigate (i.e. move and rotate) in various realistic indoor scenes, and is able to have low-and high-level interactions with the objects (e.g., applying a force or opening/closing a microwave).
We evaluate our method for the following tasks: (1) Target generalization, where the goal is to navigate to targets that have not been used during training within a scene. (2) Scene generalization, where the goal is to navigate to targets in scenes not used for training. (3) Real-world generalization, where we demonstrate navigation to targets using a real robot. Our experiments show that we outperform the state-ofthe-art DRL methods in terms of data efficiency for training. We also demonstrate the generalization aspect of our model. In summary, in this paper, we introduce a novel reinforcement learning model that generalizes across targets and scenes. To learn and evaluate reinforcement learning models, we create a simulation framework with high-quality rendering that enables visual interactions for agents. We also demonstrate real robot navigation using our model generalized to the real world with a small amount of finetuning.
Ii. Related Work
There is a large body of work on visual navigation. We provide a brief overview of some of the relevant work. The map-based navigation methods require a global map of the environment to make decisions for navigation (e.g.,  ,  ,  ,  ). One of the main advantages of our method over these approaches is that it does not need a prior map of the environment. Another class of navigation methods reconstruct a map on the fly and use it for navigation  ,  ,  ,  , or go through a training phase guided by humans to build the map  ,  . In contrast, our method does not require a map of the environment, as it does not have any assumption on the landmarks of the environment, nor does it require a human-guided training phase. Map-less navigation methods are common as well  ,  ,  ,  . These methods mainly focus on obstacle avoidance given the input image. Our method is considered map-less. However, it possesses implicit knowledge of the environment. A survey of visual navigation methods can be found in  .
Note that our approach is not based on feature matching or 3D reconstruction, unlike e.g.,  ,  . Besides, our approach does not require supervised training for recognizing distinctive landmarks, unlike e.g.,  ,  .
Reinforcement Learning (RL) has been used in a variety of applications.  propose a policy gradient RL approach for locomotion of a four-legged robot.  discuss policy gradient methods for learning motor primitives.  propose an RL-based method for obstacle detection using a monocular camera.  apply reinforcement learning to autonomous helicopter flight.  use RL to automate data collection process for mapping.  propose a kernel-based reinforcement learning algorithm for large-scale settings.  use RL for making decisions in ATARI games. In contrast to these approaches, our models use deep reinforcement learning to handle high-dimensional sensory inputs.
Recently, methods that integrate deep learning methods with RL have shown promising results.  propose deep Q networks to play ATARI games.  propose a new Screenshots of our framework and other simulated learning frameworks: ALE  , ViZDoom  , UETorch  , Project Malmo  , SceneNet  , TORCS  , SYNTHIA  , Virtual KITTI  .
search algorithm based on the integration of Monte-Carlo tree search with deep RL that beats the world champion in the game of Go.  propose a deep RL approach, where the parameters of the deep network are updated by multiple asynchronous copies of the agent in the environment.  use a deep RL approach to directly map the raw images into torques at robot motors. Our work deals with much more complex inputs than ATARI games, or images taken in a lab setting with a constrained background. Additionally, our method is generalizable to new scenes and new targets, while the mentioned methods should be re-trained for a new game, or in case of a change in the game rules.
There have been some effort to develop learning methods that can generalize to different target tasks  ,  . In contrast, our model takes the target goal directly as an input without the need of re-training.
Recently, physics engines have been used to learn the dynamics of real-world scenes from images  ,  ,  . In this work, we show that a model that is trained in simulation can be generalized to real-world scenarios.
Iii. Ai2-Thor Framework
To train and evaluate our model, we require a framework for performing actions and perceiving their outcomes in a 3D environment. Integrating our model with different types of environments is a main requirement for generalization of our model. Hence, the framework should have a plugn-play architecture such that different types of scenes can be easily incorporated. Additionally, the framework should have a detailed model of the physics of the scene so the movements and object interactions are properly represented.
For this purpose, we propose The House Of inteRactions (AI2-THOR) framework, which is designed by integrating a physics engine (Unity 3D) 1 with a deep learning framework (Tensorflow  ). The general idea is that the rendered images of the physics engine are streamed to the deep learning framework, and the deep learning framework issues a control command based on the visual input and sends it back to the agent in the physics engine. Similar frameworks have been proposed by  ,  ,  ,  ,  , but the main advantages of our framework are as follows: (1) The physics engine and the deep learning framework directly communicate (in contrast to separating the physics engine from the controller as in  ). Direct communication is important since the feedback from the environment can be immediately used for online decision making. 2We tried to mimic the appearance distribution of the real-world images as closely as possible. For example,  work on Atari games, which are 2D environments and limited in terms of appearance or  is a collection of synthetic scenes that are non-photo-realistic and do not follow the distribution of real-world scenes in terms of lighting, object appearance, textures, and background clutter, etc. This is important for enabling us to generalize to real-world images.
To create indoor scenes for our framework, we provided reference images to artists to create a 3D scene with the texture and lighting similar to the image. So far we have 32 scenes that belong to 4 common scene types in a household environment: kitchen, living room, bedroom, and bathroom. On average, each scene contains 68 object instances.
The advantage of using a physics engine for modeling the world is that it is highly scalable (training a robot in real houses is not easily scalable). Furthermore, training the models can be performed cheaper and safer (e.g., the actions of the robot might damage objects). One main drawback of using synthetic scenes is that the details of the real world are under-modeled. However, recent advances in the graphics community make it possible to have a rich representation of the real-world appearance and physics, narrowing the discrepancy between real world and simulation. Fig. 2 provides a qualitative comparison between a scene in our framework and example scenes in other frameworks and datasets. As shown, our scenes better mimic the appearance properties of real world scenes. In this work, we focus on navigation, but the framework can be used for more fine-grained physical interactions, such as applying a force, grasping, or object manipulations such as opening and closing a microwave. Fig. 3 shows a few examples of high-level interactions. We will provide Python APIs with our framework for an AI agent to interact with the 3D scenes.