The Conference on Computer Vision and Pattern Recognition (CVPR) is one of the world’s top computer vision (CV) conferences. CVPR 2019 runs June 15 through June 21 in Long Beach, California, and the list of accepted papers for the prestigious gathering has now been released.

A total of 1300 papers were accepted from a record-high 5165 submissions this year, and one standout already garnering attention is Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation. The paper is said to have received all three “Strong Accepts” in the peer review and ranks №1, according to University of California, Santa Barbara NLP Group Director William Wang, who is also one of the paper’s authors.

SNS posts from William Wang regarding the paper

The paper proposes a new method for vision-language navigation (VLN) tasks that combines the strengths of both reinforcement learning and self-supervised imitation learning.

“Turn right and head towards the kitchen. Then turn left, pass a table and enter the hallway. Walk down …”.

The method enables a robot (embodied agent) to navigate to a target position within a 3D environment by following natural language instructions that reference environmental landmarks, much like how humans give directions.

One of the challenges for VLN tasks is that natural language navigational instructions are usually based on a global-view of the complete path, while an embodied agent can only observe its local visual scene. As the agent cannot access the top-down view of the global trajectory, it has to convert the natural language navigational instructions into a global visual trajectory, and then gradually explore and navigate itself to the target position based on a series of local visual scenes.

The paper proposes two approaches to tackle these VLN problems: Reinforced Cross-Modal Matching (RCM) and Self-Supervised Imitation Learning (SIL). RCM is primarily for matching between instructions and trajectories, while at the same time evaluating whether the path being executed matches the previous instructions. SIL meanwhile is used mainly for the exploration of unseen environments by imitating past successful decisions.

The two methods were evaluated on the Room-to-Room (R2R) dataset, and five evaluation metrics were reported. Success rate weighted by inverse Path Length (SPL) is regarded as the most appropriate metric, for its ability to measure both effectiveness and efficiency.

Detailed results of the evaluations are shown in the following tables. A significant 28%~35% improvement of the SPL score can be observed when adopting RCM in comparison with the previous SOTA methods.

The table below demonstrates that an SIL-powered policy has a considerable ability to minimize the performance gap between seen and unseen environments (from ~30% to ~11%).

The response to the Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation so far suggests that it may be a candidate for CVPR 2019’s prestigious best paper award. You can read the paper on arXiv.