VisionTrap

Vision-Augmented Trajectory Prediction Guided by Textual Descriptions

Seokha Moon¹ Hyun Woo¹ Hongbeen Park¹ Haeji Jeong¹ Reza Mahjourian² Hyung-gun Chi³ Hyerin Lim⁴ Sangpil Kim¹ Jinkyu Kim¹

¹ Korea University ² UT Austin ³ Purdue University ⁴ Hyundai Motor Company

Author's email: shmoon96@korea.ac.kr

ECCV 2024

Our research introduces VisionTrap, a novel method that significantly enhances trajectory prediction for autonomous vehicles by integrating visual cues from surround-view cameras and textual descriptions generated by Vision-Language Models. Additionally, we release the nuScenes-Text dataset, which augments the nuScenes dataset with rich textual descriptions to support further research.

Abstract

In the realm of autonomous driving, accurately predicting the future trajectories of road agents is crucial for ensuring safety and efficiency. Traditional trajectory prediction methods primarily rely on past trajectories and high-definition (HD) maps. While these inputs provide valuable information, they often miss out on essential contextual cues such as the intentions of pedestrians, road conditions, and dynamic interactions between agents.

Why Vision?

Contextual Understanding: Surround-view cameras capture rich visual data that includes human gestures, gazes, and road conditions. These visual cues provide critical context that can significantly influence an agent’s behavior.
Real-time Insights: Visual inputs allow the model to understand and react to real-time changes in the environment, such as sudden movements of pedestrians or changes in traffic signals.

Why Textual Descriptions?

Semantic Guidance: Textual descriptions generated by Vision-Language Models (VLMs) offer high-level semantic information that can guide the model’s learning process. These descriptions can highlight important aspects of the scene, such as “a pedestrian is carrying stacked items and is expected to remain stationary.”
Enhanced Supervision: By refining these textual descriptions with a Large Language Model (LLM), we provide clear and precise guidance to the model, improving its ability to learn relevant features from the visual data.

Method

In our approach, we integrate visual and textual data to enhance trajectory prediction.

Per-agent State Encoder: Processes each agent’s past trajectories and attributes, encoding them into context features.
Visual Semantic Encoder: Integrates multi-view images and map data into a unified BEV feature, capturing crucial visual context. The Scene-Agent Interaction module considers the environment and agent interaction.
Text-driven Guidance Module: Uses textual descriptions to supervise the model, aligning visual features with semantic information through contrastive learning. (Only for training)
Trajectory Decoder: Predicts future positions of agents using enriched state embeddings, ensuring accurate trajectory prediction.

This comprehensive methodology leverages both visual and textual cues to significantly improve the accuracy and reliability of trajectory predictions in autonomous driving.

Qualitative Results

We demonstrate the effectiveness of VisionTrap by comparing trajectory predictions with and without the Visual Semantic Encoder and Text-driven Guidance Module. The examples below show how incorporating visual and textual data significantly improves prediction accuracy

nuScenes-Text Dataset

The nuScenes-Text dataset enriches the nuScenes dataset with detailed annotations for every object in each frame, providing three versions of descriptions from surround camera views. We removed location-specific information such as 'left', 'right' or 'away from the ego car' to prevent confusion and refined descriptions using a LLM for clarity. These annotations capture diverse semantic details, such as agent behaviors, semantic features, and environmental conditions.

Examples of our generated textual descriptions

Example of captions for objects in ego-centric surround-view images from a single scene

Textual descriptions of unique scenarios in dataset 1

Textual descriptions of unique scenarios in dataset 2

nuScenes-Text Dataset Architecture

BibTeX

If you use our code or data, please cite:

                
@inproceedings{moon2024visiontrap,
    title={Visiontrap: Vision-augmented trajectory prediction guided by textual descriptions},
    author={Moon, Seokha and Woo, Hyun and Park, Hongbeen and Jung, Haeji and Mahjourian, Reza and Chi, Hyung-gun and Lim, Hyerin and Kim, Sangpil and Kim, Jinkyu},
    booktitle={European Conference on Computer Vision},
    pages={361--379},
    year={2024},
    organization={Springer}
    }