V2V-GoT: Vehicle-to-Vehicle Cooperative Autonomous Driving with Multi-Modal Large Language Models and Graph-of-Thoughts

1NVIDIA, 2Carnegie Mellon University, *equally advising



V2V-GoT: Graph-of-thoughts reasoning framework for vehicle-to-vehicle cooperative autonomous driving. All Connected Autonomous Vehicles (CAVs) share their perception features with the Multimodal Large Language Model (MLLM), as illustrated by the grey arrows. Any CAV can ask the MLLM to provide a suggested future trajectory or answer perception or prediction questions. The MLLM fuses the perception features from all CAVs and performs inference by following the graph-of-thoughts. If two QA nodes are connected by a directed edge in the graph, as illustrated by black arrows, the answer of the parent node QA is used as the input context of the child node QA.

Overview

We propose the first graph-of-thoughts framework specifically designed for MLLM-based cooperative autonomous driving. Our graph-of-thoughts includes our proposed novel ideas of occlusion-aware perception and planning-aware prediction. We curate the V2V-GoT-QA dataset and develop the V2V-GoT model for training and testing the cooperative driving graph-of-thoughts. Our experimental results show that our method outperforms other baselines in cooperative perception, prediction, and planning tasks. For more details, please refer to our paper at arxiv.


V2V-GoT-QA Dataset

Our V2V-GoT-QA dataset includes 9 types of perception, prediction, and planning QA samples. Our proposed occlusion-aware perception questions (Q1 - Q4) consider visible, occluding, and invisible objects. Our proposed planning-aware prediction questions (Q5 - Q7) include prediction by perception features and prediction by other CAVs' current planned future trajectories. Our planning questions (Q8 - Q9) provide the suggested action settings and waypoints of future trajectories to avoid potential collisions, as illustrated in the following figures.

Illustration of V2V-GoT-QA’s 9 types of QA pairs: Perception (Q1 - Q4), Prediction (Q5 - Q7), and Planning (Q8 - Q9). The black arrows pointing at the MLLM indicate the perception data from CAVs. Other colored arrows represent referenced, predicted, or suggested future trajectories. Color stars represent current locations of objects, referenced, predicted or suggested future waypoints.


V2V-GoT Model

Our V2V-GoT uses the perception features at the current and previous timesteps from all CAVs as the input of the project layers in the MLLM to generate the visual tokens. The MLLM takes the visual tokens and the language tokens from the question and the context as input and generates the final answer in the natural language format.

Model diagram of our proposed V2V-GoT for MLLM-based cooperative autonomous driving.


Qualitative Results

The following figures show the V2V-GoT's qualitative results on V2V-GoT-QA's testing split. For more details, please to refer to our paper at arxiv.