V2V-LLM: Vehicle-to-Vehicle Cooperative Autonomous Driving with Multi-Modal Large Language Models

Hsu-kuang Chiu^1,2, Ryo Hachiuma¹, Chien-Yi Wang¹, Stephen F. Smith², Yu-Chiang Frank Wang¹, Min-Hung Chen¹

¹NVIDIA Research, ²Carnegie Mellon University

Paper arXiv Code (coming soon) Data (coming soon)

V2V-LLM: all connected autonomous vehicles (CAVs) share their perception information with the LLM. Any CAV can ask the LLM a question to obtain useful information for driving safety.

Overview

We propose a novel problem setting that integrates an LLM into cooperative autonomous driving, with the proposed Vehicle-to-Vehicle Question-Answering (V2V-QA) dataset and benchmark. We also propose our baseline method Vehicle-to-Vehicle Large Language Model (V2V-LLM), which uses an LLM to fuse perception information from multiple connected autonomous vehicles (CAVs) and answer driving-related questions: grounding, notable object identification, and planning. Experimental results show that our proposed V2V-LLM can be a promising unified model architecture for performing various tasks in cooperative autonomous driving, and outperforms other baseline methods that use different fusion approaches. Our work also creates a new research direction that can improve the safety of future autonomous driving systems. For more details, please refer to our paper at arxiv.

V2V-QA Dataset

The following table summarizes the differences between our V2V-QA dataset and recent related autonomous driving datasets. Our V2V-QA is a question-answering dataset that supports multiple vehicles in real cooperative driving scenarios.

Comparison between our V2V-QA dataset and recent related Autonomous Driving (AD) datasets.

Our V2V-QA includes grounding, notable object identification, and planning question-answer pairs, as illustrated in the following figures.

Illustration of V2V-QA’s 5 types of QA pairs. The arrows pointing at LLM indicate the perception data from CAVs.

V2V-LLM Architecture

Our V2V-LLM takes the individual perception features of every CAV as the vision input, a question as the language input, and generates an answer as the language output.

Model diagram of our proposed V2V-LLM for cooperative autonomous driving.

Qualitative Results

The following figures show the V2V-LLM's qualitative results on V2V-QA's testing split. For more details, please to refer to our paper at arxiv.

BibTeX

@ARTICLE{chiu2025v2vllm,
  title={V2V-LLM: Vehicle-to-Vehicle Cooperative Autonomous Driving with Multi-Modal Large Language Models},
  author={Chiu, Hsu-kuang and Hachiuma, Ryo and Wang, Chien-Yi and Smith, Stephen F. and Wang, Yu-Chiang Frank and Chen, Min-Hung},
  journal={https://arxiv.org/abs/2502.09980},
  year={2025}
}