MoD-DPO: Towards Mitigating Cross-modal Hallucinations in Omni LLMs using Modality Decoupled Preference Optimization

📝 Abstract

Omni-modal large language models (omni LLMs) have recently achieved strong performance across audiovisual understanding tasks, yet they remain highly susceptible to cross-modal hallucinations arising from spurious correlations and dominant language priors. In this work, we propose Modality-Decoupled Direct Preference Optimization (MoD-DPO), a simple and effective framework for improving modality grounding in omni LLMs. MoD-DPO introduces modality-aware regularization terms that explicitly enforce invariance to corruptions in irrelevant modalities and sensitivity to perturbations in relevant modalities, thereby reducing unintended cross-modal interactions. To further mitigate over-reliance on textual priors, we incorporate a language-prior debiasing penalty that discourages hallucination-prone text-only responses. Extensive experiments across multiple audiovisual hallucination benchmarks demonstrate that MoD-DPO consistently improves perception accuracy and hallucination resistance, outperforming previous preference optimization baselines under similar training budgets. Our findings underscore the importance of modality-faithful alignment and demonstrate a scalable path toward more reliable and resilient multimodal foundation models.

📊 Results

Cross-Modal Hallucination Benchmarks

MoD-DPO and MoD-DPO++ are evaluated on AVHBench and the Curse of Multi-Modalities (CMM) benchmark using two reference models: Qwen 2.5 Omni (7B) and MiniCPM-O 2.6 (8B).

AVHBench Results (Accuracy / F1)

Method	Audio-driven Video Halluc.		Video-driven Audio Halluc.		AV Matching
Method	Acc.	F1	Acc.	F1	Acc.	F1
Other Omni LLMs
VideoLLaMA 2	79.23	79.16	75.07	73.71	52.93	23.93
VITA-1.5	67.17	65.36	54.01	52.78	46.85	38.78
OmniVinci	61.36	61.16	58.56	46.88	54.32	34.14
Qwen 3 Omni	83.54	83.47	76.46	72.00	58.52	32.02
Qwen 2.5 Omni
Qwen 2.5 Omni	84.15	83.51	77.38	73.39	54.69	17.85
+ DPO	84.39	83.42	79.68	77.28	59.32	32.94
+ OmniDPO	85.34	84.23	80.77	80.39	61.50	38.52
+ MoD-DPO	87.66	87.61	82.48	80.98	69.07	58.53
+ MoD-DPO++	88.19	88.15	83.40	82.10	69.68	59.71
MiniCPM-O 2.6
MiniCPM-O 2.6	83.36	83.30	74.54	73.74	54.26	54.15
+ DPO	82.91	82.88	78.86	75.22	54.56	54.52
+ OmniDPO	84.96	84.95	75.39	75.04	56.86	56.84
+ MoD-DPO	87.08	87.02	79.00	78.87	60.57	60.53
+ MoD-DPO++	87.26	87.23	79.49	79.38	60.66	60.64

MoD-DPO++ achieves gains of up to 27% accuracy on the audiovisual matching task relative to the reference models and consistently outperforms all baselines on overall accuracy and F1.

CMM Benchmark Results (Perception Accuracy / Hallucination Resistance)

Method	Overall pa	Overall hr
Other Omni LLMs
VideoLLaMA 2	71.7	81.1
VITA-1.5	72.2	57.1
OmniVinci	89.2	69.4
Qwen 3 Omni	95.0	75.3
Qwen 2.5 Omni
Qwen 2.5 Omni	86.4	84.6
+ DPO	85.2	84.6
+ OmniDPO	86.6	84.7
+ MoD-DPO	88.8	86.2
+ MoD-DPO++	89.2	87.2
MiniCPM-O 2.6
MiniCPM-O 2.6	85.6	80.4
+ DPO	85.2	80.3
+ OmniDPO	86.4	80.6
+ MoD-DPO	88.0	82.5
+ MoD-DPO++	88.3	83.6

On CMM, MoD-DPO++ achieves 3–4% overall gains over the reference models, with the improvement on the language dominance task being notably higher for MoD-DPO++ vs. MoD-DPO, demonstrating the efficacy of language-prior debiasing.

General Benchmarks

Method	DailyOmni (AV)	MVBench (Video)	MMAU (Audio)
Qwen 2.5 Omni
Qwen 2.5 Omni	47.34	69.61	64.62
+ DPO	51.44	68.21	65.19
+ OmniDPO	50.07	68.89	65.52
+ MoD-DPO	53.00	70.95	65.77
+ MoD-DPO++	53.82	71.02	66.33
MiniCPM-O 2.6
MiniCPM-O 2.6	30.55	62.56	66.75
+ DPO	34.13	63.18	66.91
+ OmniDPO	34.33	64.42	67.73
+ MoD-DPO	35.71	64.15	68.98
+ MoD-DPO++	36.60	64.32	68.30

While baselines provide inconsistent gains on general benchmarks, MoD-DPO++ provides consistent improvements across all three benchmarks (DailyOmni, MVBench, MMAU), demonstrating that reducing hallucinations also benefits general audiovisual understanding.

💬 Contact

For questions about MoD-DPO, please contact:

Ashutosh Chaubey — achaubey@usc.edu

Mohammad Soleymani — soleymani@ict.usc.edu

📎 BibTeX

@inproceedings{chaubey2026moddpo,
  title={MoD-DPO: Towards Mitigating Cross-modal Hallucinations in Omni LLMs using Modality Decoupled Preference Optimization},
  author={Chaubey, Ashutosh and Pang, Jiacheng and Soleymani, Mohammad},
  booktitle={IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2026}
}

🙏 Acknowledgement

Research was sponsored by the Army Research Office and was accomplished under Cooperative Agreement Number W911NF-25-2-0040. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Army Research Office or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein.