For questions about MoD-DPO, please contact:
Ashutosh Chaubey — achaubey@usc.edu
Mohammad Soleymani — soleymani@ict.usc.edu
Omni-modal large language models (omni LLMs) have recently achieved strong performance across audiovisual understanding tasks, yet they remain highly susceptible to cross-modal hallucinations arising from spurious correlations and dominant language priors. In this work, we propose Modality-Decoupled Direct Preference Optimization (MoD-DPO), a simple and effective framework for improving modality grounding in omni LLMs. MoD-DPO introduces modality-aware regularization terms that explicitly enforce invariance to corruptions in irrelevant modalities and sensitivity to perturbations in relevant modalities, thereby reducing unintended cross-modal interactions. To further mitigate over-reliance on textual priors, we incorporate a language-prior debiasing penalty that discourages hallucination-prone text-only responses. Extensive experiments across multiple audiovisual hallucination benchmarks demonstrate that MoD-DPO consistently improves perception accuracy and hallucination resistance, outperforming previous preference optimization baselines under similar training budgets. Our findings underscore the importance of modality-faithful alignment and demonstrate a scalable path toward more reliable and resilient multimodal foundation models.
MoD-DPO and MoD-DPO++ are evaluated on AVHBench and the Curse of Multi-Modalities (CMM) benchmark using two reference models: Qwen 2.5 Omni (7B) and MiniCPM-O 2.6 (8B).
AVHBench Results (Accuracy / F1)| Method | Audio-driven Video Halluc. | Video-driven Audio Halluc. | AV Matching | |||
|---|---|---|---|---|---|---|
| Acc. | F1 | Acc. | F1 | Acc. | F1 | |
| Other Omni LLMs | ||||||
VideoLLaMA 2 | 79.23 | 79.16 | 75.07 | 73.71 | 52.93 | 23.93 |
VITA-1.5 | 67.17 | 65.36 | 54.01 | 52.78 | 46.85 | 38.78 |
OmniVinci | 61.36 | 61.16 | 58.56 | 46.88 | 54.32 | 34.14 |
Qwen 3 Omni | 83.54 | 83.47 | 76.46 | 72.00 | 58.52 | 32.02 |
Qwen 2.5 Omni | ||||||
Qwen 2.5 Omni | 84.15 | 83.51 | 77.38 | 73.39 | 54.69 | 17.85 |
+ DPO | 84.39 | 83.42 | 79.68 | 77.28 | 59.32 | 32.94 |
+ OmniDPO | 85.34 | 84.23 | 80.77 | 80.39 | 61.50 | 38.52 |
+ MoD-DPO | 87.66 | 87.61 | 82.48 | 80.98 | 69.07 | 58.53 |
+ MoD-DPO++ | 88.19 | 88.15 | 83.40 | 82.10 | 69.68 | 59.71 |
MiniCPM-O 2.6 | ||||||
MiniCPM-O 2.6 | 83.36 | 83.30 | 74.54 | 73.74 | 54.26 | 54.15 |
+ DPO | 82.91 | 82.88 | 78.86 | 75.22 | 54.56 | 54.52 |
+ OmniDPO | 84.96 | 84.95 | 75.39 | 75.04 | 56.86 | 56.84 |
+ MoD-DPO | 87.08 | 87.02 | 79.00 | 78.87 | 60.57 | 60.53 |
+ MoD-DPO++ | 87.26 | 87.23 | 79.49 | 79.38 | 60.66 | 60.64 |
MoD-DPO++ achieves gains of up to 27% accuracy on the audiovisual matching task relative to the reference models and consistently outperforms all baselines on overall accuracy and F1.
CMM Benchmark Results (Perception Accuracy / Hallucination Resistance)| Method | Overall pa | Overall hr |
|---|---|---|
| Other Omni LLMs | ||
VideoLLaMA 2 | 71.7 | 81.1 |
VITA-1.5 | 72.2 | 57.1 |
OmniVinci | 89.2 | 69.4 |
Qwen 3 Omni | 95.0 | 75.3 |
Qwen 2.5 Omni | ||
Qwen 2.5 Omni | 86.4 | 84.6 |
+ DPO | 85.2 | 84.6 |
+ OmniDPO | 86.6 | 84.7 |
+ MoD-DPO | 88.8 | 86.2 |
+ MoD-DPO++ | 89.2 | 87.2 |
MiniCPM-O 2.6 | ||
MiniCPM-O 2.6 | 85.6 | 80.4 |
+ DPO | 85.2 | 80.3 |
+ OmniDPO | 86.4 | 80.6 |
+ MoD-DPO | 88.0 | 82.5 |
+ MoD-DPO++ | 88.3 | 83.6 |
On CMM, MoD-DPO++ achieves 3–4% overall gains over the reference models, with the improvement on the language dominance task being notably higher for MoD-DPO++ vs. MoD-DPO, demonstrating the efficacy of language-prior debiasing.
| Method | DailyOmni (AV) |
MVBench (Video) |
MMAU (Audio) |
|---|---|---|---|
Qwen 2.5 Omni | |||
Qwen 2.5 Omni | 47.34 | 69.61 | 64.62 |
+ DPO | 51.44 | 68.21 | 65.19 |
+ OmniDPO | 50.07 | 68.89 | 65.52 |
+ MoD-DPO | 53.00 | 70.95 | 65.77 |
+ MoD-DPO++ | 53.82 | 71.02 | 66.33 |
MiniCPM-O 2.6 | |||
MiniCPM-O 2.6 | 30.55 | 62.56 | 66.75 |
+ DPO | 34.13 | 63.18 | 66.91 |
+ OmniDPO | 34.33 | 64.42 | 67.73 |
+ MoD-DPO | 35.71 | 64.15 | 68.98 |
+ MoD-DPO++ | 36.60 | 64.32 | 68.30 |
While baselines provide inconsistent gains on general benchmarks, MoD-DPO++ provides consistent improvements across all three benchmarks (DailyOmni, MVBench, MMAU), demonstrating that reducing hallucinations also benefits general audiovisual understanding.
For questions about MoD-DPO, please contact:
Ashutosh Chaubey — achaubey@usc.edu
Mohammad Soleymani — soleymani@ict.usc.edu
@inproceedings{chaubey2026moddpo,
title={MoD-DPO: Towards Mitigating Cross-modal Hallucinations in Omni LLMs using Modality Decoupled Preference Optimization},
author={Chaubey, Ashutosh and Pang, Jiacheng and Soleymani, Mohammad},
booktitle={IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2026}
}
Research was sponsored by the Army Research Office and was accomplished under Cooperative Agreement Number W911NF-25-2-0040. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Army Research Office or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein.