Unveiling Bias in Multimodal Models
Abstract
Vision Language Models (VLMs) have significantly advanced multimodal understanding by effectively combining visual and textual modalities for various applications, including image captioning, visual question answering, and video summarization. However, despite their capabilities, these models exhibit pronounced modality biases, predominantly relying on textual inputs over visual data. This thesis systematically evaluates unimodal biases in state-of-the-art VLMs, highlighting the impacts on performance and proposing innovative strategies for bias mitigation, including Prompting, Interleaved Vision-Text Projection (IVTP), and Cross-Attention Projection. Our experimental evaluations using the MMWorld dataset demonstrate that targeted mitigation strategies substantially enhance modality balance and model robustness. The findings underscore the importance of architectural adjustments and training methodologies to ensure equitable multimodal integration, paving the way for more reliable and robust multimodal AI systems.
Permanent Link
http://digital.library.wisc.edu/1793/95270Type
Thesis
Description
Senior Honors Thesis, Department of Computer Sciences, University of Wisconsin-Madison

