Show simple item record

dc.contributor.advisorSala, Frederic
dc.contributor.authorPrabhu, Yogesh
dc.date.accessioned2025-05-28T20:10:32Z
dc.date.available2025-05-28T20:10:32Z
dc.date.issued2025
dc.identifier.urihttp://digital.library.wisc.edu/1793/95270
dc.descriptionSenior Honors Thesis, Department of Computer Sciences, University of Wisconsin-Madisonen_US
dc.description.abstractVision Language Models (VLMs) have significantly advanced multimodal understanding by effectively combining visual and textual modalities for various applications, including image captioning, visual question answering, and video summarization. However, despite their capabilities, these models exhibit pronounced modality biases, predominantly relying on textual inputs over visual data. This thesis systematically evaluates unimodal biases in state-of-the-art VLMs, highlighting the impacts on performance and proposing innovative strategies for bias mitigation, including Prompting, Interleaved Vision-Text Projection (IVTP), and Cross-Attention Projection. Our experimental evaluations using the MMWorld dataset demonstrate that targeted mitigation strategies substantially enhance modality balance and model robustness. The findings underscore the importance of architectural adjustments and training methodologies to ensure equitable multimodal integration, paving the way for more reliable and robust multimodal AI systems.en_US
dc.titleUnveiling Bias in Multimodal Modelsen_US
dc.typeThesisen_US


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record