Show simple item record

dc.contributor.advisorLee, Yong Jae
dc.contributor.authorNguyen, Le Thien Phuc
dc.date.accessioned2026-01-26T20:57:44Z
dc.date.available2026-01-26T20:57:44Z
dc.date.issued2025
dc.identifier.urihttp://digital.library.wisc.edu/1793/96385
dc.descriptionSenior Honors Thesis, Department of Computer Sciences, University of Wisconsin-Madisonen_US
dc.description.abstract"Multimodal large language models (MLLMs) are expected to jointly interpret vision, audio, and language, yet existing video benchmarks rarely assess fine-grained reasoning about human speech. Many tasks remain visually solvable or only coarsely evaluate speech, offering limited insight into whether models can align who speaks, what is said, and when it occurs. We introduce AV-SpeakerBench, a curated benchmark of 3,212 multiple-choice questions focused on speaker-centric audiovisual reasoning in real-world videos. It features: (1) a speaker-centered formulation that treats speakers—not scenes—as the core reasoning unit; (2) fusion-grounded question design embedding audiovisual dependencies into question semantics; and (3) expert-curated annotations ensuring temporal precision and cross-modal validity. Comprehensive evaluations show that the Gemini family consistently outperforms open-source systems, with Gemini 2.5 Pro achieving the best results. Among open models, Qwen3-Omni-30B approaches Gemini 2.0 Flash but remains far behind Gemini 2.5 Pro, primarily due to weaker audio-visual fusion rather than visual perception. We believe AV-SpeakerBench establishes a rigorous foundation for advancing fine-grained audiovisual reasoning in future multimodal systems."en_US
dc.subjectBenchmarken_US
dc.subjectEvaluationen_US
dc.subjectMultimodal Large Language Modelen_US
dc.subjectAudiovisual understandingen_US
dc.titleSee, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Modelsen_US
dc.typeThesisen_US


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record