• Login
    View Item 
    •   MINDS@UW Home
    • MINDS@UW Madison
    • College of Letters and Science, University of Wisconsin–Madison
    • College of Letters & Science Honors Program Senior Honors Theses
    • Natural Sciences
    • Computer Sciences
    • View Item
    •   MINDS@UW Home
    • MINDS@UW Madison
    • College of Letters and Science, University of Wisconsin–Madison
    • College of Letters & Science Honors Program Senior Honors Theses
    • Natural Sciences
    • Computer Sciences
    • View Item
    JavaScript is disabled for your browser. Some features of this site may not work without it.

    See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models

    Thumbnail
    File(s)
    Honors Thesis (29.46Mb)
    Date
    2025
    Author
    Nguyen, Le Thien Phuc
    Advisor(s)
    Lee, Yong Jae
    Metadata
    Show full item record
    Abstract
    "Multimodal large language models (MLLMs) are expected to jointly interpret vision, audio, and language, yet existing video benchmarks rarely assess fine-grained reasoning about human speech. Many tasks remain visually solvable or only coarsely evaluate speech, offering limited insight into whether models can align who speaks, what is said, and when it occurs. We introduce AV-SpeakerBench, a curated benchmark of 3,212 multiple-choice questions focused on speaker-centric audiovisual reasoning in real-world videos. It features: (1) a speaker-centered formulation that treats speakers—not scenes—as the core reasoning unit; (2) fusion-grounded question design embedding audiovisual dependencies into question semantics; and (3) expert-curated annotations ensuring temporal precision and cross-modal validity. Comprehensive evaluations show that the Gemini family consistently outperforms open-source systems, with Gemini 2.5 Pro achieving the best results. Among open models, Qwen3-Omni-30B approaches Gemini 2.0 Flash but remains far behind Gemini 2.5 Pro, primarily due to weaker audio-visual fusion rather than visual perception. We believe AV-SpeakerBench establishes a rigorous foundation for advancing fine-grained audiovisual reasoning in future multimodal systems."
    Subject
    Benchmark
    Evaluation
    Multimodal Large Language Model
    Audiovisual understanding
    Permanent Link
    http://digital.library.wisc.edu/1793/96385
    Type
    Thesis
    Description
    Senior Honors Thesis, Department of Computer Sciences, University of Wisconsin-Madison
    Part of
    • Computer Sciences

    Contact Us | Send Feedback
     

     

    Browse

    All of MINDS@UWCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

    My Account

    Login

    Contact Us | Send Feedback