• Login
    View Item 
    •   MINDS@UW Home
    • MINDS@UW Milwaukee
    • UW Milwaukee Electronic Theses and Dissertations
    • View Item
    •   MINDS@UW Home
    • MINDS@UW Milwaukee
    • UW Milwaukee Electronic Theses and Dissertations
    • View Item
    JavaScript is disabled for your browser. Some features of this site may not work without it.

    HYBRID JIT–CUDA GRAPH OPTIMIZATION FOR LOW-LATENCY LARGE LANGUAGE MODEL INFERENCE

    File(s)
    Main File (723.3Kb)
    Date
    2025-12
    Author
    Yadav, Divakar Kumar
    Department
    Computer Science
    Advisor(s)
    Zhao, Tian
    Metadata
    Show full item record
    Abstract
    Large Language Models (LLMs) deliver strong performance but suffer from high inference latency and unstable kernel-launch overhead. This the- sis introduces a Hybrid JIT–CUDA Graph Runtime that combines static CUDA Graph replay with JIT-compiled dynamic kernels to achieve deter- ministic, low-latency autoregressive decoding. A rolling graph-generation mechanism enables variable sequence lengths while preserving static-shape constraints. Evaluated on an NVIDIA H100 GPU, the system reduces Time- to-First-Token (TTFT) by 45–80%, achieves the lowest and most stable P99 latency, and provides the fastest end-to-end 500-token generation compared with HuggingFace (PyTorch Eager) and TensorRT–LLM. The results demon- strate that integrating JIT flexibility with CUDA Graph determinism offers an effective approach for high-performance LLM inference.
    Subject
    Computer science
    Permanent Link
    http://digital.library.wisc.edu/1793/96431
    Type
    thesis
    Part of
    • UW Milwaukee Electronic Theses and Dissertations

    Contact Us | Send Feedback
     

     

    Browse

    All of MINDS@UWCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

    My Account

    Login

    Contact Us | Send Feedback