HYBRID JIT–CUDA GRAPH OPTIMIZATION FOR LOW-LATENCY LARGE LANGUAGE MODEL INFERENCE
File(s)
Date
2025-12Author
Yadav, Divakar Kumar
Department
Computer Science
Advisor(s)
Zhao, Tian
Metadata
Show full item recordAbstract
Large Language Models (LLMs) deliver strong performance but suffer from high inference latency and unstable kernel-launch overhead. This the- sis introduces a Hybrid JIT–CUDA Graph Runtime that combines static CUDA Graph replay with JIT-compiled dynamic kernels to achieve deter- ministic, low-latency autoregressive decoding. A rolling graph-generation mechanism enables variable sequence lengths while preserving static-shape constraints. Evaluated on an NVIDIA H100 GPU, the system reduces Time- to-First-Token (TTFT) by 45–80%, achieves the lowest and most stable P99 latency, and provides the fastest end-to-end 500-token generation compared with HuggingFace (PyTorch Eager) and TensorRT–LLM. The results demon- strate that integrating JIT flexibility with CUDA Graph determinism offers an effective approach for high-performance LLM inference.
Subject
Computer science
Permanent Link
http://digital.library.wisc.edu/1793/96431Type
thesis
