• Login
    View Item 
    •   MINDS@UW Home
    • MINDS@UW Madison
    • College of Engineering, University of Wisconsin--Madison
    • Department of Electrical and Computer Engineering
    • Theses--Electrical Engineering
    • View Item
    •   MINDS@UW Home
    • MINDS@UW Madison
    • College of Engineering, University of Wisconsin--Madison
    • Department of Electrical and Computer Engineering
    • Theses--Electrical Engineering
    • View Item
    JavaScript is disabled for your browser. Some features of this site may not work without it.

    Designing Efficient Barriers and Semaphores for Graphics Processing Units

    Thumbnail
    File(s)
    DESIGNING EFFICIENT BARRIERS AND SEMAPHORES - Rohan Mahapatra.pdf (1015.Kb)
    Date
    2020-05
    Author
    Mahapatra, Rohan
    Metadata
    Show full item record
    Abstract
    General-purpose GPU applications that use fine-grained synchronization to enforce ordering between many threads accessing shared data have become increasingly popular. Thus, it is imperative to create more efficient GPU synchronization primitives for these applications. Accordingly, in recent years there has been a push to establish a single, unified set of GPU synchronization primitives. However, unlike CPUs, modern GPUs poorly support synchronization primitives. In particular, inefficient support for atomics, which are used to implement fine-grained synchronization, make it challenging to implement efficient algorithms. Therefore, as GPU algorithms are scaled to millions or billions of threads, existing GPU synchronization primitives either scale poorly or suffer from livelock or deadlock issues because of increased contention between threads accessing the same shared synchronization objects. In this work, we seek to overcome these inefficiencies by designing more efficient, scalable GPU global barriers and semaphores. In particular, we show how multi-level sense reversing barriers and priority mechanisms for semaphores can be extended from prior CPU implementations and applied to the GPUs unique processing model in order to improve performance and scalability of GPU synchronization primitives. Our results show that proposed designs significantly improve performance compared to state-of-the-art solutions like CUDA Cooperative Groups, and scale to an order of magnitude more threads – avoiding livelock as the algorithms scale compared to prior open source algorithms. Overall, across three modern GPUs: the proposed barrier implementation reduces atomic traffic by 50% and improves performance by an average of 26% over a GPU tree barrier algorithm and improves performance by an average of 30% over CUDA Cooperative Groups for four full-sized benchmarks; the new semaphore implementation improves performance by an average of 65% compared to prior GPU semaphore implementations.
    Permanent Link
    http://digital.library.wisc.edu/1793/80527
    Type
    Thesis
    Part of
    • Theses--Electrical Engineering

    Contact Us | Send Feedback
     

     

    Browse

    All of MINDS@UWCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

    My Account

    Login

    Contact Us | Send Feedback