Project date: Apr. 2021 – Sep. 2021
My team of 3 started this project for the 3-day San Diego Supercomputer Center GPU Hackathon. We worked to improve the existing SigProfilerExtractor GPU code, which was implemented using PyTorch.
The code that I wrote is now included in the main SigProfilerExtractor program. It ran 14x faster and was 3x more memory efficient than the original code, which was already an order of magnitude faster than the basic CPU implementation.
The existing codebase is entirely in Python, so we used PyBind 11, a C++ package that allows C/C++ binaries to be imported into Python. All of the core functions were written in CUDA – NVIDIA’s GPU programming language based on C++.
- We used NVIDIA’s Nsight Systems and Nsight Compute along with several built-in command-line commands to profile our code’s performance.
- CUBLAS is a matrix operation package that implements state-of-the-art parallel programming algorithms in CUDA. We replaced all basic matrix operations with functions from CUBLAS to speed up the code.
- CUDA Graphs essentially generates a graph of the operations done during each iteration and runs them all with greatly reduced downtime between operations.
- We used a concept called “handles” to expedite memory transfers between system memory and GPU memory.