Mutational Signature Extraction using NMF in CUDA

Project date: Apr. 2021 – Sep. 2021

My team of 3 started this project for the 3-day San Diego Supercomputer Center GPU Hackathon. We worked to improve the existing SigProfilerExtractor GPU code, which was implemented using PyTorch.

The code that I wrote is now included in the main SigProfilerExtractor program. It ran 14x faster and was 3x more memory efficient than the original code, which was already an order of magnitude faster than the basic CPU implementation.

Implementation

The existing codebase is entirely in Python, so we used PyBind 11, a C++ package that allows C/C++ binaries to be imported into Python. All of the core functions were written in CUDA – NVIDIA’s GPU programming language based on C++.

Optimizations

  • We used NVIDIA’s Nsight Systems and Nsight Compute along with several built-in command-line commands to profile our code’s performance.
  • CUBLAS is a matrix operation package that implements state-of-the-art parallel programming algorithms in CUDA. We replaced all basic matrix operations with functions from CUBLAS to speed up the code.
  • CUDA Graphs essentially generates a graph of the operations done during each iteration and runs them all with greatly reduced downtime between operations.
  • We used a concept called “handles” to expedite memory transfers between system memory and GPU memory.

Pan-Precancer Genomic Analysis

Project date: Mar. 2019 – Sep. 2021

The goal of this project was to create a compendium of mutational signatures for all precancers in an attempt to better understand the nature of cancer progression.

Pipeline Design

In order to analyze mutational signatures, the first thing we need to do is variant calling. In this case, this meant finding differences in the DNA of a normal sample and a tumor sample. The pipeline we designed started with alignment and used an ensemble method to identify variants with a greater degree of certainty than any single variant calling program alone.

Variant Filtering

To ensure our results had as little noise as possible, we employed several filters on the variants.

Ensemble Learning

We used an ensemble of several reliable variant callers to identify variants from each sample. Our filter would only retain the variants that were called be multiple variant callers in an attempt to reduce our false positives.

Variant Effect Predictor (VEP)

The concept of driver mutations in cancer is that the development of a cancerous tumors is primarily caused by “driver” mutations while other random mutations that do not significantly impact the cancer’s progression are called “passenger” mutations. Mutational signatures are mostly concerned with the driver mutations. To reflect this, we identified the effects of every variant using Ensembl’s Variant Effect Predictor (VEP) software.

Allele Frequency

A concern when looking at cancers and precancers is the likelihood that our tumor sample may be contaminated with normal tissue. This would cause a portion of the DNA sequences to reflect the normal genome rather than the tumor’s genome. To combat this, we removed the variants that did not have a large enough proportion of the mutated allele compared to the normal allele.

Pan-Precancer Analysis

I analyzed and generated several unpublished figures that illustrated our findings from analyzing 2000+ precancer samples.