Muhammad Haseeb is a Ph.D. candidate at the Knight Foundation School of Computing and Information Sciences (KFSCIS), FIU, and has been working with Dr. Fahad Saeed since 2018. His PhD research focuses on the development of High-Performance Computing (HPC) and CPU-GPU methods for scalable acceleration of computational proteomics analyses on top-500 supercomputers. Apart from doctoral research, he has worked as an Application Performance Intern at the Berkeley Lab during the summers of 2020 and 2021 where he developed CPU-GPU algorithms for scientific applications and performance analysis and optimization tools. During his time as a PhD student, he published 6 peer-reviewed papers in top-venues including Nature Computational Science, IEEE IPDPS and IEEE BIBM, 2 patents, and a Springer book. Haseeb was awarded the “2021 Best Graduate Student Research Award” by the KFSCIS, FIU for his research work. After his graduation, he will be joining the National Energy Research Scientific Computing Center (NERSC), Lawrence Berkeley (DOE) National Laboratory.
For the past 30 years, computational proteomics researchers have strived to improve the efficiency of database peptide search algorithms that deduce peptides from high-throughput Mass Spectrometry (MS) data. In database peptide search, the experimentally acquired MS data are searched against large-scale databases of theoretically simulated MS data to find the best match. Tera-scale data volumes and data-intensive nature of the search algorithms lead to several days (or weeks) of execution times with existing methods. Rapid advances in high-throughput MS instruments and increased interest in systems biology, multi-omics, and cancer proteomics studies drive the need for faster database peptide search methods.
Our work intends to solve this problem by accelerating the database peptide search algorithmic workflow by optimally leveraging the heterogeneous distributed-memory resources in modern supercomputers. The first part of this work presents an HPC framework, called HiCOPS, which efficiently parallelizes the database peptide search problem across supercomputer nodes achieving a 10x speedup as compared to existing HPC methods and with 75% strong-scale efficiency. The second part of this work presents a novel data structure that compresses the size of tera-scale databases by 50% without incurring any processing (query) speed loss. The third part of this work presents GPU-accelerated algorithms, kernels, and data pipelines to further accelerate the HiCOPS framework by 4x rendering an overall improvement of 40x over the existing state-of-art serial and/or HPC computational infrastructure.