Muhammad Usman Tariq
Muhammad Usman Tariq is a Ph.D. candidate at the Knight Foundation School of Computing and Information Sciences at Florida International University, supervised by Professor Fahad Saeed. Usman completed his Bachelor’s in Electrical Engineering from the University of Engineering and Technology in Lahore, Pakistan, in 2014. After working for two years as a Software Engineer at Xavor Corporation, he moved to the U.S. to pursue his Master’s in Computer Engineering at Western Michigan University. There, his research focused on parallel computing, graph sampling, and FPGAs, and he also served as a Teaching Assistant in microprocessing and digital logic labs. Since starting his Ph.D. in 2018, Usman has concentrated on research at the intersection of deep learning, bioinformatics, and computational biology. He has published multiple papers in journals and conference proceedings, addressing complex problems in the field of deep learning, proteomics, computational chemistry, and mass spectrometry through innovative approaches. He worked at Facebook (now Meta) in 2022 as a research intern working on creative machine learning in the Facebook App Monetization (FAM) team.
Accurately and efficiently identifying protein sequences and structures is a foundational aspect of biological sciences, laying the groundwork for our comprehension of intricate molecular interactions. High throughput Liquid Chromatography-Tandem Mass Spectrometry (LC-MS/MS) serves as a pivotal technology in this identification journey and is indispensable in the realm of shotgun proteomics. LC-MS/MS operates by fragmenting peptides, which are subsections of protein sequences. Each fragmented peptide yields a spectrum, constituting the output of the MS instrument. This generates thousands of spectra, that can be analyzed through computational or machine-learning techniques to deduce peptide sequences. The mass spectrometry-based peptide database search is a key computational method involved in this process. It is used to match the experimentally acquired mass spectra against a database of theoretical peptides. This search is crucial as it bridges the gap between the mass spectrometry data and actual peptide sequences, thereby leading to protein identification. Its significance is further emphasized by its ability to translate the high-dimensional LC-MS/MS data into actionable biological intelligence, thus contributing to a nuanced understanding of protein structures and their functionalities. Conventional mass spectrometry-based peptide database searches, usually reliant on numerical algorithms, often run into problems like inconsistencies, misidentifications, and statistical drawbacks such as a high False Discovery Rate (FDR). Consequently, over 70% of mass spectrometry data remains unidentified, thereby underlining the immediate need for enhanced strategies.
This dissertation unveils machine-learning solutions to surmount these hurdles, centering on three groundbreaking techniques: 1) A novel deep cross-modal similarity network, called SpeCollate, designed to produce relevant embeddings for both spectra and peptides, permitting direct comparison via L2 distance without the necessity for theoretical spectra or heuristic functions. 2) An attention-driven network aimed at reducing redundant and potentially erroneous matches by predicting distinct peptide attributes directly from the MS spectra using embeddings generated by SpeCollate. 3) Uncertainty assessment of the embeddings created by these deep learning architectures, introducing innovative metrics to quantify the model’s output uncertainty, thereby contributing to informed decision-making, particularly in clinical contexts. Experimental findings reveal that our methodology identifies peptides at a rate 1.5 times greater than existing state-of-the-art methods and slashes the search space dimension by up to 14x. Even after considering computational overheads, this translates to an impressive 10x acceleration in real-world datasets. Moreover, our uncertainty metrics exhibit a robust 0.94 AUC-ROC in retrieving high-quality spectra and excel in distinguishing in-distribution and out-of-distribution instances with a remarkable 0.99 AUC-ROC. To conclude, the dissertation encapsulates the devised methods into an integrated framework designed to augment peptide identification from mass spectrometry data and drastically improve search times while augmenting reliability scores. Overall, the methodologies and metrics presented in this work signify a considerable leap in the computational proteomics landscape, delivering both amplified performance, efficiency, and reliability.