School of Computing and Information Sciences
Camilo Valdes is a Ph.D. candidate in the Bioinformatics Research Group (BioRG) at the School of Computing and Information Sciences, working on cloud computing, machine learning, and genomics under the supervision of Professor Giri Narasimhan. He completed two B.Sc. degrees in Computer Science and Biology from FIU, and then worked at the University of Miami’s Center for Computational Science as a research analyst. He has published papers in top journals such as Nature, the Proceedings of the National Academy of Sciences, Oxford Bioinformatics, and the proceedings of the Intelligent Systems for Molecular Biology and European Conference on Computational Biology. He was also awarded a $25,000 research award from Amazon to develop scalable analysis tools and methods for metagenomics.
A microbiome is a collection of microorganisms that inhabit a particular environmental niche such as the human body or earth soil. Profiling a microbiome is a critical task that tells us what microorganisms are present, and in what proportions.
A powerful approach for profiling microbiomes is to do metagenomic whole-genome sequencing (mWGS), followed by a bioinformatic analysis that uses a reference collection of microbial genomes to infer the profile. Advances in recent years have steadily reduced the cost of sequencing genomes and microbiomes. Consequently, this has resulted in ever increasing collections of reference genomic sequences and enormous numbers of extremely large and complex metagenomic datasets. The analysis of these massive datasets poses a computational challenge as the profiling task requires the use of enormous indexes of the reference genomes and generates extremely large intermediate results.
As a first step, we propose novel, accurate, and scalable methods for the analysis of massive mWGS datasets, while simultaneously using a “complete” reference genome collection that reflects the entirety of our current knowledge, with the goal of creating high-resolution profiles that deliver correct estimates on metagenomic bacterial populations. The proposed methods leverage streaming architectures as well as distributed computational pipelines to create dynamic computational clusters that can process large amounts of DNA sequencing data at scale, and that run much faster, and significantly cheaper, than other state of the art tools.