Jimeng Sun

Georgia Tech School of Computational Science and Engineering


Lecture Information:
  • October 30, 2015
  • 2:00 PM
  • ECS: 241

Speaker Bio

Jimeng Sun is an Associate Professor of School of Computational Science and Engineering at College of Computing in Georgia Institute of Technology. Prior to joining Georgia Tech, he was a research staff member at IBM TJ Watson Research Center. His research focuses on health analytics using electronic health records and data mining, especially in designing novel tensor analysis and similarity learning methods and developing large-scale predictive modeling systems. He has published over 70 papers, filed over 20 patents (5 granted). He has received ICDM best research paper award in 2008, SDM best research paper award in 2007, and KDD Dissertation runner-up award in 2008. Dr. Sun received his B.S. in Computer Science from Hong Kong University of Science and Technology in 2002, and PhD in Computer Science from Carnegie Mellon University in 2007.

Abstract

As the adoption of electronic health records (EHRs) has grown, EHRs are now composed of a diverse array of data, including structured information and unstructured clinical progress notes. Two unique challenges need to be addressed in order to utilize EHR data in clinical research and practice:
1) Computational Phenotyping: How to turn complex and messy EHR data into meaningful clinical concepts or phenotypes?
2) Scalable predictive modeling: How to efficiently construct and validate clinical predictive models from EHR?
In this talk, we discuss our approaches to these challenges. For computational phenotyping, we present EHR data as data as inter-connected high-order relations i.e. tensors (e.g. tuples of patient-medication-diagnosis, patient-lab, and patient-symptoms), and then develop expert-guided sparse nonnegative tensor factorization for extracting multiple phenotype candidates from EHR data. Most of the phenotype candidates are considered clinically meaningful and with predictive power. For predictive modeling, we introduce CloudAtlas, a cloud-based parallel predictive modeling system using big data infrastructure including Hadoop and Spark. Besides parallel model building, CloudAtlas can accurately estimate the running time and cost for a predictive modeling workflow then provHisions the proper cluster on demand in the cloud. In particular, we demonstrate that CloudAtlas can achieve 40X speedup plus 40% cost saving compared to traditional sequential execution on large EHR datasets.