Jun Li

Florida International University


Lecture Information:
  • February 9, 2018
  • 2:00 PM
  • ECS 241

Speaker Bio

Jun Li is an assistant professor in the School of Computing and Information Sciences, Florida International University. He received his Ph.D. degree from the Department of Electrical and Computer Engineering, University of Toronto, in 2017, and his B.S. and M.S. degrees from the School of Computer Science, Fudan University, China, in 2009 and 2012. My research interest focuses on large-scale distributed storage systems with erasure coding. Merging the gap between theory and practice, my research studies both theoretical and practical challenges of deploying erasure coding in a distributed storage system with high performance and low resource consumption.

Abstract

Distributed storage systems store a significant amount of data on a large number of commodity servers. In a distributed storage system, it is rather normal to observe server failures that can lead to data losses. Conventionally, a distributed storage system stores replicated data to tolerate server failures, leading to steep storage overhead. Recently, erasure coding has been increasingly replacing replication, thanks to its lower storage overhead with the same level of failure tolerance. However, traditional erasure codes, such as Reed-Solomon codes, suffer from high network and disk I/O overhead, as well as low parallelism during distributed data processing. In this talk, I will start with a general introduction of erasure coding in the context of distributed storage systems. I will then present our recent works on improving the performance of I/O accesses of data with erasure coding in distributed storage systems while preserving desirable properties of existing erasure codes. On the input side, we take advantage of existing properties of erasure codes to improve the overall throughput when writing erasure-coded data into distributed storage systems. On the output side, we design novel codes to enhance the parallelism when data are read in parallel by distributed data analytical systems.