Shantenu Jha

Rutgers University Electrical and Computer Engineering


Lecture Information:
  • April 24, 2024
  • 12:00 PM
  • PG5: 134

Speaker Bio

Shantenu is an Assistant Professor at Rutgers University, and a Visiting Scientist at the School of Informatics (University of Edinburgh) and at University College London. Before moving to Rutgers, he was the lead for Cyberinfrastructure Research and Development at the CCT at Louisiana State University. His research interests lie at the triple point of Applied Computing, Cyberinfrastructure R&D and Computational Science. Shantenu is the lead investigator of the SAGA project (http://www.saga-project.org), which is a community standard and is part of the official middleware/software stack of most major Production Distributed Cyberinfrastructure — such as US NSF’s XSEDE and the European Grid Infrastructure. His research has been funded by multiple NSF awards, US Department of Energy, US National Institute for Health (NIH) as well as the UK EPSRC (OMII-UK project and Research theme at the e-Science Institute). He is the recipient of the NSF CAREER Award in 2013 and has won several prestigious awards at ACM/IEEE Supercomputing and the International Supercomputing Series. He seeks fearless and revolutionary young minds to join the RADICAL (thinking) group! Away from work, Jha tries middle-distance running and biking, tends to be an economics-junky, enjoys reading and writing random musings and tries to use his copious amounts of free time with a conscience.

Abstract

There are several important science and engineering problems that require the coordinated execution of multiple high-performance simulations. Some common scenarios include but are not limited to, “an ensemble of tasks”, “loosely-coupled simulations of tightly-coupled simulations” or “multi-component multi-physics simulations”. We posit that the tools and capabilities to support flexible yet scalable requirements of multiple simulations are limited. A promising way to overcome this surprisingly common limitation is the use of “Pilot-Jobs” — which can be defined as a container job to provide mullet-level scheduling capabilities via an application-level overlay on system schedulers. We discuss both the theory and practice of Pilot abstractions: Specifically, we introduce the P* Model of Pilot-abstractions, and present “SAGA-BigJob” as a SAGA-based extensible, interoperable and scalable implementation of the P* Model. We then discuss several science problems that have/are using BigJob to execute multiple simulations at unprecedented scales on a range of supercomputers and distributed supercomputing infrastructure such as (the US) NSF XSEDE.