Scaling-out CloudBLAST: Deploying Elastic MapReduce across Geographically Distributed Virtulized Resources for BLAST

Project Information

Discipline
Electrical and Related Engineering (106) 
Subdiscipline
14.09 Computer Engineering 
Orientation
Research 
Abstract

This project proposes and evaluates an approach to the parallelization, deployment and management of embarrassingly parallel bioinformatics applications (e.g., BLAST) that integrates several emerging technologies for distributed computing. In particular, it evaluates scaling-out applications on a geographically distributed system formed by resources from distinct cloud providers, which we refer to as sky-computing systems. Such environments are inherently disconnected and heterogeneous with respect to performance, requiring the combination and extension of several existing technologies to efficiently scale-out applications with respect to management and performance.

Intellectual Merit

An end-to-end approach to sky computing is proposed, integrating several technologies and techniques, namely, Infrastructure-as-a-Service cloud toolkit (Nimbus) to create virtual machines (VMs) on demand with contextualization services that facilitate the formation and management of a logic cluster, virtual network (ViNe) to connect VMs on private networks or protected by firewalls, virtual networking (TinyVine) to overcome additional connectivity limitations imposed by cloud providers or middleware, MapReduce framework (Hadoop) for parallel fault-tolerant execution of unmodified applications, extensions to Hadoop to handle inputs as those required by BLAST, and skewed task distribution to deal with resource imbalance.

Broader Impacts

The outcomes of this project are made available in the form of publications, demos, appliances, presentations and tutorials. This material can be transformative in accelerating future computer engineering developments by taking advantage of a proven integrated cloud-based solution and in facilitating the use of complex systems by non-experts in the field of bioinformatics by offering an end-to-end solution to run BLAST that does not require in-depth knowledge of the underlying cyberinfrastructure technologies.

Project Contact

Project Lead
Andrea Matsunaga (ammatsun) 
Project Manager
Andrea Matsunaga (ammatsun) 
Project Members
Mauricio Tsugawa  

Resource Requirements

Hardware Systems
  • alamo (Dell optiplex at TACC)
  • foxtrot (IBM iDataPlex at UF)
  • hotel (IBM iDataPlex at U Chicago)
  • sierra (IBM iDataPlex at SDSC)
 
Use of FutureGrid

Perform experiments to evaluate the proposed solution and develop tutorial.

Scale of Use

Every system you have for blocks of few days for running large experiments (already provided). A few VMs for upgrading and maintaining of existing solution.

Project Timeline

Submitted
05/14/2012 - 19:14