Exploring map/reduce frameworks for users of traditional HPC

Abstract

The map/reduce paradigm has become a critical part of keeping pace with the "deluge of data" coming from data sources in domain sciences such as genetics, cosmology, and high-energy physics. While data-enabled science is becoming widely accepted as a key component of scientific discovery, an understanding of exactly how to transform terabytes of raw data into useful information is not nearly as widespread. Map/reduce's roots in Java and web-oriented technologies have created a perceptible barrier to entry for users of "traditional" high-performance computing (HPC) whose core competencies include languages such as Fortran and C. Although the map/reduce framework can be generalized to any language, a practical understanding of exactly how to extend map/reduce to applications and languages with which HPC users are comfortable is not widespread. Thus, a gap between potential and realized applications exists within the context of data-intensive computing with map/reduce. As such, this project aims to develop a practical understanding of existing map/reduce frameworks and methods among the HPC professionals (e.g., XSEDE User Services staff) who provide guidance to the HPC user community within XSEDE and on an ad hoc basis. By developing this hands-on working knowledge of applying map/reduce methodologies in traditional HPC domains, we hope to provide more practical and useful guidance to traditional users of HPC whose research is now involving data-intensive computation. This will involve (1) exploring map/reduce in the context of traditional HPC languages (Fortran, C, and Python) via Hadoop streaming and MapReduce MPI's native support, (2) evaluating performance and ease-of-use of these frameworks for existing scientific problems, (3) developing documentation and boilerplate code for potential users, and (4) establishing tutorials for migrating existing Fortran/C/Python kernels to distributed map/reduce.

Intellectual Merit

This project aims to evaluate the feasibility, applicability, and ease of applying the map/reduce methodology and existing frameworks to problems specifically in domain sciences. The significance of this work, though, is in exploring these areas from the perspective of a traditional HPC user rather than a cloud-services provider or developer of a specific map/reduce product. A model application for this evaluation will be the digestion of gene sequence variations from Variant Call Format (VCF) for ease of input into a genomics database application. This is a simple yet common task is representative of a widespread bottleneck in traditional scientific analyses—data reduction—that will only get worse as the level of parallelism in simulations increases.

Broader Impact

The goal of this project is inherently broad in that its purpose is to develop an understanding of how map/reduce methodology can be used by computational scientists of all domains with as low of an entry barrier as possible. By equipping the professionals who work directly with users of high-performance computers (XSEDE User Services staff) with an understanding of the strengths of map/reduce frameworks and how to apply them using the native languages of user applications, this project will narrow the gap between traditional HPC and these emerging data-intensive techniques. This knowledge will be made available to the public in the form of short written tutorials, performance comparisons, and whitepapers that are made available online in website, wiki, and blog format.

Use of FutureGrid

FutureGrid is the enabling technology behind this project. While the members of this project will have compute cycles on traditional HPC resources by virtue of being XSEDE users and staff, it is not sensible to compete with the XSEDE user base for cycles in the context of carrying out the testing frameworks. Maintaining a low barrier to entry is a key goal, and since a rich VM-based map/reduce ecosystem already exists for technologies like Hadoop and Twister, FutureGrid is an ideal platform for rapidly assessing how suitable these technologies would be for key domain-specific problems. Should the deployment of "virtual appliances" prove to be an effective tool in allowing researchers to easily adapt map/reduce to typical problems, FutureGrid is also an excellent development testbed for such appliances.

Furthermore, FutureGrid resources are equipped with hardware that is relevant to traditional HPC users such as Infiniband and the Torque resource manager. This allows the project to be as realistic and relevant as possible when assessing frameworks (e.g., MapReduce MPI over Infiniband) and ease of use (e.g., adapting existing Torque scripts to use map/reduce). Finally, FutureGrid was architected with scientific research in mind and already has a broad community and knowledgebase that has made tremendous progress in making these cloud-oriented technologies accessible to researchers. This vastly reduces the startup overhead for the specific goals of this project.

Scale Of Use

This project is not directly funded and, as such, is subject to intermittent burst demand as members' time allows. However it is intended to be a long-term, ongoing effort to continually assess new methodologies as they emerge.

Publications

Results

Source code for these guides is also available on GitHub. Work is ongoing.

Project Number: FG-344

Project Lead: Glenn K. Lockwood

Institution: University of California San Diego

Project Status: Active

View Project Details

Keywords

hpc, map/reduce, mpi

Timeline

Updated: 1 year 4 weeks ago