Wide area distributed file system for MapReduce applications on FutureGrid platform

Project Information

Discipline
Computer Science (401) 
Orientation
Research 
Abstract

Map/Reduce is a software framework introduced by Google for processing datasets with the help of parallel processing using a large number of computer nodes. Map/Reduce has received significant attention as a programming model for various scientific problems using a wide variety of computing architectures, including large-scale compute clusters, GPGPU, and multi-core architectures. However, some data-intensive computing applications such as those used as part of the Large Hadron Collider (LHC) computing project, earth observation sciences, and biomedical applications require large-scale distributed data processing across multiple data centers connected via high speed networking. In this project, we aim to develop a software framework for Map/Reduce across distributed data centers in support of data-intensive applications.

Intellectual Merit

"We will develop a Wide area Distributed File System, a high performance large-scale file system that integrates existing data management toolkits. The Cyberaide Farm is not expected to provide a complete set of file system functionalities; however, it will deliver high performance data transfer and replication management across distributed data centers. We will develop a Cyberaide Farm Runtime System, which extends the Apache Hadoop framework and develops a set of APIs to accommodate the developer and user communities."

Broader Impacts

Our work will be implemented and tested on distributed production data centers powered by the FutureGrid projects. We will use standard data-intensive computing benchmarks, high performance engineering applications and data-intensive social computing applications to evaluate our implementations.

Project Contact

Project Lead
Lizhe Wang (lizhe) 
Project Manager
Lizhe Wang (lizhe) 

Resource Requirements

Hardware Systems
  • hotel (IBM iDataPlex at U Chicago)
  • india (IBM iDataPlex at IU)
  • sierra (IBM iDataPlex at SDSC)
  • xray (Cray XM5 at IU)
 
Use of FutureGrid

We are planning to use FutureGrid as a test bed to setup several clusters, to deploy wide-area parallel file system for MapReduce applications.

Scale of Use

Distributed Hadoop clusters that across wide area, each cluster at least has 8 compute nodes

Project Timeline

Submitted
12/12/2010 - 12:18 
Completed
01/18/2013