Wide area distributed file system for MapReduce applications on FutureGrid platform
Project Information
- Discipline
- Computer Science (401)
- Orientation
- Research
Map/Reduce is a software framework introduced by Google for processing datasets with the help of parallel processing using a large number of computer nodes. Map/Reduce has received significant attention as a programming model for various scientific problems using a wide variety of computing architectures, including large-scale compute clusters, GPGPU, and multi-core architectures. However, some data-intensive computing applications such as those used as part of the Large Hadron Collider (LHC) computing project, earth observation sciences, and biomedical applications require large-scale distributed data processing across multiple data centers connected via high speed networking. In this project, we aim to develop a software framework for Map/Reduce across distributed data centers in support of data-intensive applications.
Intellectual Merit"We will develop a Wide area Distributed File System, a high performance large-scale file system that integrates existing data management toolkits. The Cyberaide Farm is not expected to provide a complete set of file system functionalities; however, it will deliver high performance data transfer and replication management across distributed data centers. We will develop a Cyberaide Farm Runtime System, which extends the Apache Hadoop framework and develops a set of APIs to accommodate the developer and user communities."
Broader ImpactsOur work will be implemented and tested on distributed production data centers powered by the FutureGrid projects. We will use standard data-intensive computing benchmarks, high performance engineering applications and data-intensive social computing applications to evaluate our implementations.
Project Contact
- Project Lead
- Lizhe Wang (lizhe)
- Project Manager
- Lizhe Wang (lizhe)
Resource Requirements
- Hardware Systems
-
- hotel (IBM iDataPlex at U Chicago)
- india (IBM iDataPlex at IU)
- sierra (IBM iDataPlex at SDSC)
- xray (Cray XM5 at IU)
We are planning to use FutureGrid as a test bed to setup several clusters, to deploy wide-area parallel file system for MapReduce applications.
Scale of UseDistributed Hadoop clusters that across wide area, each cluster at least has 8 compute nodes
Project Timeline
- Submitted
- 12/12/2010 - 12:18
- Completed
- 01/18/2013