Evaluation of Hadoop for IO-intensive applications

Project Information

Discipline
Computer Science (401) 
Orientation
Research 
Abstract

One advantage of MapReduce is its data-affinity aware scheduling, which makes MapReduce more efficient than traditional HPC systems for data-intensive applications. In this project, we want to evaluate the performance of Hadoop for IO intensive applications. 

Intellectual Merit

We closely investigate how Hadoop performs to run IO-intensive applications. For Hadoop, the execution time of MapReduce jobs is impacted by many factors. We choose some important factors (e.g. input data size, the number of nodes) and measure how they impact the job run time.

Broader Impacts

This project enables Hadoop users to understand how the factors we considered influence performance. As a result, those factors can be accordingly tuned by users to maximize the performance for their specific environments.

Project Contact

Project Lead
Zhenhua Guo (zhguo) 
Project Manager
Zhenhua Guo (zhguo) 

Resource Requirements

Hardware Systems
  • hotel (IBM iDataPlex at U Chicago)
  • india (IBM iDataPlex at IU)
  • sierra (IBM iDataPlex at SDSC)
 
Use of FutureGrid

We wish to use bare metal machines to test how IO-intensive applications perform on Hadoop in FutureGrid.

Scale of Use

We plan to use 20-60 HPC nodes.

Project Timeline

Submitted
10/28/2010 - 15:21 
Completed
09/06/2012