Investigation of Data Locality and Fairness in MapReduce

Project Information

Discipline
Computer Science (401) 
Orientation
Research 
Abstract

Traditional High-Performance Computing (HPC) environments separate compute and storage resources and adopt "bring data to compute" strategy. MapReduce is a data parallel model that makes use the same set of nodes for both compute and storage. As a result, data affinity is integrated into the scheduling algorithm to bring compute to data. In data-intensive computing, data locality becomes more important than before because it can potentially reduce network traffic significantly. In this project, we try to investigate the data locality of MapReduce in detail, and do the following things: 1) we summarize important system factors and theoretically deduce the relationship between those factors and data locality; 2) we analyze the state-of-the-art Hadoop scheduling algorithms to investigate their performance; 3) we propose new scheduling algorithms that yield optimal data locality; 4) we integrate data locality and fairness; 5) we compare our algorithms with the default Hadoop scheduling algorithm.

Intellectual Merit

This project tries to address an important issue in MapReduce : data locality. Our proposed algorithms yield optimal data locality and can dramatically reduce the time of data movement. The integration of data locality and fairness allows users to make the best tradeoff based on their environments and requirements.

Broader Impacts

In the era of data-intensive computing, we all know data locality is critical because it is not efficient to move extreme amount of data during data processing. This project can help researchers to better understand MapReduce data locality in a quantitative way. In addition, this project produces some insightful conclusions and results that pave the foundation for further research on data parallel systems.

Project Contact

Project Lead
Zhenhua Guo (zhguo) 
Project Manager
Zhenhua Guo (zhguo) 

Resource Requirements

Hardware System
  • hotel (IBM iDataPlex at U Chicago)
 
Use of FutureGrid

We ran extensive simulation experiments on FutureGrid bare metal machines.

Scale of Use

We used 1 - 5 of HPC nodes.

Project Timeline

Submitted
09/06/2012 - 16:33 
Completed
10/18/2012