Investigation of Data Locality and Fairness in MapReduce
Abstract
Traditional High-Performance Computing (HPC) environments separate compute and storage resources and adopt "bring data to compute" strategy. MapReduce is a data parallel model that makes use the same set of nodes for both compute and storage. As a result, data affinity is integrated into the scheduling algorithm to bring compute to data. In data-intensive computing, data locality becomes more important than before because it can potentially reduce network traffic significantly. In this project, we try to investigate the data locality of MapReduce in detail, and do the following things: 1) we summarize important system factors and theoretically deduce the relationship between those factors and data locality; 2) we analyze the state-of-the-art Hadoop scheduling algorithms to investigate their performance; 3) we propose new scheduling algorithms that yield optimal data locality; 4) we integrate data locality and fairness; 5) we compare our algorithms with the default Hadoop scheduling algorithm.
Intellectual Merit
This project tries to address an important issue in MapReduce : data locality. Our proposed algorithms yield optimal data locality and can dramatically reduce the time of data movement. The integration of data locality and fairness allows users to make the best tradeoff based on their environments and requirements.
Broader Impact
In the era of data-intensive computing, we all know data locality is critical because it is not efficient to move extreme amount of data during data processing. This project can help researchers to better understand MapReduce data locality in a quantitative way. In addition, this project produces some insightful conclusions and results that pave the foundation for further research on data parallel systems.
Use of FutureGrid
We ran extensive simulation experiments on FutureGrid bare metal machines.
Scale Of Use
We used 1 - 5 of HPC nodes.
Publications
- [Guo:2012:IDL:2287016.2287022] Guo, Z., G. Fox, and M. Zhou, "Investigation of data locality and fairness in MapReduce", ACM,
- [fg-261-05-2012-a] Guo, Z., G. Fox, and M. Zhou, "Investigation of Data Locality in MapReduce", IEEE Computer Society, 05/2012
Results
The detailed results of this project have been presented in two papers: "Investigation of data locality and fairness in MapReduce" [1], and "Investigation of Data Locality in MapReduce" [2].
References
- [Guo:2012:IDL:2287016.2287022] Guo, Z., G. Fox, and M. Zhou, "Investigation of data locality and fairness in MapReduce", Proceedings of third international workshop on MapReduce and its Applications Date, Delft, The Netherlands, ACM, pp. 25–32, 2012.
- [fg-261-05-2012-a] Guo, Z., G. Fox, and M. Zhou, "Investigation of Data Locality in MapReduce", Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012), Ottawa, Canada, IEEE Computer Society, pp. 419–426, 05/2012.