Distributed Mapreduce
Abstract
Data is generated at a rapid speed at all the places in the world nowadays. This nature of widely spread of data and their geographical distance to all the data centers has brought up a big challenge for big companies data analysis and for come-as-use cloud service. We have compared the performance of using different architecture of Hadoop on some widely scattered data sets by experiments, and proposed an approach of doing the computation as close to the data as possible. We also pointed out that there could be several factors we need to balance when we have data outside the cloud which we need to do the computation on. http://www-users.cs.umn.edu/~cardosa/cardosa-mapred11.pdf
And this project is to for further experiment on the implementation of this distributed mapreduce system.
Intellectual Merit
This project will improve the research at distributed systems especially on the problems that moving data might be costly in the whole workflow of computation(like scientific data which needs to be imported into computing clusters). This project will also help build an improved Hadoop prototype which will have a better performance for widely distributed data set and it will be open sourced so that it could be used for other scientific experiments.
We have previously proved it is a worthwhile problem to work at in this published paper.
http://www-users.cs.umn.edu/~cardosa/cardosa-mapred11.pdf
Broader Impact
Included student ranged from Ph.D to undergraduate students, it is also improving collaborated research across several research group including network, database, distributed system and database. The finished software could be used for further research topics like spatial data mining and social networks.
Use of FutureGrid
Run some simulation of different data transfer stategy;
Deploy our modified hadoop;
Run some experiments with the modified hadoop to see its performance.
Scale Of Use
a few VMs for an experiment