Distributed Mapreduce

Project Information

Discipline
Computer Science (401) 
Subdiscipline
11.07 Computer Science 
Orientation
Research 
Abstract

Data is generated at a rapid speed at all the places in the world nowadays. This nature of widely spread of data and their geographical distance to all the data centers has brought up a big challenge for big companies data analysis and for come-as-use cloud service. We have compared the performance of using different architecture of Hadoop on some widely scattered data sets by experiments, and proposed an approach of doing the computation as close to the data as possible. We also pointed out that there could be several factors we need to balance when we have data outside the cloud which we need to do the computation on. http://www-users.cs.umn.edu/~cardosa/cardosa-mapred11.pdf And this project is to for further experiment on the implementation of this distributed mapreduce system.

Intellectual Merit

This project will improve the research at distributed systems especially on the problems that moving data might be costly in the whole workflow of computation(like scientific data which needs to be imported into computing clusters). This project will also help build an improved Hadoop prototype which will have a better performance for widely distributed data set and it will be open sourced so that it could be used for other scientific experiments. We have previously proved it is a worthwhile problem to work at in this published paper. http://www-users.cs.umn.edu/~cardosa/cardosa-mapred11.pdf

Broader Impacts

Included student ranged from Ph.D to undergraduate students, it is also improving collaborated research across several research group including network, database, distributed system and database. The finished software could be used for further research topics like spatial data mining and social networks.

Project Contact

Project Lead
Chenyu Wang (wang2143) 
Project Manager
Chenyu Wang (wang2143) 
Project Members
Jerome Mitchell, Bingjing Zhang  

Resource Requirements

Hardware Systems
  • foxtrot (IBM iDataPlex at UF)
  • hotel (IBM iDataPlex at U Chicago)
  • india (IBM iDataPlex at IU)
 
Use of FutureGrid

Run some simulation of different data transfer stategy; Deploy our modified hadoop; Run some experiments with the modified hadoop to see its performance.

Scale of Use

a few VMs for an experiment

Project Timeline

Submitted
07/18/2011 - 13:50