Using Map/Reduce in FutureGrid

As the computing landscape becomes increasingly data-centric, data-intensive computing environments are poised to transform scientific research. In particular, MapReduce based programming models and run-time systems such as the open-source Hadoop system have increasingly been adopted by researchers with data-intensive problems, in areas including bio-informatics, data mining and analytics, and text processing.

FutureGrid provides capabilities that allow users to experiment with MapReduce applications and middleware, including the widely-used Hadoop platform and the iterative map/reduce Twister plaftorm. There are different ways you may want to use MapReduce platforms in the testbed. This page guides you in selecting from FutureGrid capabilities that are best suited depending on your goals, and links to respective tutorials:

MapReduce on Physical Machines

While there exist MapReduce systems that run on virtual machines, many dedicated Hadoop deployments run the Hadoop run-time and data-processing applications on physical machines to avoid I/O virtualization overheads. Currently, we have two major approaches for deploying Hadoop on physical machines in FutureGrid: The first uses "MyHadoop", where Hadoop tasks are instantiated dynamically using an HPC scheduler (Torque). The second uses "SalsaHadoop", where Hadoop starts with a 'one-click script' automatically on obtained HPC nodes and tasks are submitted to the Hadoop master directly. In addition, FutureGrid also supports Twister, a lightweight iterative MapReduce runtime, running on the HPC cluster.

Associated tutorials:

Basic High Performance Computing [novice]
Running Hadoop as a batch job using MyHadoop [novice]
Running SalsaHadoop (one-click Hadoop) on HPC environment [beginner]
Running Twister on HPC environment [beginner]

MapReduce on Virtual Machines

Running Hadoop on virtual machines gives users the flexibility to customize the Hadoop runtime system and any additional middleware as desired, e.g. for research on novel MapReduce middleware approaches. Currently, Hadoop images can be deployed on FutureGrid resources in the following ways:

Using Eucalyptus on FutureGrid [novice]
Running SalsaHadoop on Eucalyptus [intermediate]
Running FG-Twister on Eucalyptus [intermediate]
Running Twister on Eucalyptus [intermediate]

Education / Training with MapReduce

FutureGrid offers educational virtual appliances that allow users to deploy virtual private clusters where Hadoop tasks can be deployed using Condor. This approach allows users to not only experiment with Hadoop on FutureGrid, but also with virtual clusters, on their own resources. Currently, Hadoop virtual appliances can be deployed on FutureGrid resources in the following ways:

Running a Grid Appliance on FutureGrid [novice]
Running Condor tasks on the Grid Appliance [novice]
Running Hadoop tasks on the Grid Appliance [novice]
Running Hadoop WordCount on FutureGrid [novice]
Running Hadoop Blast on FutureGrid [novice]
Running Twister Kmeans on FutureGrid [novice]
Running Twister Blast on FutureGrid [novice]

Using Map/Reduce in FutureGrid

MapReduce on Physical Machines

MapReduce on Virtual Machines

Education / Training with MapReduce

About

Support

Community

Projects