Using Map/Reduce in FutureGrid

As the computing landscape becomes increasingly data-centric, data-intensive computing environments are poised to transform scientific research. In particular, MapReduce based programming models and run-time systems such as the open-source Hadoop system have increasingly been adopted by researchers with data-intensive problems, in areas including bio-informatics, data mining and analytics, and text processing. 

FutureGrid provides capabilities that allow users to experiment with MapReduce applications and middleware, including the widely-used Hadoop platform and the iterative map/reduce Twister plaftorm. There are different ways you may want to use MapReduce platforms in the testbed. This page guides you in selecting from FutureGrid capabilities that are best suited depending on your goals, and links to respective tutorials:


MapReduce on Physical Machines


While there exist MapReduce systems that run on virtual machines, many dedicated Hadoop deployments run the Hadoop run-time and data-processing applications on physical machines to avoid I/O virtualization overheads. Currently, we have two major approaches for deploying Hadoop on physical machines in FutureGrid: The first uses "MyHadoop", where Hadoop tasks are instantiated dynamically using an HPC scheduler (Torque). The second uses "SalsaHadoop", where Hadoop starts with a 'one-click script' automatically on obtained HPC nodes and tasks are submitted to the Hadoop master directly. In addition, FutureGrid also supports Twister, a lightweight iterative MapReduce runtime, running on the HPC cluster. 

Associated tutorials:


MapReduce on Virtual Machines


Running Hadoop on virtual machines gives users the flexibility to customize the Hadoop runtime system and any additional middleware as desired, e.g. for research on novel MapReduce middleware approaches. Currently, Hadoop images can be deployed on FutureGrid resources in the following ways:


Education / Training with MapReduce


FutureGrid offers educational virtual appliances that allow users to deploy virtual private clusters where Hadoop tasks can be deployed using Condor. This approach allows users to not only experiment with Hadoop on FutureGrid, but also with virtual clusters, on their own resources. Currently, Hadoop virtual appliances can be deployed on FutureGrid resources in the following ways: