Hadoop WordCount

Number:
Author: Tak-Lon Stephen Wu
Improvements: 
Version: 1.0
Date: 2013-07-02

Hadoop WordCount

WordCount is a simple program which counts the number of occurrences of each word in a given text input data set. WordCount fits very well with the MapReduce programming model making it a great example to understand the Hadoop Map/Reduce programming style. You can download the WordCount source code from Big Data for Science tutorial.

 

Acknowledge

This page was originally designed by SalsaHPC group for Big Data for Science Workshop; you can see the original pages here.

Requirements

  1. Log in to FutureGrid Cluster and obtain compute nodes. (HPC / Eucalyptus)
  2. Start SalsaHadoop/Hadoop on compute nodes. (SalsaHadoop Tutorial)
  3. Download and unzip WordCount source code from Big Data for Science tutorial.
  4. Linux command experience

1. Download and unzip WordCount under $HADOOP_HOME

Assuming you start SalsaHadoop/Hadoop with setting $HADOOP_HOME=~/hadoop-0.20.203.0, and are running the master node on i55, download and unzip the WordCount source code from Big Data for Science tutorial under $HADOOP_HOME.
[taklwu@i55 ~]$ cd $HADOOP_HOME
[taklwu@i55 hadoop-0.20.203.0]$ wget http://salsahpc.indiana.edu/tutorial/source_code/Hadoop-WordCount.zip
[taklwu@i55 hadoop-0.20.203.0]$ unzip Hadoop-WordCount.zip

2. Execute

Hadoop-WordCount

First, we need to upload the input files (any text format file) into Hadoop distributed file system (HDFS):
[taklwu@i55 hadoop-0.20.203.0]$ bin/hadoop fs -put $HADOOP_HOME/Hadoop-WordCount/input/ input
[taklwu@i55 hadoop-0.20.203.0]$ bin/hadoop fs -ls input

Here, $HADOOP_HOME/Hadoop-WordCount/input/ is the local directory where the program inputs are stored. The second "input" represents the remote destination directory on the HDFS.

After uploading the inputs into HDFS, run the WordCount program with the following commands. We assume you have already compiled the word count program.

[taklwu@i55 hadoop-0.20.203.0]$ bin/hadoop jar $HADOOP_HOME/Hadoop-WordCount/wordcount.jar WordCount input output

If Hadoop is running correctly, it will print hadoop running messages similar to the following:

WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please use org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties files.
11/11/02 18:34:46 INFO input.FileInputFormat: Total input paths to process : 1
11/11/02 18:34:46 INFO mapred.JobClient: Running job: job_201111021738_0001
11/11/02 18:34:47 INFO mapred.JobClient:  map 0% reduce 0%
11/11/02 18:35:01 INFO mapred.JobClient:  map 100% reduce 0%
11/11/02 18:35:13 INFO mapred.JobClient:  map 100% reduce 100%
11/11/02 18:35:18 INFO mapred.JobClient: Job complete: job_201111021738_0001
11/11/02 18:35:18 INFO mapred.JobClient: Counters: 25
...

3. Monitoring Hadoop

We can also monitor the job status using lynx, a text browser, on i136 based Hadoop monitoring console. Assuming the Hadoop Jobtracker is running on i55:9003, you can type:

[taklwu@i136 ~]$ lynx i55:9003

4. Check the result

After finishing the job, please use theses commands to check the output:

[taklwu@i55 ~]$ cd $HADOOP_HOME
[taklwu@i55 ~]$ bin/hadoop fs -ls output
[taklwu@i55 ~]$ bin/hadoop fs -cat output/*

Here, "output" is the HDFS directory where the result stored. The result will look like the following:

you." 15
you; 1
you? 2
you?" 23
young 42

5. Finishing the Map-Reduce process

After finishing the job, please use this command to kill the HDFS and Map-Reduce daemon:

[taklwu@i55 hadoop-0.20.203.0]$ bin/stop-all.sh