Twister Kmeans

Author: Tak-Lon Stephen Wu
Version: 0.1
Date: 2011-11-07

In statistics and machine learning, Kmeans clustering is a method of cluster analysis which aims to partition n observation into k clusters where each observation belongs to the cluster with the nearest mean [wikipedia].

In each iteration of Twister Kmeans, all the map tasks get the same input data (current cluster centers)  and each computes a partial cluster centers by going through the 3D data points owned by itself. A reduce task computes the average of the partial cluster centers and produce the cluster centers for the next step. Main program, once it gets these new cluster centers, calculates the difference between the new cluster centers and the previous cluster centers, then determine if it needs to execute another cycle of MapReduce computation. 

Figure 1. The workflow of Kmeans application with Twister MapReduce framework

The following tutorial shows the steps running built-in Twister Kmeans application within the Twister 0.9 package.


  1. Login to FutureGrid Cluster and obtain compute nodes. (HPC / Eucalyptus)
  2. Start ActiveMQ and Twister on compute nodes. (Twister on FutureGrid Tutorial)
  3. Download and unzip Twister Kmeans source code from Big Data for Science tutorial.
  4. Install Apache Ant on the working node.
  5. Linux command experience.
  6. Open two command prompts, one for Twister driver/master directory under $TWISTER_HOME/bin, another for Twister Kmeans examples directory $TWISTER_HOME/samples/kmeans/bin/

1. Check if Twister-Kmeans-0.9.jar exist

From now on, you will need to open two command prompts, one for Twister Driver under $TWISTER_HOME/bin, another for kmeans application $TWISTER_HOME/samples/kmeans/bin. Detail instruction of running Twister Kmeans can also found in $TWISTER_HOME/samples/kmeans/README.txt.

Assuming your start Twister with setting $TWISTER_HOME=~/twister-0.9, the driver/master is running on i55, and ActiveMQ broker is running on i56. Then, we check the exist of the Kmeans executable $TWISTER_HOME/apps/Twister-Kmeans-0.9.jar. If we cannot find it, please run ant compiler under Twister Kmeans examples directory $TWISTER_HOME/samples/kmeans/
[taklwu@i55 bin]$ ls -l $TWISTER_HOME/apps
total 20
-rw------- 1 taklwu users 13876 Oct 21 13:24 Twister-Kmeans-0.9.jar

2. Prepare input data directory across worker nodes

Before the input data set is generated, we need to make a general directory on all nodes in order to store them locally. Therefore, on the Twister Driver command prompt under $TWISTER_HOME/bin, we make Kmeans directory with using Twister driver script ./ mkdir kmeans

[taklwu@i55 bin]$ ./ mkdir kmeans
i55:/N/u/taklwu/twister-0.9/data/kmeans created.
mkdir: cannot create directory `/N/u/taklwu/twister-0.9/data/kmeans': File exists
i56:/N/u/taklwu/twister-0.9/data/kmeans created.

The warning "mkdir: cannot create directory `/N/u/taklwu/twister-0.9/data/kmeans': File exists" is shown as we are using Shared File System on HPC nodes. It will not be shown if we run Twister Kmeans on Eucalyptus.

3. Generate Kmeans input data

After making the Kmeans directories across all worker nodes, on the Kmeans application command prompt under $TWISTER_HOME/samples/kmeans/bin, run the script to generate the Kmeans input data set and distribute to all worker nodes.

Script usage: ./  [init clusters file][num of clusters][vector length][sub dir][data file prefix][number of files to generate][number of data points]
e.g. ./ init_clusters.txt 2 3 /kmeans km_data 80 80000

[taklwu@i55 bin]$ ./ init_clusters.txt 2 3 /kmeans km_data 80 80000

kmeans args.len:7
JobID: kmeans-data-gen2b1ce8c3-fe6d-11e0-9a94-e966ca4f6cd6
Oct 24, 2011 2:22:45 PM org.apache.activemq.transport.failover.FailoverTransport doReconnect
INFO: Successfully connected to tcp://i56:61616
0    [main] INFO  cgl.imr.client.TwisterDriver  - MapReduce computation termintated gracefully.
9    [Thread-1] DEBUG cgl.imr.client.ShutdownHook  - Shutting down completed.

Here "sub dir" refers to the directory where you want the data files to be saved remotely. This is a sub directory under data_dir of all the nodes.

3. Create Partition File

Irrespective of whether you generate data using above method or manually you need to create a partition file to run the application. On the Twister Driver command prompt under $TWISTER_HOME/bin, run the script to create the partition file (file location metadata) according to the input data set distribution.

Script usage: ./ [common directory][file filter][partition file]
e.g. ./ /kmeans km_data

[taklwu@i55 bin]$ ./ /kmeans km_data
Oct 24, 2011 2:28:50 PM org.apache.activemq.transport.failover.FailoverTransport doReconnect
INFO: Successfully connected to tcp://i56:61616
Partition file created.

5. Run Kmeans Clustering

Once the above steps are successful you can simply run the following shell script to run Kmeans clustering application. Here, on the Kmeans application command prompt $TWISTER_HOME/samples/kmeans/bin, run the application script and submit the job to Twister Driver/Master Daemon.

Script usage: ./ [init clusters file][number of map tasks][partition file]
e.g. ./ init_clusters.txt 80 $TWISTER_HOME/bin/

[taklwu@i55 bin]$ ./ init_clusters.txt 80 $TWISTER_HOME/bin/
JobID: kmeans-map-reduce52c1b91f-fe6e-11e0-9e5d-3fed4ed93ecd
Oct 24, 2011 2:31:01 PM org.apache.activemq.transport.failover.FailoverTransport doReconnect
INFO: Successfully connected to tcp://i56:61616
0    [main] INFO  cgl.imr.client.TwisterDriver  - Configure Mappers through the partition file, please wait....
4226 [main] INFO  cgl.imr.client.TwisterDriver  - Configuring Mappers through the partition file is completed.
252.4857784991884 , 373.4822574603571 , 245.93135222874267 ,
244.99837316981603 , 123.22713052183707 , 252.94566387185583 ,
Total Time for kemeans : 6.487
Total loop count : 7
5891 [main] INFO  cgl.imr.client.TwisterDriver  - MapReduce computation termintated gracefully.
Kmeans clustering took 6.502 seconds.
5893 [Thread-1] DEBUG cgl.imr.client.ShutdownHook  - Shutting down completed.