Hadoop Blast

Number:
Author: Tak-Lon Stephen Wu
Improvements: 
Version: 0.1
Date: 2011-11-01

Hadoop Blast

BLAST (Basic Local Alignment Search Tool) is one of the most widely used bioinformatics applications written in C++, and the version we are using is v2.2.23. Hadoop Blast is an advanced Hadoop program which helps Blast, a bioinformatics application, utilizes the Computing Capability of Hadoop. The database used in the following settings is a subset (241 MB) of Non-redundant protein sequence database from nr (8.5GB) database.

You can download the Hadoop Blast source code and customized Blast program and Database archive (BlastProgramAndDB.tar.gz) from Big Data for Science tutorial.

 

 

Acknowledge

This page was original designed by SalsaHPC group for Big Data for Science Workshop, you can see the original pages here.

Requirement

  1. Login to FutureGrid Cluster and obtain compute nodes. (HPC / Eucalyptus)
  2. Start SalsaHadoop/Hadoop on compute nodes. (SalsaHadoop Tutorial)
  3. Download and unzip Hadoop Blast source code from Big Data for Science tutorial.
  4. Download customized Blast binary and Database archive BlastProgramAndDB.tar.gz
  5. Linux command experience.

1. Download Hadoop Blast under $

HADOOP_HOME

Assuming your start SalsaHadoop/Hadoop with setting $HADOOP_HOME=~/hadoop-0.20.203.0, and is running the master node on i55. Then, we download the Hadoop Blast source code and customized Blast program and Database archive (BlastProgramAndDB.tar.gz) from Big Data for Science tutorial to $HADOOP_HOME.
[taklwu@i55 ~]$ cd $HADOOP_HOME
[taklwu@i55 hadoop-0.20.203.0]$ wget http://salsahpc.indiana.edu/tutorial/source_code/Hadoop-Blast.zip
[taklwu@i55 hadoop-0.20.203.0]$ wget http://salsahpc.indiana.edu/tutorial/apps/BlastProgramAndDB.tar.gz
[taklwu@i55 hadoop-0.20.203.0]$ unzip Hadoop-Blast.zip

2. Prepare Hadoop Blast

Assuming the program are already stored in $HADOOP_HOME/Hadoop-Blast, we need to copy the input files, Blast program and Database archive (BlastProgramAndDB.tar.gz) onto HDFS.

[taklwu@i55 hadoop-0.20.203.0]$ bin/hadoop fs -put $HADOOP_HOME/Hadoop-Blast/blast_input HDFS_blast_input
[taklwu@i55 hadoop-0.20.203.0]$ bin/hadoop fs -ls HDFS_blast_input
[taklwu@i55 hadoop-0.20.203.0]$ bin/hadoop fs -copyFromLocal $HADOOP_HOME/BlastProgramAndDB.tar.gz BlastProgramAndDB.tar.gz
[taklwu@i55 hadoop-0.20.203.0]$ bin/hadoop fs -ls BlastProgramAndDB.tar.gz
  • Line 1 push all the blast input files (FASTA formatted queries) onto HDFS “HDFS_blast_input” directory from local disk.
  • Line 2 list the pushed files on HDFS directory "HDFS_blast_input"
  • Line 3 copies the Blast program and database archive (BlastProgramAndDB.tar.gz) from $HADOOP_HOME onto the HDFS as distributed caches which will be used later.
  • Line 4 double check the pushed Blast program and database archive "BlastProgramAndDB.tar.gz" on HDFS

3. Execute Hadoop-Blast

After deploying those required files onto HDFS, run the Hadoop Blast program with the following commands:

[taklwu@i55 hadoop-0.20.203.0]$ bin/hadoop jar $HADOOP_HOME/Hadoop-Blast/executable/blast-hadoop.jar BlastProgramAndDB.tar.gz \
 bin/blastx /tmp/hadoop-taklwu-test/ db nr HDFS_blast_input HDFS_blast_output '-query #_INPUTFILE_# -outfmt 6 -seg no -out #_OUTPUTFILE_#'

Here is the description of the above command:

bin/hadoop jar Executable BlastProgramAndDB_on_HDFS bin/blastx Local_Work_DIR db nr HDFS_Input_DIR Unique_HDFS_Output_DIR '-query #_INPUTFILE_# -outfmt 6 -seg no -out #_OUTPUTFILE_#'
Parameter Description
Executable The full path of the Hadoop-Blast Jar program, e.g. $HADOOP_HOME/apps/Hadoop-Blast/executable/blast-hadoop.jar
BlastProgramAndDB_on_HDFS    The archive name of Blast Program and Database on HDFS, e.g. BlastProgramAndDB.tar.gz
Local_Work_DIR The local directory for storing temporary output of Blast Program, e.g. /tmp/hadoop-test/
HDFS_Input_DIR The HDFS remote directory where stored input files, e.g. HDFS_blast_input
Unique_HDFS_Output_DIR A Never used HDFS remote directory for storing output files, e.g. HDFS_blast_output

If Hadoop is running correctly, it will print hadoop running messages similar to the following:
11/11/01 19:31:08 INFO input.FileInputFormat: Total input paths to process : 16
11/11/01 19:31:08 INFO mapred.JobClient: Running job: job_201111021738_0002
11/11/01 19:31:09 INFO mapred.JobClient:  map 0% reduce 0%
11/11/01 19:31:31 INFO mapred.JobClient:  map 18% reduce 0%
11/11/01 19:31:34 INFO mapred.JobClient:  map 50% reduce 0%
11/11/01 19:31:53 INFO mapred.JobClient:  map 75% reduce 0%
11/11/01 19:32:04 INFO mapred.JobClient:  map 100% reduce 0%
...
Job Finished in 191.376 seconds

3. Monitoring Hadoop

We can also monitor the job status using lynx, a text browser, on i136 based Hadoop monitoring console. Assuming the Hadoop Jobtracker is running on i55:9003:

[taklwu@i136 ~]$ lynx i55:9003

In addition, all the outputs will stored in the HDFS output directory (e.g. HDFS_blast_output).

[taklwu@i55 hadoop-0.20.203.0]$ bin/hadoop fs -ls HDFS_blast_output
[taklwu@i55 hadoop-0.20.203.0]$ bin/hadoop fs -cat HDFS_blast_output/*
BG3:2_30MNAAAXX:7:1:981:1318/1  gi|298916876|dbj|BAJ09735.1|    100.00  11      0       0       3       35      9       19      7.0     27.7
BG3:2_30MNAAAXX:7:1:981:1318/1  gi|298708397|emb|CBJ48460.1|    100.00  11      0       0       3       35      37      47      7.0     27.7
BG3:2_30MNAAAXX:7:1:981:1318/1  gi|298104210|gb|ADI54942.1|     100.00  11      0       0       3       35      11      21      7.0     27.7
BG3:2_30MNAAAXX:7:1:981:1318/1  gi|297746593|emb|CBM42053.1|    100.00  11      0       0       3       35      11      21      7.0     27.7
BG3:2_30MNAAAXX:7:1:981:1318/1  gi|297746591|emb|CBM42052.1|    100.00  11      0       0       3       35      11      21      7.0     27.7
BG3:2_30MNAAAXX:7:1:981:1318/1  gi|297746589|emb|CBM42051.1|    100.00  11      0       0       3       35      11      21      7.0     27.7
BG3:2_30MNAAAXX:7:1:981:1318/1  gi|297746587|emb|CBM42050.1|    100.00  11      0       0       3       35      11      21      7.0     27.7
BG3:2_30MNAAAXX:7:1:981:1318/1  gi|297746585|emb|CBM42049.1|    100.00  11      0       0       3       35      11      21      7.0     27.7
BG3:2_30MNAAAXX:7:1:981:1318/1  gi|297746583|emb|CBM42048.1|    100.00  11      0       0       3       35      11      21      7.0     27.7
BG3:2_30MNAAAXX:7:1:981:1318/1  gi|297746581|emb|CBM42047.1|    100.00  11      0       0       3       35      11      21      7.0     27.7

...

5. Finishing the Map-Reduce process

After finishing the Job, please use the command to kill the HDFS and Map-Reduce daemon:

[taklwu@i55 hadoop-0.20.203.0]$ bin/stop-all.sh