Analyzing Large-scale Cancer Genomics Sequencing Data with Next Generation Sequencing (NGS) Data Analysis Tools in Hybrid Cloud

Project Information

Discipline
Biology (603) 
Orientation
Research 
Abstract

I and future members of this project will use the FutureGrid computing resource to apply NGS data analysis tools to the MapReduce framework of the recent published HugeSeq computational pipeline (http://www.nature.com/nbt/journal/v30/n3/full/nbt.2134.html) to analyze large-scale cancer genomics sequencing data* in hybrid cloud for detection and annotation of SNPs (single nucleotide polymorphisms), Indels (insertions or deletions), SVs (structural variations such as translocations and inversions) and CNVs (copy number variants) in cancer genomics data.  This project will use Virtual Machines (VMs) and hybrid cloud computing to accelerate NGS data analyses in keeping pace with the speed at which NGS data are generated as well as in leveraging new advances in genomics and personalized (sometimes referred to as “individualized” or the preferred term “precision”) medicine.
 
*E.g exom and whole genome sequence data publicly available from the Sequence Read Archive (http://blogs.nature.com/news/2011/10/sequencing_data_archive_resurf.html), the Cancer Genome Atlas (http://cancergenome.nih.gov/), the new Cancer Genomics Hub at UCSC (https://cghub.ucsc.edu/), the Catalogue of Somatic Mutations in Cancer database (http://www.sanger.ac.uk/resources/databases/cosmic.html), the Pediatric Cancer Genome Project (http://www.stjude.org/whole-genome-data), and the International Cancer genome Consortium (http://www.sanger.ac.uk/research/areas/humangenetics/icgc.html).

Intellectual Merit

This project will use the resources provided by FutureGrid to explore, test and analyze different NGS data analysis tools for detection and annotation of all types of genetic variations (e.g. SNPs, Indels, SVs and CNVs) in large-scale sequencing data of cancer genomics using the MapReduce framework of the HugeSeq computational pipeline. This project will also exploit Virtual Machines and hybrid cloud computing to develop a portable and stand-alone hybrid cloud-enable VM software package (bundled with the aforementioned NGS tools in the HugeSeq MapReduce framework) for researchers in genomics medicine to run computationally intensive NGS data analyses easily in hybrid cloud - Which keeps sensitive data in private cloud, while providing (especially to those without extensive bioinformatics resources) the scalability, computational resources and cost-effectiveness of the public cloud.

Broader Impacts

The proposed portable hybrid cloud-enable VM package developed in this project will be available for (1) download as an open source software through Sourceforge, (2) researchers in medical genomics and NGS data analysis research community at large to analyze large-scale sequence data in hybrid cloud for detection and annotation all types of genetic variations (SNPs, Indels, SVs and CNVs) in the genomic sequences, and (3) education, training and outreach on NGS data analyses in the cloud through online tutorials, online classes freely accessed by everyone worldwide through Coursera (https://www.coursera.org/), webinars, and workshops.

Project Contact

Project Lead
Linda McMahan (mcmahan) 
Project Manager
Linda McMahan (mcmahan) 

Resource Requirements

Hardware Systems
  • alamo (Dell optiplex at TACC)
  • hotel (IBM iDataPlex at U Chicago)
  • india (IBM iDataPlex at IU)
  • sierra (IBM iDataPlex at SDSC)
 
Use of FutureGrid

Plan to store small-scale cancer genomics sequence data, create and test VM on FutureGrid to test and apply NGS analysis tools using the MapReduce framework of the HugeSeq computational pipeline in hybrid cloud. Plan to make VM available to researchers in cancer genomics, genomics medicine and the NGS data analysis research communities at large.

Scale of Use

A few VMs to test and process small-scale cancer genomics sequence data. The running time of the VMs will be dependent on the NGS analysis tools being tested in the HugeSeq MapReduce framework. If possible, I would like a long term usage of the service to store small-scale sequence data and perform ongoing exploring, testing and analyzing different NGS data analysis tools for detection and annotation of all types of genetic variations in genomics sequence data.

Project Timeline

Submitted
08/28/2012 - 14:47