CloVR - Cloud Virtual Resource for Automated Sequence Analysis From Your Desktop

Project Information

Discipline
Biology (603) 
Orientation
Research 
Abstract

Background: Recently, second-generation sequencing platforms (e.g. 454, Illumina, Solid) have made genomic tools affordable and increased their popularity to the broader research community. However, demands in computational resources and lack of standardized analysis tools are increasingly representing a bottleneck in the bioinformatics analysis of large-scale sequence data. Results: Here, we present the Cloud Virtual Resource (CloVR) software package that takes advantage of two technologies, Virtual Machines and Cloud computing, to provide a new community resource for sequence analysis, suitable for large-scale sequencing projects. CloVR is available as an Open Source virtual machine at http://clovr.org and bundles pre-installed and pre-configured bioinformatics tools into automated pipelines. With the CloVR virtual machine, users have the option to run supported pipelines on their local computers and to utilize scalable on-demand Cloud computing services to perform CPU-intensive tasks on the Internet without having to install additional software. In order to support a large variety of different sequencing projects, the CloVR virtual machine is composed of separate sequence analysis tracks. Each track within CloVR is comprised of the entire suite of Open Source software tools necessary to support a fully automated analysis as required in a typical genomics project. Currently supported applications include BLAST search (CloVR-Search) single microbial whole-genome shotgun (WGS) assembly and annotation (CloVR-Microbe), metagenomic WGS assembly, gene prediction and BLAST comparison (CloVR-Metagenomics), and 16S phylogeny (CloVR-16S). CloVR currently supports VMware for local execution and the commercial Amazon EC2 Cloud (http://aws.amazon.com/ec2/) and the academic free Nimbus Science Clouds (http://www.scienceclouds.org/). Conclusion: CloVR is a genomics tool that enables any researcher with a sequencing machine and an Internet connection to perform complex and computationally demanding sequence analysis and join the genomic revolution.

Intellectual Merit

Less than 15 years after the first complete sequencing of a bacterial genome, sequence analysis has now become an integral part of nearly all research areas in biology. Recently, sequencing expenses have dropped sharply due to the affordability of second generation sequencing technology leading to the establishment of an increasing number of small genome sequencing facilities. Despite the increased rate of sequence generation, there has not been a commensurate increase in access to computational resources to support high-quality sequence processing and analysis. In particular, some of the investigators new to the field who are now obtaining next generation sequencing platforms could be insufficiently prepared to take full advantage of their own high-throughput sequencing devices. This proposal intends to close this technological gap by increasing the accessibility of state-of-the-art sequence analysis software to researchers without extensive bioinformatics resources. The proposal describes development of a portable and stand-alone software package, using Virtual Machines (VM), that will incorporate readily available, open source tools for genome analysis. The VM design will provide two main advantages, allowing users to 1) circumvent complex software installations and 2) avoid performance bottlenecks of local computing networks. First, the VM package will include fully operational bioinformatics pipelines within a single executable file that is compatible with all computer operating systems and makes further software installations unnecessary. Second, due to the VM portability, the processing of large sequence data can be outsourced to large distributed computing networks called compute clouds. The analysis protocols provided on the VM package will replicate and extend established bioinformatics protocols and include tools for whole genome and metagenome annotation and comparative analysis, including sequence assembly, gene prediction, functional annotation, metabolic pathway reconstruction, and phylogenetic classification. The availability of the proposed open source and cloud-enabled VM package will increase the usability of microbial genome sequencing to a broad user community.

Broader Impacts

CloVR is a 2 year old project with funding from NSF and NIH to simplify and automate large-scale bioinformatics for sequence analysis. We have a beta release of the VM already available, a set of beta testers, and plan a full stable release in the next few months. The broader impact section from the grant was described as: 1) Release of portable VM tool package. The proposed VM package will be made available as a work in progress with at least two trial and four production releases during the funded period. It will be available for download as an open source software tool through the project webpage. 2) Education, training and outreach. The aim of this proposal is to increase the availability of state-of-the-art microbial sequence analysis tools to small research groups. The VM package will be extensively advertised and documented through publications, conference presentations, the project website and an online blog. In addition, an online seminar ("webinar") will be offered that will use the World Wide Web to teach the basics of microbial genome analysis using a test set of sequence data distributed together with the VM package. The VM package and accompanying online classes will be developed in close dialogue with the scientific community, who will have the opportunity to subscribe to the online blog with associated discussion forum. 3) De-centralized microbial sequence analysis. Genome analysis can provide significant benefits to many areas of microbial research. The release of next-generation sequencing technologies promotes a new model of affordable, de-centralized microbial sequence analysis with benefits for the entire scientific community. The proposed portable, open source, microbial sequence analysis package will contribute to the success of this model.

Project Contact

Project Lead
Samuel Angiuoli (angiuoli) 
Project Manager
Samuel Angiuoli (angiuoli) 
Project Members
Saliya Ekanayake  

Resource Requirements

Hardware System
  • I don't care (what I really need is a software environment and I don't care where it runs)
 
Use of FutureGrid

We plan to install and test the CloVR VM on futuregrid and make it available to project collaborators and the community at large. We have done extensive testing on Amazon EC2 and two free academic clouds; DIAG (http://diagcomputing.org/) using Nimbus and the now defunct Magellan (http://magellan.alcf.anl.gov/), previously using Eukalyptus. We are eager to add futuregrid to this list.

Scale of Use

A dozen or so VMs several times a week to test and process experimental data

Project Timeline

Submitted
08/24/2011 - 11:24