IPython pipelines for training life sciences researchers on NGS data analysis

Project Information

Discipline
Biology (603) 
Subdiscipline
26.0402 Molecular Biology 
Orientation
Research 
Abstract

Genomes and transcriptomes encapsulate the story of living things in the form of nucleic acid sequences. Petabytes of sequence data are publicly available now, the product of recent advances in DNA sequencing technology (Next-Generation Sequencing, NGS). To effectively exploit these resources, research scientists, clinicians and students in the life sciences need to become familiar with (1) the Linux command line, (2) at least one scripting language (e.g., perl or python) and (3) high performance computing. This project uses NGS data, high performance computing and IPython running on Linux virtual machines (VMs) to elucidate molecular mechanisms of epigenetic gene regulation, RNA polymerase function and RNA-directed DNA methylation (RdDM) in the model organism Arabidopsis. During the course of the project (~1 year), postdoctoral researchers and graduate students will be trained to use standard bioinformatic tools for NGS data analysis, including bowtie, samtools, R packages and scientific python. We are developing python scripts that string these tools together into pipelines for genetic mapping of single nucleotide polymorphisms (SNPs) and analysis of RNA sequencing (RNA-seq) datasets. We will use Linux VMs on FutureGrid to tackle important research questions in the field of RdDM, while developing training materials (Git repositories and IPython notebooks) for teaching advanced undergraduate students, graduate students and postdocs to use command line tools for short read alignment and analysis.

Intellectual Merit

Our future publications in the fields of epigenetic gene regulation and RdDM will link to IPython notebooks that document the custom VM images and pipelines used in bioinformatic aspects of our work. This project will develop highly reproducible approaches for NGS data analysis using cloud infrastructure.

Broader Impacts

Git repositories and IPython notebooks created by this project will be made publicly available for use in training life sciences researchers and students on the Linux command line and NGS data analysis. Once these training materials have been tested on a small scale at Indiana University, they can be adapted for Software Carpentry bootcamps in the United States and abroad.

Project Contact

Project Lead
Todd Blevins (toddblev) 
Project Manager
Todd Blevins (toddblev) 

Resource Requirements

Hardware Systems
  • alamo (Dell optiplex at TACC)
  • hotel (IBM iDataPlex at U Chicago)
  • india (IBM iDataPlex at IU)
  • sierra (IBM iDataPlex at SDSC)
  • xray (Cray XM5 at IU)
 
Use of FutureGrid

The project will provision several xlarge VMs in the OpenStack environment for 1-2 week intervals, with custom-installed bioinformatic tools to provide remote access to NGS analysis pipelines for up to 6 student/post-doc users.

Scale of Use

Up to four VMs for bioinformatic analyses and student/postdoc training, as well as cycles on the india, sierra or xray HPC clusters as available.

Project Timeline

Submitted
07/14/2013 - 18:41