IPython pipelines for training life sciences researchers on NGS data analysis

Project Information

Discipline: Biology (603)
Subdiscipline: 26.0402 Molecular Biology
Orientation: Research

Abstract

Genomes and transcriptomes encapsulate the story of living things in the form of nucleic acid sequences. Petabytes of sequence data are publicly available now, the product of recent advances in DNA sequencing technology (Next-Generation Sequencing, NGS). To effectively exploit these resources, research scientists, clinicians and students in the life sciences need to become familiar with (1) the Linux command line, (2) at least one scripting language (e.g., perl or python) and (3) high performance computing. This project uses NGS data, high performance computing and IPython running on Linux virtual machines (VMs) to elucidate molecular mechanisms of epigenetic gene regulation, RNA polymerase function and RNA-directed DNA methylation (RdDM) in the model organism Arabidopsis. During the course of the project (~1 year), postdoctoral researchers and graduate students will be trained to use standard bioinformatic tools for NGS data analysis, including bowtie, samtools, R packages and scientific python. We are developing python scripts that string these tools together into pipelines for genetic mapping of single nucleotide polymorphisms (SNPs) and analysis of RNA sequencing (RNA-seq) datasets. We will use Linux VMs on FutureGrid to tackle important research questions in the field of RdDM, while developing training materials (Git repositories and IPython notebooks) for teaching advanced undergraduate students, graduate students and postdocs to use command line tools for short read alignment and analysis.

Intellectual Merit

Our future publications in the fields of epigenetic gene regulation and RdDM will link to IPython notebooks that document the custom VM images and pipelines used in bioinformatic aspects of our work. This project will develop highly reproducible approaches for NGS data analysis using cloud infrastructure.

Broader Impacts

Git repositories and IPython notebooks created by this project will be made publicly available for use in training life sciences researchers and students on the Linux command line and NGS data analysis. Once these training materials have been tested on a small scale at Indiana University, they can be adapted for Software Carpentry bootcamps in the United States and abroad.

Project Contact

Project Lead: Todd Blevins (toddblev)
Project Manager: Todd Blevins (toddblev)

Resource Requirements

Hardware Systems

alamo (Dell optiplex at TACC)
hotel (IBM iDataPlex at U Chicago)
india (IBM iDataPlex at IU)
sierra (IBM iDataPlex at SDSC)
xray (Cray XM5 at IU)

Use of FutureGrid

The project will provision several xlarge VMs in the OpenStack environment for 1-2 week intervals, with custom-installed bioinformatic tools to provide remote access to NGS analysis pipelines for up to 6 student/post-doc users.

Scale of Use

Up to four VMs for bioinformatic analyses and student/postdoc training, as well as cycles on the india, sierra or xray HPC clusters as available.

Project Timeline

Submitted: 07/14/2013 - 18:41

IPython pipelines for training life sciences researchers on NGS data analysis

Project Information

Project Contact

Resource Requirements

Project Timeline

About

Support

Community

Projects