I399 Bioinformatics and Cyberinfrastructure project - 1000 Genomes protein analysis

Project Information

Discipline
Genetics (610) 
Subdiscipline
11.07 Computer Science 
Orientation
Research 
Abstract

Recent improvements in sequencing technology ("next-gen" sequencing platforms) have sharply reduced the cost of sequencing. The 1000 Genomes Project is the first project to sequence the genomes of a large number of people, to provide a comprehensive resource on human genetic variation. The goal of the 1000 Genomes Project is to find most genetic variants that have frequencies of at least 1% in the populations studied. While recent work has been conducted towards sequence alignment and nucleotide matching, there is a large need for protein sequencing and comparison between the 697 currently sequenced datasets. This project will look at the protein synthesis at a low level order to identify differences between members of the population, which can hopefully lead to a better understanding of how proteins differ between individuals.

Intellectual Merit

This project will study the protein biosynthesis processes across 697 currently sequenced members of the 1000 Genomes project [http://www.1000genomes.org/]. This work will identify the computational pipeline needed for a direct comparison between various protein peptide sequences within the currently sequenced data. This pipeline will employ the latest Cloud computing technologies to streamline development and minimize computation time while allowing for a wide variety of work loads to occur.

Broader Impacts

This knowledge discovery process and pipeline may be of great interest to the 1000 Genomes project members who are currently looking to employ these processes en masse. Furthermore, undergraduate informatics students will be immersed in the project from the beginning, giving them an in-depth vision of the research process of knowledge discovery.

Project Contact

Project Lead
Andrew Younge (ajyounge) 
Project Manager
Andrew Younge (ajyounge) 
Project Members
Fuxiao Xin, Alex Wu, Ryan Konz, Tony Gao, Blaine Rothrock, Kym Pagel, Zixuan Li, Jonathan Gold  

Resource Requirements

Hardware Systems
  • alamo (Dell optiplex at TACC)
  • hotel (IBM iDataPlex at U Chicago)
  • india (IBM iDataPlex at IU)
  • sierra (IBM iDataPlex at SDSC)
 
Use of FutureGrid

We intend to create and deploy a multitude of Virtual Machines to function in the protein peptide biosynthesis analysis, as well as use 1 or more storage devices to place the large datasets in-house, either in a SAN or within S3 or EBS storage devices.

Scale of Use

Initially the scale of use will be minimal, as the goal of the i1399 project is only to outline this process and perform some elementary analysis. If demand for this system increases, computation may need to ramp up drastically.

Project Timeline

Submitted
02/01/2011 - 07:45