I399 Bioinformatics and Cyberinfrastructure project - 1000 Genomes protein analysis
Project Information
- Discipline
- Genetics (610)
- Subdiscipline
- 11.07 Computer Science
- Orientation
- Research
Recent improvements in sequencing technology ("next-gen" sequencing platforms) have sharply reduced the cost of sequencing. The 1000 Genomes Project is the first project to sequence the genomes of a large number of people, to provide a comprehensive resource on human genetic variation. The goal of the 1000 Genomes Project is to find most genetic variants that have frequencies of at least 1% in the populations studied. While recent work has been conducted towards sequence alignment and nucleotide matching, there is a large need for protein sequencing and comparison between the 697 currently sequenced datasets. This project will look at the protein synthesis at a low level order to identify differences between members of the population, which can hopefully lead to a better understanding of how proteins differ between individuals.
Intellectual MeritThis project will study the protein biosynthesis processes across 697 currently sequenced members of the 1000 Genomes project [http://www.1000genomes.org/]. This work will identify the computational pipeline needed for a direct comparison between various protein peptide sequences within the currently sequenced data. This pipeline will employ the latest Cloud computing technologies to streamline development and minimize computation time while allowing for a wide variety of work loads to occur.
Broader ImpactsThis knowledge discovery process and pipeline may be of great interest to the 1000 Genomes project members who are currently looking to employ these processes en masse. Furthermore, undergraduate informatics students will be immersed in the project from the beginning, giving them an in-depth vision of the research process of knowledge discovery.
Project Contact
- Project Lead
- Andrew Younge (ajyounge)
- Project Manager
- Andrew Younge (ajyounge)
- Project Members
- Fuxiao Xin, Alex Wu, Ryan Konz, Tony Gao, Blaine Rothrock, Kym Pagel, Zixuan Li, Jonathan Gold
Resource Requirements
- Hardware Systems
-
- alamo (Dell optiplex at TACC)
- hotel (IBM iDataPlex at U Chicago)
- india (IBM iDataPlex at IU)
- sierra (IBM iDataPlex at SDSC)
We intend to create and deploy a multitude of Virtual Machines to function in the protein peptide biosynthesis analysis, as well as use 1 or more storage devices to place the large datasets in-house, either in a SAN or within S3 or EBS storage devices.
Scale of UseInitially the scale of use will be minimal, as the goal of the i1399 project is only to outline this process and perform some elementary analysis. If demand for this system increases, computation may need to ramp up drastically.
Project Timeline
- Submitted
- 02/01/2011 - 07:45