Modeling Biological Networks

IV.1 Coordinators
IV.2 Participants
IV.3 Introduction
IV.4 Background and Significance
IV.5 Research Plan
IV.6 Specific Subprojects

IV.7 Connection to Specific Projects 2 (cytoskeleton) and 3 (organogenesis)
IV.8 Timeline

< Previous | Page 17 of 35 | Next >

IV.6.iii Subproject 3 - Regulatory Words in Eukaryotic Genomes:

This subproject attempts to combine the classical statistical approach to significance identification with the search for non-exponential behavior motivating network analysis.

IV.6.iii.a Introduction:

We are exploring the statistical structure of the Drosophila genome with the goal of identifying regulatory words and clusters of words.

One ambitious goal of biology is to learn to decipher the regulatory information in a genome. Just as we are now able to scan for regions that encode proteins, we would like to be able to "read" the information that specifies transcription factor binding and stage-specific gene expression. This goal is very far from realization. At present, we study particular DNA segments using the sequence to identify candidate regulatory regions and the transcription factor families that probably bind to them. We use biochemistry to narrow and confirm the identifications, and genetic tests to learn which binding factors matter in any particular developmental setting.

Complete genome sequences have stimulated a search to identify regulatory "words." One approach that has had considerable success in small genomes and some success in larger ones is to correlate expression patterns with the presence of particular sequence patterns. The increasing volume of expression data emerging from microarray studies (Subproject 5) aids this approach.

An on-going collaboration of Consortium members is testing a different approach. We cannot identify words of known biological significance by their overall abundance; short words the size of a protein binding site (4-12 Np), both known important transcription factor binding sites and those of no known significance, are often either underrepresented or overrepresented in genomes. That approach fails to provide guidance in distinguishing significant instances of a particular word. The transcription initiation consensus sequence 5'TCAGT-3' appears 50,119 times in the sequenced (euchromatin) of Drosophila chromosome 3R, about 10,000 times or 39 standard deviations fewer times than expected (given overall base composition). However, we have no way to interpret this startling statistic.

We know that significant transcription factor (TF) binding sites cluster. They occur "near" other instances of themselves or near other TF binding sites. Molecular explanations for this clustering include that TFs often do not bind strongly on their own but bind in cooperation with other TFs. Bound TFs often function in cooperation with neighboring TFs. "Enhancer sites" are localized clusters of TF binding sites, often associated with specific phenotypes (expression in tissue X, etc.). Numerous papers documenting self-clusters of biologically significant words, and clustering has come to dominate thinking about gene regulation in eukaryotes.