Modeling Biological Networks

IV.1 Coordinators
IV.2 Participants
IV.3 Introduction
IV.4 Background and Significance
IV.5 Research Plan
IV.6 Specific Subprojects

IV.7 Connection to Specific Projects 2 (cytoskeleton) and 3 (organogenesis)
IV.8 Timeline

< Previous | Page 22 of 35 | Next >

IV.6.iv.c.2 Uncovering dynamical constraints using Singular Value Decomposition:

To understand the logic of gene expression requires global approaches beyond simple clustering. The numerical algorithm called singular value decomposition (SVD) greatly assists the analysis of microarray expression data (Holter et al., 2000; Alter, Brown and Botstein, 2000; Holter et al., 2001). SVD searches for an optimal set of independent parameters to describe m measured states. The following simple example illustrates the notion of optimal independent parameters.

Consider a hypothetical biochemical reaction with a protein kinase (K), inorganic phosphate (S), the enzyme target of the kinase (E), and its phosphorylated form (P). The traditional way to describe this reaction is to keep track of all four concentrations as a function of time: cK(t), cS(t), cE(t) and cP(t). In most cases, a smaller number of independent parameters describes the reaction more naturally. In this hypothetical biochemical reaction, the kinase concentration does not change (as it only acts as a catalyst) and inorganic phosphate is abundant in the cell. If we suppose that P is the inactive form of enzyme E, we can discard cK and cS, and consider only cE and cP. Since P and E are merely two different forms of the same protein, the sum and the difference of their concentrations (c1=cE+cP and c2=cE-cP) are more meaningful parameters. Thus the number of independent parameters in this example is two and the two appropriate parameters are linear combinations of the original ones.

SVD makes this search for optimal parameters systematic. More sophisticated methods like Principle Component Analysis further extend our ability to identify fundamental parameters. We will try to determine the presence or absence of constraints on the number of states a transcriptome, and hence the genetic regulatory network, can achieve. We will then attempt to correlate these data to our clustering-based modules (Subproject 1) and the known regulation of E. coli metabolism.

While microarray expression data can potentially clarify the dynamics of the E. coli metabolic network, important obstacles remain. Data from expression arrays are inherently noisy, and we need to understand the nature of this "genetic" noise and its effect on data quality. Our knowledge regarding genetic regulatory networks is so limited that the regulation of gene expression seems intractably complex.

To begin to understand the characteristics and constraints of genetic regulatory networks we will initially use two publicly available microarray data sets; one for E. coli (Arfin et al., 2000), the other for S. cerevisiae (Hughes et al., 2000). The first set provides steady state mRNA expression data for wild-type and an integration host factor (IHF) mutant E. coli strain sampled eight different times. The second set provides steady state mRNA expression data for wild-type S. cerevisiae sampled sixty-three separate times and 300 single cDNA microarray measurements for a series of externally or internally perturbed yeasts.

The prevailing steady-state hypothesis of transcriptome activity implies that for a given gene i the histogram of its expression level for different arrays should follow a short-ranged distribution, e.g., Gaussian or Poisson. The variability in the expression level of gene i would lie within a certain range. Expression data points for any given gene in the available data sets are too few to determine distributions. However, examining all genes together provides excellent statistics. For similar reasons, good statistics for the comparison of transcriptomes require that we examine all mRNA expression data together. The original analysis of the gene expression data found fluctuations in the expression patterns of many genes (Hughes et al., 2000, Arfin et al., 2000), providing a measure of the magnitude of noise resulting either from slightly altered growth conditions in a given experiment (external noise), or naturally occurring stochastic fluctuations in the expression level of a given gene (internal noise). We then compare the internal (by comparing across genes) and external (by comparing across transcriptomes) noise with the observed fluctuations using standard statistical approaches.

We also plan to investigate the effect of genetic network perturbations. At present only expression profiles for S. cerevisiae cells with single non-lethal gene ablations and pharmacological treatment (Hughes et al., 2000) are available. Deleting a gene alters the underlying genetic network by removing a node. The cells then develop a new steady state, allowing them to function without the deleted gene product. An important question to address will be the degree of difference between the new steady state and that of the wild-type. Some perturbations may result only in small changes in the use of the underlying genetic network architecture, with only genes directly interacting with the deleted genes affected significantly. Others may fundamentally reorganize gene expression landscapes across the whole cell.

Our analysis of microarray data sets will use the base 10 logarithmic ratios of their relative gene expression levels. We will first arrange the data sets into matrices, in which the rows represent genes, and the columns individual microarray measurements. Our statistical characterization uses the following notation: The data matrix, e, has N rows (each containing the expression levels of one gene) and m columns (each containing the expression levels of all genes in one experiment, i.e., the given measured transcriptome). The expression level of the ith gene in the jth array is eij, the average expression level of this gene throughout the m arrays is and the variance of the expression level of the same gene is . The average expression level of genes in one array is and the variance of the expression level in the same array is .

We will apply SVD to the microarray matrices to characterize each of the transcriptomes by a corresponding column vector that simultaneously encompasses all measured relative gene expression levels. We call the jth column vector, corresponding to the jth microarray, cj. The {cj}j=1,m vectors embed in an N-dimensional vector space, RN. so each cj vector has N components.

SVD is a linear transformation that finds within RN an m-dimensional subspace, Sm, with basis vectors {bj}j=1,m. Sm must fulfill two conditions. It must contain all column vectors of e, i.e., it can fully represent e. Denote the representations of the cj vectors in Sm by cj' and the representation of e by e'. The cj' vectors are m-dimensional so e' is an m x m matrix. The columns of e' are the cj' vectors. The second condition is that the representations of the {bj}j=1,m vectors in Sm, the {bj'}j=1,m vectors, should be orthonormal eigenvectors of e'. Thus SVD performs a subspace search within an N-dimensional vector space, RN, and a principal axis transformation within the computed subspace, Sm.

SVD provides the following data: The column vectors of the N x N matrix, u, are the {bj}j=1,m vectors. The diagonal components of the diagonal m x m matrix, w, contain the eigenvalues from the principal axis transformation. The rows of the m x m matrix, v^T, are the {bj'}j=1,m vectors. We can summarize the transformation as e = u w v^T. The ith diagonal component, wi, of the matrix w defines the relative weight of the corresponding bi eigenvector as .