|
|
|
Journal of Bacteriology, November 2003, p . 6392-6399, Vol . 185, No . 21 Genome-Scale Analysis of the Uses of the Escherichia coli Genome: Model-Driven Analysis of Heterogeneous Data SetsTimothy E . Allen,1 Markus J . Herrgård,1 Mingzhu Liu,2 Yu Qiu,2 Jeremy D . Glasner,2,3 Frederick R . Blattner,2 and Bernhard Ø . Palsson1* Department of Bioengineering, University of California-San Diego, La Jolla, California 92093-0412,1 Genetics Department,2 Animal Health and Biomedical Sciences, University of Wisconsin-Madison, Madison, Wisconsin 537063 Received 27 February 2003/ Accepted 23 July 2003
Previously, a sequence-based framework for calculating the metabolic costs of expressing a gene and synthesizing its gene product was established (2) . These costs are calculated directly from the DNA sequence, and estimates of ribosomal content can be used to scale the total protein-producing capacity of a cell and the requisite costs . The established framework, when scaled to account for all the genes in the Escherichia coli K-12 strain MG1655 genome (7), should allow explicit calculation of the material and energy costs required for expressing the entire genome, in addition to the costs for synthesizing the resulting proteome . Fundamental values for cellular biomass requirements have been experimentally measured for E . coli (22), but these values have never been calculated directly from the merging of sequence data with high-throughput gene expression data . Previous sequence-based cost estimates for protein synthesis have been calculated from expression estimates based on codon usage (1) but have not integrated actual expression or mRNA half-life data . A method for integrating such heterogeneous data sets would provide fundamental material and energy cost values, estimated effective promoter strengths on a genome scale, and the genome location distribution of gene expression in prokaryotes . Expression profiling has been used to identify genes whose expression changes under shifting environmental conditions (4, 24, 31, 40, 46) . A variety of methods have been developed with which to analyze these data, including coexpression pattern analysis for operon prediction (33), dimensionality reduction techniques (16, 18), and several types of clustering methods (3) . A model-driven means by which to interpret and analyze expression data, however, has not been established . The availability of sequence data, expression data, and, most recently, global mRNA half-life data (6, 36) has created a need for such a structured analysis and integration of these disparate data sets . We developed a method that accomplishes this goal and used it to study the overall cost of maintaining a particular expression state, the distribution of individual effective promoter strengths, and the corresponding genome location-dependent characteristics of gene expression .
Similarly, if the abundance of a particular transcript (mi) relative to the total mRNA content (mrel,i = mi/mtot, where
Calculation of transcription state.
The transcription state is defined as the vector of all transcription rates in the genome,
where
The transcription initiation rates (
It is assumed that transcription elongation is not limiting for protein synthesis, since once transcription initiation occurs, ribosomes may bind to the unfinished mRNA transcript and translation may commence at a rate comparable to the mRNA elongation rate (42) .
In a steady state, the transcription rate must balance the mRNA degradation rate:
It is therefore possible to reconcile data for mRNA concentrations, effective promoter strengths, and mRNA degradation rates in the following manner:
The effective promoter strengths, which depend on both the intracellular conditions and the regulation present, can thus be calculated globally if large-scale mRNA concentration data (35, 44) and mRNA half-life data (6, 36) are available . If log-phase growth is assumed, the number of copies of each promoter per cell can be estimated from each gene's position on the chromosome and the growth rate of the cell (8) . Since these effective promoter strengths are essentially normalized transcription rate constants, they are subject to regulation . Thus, the variance of each effective promoter strength across many data sets becomes a useful quantity . The vector of all effective promoter strengths, q = (q1...qN), constitutes the promoter activation state of the genome, where N is the number of coding sequences in the genome . Metabolic cost of RNA synthesis. The synthesis rate of each mRNA transcript, which determines the nucleotide triphosphates required, is set by the effective promoter strength for each (ith) gene . Neither the mRNA elongation rate nor the free RNAP concentration is assumed to be limiting for the synthesis rate of each transcript (8, 28) . In the absence of large-scale promoter strength data, however, the transcription rate for each transcript may be estimated from the relative mRNA amounts (estimated from expression data) and from available mRNA decay rates (6, 36) (equation 1) . It is possible to normalize the nucleotides required for mRNA maintenance when the total mRNA concentration ([mRNA]tot) at a given growth rate is known (8) .
Metabolic cost of protein synthesis.
The total protein synthesis rate (i.e., the overall capacity of the cell to synthesize protein) is limited by the number of ribosomes available to the cell (8, 19) . Additionally, the relative abundance of each transcript (mrel,i) determines the weighting of the synthesis rate for each protein since all mRNA transcripts compete for the pool of available ribosomes . This disregard for the potential effect of transcript length on ribosomal occupancy is probably valid since the messages are not necessarily saturating . In fact, the number of ribosomes in a typical E . coli cell is about 1 order of magnitude greater than the total number of messages (22) . Thus, an upper boundary for each protein synthesis rate (
where ß is the maximal protein synthesis capacity of the cell (in number of peptide bonds formed per cell per unit of time; about 340,000 peptide bonds per cell per s [8]) as limited by the number of ribosomes present and ai is the number of amino acids in each protein . The corresponding amino acid costs for supporting the upper boundaries for protein synthesis rates can be directly calculated from the known sequence . Additionally, the energy cost required for ribosomal binding, translocation along the ribosomes, and tRNA charging can be calculated for each protein synthesis rate .
Analyzing genome location-dependent patterns in gene expression.
Calculation of the transcription state of the genome requires a means of analyzing potential patterns in expression along the chromosome . Wavelet transform techniques (5) can be used to analyze and visualize the genome location-dependent variability of gene expression . While standard Fourier transforms allow identification of periodic patterns in stationary signals, wavelet transforms allow identification of both periodic and nonperiodic localized patterns and do not assume a stationary signal . In this work we used the continuous wavelet transform, which is better suited for visualizing patterns than its discrete counterpart (21) . The continuous wavelet transform of signal x(t) [W(t,a)] (in our case, effective promoter strengths along the genome), is defined as
where g([t' - t]/a) is the wavelet transform filter centered at t and the width of the filter (a) is used to determine the scale at which patterns are analyzed . By choosing the filter function (g) we can extract different types of patterns from the data . Here we used the Morlet wavelet defined as g(t) = cos(5t)exp(-t2/2), which is particularly well suited for studying localized periodic patterns in data (5) . The wavelet transform can be visualized by using a scalogram that displays the transform W(t,a) as a contour plot with location along the genome (t) on one axis and the scale (a) on the other axis . We evaluated the significance of the spatial patterns extracted through wavelet analysis by randomizing the gene order in the E . coli genome and recomputing the transform for each randomized genome . A P value for each individual W(t,a) was then calculated based on 1,000 randomized genomes by computing the number of times that a specific |W*(t,a)| for a randomized genome was larger than the true |W(t,a)| . Experimental methods and normalization. All mRNA expression data were generated from E . coli grown in batch culture as described in detail elsewhere (J . D . Glasner, T . Durfee, Y . Qiu, M . Liu, Y . Kang, C . Herring, C . R . Richmond, G . Plunkett 3rd, N . T . Perna, R . Mau, D . Frisch, S . Hinsa, S . Fendrick, G . Nodalski, P . Borelli, S . Phillips, N . Hermersan, and F . R . Blattner, unpublished data) and are available online (13; http://asap.ahabs.wisc.edu/annotation/php/logon.php) . In most experiments we used the sequenced K-12 strain MG1655 . Seventeen experiments involved strains derived from MG1655 with single ORF disruptions, and in 2 experiments (single spotted array hybridizations) we used strains DH5alpha and DH10B . In 39 of 49 experiments we used cells harvested at the early exponential phase growth, and in 10 experiments we used cells from late-exponential-phase or stationary-phase cultures . In 46 of the experiments we used cells grown in a MOPS (morpholinepropanesulfonic acid)-based minimal medium, while in 3 experiments we used Luria-Bertani media . Glucose was used as the carbon source in most minimal medium experiments (43 of 49 experiments), and in the other experiments we used acetate, glycerol, or proline as the carbon source . Data were collected by hybridization of fluorescently labeled cDNAs to either Affymetrix E . coli antisense oligonucleotide arrays (as described by Rosenow et al . [32]) or microarrays of spotted ORF-length PCR fragments (as described by Yang and Ames [44]) . The oligonucleotide arrays contained probes for both ORFs and intergenic regions, but only the data corresponding to ORFs were considered in this study . For each ORF on the Affymetrix array we calculated the average difference value using the Microarray Suite software (Affymetrix, Inc., Santa Clara, Calif.) . For spotted arrays the signal intensity for each ORF was taken to be the average intensity of duplicate spots on the array . Fluorescently labeled genomic DNA was used as a reference for the spotted arrays and thus provided an absolute measure of expression . To convert the signal values to estimates of transcript abundance, the simplifying assumption was made that for each experiment an average E . coli cell in the population contained 10,000 (gene-size) mRNA transcripts (22) . The signal for each ORF on each array was scaled by the factor 10,000/sum of the signal intensities for each array . When replicate hybridizations were available, the scaled signal values were averaged across arrays . A small number of spots on each spotted microarray were disregarded when we averaged across replicates because of poor-quality PCR, spotting, or hybridization . For this reason the sums of the estimates for the numbers of copies per cell are slightly lower than 10,000 and vary across the spotted cDNA array experiments .
Metabolic cost of genome expression.
The cost of expressing the E . coli genome was calculated for a number of different steady-state mRNA concentration distributions . A number of random distributions were probed, as were mRNA concentrations derived directly from the 49 gene expression data sets generated in this study . All of these cost calculations were normalized by using parameters corresponding to a cell with a 40-min doubling time (Table 1) . Thus, for the mRNA maintenance cost, the mRNA concentrations were normalized to a specified total mRNA concentration (
Measured in vivo expression profiles. The material and energy costs were then calculated for mRNA concentration distributions derived from available experimentally determined gene expression data, and the resulting costs and CVs are shown in Table 1 . Gene expression data sets from 49 separate experiments (corresponding to 91 hybridizations, including 41 Affymetrix arrays and 50 spotted cDNA arrays) were generated as described above, and the numbers of transcript copies per cell were estimated for most of the 4,290 coding sequences in E . coli for each data set . For the spotted arrays, the numbers of transcript copies per cell were estimated from microarrays normalized by using genomic DNA as described above . The experimental conditions from which the data were derived varied widely and included exponential and stationary-phase growth in glucose minimal medium, exponential growth in acetate and glycerol minimal media, response to acid shock, response to cold shock, response to heat shock, growth in media containing an antibiotic, growth in Luria-Bertani broth, and various deletions grown on glucose minimal medium . In order to examine if the observed relative cost invariance was true for data sets available elsewhere, additional data sets were obtained from previous studies (41) . The results for these data sets (data not shown) were comparable to those from our laboratory and did not alter the overall findings of this study . Cost comparisons. The averages and CVs from each computation of metabolic costs were compared . The variance in the results among the 400 random simulations was essentially negligible (all CVs were <1%) . The 49 simulations resulting from expression data exhibited slightly higher variation (the average CV for the amino acid demands was 3.6%), but no CV was higher than 10% (for the tryptophan cost) . There was not a statistically significant difference in the costs for any of the amino acids or nucleotides resulting from randomly distributed mRNA concentrations or data-based simulations . The mean protein length was about 40 amino acids shorter for the data-based simulations than would be expected if the mRNA distribution were random . The highest CVs for the data-based cost calculations were for tryptophan (10.0%), cysteine (8.6%), and lysine (7.3%), and the lowest were for isoleucine (1.3%), threonine (1.3%), and asparagine (1.4%) . The amino acid composition of a related strain of E . coli (B/r) has been experimentally determined (22), and the calculated costs for E . coli K-12 correlate relatively well with the biomass data (results not shown) . Distribution of estimated effective promoter strengths. Using global mRNA half-life data (6), we calculated the effective promoter strength for each of the 49 sets of mRNA concentrations estimated from expression data (which included expression data from a variety of experimental conditions) . The mean effective promoter strength and the corresponding CV were plotted for each of the 3,817 genes for which both expression data and half-life data were available (Fig . 1A) . (Table 1 indicates the parameters used for calculation of promoter strengths.) In this analysis, the CV could be thought of as a measure of the extent to which a gene was subject to regulation under the experimental conditions tested . The highest expression levels generally corresponded to ribosomal protein components and associated protein synthesis enzymes, structural proteins, and membrane pore proteins (as classified by Serres and Riley [38]) . Although the majority of CVs (the CVs for 60.9% of the 3,817 mean effective promoter strengths) fell between 50 and 100%, 115 genes had standard deviations that were equal to or greater than double their average expression levels . Over one-fifth of the genes (876 genes or 22.9%) had CVs of less than 50% (Fig . 2a) .
Genome location-dependent patterns in gene expression. In order to elucidate potential genome location-dependent patterns in gene expression, wavelet transforms were applied to the effective promoter strength data as described above . Sliding averages of the calculated effective promoter strengths obtained by using Savitzky-Golay smoothing (Fig . 1B and C) indicated that there was nonrandom genome location-dependent variability along the E . coli chromosome . In particular, there appeared to be a periodic large-scale pattern of regions with high average expression . This pattern was present both in the data sets generated from Affymetrix experiments and in the data sets generated from spotted array experiments, implying that the observed pattern was not likely to be an artifact of the experimental platform (Fig . 1) . In order to elucidate this pattern and other more subtle spatial patterns in the data, continuous wavelet and Fourier transforms were applied to the effective promoter strength data . The continuous wavelet transform of the average effective promoter strengths estimated from the 20 Affymetrix GeneChip experiments performed in this study (using the Morlet wavelet [5]) was represented in a scalogram (Fig . 3a) . The major feature of the transform was the clear periodic pattern at a scale of approximately 600 kb . This pattern was observed in the spotted array data sets and was also detected by using other types of wavelet filters, such as the Marr wavelet used by Murray et al . (21), indicating that the observed pattern was not an artifact due to either the experimental platform or the particular transform used (results not shown) .
The observed periodic pattern appeared in all the individual effective promoter strength data sets computed by using different expression profiles and hence did not seem to be specific to any particular experimental conditions . No such pattern was observed in the raw mRNA half-life data . A periodic pattern was, however, detected in the raw gene expression data (data not shown), but the pattern was somewhat less well defined than that in the effective promoter strength data . Since the effective promoter strengths were corrected for differential mRNA decay rates and distance from the replication origin, they seemed to be a more appropriate measure of the actual transcription rate than mRNA expression data alone . Analysis of gene functional classes whose members are preferentially located in particular regions of high or low average expression within the periodic pattern (Fig . 3b) may elucidate the relationship between the observed periodicity and E . coli cellular function . Flagellar and other cell motility-related genes and genes encoding ribosomal and other translation-related proteins are preferentially located in one or more of the high-expression regions . On the other hand, genes involved in major metabolic functions, such as energy metabolism, carbon utilization, and transport, tend to be located in the low-expression regions . Furthermore, genes in certain functional classes are typically strongly enriched in only one or two of the high- or low-expression regions, indicating that there are potentially distinct roles for each of these regions . Note that the only data generated were data for protein-encoding ORFs . Thus, the rRNA and tRNA transcription rates were not considered in the analysis of genome location-dependent patterns .
The apparent invariance of the costs for maintaining any expression state of the genome implies that the metabolic resources required to maintain a particular transcription and proteomic state are relatively constant and independent of external conditions . This invariance does not hold true, however, if a gene or small subset of genes with atypical amino acid composition is expressed at a level that is orders of magnitude higher than the level of expression of the rest of the genes (data not shown) . Thus, microbes genetically engineered to express a particular protein at a high level may experience significant phenotypic effects associated with the cost imposed by such atypical expression . It is also possible that the dynamic range of microarrays and gene chips becomes limiting if a few transcripts are expressed at a very high level and therefore saturate the signal on the arrays (9, 30) . To test the significance of this effect, cost simulations were performed in which the top 0.1% of genes with the highest expression levels were assigned values for number of copies per cell that were 10% higher than the level reported by the arrays . The highest CV was increased to just over 20% (for tryptophan), while the average CV of the amino acid costs increased from 3.6 to 8.1%, suggesting that a limited dynamic range in the experimental technology could have some effect on the calculated costs . Finally, it is possible that the observed invariance may have been due to a lack of probing the experimental conditions that would most alter the relative amino acid costs required for expression . However, the conditions chosen were quite varied, and hence we expected there to be differences in the overall metabolic costs between the conditions if such differences exist . The variation in effective promoter strength was computed for the entire genome . In general, no clear patterns were found between gene category and variation in expression level . There was also no observed functional class bias either in the effective promoter strengths or in the variance across 49 different calculations . It is worth noting that these computations were biased by the experimental conditions under which each expression profile was measured . To better ascertain genes that are subject to regulation, it will be necessary to test more varied growth conditions (e.g., growth on other carbon sources, anaerobic growth, growth during diauxic shifts, etc.) . If M9 medium (which contains a relatively large amount of phosphate) were used instead of MOPS medium, for example, one might expect the genes involved in the phosphate regulon to exhibit altered effective promoter strengths (and, consequently, increased CVs in the subsequent analysis), thus revealing the extent to which these genes were differentially regulated under the changing medium conditions (43) . As more data sets are included in this type of integrated analysis, a better gauge of the variability in gene expression should be obtained, thus more completely revealing the extent to which each gene is subject to regulation .
An approximately 600-kb periodic genome location-dependent pattern in gene expression in the E . coli genome was detected by performing wavelet analysis of the effective promoter strength data generated in this study . The origin and significance of this pattern, however, are not clear . One possible explanation for the observed pattern is the existence of topological domains with potentially different levels of supercoiling in the E . coli chromosome (39) . It has been estimated that there are 43 ± 10 such domains so that the average domain size is approximately 100 kb (39) . No significant 100-kb periodicity was detected in the wavelet analysis except for particular localized patterns (Fig . 3a), although an irregular periodicity at a sliding average of 100 genes ( As genome-scale data, including mRNA expression data, mRNA half-lives, and proteomic data, are becoming more widely available, the need for integrating these heterogeneous data types is becoming stronger (26) . As this study demonstrated, a higher-order biological analysis can be performed based upon the integration of multiple data types that cannot be done based on an analysis of individual data sets . Such integrated data analysis is enabled by genome-scale in silico models . Different data types demand a model to explicitly relate their values, thus revealing emergent properties that would otherwise be inaccessible (15) . The proposed model integrates three types of genome-scale data: sequence, gene expression data, and mRNA half-life data . This structured framework constitutes a novel means by which to analyze expression data and interpret the expression state of a cell . The scalability of the methods used to generate these data should greatly facilitate future integration of the genomic expression state with existing genome-scale metabolic models . This method therefore constitutes an important step in our progress towards achieving truly genome-scale integrated models of cellular function .
What Is Botulism?,
What Is Fermentation?,
What Is Genome?,
What Is Molecular Biology?,
What Is Rhizobia?,
e,
Bacteria,
e,
Bacteriology,
i,
Bacterium,
o,
Microorganisms,
s,
Microbes,
i,
Erythromycin,
a,
Cell suspensions,
s,
Suspension cells,
e,
Antibiotics,
c,
Pseudomonas aeruginosa,
a,
Microbiological,
o,
Streptococci,
i,
Bacteria,
n,
Enterobacters,
s,
Enterobacteriacea,
s,
Escherichia coli,
i,
Pseudomonas aeruginosa,
a,
Multidrug resistant,
n,
Staphylococcus,
r,
Escherichia coli,
r,
Salmonella typhimurium,
r,
Escherichia coli,
s,
Bacteria,
o,
S. cerevisiae,
s,
Escherichia coli,
c,
Microbiological
|
© 2005
Transgalactic Ltd (manufacturer of Bioscreen C software) |
Privacy Statement | P.O. Box
1393, 00101 Helsinki, Finland,
Last modified: May 25, 2005
| ||||||