|
|
|
Journal of Bacteriology, October 2002, p . 5733-5745, Vol . 184, No . 20 Correlations between Shine-Dalgarno Sequences and Gene Features Such as Predicted Expression Levels and Operon StructuresJiong Ma,1 Allan Campbell,1 and Samuel Karlin2* Department of Biological Sciences,1 Department of Mathematics, Stanford University, Stanford, California 943052 Received 29 April 2002/ Accepted 22 July 2002
The SD sequence plays an important role in formation of the initiation complex by base-pairing with the anti-SD sequence found at the 3' end of 16S rRNA . This has been demonstrated by extensive experiments with Escherichia coli (9, 19, 46), other bacteria, and even archaea (8, 32, 35, 37) . The SD sequences could be different subsequences of the complementary sequence of the anti-SD sequence (see Table 1); however, most SD sequences are slight variations of the GGAGG core (43) . The effectiveness of an SD sequence is determined by both its base-pairing potential with the anti-SD sequence and its spacing from the start codon (10, 34) . The aligned spacing of the SD sequences (see the legend to Fig . 1A for definition) generally varies from 5 to 13 bases, with optimal spacings of about 8 to 10 bases for E . coli genes (7, 34) . Although it is not mandatory in translation initiation, a strong SD sequence may compensate for a weak start codon and counteract mRNA secondary structures that hinder access to the start (10, 55) .
The main objective of this paper is to investigate, in 30 complete prokaryotic genomes (available as of June 2001), the correlation between the presence of an SD sequence and predicted expression levels of genes based on codon usage biases, functional gene classes, type of start codon, and distance between successive genes .
Detection of SD sequences.
To detect putative SD sequences, we calculated the free energy (designated
There are several rationales in favor of a threshold of -4.4 kcal/mol . (i) An effective SD sequence usually binds to the core CCUCC of the anti-SD sequence, which is conserved in all but one of the genomes (Table 1) . It seems unlikely that the basic mechanism of the SD interaction will change from genome to genome, given the conservation of the core anti-SD motif . Thus, we define the SD sequences GGAG, GAGG, and AGGA as core SD motifs, all of which have a free energy of binding of -4.4 kcal/mol . (ii) Constrained by the base composition of the anti-SD sequence, its complementary motifs with a free energy of binding greater than -4.4 kcal/mol most often bind parts other than the core CCUCC and are likely to be random motifs . Thus, we have designated this threshold to exclude these random motifs . (iii) We analyzed several genomes with SD sequences defined by the core SD motifs GAGG, GGAG, and AGGA (SD sequences were defined as sequences harboring any of these motifs) and obtained highly concordant results (data not shown) . Also, we relaxed the stringency and accepted the 3-bp motifs GGA, GAG, and GGA as SD sequences . The final results are consistent (see Supplementary Data Fig . S-1 and Table S-1; all supplementary data can be accessed at http://gnomic.stanford.edu/jiongm/SD/) . For most genes, this cutoff value effectively leaves only one or no SD sequence in the 5' region (20 nucleotides) . In rare cases there might be two or more competing motifs that qualify as SD sequences . When this happens, we chose the one with the lowest free energy of binding . The aligned spacing of an SD sequence is defined as the number of bases between the first base of the start codon and the U in the core anti-SD motif CCUCC (Table 1) in the duplex formed (Fig . 1A) . The aligned spacing of the SD sequence GGAGG in Fig . 1A is 7 bases . There are 22 possible spacings (0 to 21 bases) . However, generally more than 80% of all the SD sequences occur at spacings of 5 to 13 bases to the start codon (see below) .
Theoretical measures of gene expression.
We used a method introduced by Karlin and Mrázek (22) to assess codon biases of a class of genes (or a single gene) relative to a second class of genes . Let G be a group of genes with average codon frequencies g(x,y,z) for the codon triplet (x,y,z) such that
where {pa(F)} is the average amino acid frequency of the genes of F . When no ambiguity is likely, we refer to B(F|G) as the codon bias of F with respect to G . The assessments of equation 1 can be made for any two gene groups from the same genome or from different genomes . In particular, there are four classes of genes as standards: C, all genes; ribosomal proteins (RP); transcription, translation processing factors (TF); and major chaperon/degradation (CH) genes functioning in protein folding, trafficking, and secretion . Qualitatively, gene g is predicted to be highly expressed (PHX) if B(g|C) is high while B(g|RP), B(g|CH), and B(g|TF) are low (i.e., the codon usage of g is very different from that of the average genes but rather similar to that of the gene classes RP, CH, and TF) . Similarly, g is putatively alien (PA) if its codon biases relative to the four classes are all large . Predicted expression levels of g with respect to these standards are calculated by
We propose a general expression measure for g as follows:
Other weights can also be used and give similar results .
Definition of PHX and putative alien (PA) gene classes.
We defined a gene as PHX if the following two conditions were satisfied: (i) at least two among the three expression values ERP(g), ECH(g), and ETF(g) exceeded 1.05 and (ii) the overall expression level E(g) was
Logistic regression analysis.
To study the correlation between SD presence and E(g) values, we used a logistic regression model (18) . Considering a genome with n genes (each
where
where ß1 is the regression coefficient and the measure of correlation . The SPSS statistics software (version 6.1.4; SPSS Inc., Chicago, Ill.) was used for the model fitting . The estimated ß1 (as ß) and the estimated standard error are given for each genome in Table 2; other estimated parameters not shown include ß0 and the P value for a likelihood ratio test of the regression .
Using the free-energy method and a cutoff value of -4.4 kcal/mol, all the SD sequences detected were at least 4 bases in length, and most harbored the motif GGAG, GAGG, or AGGA (e.g., 88% of the SD sequences in Escherichia coli K-12) . In some natural mRNAs an SD sequence can consist of a weaker motif, e.g., AAGG, with a
Of the 30 genomes, 22 had an SD% exceeding 40% for all genes . Bacillus subtilis and Thermotoga maritima registered the highest SD%, 89.4% and 90.1%, respectively . The lowest genome SD% occurred for Rickettsia prowazekii, Mycoplasma genitalium, Mycoplasma pneumoniae, Halobacterium sp . strain NRC-1, Thermoplasma acidophilum, Sulfolobus solfataricus, and Pseudomonas aeruginosa, each at around 20% . In general, fast-growing bacteria, gram-negative thermophiles, spirochetes, methanogens, and hyperthermophilic archaea achieved relatively high SD%, while obligate intracellular parasites, surface parasites, pathogens, and cyanobacteria had diminished genome SD% . We carried out a simulation study to determine whether these SD% values represent real DNA elements or just random motifs . For each genome, we generated 100 (1,000 for Escherichia coli K-12) data sets of random sequences 20 nucleotides long according to the base composition of the original 20-nucleotide 5' end sequence data set, each with the same number of sequences as in the given genome . SD sequences were detected and SD% was calculated for each set of these random sequences . The SD% values shown in Table 1 were found to represent real motifs in all the genomes except for Mycoplasma genitalium and Halobacterium sp . strain NRC-1, as assessed by distributions of the SD% for these simulated data sets (the probability of these SD% values coming from random sequences was <0.01) . Correlation between SD presence and predicted gene expression levels. It is known that not all genes contain an SD sequence . In some genomes, the majority of genes do not have such a motif (Table 1) . Although an SD sequence is not compulsory for the translation of many genes (21), it may still be effective for genes that contain such a motif . This raises the question of how the SD sequences are distributed in different gene classes . First we examined SD sequences for the RP genes . Primarily highly expressed during fast growth, the RP gene class showed a very high SD%, around 80% in most genomes (Table 2) . Even for genomes with a low overall SD%, the RP SD% was significantly high . For example, the SD% was 85.7% for RP genes in Thermoplasma acidophilum (23.5% for the genome) and 58.5% for RP in Sulfolobus solfataricus (23.0% for the genome) . This is consistent with a greater SD presence for highly expressed genes .
We then divided the genes of a genome ( To verify the positive correlation of SD presence and gene expression levels, we applied logistic regression analysis . The regression coefficient ß and its estimated standard error for each genome are given in Table 2 . All but six genomes (Borrelia burgdorferi, Bacillus subtilis, Mycoplasma genitalium, Methanococcus jannaschii, Halobacterium sp . strain NRC-1, and Pyrobaculum aerophilum) recorded a significant positive correlation between SD presence and E(g) values (P < 0.01 for a likelihood ratio test of the regression) . For the genomes of Borrelia burgdorferi, Methanococcus jannaschii, and Halobacterium sp . strain NRC-1, the P value for the likelihood test was between 0.05 and 0.1, indicating a relatively strong correlation . Of the three genomes that did not record a significant correlation (P > 0.1), Mycoplasma genitalium had the lowest SD% (10.8%); Bacillus subtilis was among the highest in SD%; and Pyrobaculum aerophilum was low at about 23% (Tables 1 and 2) . Since all the data sets used were original genome annotations, a reasonable concern was that incorrect annotations of the gene start sites may have affected the accuracy of our SD analysis . To better determine how the genome data would compare to more reliable data sets, we analyzed the SD% for genes from several human-curated Escherichia coli K-12 data sets and achieved very similar results, as shown in Table 3 . The data sets on essentiality were from the Profiling of E . coli Chromosome (PEC) database (http://www.shigen.nig.ac.jp/ecoli/pec/) . The PEC data set classifies all E . coli genes into three groups: genes essential for cell growth ("essential"; total of 191 genes), those dispensable for cell growth ("nonessential"), and those unknown to be essential or nonessential ("unknown"), mainly using information from the literature . The "verified" (total, 656 genes) data set was extracted from EcoMap12 (http://bmb.med.miami.edu/EcoGene/EcoWeb/), which consists of genes whose starts have been confirmed by N-terminal protein sequencing (41) . There are 65 genes in the verified set whose start sites were incorrectly annotated in the NCBI genome (4), giving an accuracy of about 90% for start site annotation, which is consistent with the average accuracy estimated for various gene-finding programs (25) .
To further reduce potential errors caused by annotation inaccuracies, we compiled a "single-start genes" data set for each genome, which consists of genes with a single start codon (AUG, GUG, or UUG as the first codon) within 90 nucleotides of their annotations . Of the 65 wrongly annotated genes in the E . coli "verified" data set, the correct start was found within 30 codons of the annotations for 54 (83%) . Therefore, the single-start genes may have a chance of <0.02 of being wrongly annotated if the error rate for the genome annotations is 10% or about only 0.04 if the error rate reaches as high as 25% in certain genomes, as estimated by some authors (3) . In general, these genes constitute about 26% of a genome (29% PHX genes, 25% PMX, and 26% PA; see Supplementary Data Table S-2) . Compared to the whole-genome data, they registered highly comparable SD% for the three gene classes PHX, PMX, and PA, indicating that the inaccuracies in start site prediction could only slightly affect the validity of our results obtained from genome annotations (see Supplementary Data Table S-2) . There was also evidence suggesting that wrong starts are likely to be distributed evenly among the different classes of genes (PHX, PMX, and PA) that we used and thus would not significantly affect our comparisons of SD presence between PHX and PMX or PA gene classes . Of the 65 E . coli genes with incorrect starts mentioned above, 20% were PHX, 77% were PMX, and 3% were PA, indicating that incorrect annotations do not tend to bias strongly toward PMX or PA genes . Taken together, our results on the correlation of SD presence and predicted expression levels have been verified by both human-curated E . coli data sets and the high-quality single-start gene data sets . The validity of the results holds despite the existence of a few incorrectly predicted gene start sites in the genome data . It is also evident that the increased SD% for PHX genes is not due solely to the presence of RP genes, as shown in Table 3 for Escherichia coli K-12 . The collection of PHX genes, excluding RP genes, achieved an SD% similar to that of the complete PHX class for the verified, essential, and whole-genome data sets (Table 3) . The results corroborate our assignment of genes as PHX based on codon usage, even in the many prokaryotes for which little direct information on protein abundances is available . Although many factors affect protein abundances, a high rate of translational initiation is essential to achieve a high level of expression and is the factor most simply observed by genome analysis .
SD sequences for PHX and PMX genes.
We also tried to determine whether the SD sequences of RP and PHX genes are stronger than those of PMX genes in terms of base-pairing potential with the anti-SD sequence and with respect to their aligned spacings, which reflect the two major determinants of the strength of an SD sequence (17) . Ringquist et al . (34) showed experimentally that the SD sequence UAAGGAGG is about fourfold more effective than AAGGA . The former SD has a
We first determined the OAS for each genome based on the distribution of SD spacings for all the genes in general and the PHX and RP gene classes in particular . The genomes of Escherichia coli K-12 and Pyrococcus abyssi are shown as two examples in Fig . 1 . The OAS are 7, 8, and 9 bases for Escherichia coli K-12 and 9, 10, and 11 bases for Pyrococcus abyssi (Fig . 1B and C) . Notably, 6, 7, and 8 bases are the most occupied SD spacings for PMX genes from Escherichia coli K-12, whereas 7, 8, and 9 bases are preferred by PHX and RP genes (Fig . 1B) . In fact, no SD sequence for the Escherichia coli K-12 RP genes occurs at an aligned spacing of 6 bases . Assuming that the SD sequences for RP genes are the most optimal, the three aligned spacings of 7, 8, and 9 bases were chosen as the OAS for SD sequences in Escherichia coli K-12 . These OAS agree excellently with experimental evidence that 8 to 10 bases are optimal for SD sequences in Escherichia coli K-12 genes (7, 34) . These also indicate that SD sequences for PHX genes may have a distribution closer to the actual optimal spacings than PMX genes . For the genomes of Haemophilus influenzae, Vibrio cholerae, Campylobacter jejuni, Helicobacter pylori 26695, Chlamydophila pneumoniae, and Chlamydia trachomatis, the OAS were determined in a way similar to that used for Escherichia coli K-12 . In other genomes, the OAS were aligned spacings occupied by the largest fraction of SD sequences for both PHX and PMX genes, e.g., for Pyrococcus abyssi (Fig . 1C; see also Supplementary Data Fig . S-2) . However, the SD sequences in the genomes of Mycoplasma genitalium and Pyrobaculum aerophilum were spread to all positions . Their OAS were chosen in the same way but may not represent optimal spacings (see Supplementary Data Fig . S-3) . Table 1 displays the OAS for each genome . In general, bacterial genomes attain similar OAS, with position 8 being the most common optimal spacing . Archaeal genomes show a preference for OAS about 2 bases longer than that of most bacterial genomes, usually at positions of 9 to 11 bases (Table 1, Fig . 1B and C) .
We display in Fig . 2 for each genome the mean
The third group consisted of Aquifex aeolicus, Thermotoga maritima, and all the archaea . The SD sequences in this cluster were the strongest, except for the genomes with a very low genome SD% (Halobacterium sp . strain NRC-1, Sulfolobus solfataricus, and Pyrobaculum aerophilum) . In Bacillus subtilis, Aquifex aeolicus, Thermotoga maritima, and the euryarchaea, the SD sequences for RP genes were significantly higher in OAS% and significantly lower in mean
It was previously suggested that there is no direct correlation between the affinity of the SD sequence for the anti-SD sequence and the efficiency of initiation complex formation under certain experimental conditions (10) . An SD interaction that involves the center of the anti-SD sequence, CCUCC, may be more efficient in facilitating translation initiation than when it involves off-center sequences (24) . This could explain the results of Ringquist et al . (34) and also the twofold-higher yields for GAGGU (
Since a majority of the SD sequences that we detected involved interaction with the core anti-SD sequence, it might be reasonable to speculate that a lower mean
Variation of SD% for different functional gene classes. We also tried to find out whether SD presence is correlated with certain gene classes by assessing the SD% for different functional classes defined in the Cluster of Orthologous Groups (COG) database (50, 51) . The two COG categories that are persistently highest in SD% are J (translation, ribosome structure, and biogenesis) and C (energy production and conversion) (see Supplementary Data Table S-3), consistent with the recognition that most genes in these groups are PHX (22) . In contrast, the COG categories with low SD% include L (DNA replication, recombination and repair), M (cell envelope biogenesis, outer membrane), and I (lipid metabolism) (see Supplementary Data Table S-3) . Genes in these classes usually attain the expression levels of PMX genes (22) . Thus, variations in SD% for different COG classes seem to reflect an association with the expression levels of the genes in the class . Relationship between SD presence and start codon. Most genes rely on AUG as a start codon, while GUG and UUG are used sparsely (Table 4) . Moreover, genes with an AUG start codon tend to have a higher SD% than genes with either GUG or UUG . The increase was significant in 12 genomes and most pronounced in the five euryarchaeal genomes with SD% exceeding 40% (Table 4) .
We have shown that SD presence is significantly correlated with predicted gene expression levels in most prokaryotic genomes . In particular, the RP genes and more generally the PHX genes display a higher SD% than the PMX genes (i.e., the average genes) . Also, in some genomes the SD sequences of RP and PHX genes are closer to optimal in both base-pairing potential with the anti-SD sequence and spacing to the start codon (Fig . 2) . This provides further evidence that the SD sequence is important in translation of these genes . A strong SD sequence may also work together with other features of the highly expressed genes, e.g., the stronger start codon AUG and favorable secondary structure around the translation initiation region (16), that ameliorate the translation initiation efficiency . Relationship between SD presence and distance between successive genes. The intergenic distance (Dg) is another important feature of prokaryotic genes that might correlate with the SD presence . For ease of discussion, we refer to the Dg of gene g as the distance (in base pairs) from g's start codon to the end of its immediate upstream gene in the same orientation . Negative values of Dg signify genes that overlap their immediate upstream genes . In most genomes, the most prevalent value of Dg is -4 bp (the junction is always AUGA; also see reference 38), which is observed for on average 7.8% and as much as 18% for Thermotoga maritima . The median Dg in a genome varies from 9 bp for Campylobacter jejuni and 11 bp for both Thermotoga maritima and Mycoplasma genitalium to 187 bp for Methanococcus jannaschii and 201 bp for Halobacterium sp . strain NRC-1 (see Supplementary Data Table S-4) . In most archaeal genomes, the SD% for genes with a Dg of -4 bp is marked higher than the SD% for all the other genes, at a level comparable to the SD% of the RP genes . In contrast, many genomes recorded a reduced SD% for the collection of genes with a Dg of >20 bp, compared to genes with a Dg of <20 bp . This is especially valid for all the archaeal genomes (see Supplementary Data Table S-4) . We then assessed SD% for genes with different Dg ranges . Since the SD% does not show much variation among the groups with a Dg of greater than 30 bp, we focused on genes with a Dg of below 30 bp, which on average constitute 35% of a genome . We divided all the genes in a genome into seven Dg groups: genes with a Dg below -20 bp; five groups with a Dg of from -20 to 30 bp, with 10-bp intervals; and genes with a Dg exceeding 30 bp (see Supplementary Data Table S-5) . In most genomes, each group contained more than 30 genes . The gene group with a Dg of -10 to 0 bp was the largest among the five groups of 10-bp intervals . Figure 3 shows the SD% for these Dg groups .
Genes with a Dg of 0 to 20 bp may have strong biases in base composition in their translation initiation region because their 5' end is located in the regions around the stop codon of the upstream gene (49) . Rocha et al . (35) found that the 6 bases following the stop codon in Bacillus subtilis genes are AU rich . Such biases could discount the occurrence of an SD sequence, which might be the reason for the somewhat reduced SD% for the group with a Dg of 0 to 10 bp in bacterial genomes (Fig . 3) . On the other hand, Eyre-Walker (12) showed that Escherichia coli K-12 genes overlapping a downstream gene tend to have low codon preferences at the 3' end, which would more easily enable the presence of an SD for the downstream gene (e.g., with a Dg of -20 to 0 bp) .
The archaeal genomes revealed a common trend distinctive from the bacteria . The genes with a Dg of less than 20 bp (Fig . 3F) or less than 10 bp (Fig . 3G) were strongly biased with an extant SD compared to genes with a larger Dg . This was even more emphatic for genomes with less than 30% overall SD%, especially for gene groups with a Dg of between -20 and 10 bp (Fig . 3G) . These increased SD% were again not correlated with higher expression levels (data not shown) . It is interesting that Bacillus subtilis, Aquifex aeolicus, and Thermotoga maritima were distinctively like bacteria in their relationship between Dg and SD presence (Fig . 3D), even though they were very similar to the archaea in the SD sequences with respect to
Relationship between SD presence and operon structure. The greatly increased SD presence in genes in close proximity to their upstream genes led us to investigate the connection between the SD sequence and operon structure . Apparently many genes in the groups with a Dg of -20 to 20 bp are genes within operons (38) . It has been suggested that operon structure might have arisen during the evolution of both bacteria and archaea by thermoreduction from a common thermophilic ancestor (14) . The operon structures in the two kingdoms thus might have some common features, such as the SD sequence . The high SD presence suggests that the SD sequences may play an essential role in translation of these genes . We analyzed SD sequences for 391 documented operons from Escherichia coli K-12 (each with at least two genes) extracted from the RegulonDB database (39) . Of the 601 internal genes within these operons, 69.2% had a Dg of between -20 and 30 bp, compared to only 6.6% of the 391 initial operon genes . The SD% was 71.0% for genes within operons and 67.3% for initial genes . We then conducted a more general analysis over the 30 genomes . Based on the Dg, we partitioned the genes in a genome into three classes, types I to III, as illustrated in Fig . 4A . Type I consists of genes at least 100 bp in distance from both the upstream and downstream genes; type I genes are presumably single genes . Type II consists of genes with a Dg larger than 50 bp and followed by at least two consecutive downstream genes with a Dg below 20 bp; type II genes are likely initial genes of operons . Type III comprises all genes with a Dg below 20 bp following a type II gene; type III genes are likely genes within operons . The three classes encompass about half of a genome . We found that more than one third of the type II and type III genes in Escherichia coli K-12 were present in the 391 known operons, and most of them were also predicted to be operons by Salgado's method (38) . On average, there were three type III genes following each type II gene (see Supplementary Data Table S-6) . Figure 4B presents the SD% for these three gene classes .
This conservation was even more significant in the genomes where the overall SD% was very low and/or no correlation between the SD presence and predicted expression levels was observed . Such genomes included those of Mycoplasma genitalium, Mycoplasma pneumoniae, Synechocystis sp . strain PCC6803, Halobacterium sp . strain NRC-1, Sulfolobus solfataricus, and Pyrobaculum aerophilum (Fig . 2, 3, and 4B) . Thus, it is tempting to speculate that the SD sequence may have coevolved with the operon gene structure in both bacteria and archaea (14) . The correlation of SD presence with gene expression levels might have been established later . This would explain the observation that, in all archaeal genomes and Aquifex aeolicus, PHX genes with a Dg of below 50 bp recorded a significantly higher SD% than other PHX genes (data not shown) . The RP genes are both highly expressed and profusely expressed in operons, and not surprisingly, they always attained the highest SD% (Table 2) . The archaeal genomes provide an excellent system with which to analyze the evolution of both the SD sequence and the bacterial translation mechanism utilizing the SD-anti-SD interaction . Some euryarchaea (Thermoplasma acidophilum and Halobacterium sp . strain NRC-1), and especially crenarchaea (Sulfolobus solfataricus and Pyrobaculum aerophilum), seem to have gradually lost conservation of both the anti-SD and the SD sequences (Table 1; Fig . 2) . Accumulating evidence suggests that many single genes, or initial genes of operons, in these genomes are translated through leaderless mRNA by mechanisms that do not involve the SD-anti-SD interaction (45, 47, 52) . The SD sequence may thus become dispensable for these genes . However, for genes within operons, the SD sequence appears to be particularly important, evidenced by the prevalence of the SD motifs in those genes (Fig . 3F and G) . Experimental evidence supporting this hypothesis has been reported for Sulfolobus solfataricus (8) . SD presence and other gene features. It has been suggested that the SD sequence is especially important in a genome where an S1 ribosomal protein is missing, e.g., Bacillus subtilis, which has only a reduced S1 homologue and achieves the second highest SD% of all the genomes (Table 1) (35) . However, we did not find such a correlation for other genomes . Three bacteria (Ureaplasma urealyticum, Mycoplasma genitalium, and Mycoplasma pneumoniae) and all archaeal genomes did not have an S1 or any S1 homologues . But, unlike Bacillus subtilis, the genomes of Ureaplasma urealyticum, Mycoplasma genitalium, and Mycoplasma pneumoniae recorded a very low SD% (Table 1) . On the other hand, genomes with an S1 gene can achieve very high SD%, e.g., Thermotoga maritima, which had the highest SD% (Table 1) . Thus, SD presence is not correlated with the presence or absence of an S1 RP gene . Also, the SD sequence seems to be uncorrelated with factors such as copy number of the 16S rRNA, G+C content, total number of genes, gene length, or lifestyle (data not shown) . Further comments. Given the correlation between the SD sequence and other gene features, especially expression levels and distances between successive genes, it is suggested that the SD sequence should be incorporated in algorithms for gene start determination, expression level prediction, and operon prediction to improve accuracy . Most of the genomes studied in this report were annotated with the programs GeneMark (20, 26) and GLIMMER (40) or a combination of automatic gene-finding methods and similarity searches in protein databases . Now SD information has been incorporated in recent programs, such as GeneMark.hmm and GeneMarkS (3, 25) . It appears to work well for genomes with high SD%, such as low-G+C gram-positive bacteria (e.g., Listeria monocytogenes [15]) . However, for many genomes, the SD% is around 30 to 50% and thus would provide only marginal improvements (36) . On the other hand, the relationship between SD presence and intergenic distances may contribute greatly to operon predictions, an important part of prokaryotic genomics . No highly reliable method to date has been developed for operon prediction (38) . Also, little is known about operons in archaeal genomes . Our findings that archaeal genes that are presumably within operons have remarkably increased SD presence should help in developing an effective method for operon characterization in these genomes . Recently, the crystal structures of both the 50S and 30S complexes of the bacterial ribosome have been determined at high resolution (2, 41, 56) . A structure of the 80S ribosome from Saccharomyces cerevisiae was also reported (48) . These accomplishments greatly augment our understanding of the mechanisms of protein synthesis at the atomic level (5, 6, 29-31, 33, 44) . Furthermore, Yusupova et al . (58) directly observed the path of mRNA in the 70S ribosome from Thermus thermophilus at 7 Å resolution . The model mRNA was based on the phage T4 gene 32 mRNA except that the SD sequence was expanded to AAGGAGGU . They found that about 30 nucleotides are bound to the 30S subunit (15 bp upstream of the initiator to 15 bp downstream), which is roughly the whole translation initiation region . The SD interaction was clearly observed to form a helix, which was accommodated in a cleft formed by 16S rRNA elements and the ribosomal proteins S11 and S18 (58) . These results provide additional proof that the SD interaction can be an important part of translation initiation . The SD sequence in the mRNA, AAGGAGGU, had an aligned spacing of 7 bases . It is interesting that of the 67 AAGGAGGU SD sequences in the 21 bacterial genomes (Table 1), only 4 occurred at an aligned spacing of 7 bases, while 10, 19, and 12 conferred 8, 9, and 10 bases of spacing, respectively . A total of 55 (82.1%) were present at a spacing larger than 7 bases . Thus, most likely an aligned spacing of 9 bases should be more preferable for the mRNA in the structure . There are apparently structural constraints that require such an optimal spacing, and three-dimensional simulation studies based on the structure using different SD sequences and spacings could provide insights into these structural constraints and a better understanding of the SD interaction .
What Is Genome?, What Is Dna?, What Is Functional Genomics?, What Is Molecular Biology?, What Is Bioreactor?, e, Microbiology, i, Microorganisms, c, Bacteria, r, Bacterium, r, Bacteriology, r, Neisseria, o, Leuconostoc, e, Vibriosis, a, Staphylococcus, a, Antibiotic treatment, c, |