|
|
|
Journal of Bacteriology, April 2002, p . 2260-2272, Vol . 184, No . 8 Evolutionary Analysis by Whole-Genome ComparisonsArvind K . Bansal1 and Terrance E . Meyer2* Department of Computer Science, Kent State University, Kent, Ohio 44242,1 Department of Biochemistry, University of Arizona, Tucson, Arizona 857212 Received 20 August 2001/ Accepted 11 January 2002
Whole-genome analysis is still in its infancy, and various means of analyzing the data are currently being explored (4, 16, 22, 27, 33, 47, 61, 65) . Genome sizes range from ca . 0.5 to 10 Mb for bacteria and from 3 Mb to 3 Gb for eukaryotes . The number of genes per genome varies from less than 500 to more than 26,000, and the average gene size is ca . 1 kb . There are very few truly universal genes, and the extent of identity among proteins having similar structures and functions varies from highly conserved to barely recognizable . Among the homologs or similar proteins, there are those that share descent from a common ancestor and presumably have the same function (orthologs) and those that have been duplicated and are more likely to have different functions (paralogs) . There is no way to know from sequence comparisons alone which genes are true orthologs and which are paralogs, but it is common practice in whole-genome studies to label as orthologs and assign tentative functions to those homologs with the greatest identity above a certain threshold . Duplicated genes are not uncommon, and the presence of split genes and gene fusions are also not unusual (4, 5, 25) . Many genes are arranged in groups or operons and may be cotranscribed, but rearrangement of gene order from one species to the next is more often the rule than the exception (4, 5, 27) . The presence of introns makes it difficult to identify eukaryotic genes, and best estimates place the numbers of human genes in the neighborhood of 26,000 to 39,000 (68) . Thus, whole-genome analysis is not trivial, and it may take some time before agreement is reached on the best way to analyze the data . Nevertheless, we have now compared the genomes of 37 species in novel ways and have reached conclusions that differ from those in previous studies that considered a smaller number of species .
We validated our technique with experimentally known orthologs in E . coli and B . subtilis and reported comparisons of E . coli with H . influenzae (see http://www.cs.kent.edu/ Our analysis showed that the genes in very closely related microbes shared larger alignment fragments, and the fragment size varied from better than 90% of the length of the smaller gene for very closely related genomes to ca . 20% for the most distantly related genomes . This varied for different genes (including ribosomal proteins), making it quite difficult to perform any normalization based upon genome relatedness or to develop a broader classification of the set of genomes . Nevertheless, we felt that this should be examined in greater depth to give the reader a better sense of the problem . For example, more than 80% of one of the fully conserved genes, E . coli rpsK (130-amino-acid ribosomal protein), matched with orthologs in all other genomes, while E . coli pheS (327-amino-acid phenylalanyl-tRNA synthetase alpha chain) matched only a 161-amino-acid fragment of Aeropyrum pernix APE2302 (473-amino-acid protein) . For the two Bacillus spp., the two Chlamydia spp., and the two E . coli strains, the median matching length was better than 90% . For Saccharomyces cerevisiae versus Ureaplasma urealyticum, for C . elegans versus Borrelia burgdorferi, or for C . elegans versus Mycoplasma genitalium, the median matching length was in the range of 23% ± 4% . For comparisons among the bacteria or among the archaea, the median matching length was ca . 82% ± 8% . For the most divergent prokaryotes, the median matching length was in the range of 51% ± 8% . The median length between S . cerevisiae and the prokaryotes was 38% ± 8%, with extremes of 21% and 57% . The median length for C . elegans and prokaryotes was ca . 31% ± 7%, with extremes of 19% and 42% . The matching length for S . cerevisiae with C . elegans was a surprisingly low 32% . There is no evidence that the eukaryotes match larger segments with either the archaea or the bacteria . It is also clear that we need to analyze greater numbers of organisms in each of the major categories to refine this avenue of research . Despite the choice of a relaxed cutoff, the BLAST phase did not pick up every homolog . However, the occurrence of such incidences was very low . This fact was found due to the loss of transitivity identified in ortholog relationships for some genes, that is, if a gene X in a genome A was orthologous to a gene Y in a genome B and gene Y in a genome B was orthologous to gene Z in genome C, then it was not always the case that gene X in genome A was identified as a homolog to gene Z in genome C after BLAST comparison . It was also found that, despite our choosing unique or best homologs and pruning all remaining homologs involving two genes (or gene fragments) in orthologous gene pairs, the group of orthologs clustered after merging the results of all 37 genomes often contained two genes with very similar functionality from the same genomes . This phenomenon of multiple similar genes increased when the definition of orthologs allowed smaller best-matching fragments to be included, suggesting that, for distantly related genomes, pairwise genome comparison may also pick up some paralogs or genes which can substitute for the functionality . We removed all of the gene pairs that had orthologous gene fragments less than 30% the size of the largest orthologous fragment in the cluster . The cutoff of 30% was guided by the fact that many fully conserved genes are ribosomal proteins with sizes of as low as 93 amino acids (E . coli rpsS) . Since the artifact cutoff size was taken as 30 amino acids, smaller gene fragments matching with ribosomal proteins will be treated as artifacts, and the results would be inconsistent if the cutoff ratio was <30% . This cutoff also ensures that we restrict the number of closely related paralogs . To account for missing orthologs caused by incompleteness of BLAST comparison, we used the transitivity relationship described above . Care was taken so that the same alignment fragments were used for the transitive relationship . However, this sometimes resulted in two "orthologous genes" being identified from the same genome: the first one was identified by direct genome comparison of genomes A and C (in the absence of the missed gene pair from BLAST comparison), and the second one was derived by the transitive relationship . It was very difficult to find the exact identity score for orthologs derived by transitivity, and the presence of any false positives (caused by missing homologs from the BLAST phase) in the transitivity chain may cause a paralog to be picked up . In such cases, the following criterion was used to pick up an ortholog: (i) only one gene, with the largest aligned fragment (which was >20% larger than the nearest candidates) in each genome, was retained if the ratio of the two gene fragments from the same genome was <75% the size of the largest fragment; (ii) the genes using direct pairwise genome comparison were preferred over derived orthologs if the separation between the sizes of the two gene fragments was <20%; and (iii) all genes were retained if their size was >75% of the size of the largest fragment . Derived orthologs (using transitivity) and clusters of orthologs were obtained by using a greedy algorithm with the starting point being a set of clusters of fully conserved genes identified from individual genome comparisons with all other genomes . The clusters with orthologs in the maximum number of genomes were joined first . Two clusters of orthologs (containing common orthologs) were merged by using the following criterion: if the differences in the sizes of the largest gene fragments in both clusters were <20% and both largest genes were orthologs, then both clusters were completely merged; otherwise only genes whose size was within 80% of the sizes of the ortholog were copied from the second cluster and included in the first cluster . The rationale for this criterion is that genomes having orthologs with larger aligned fragments are more probably closely related . However, two genes from two genomes being orthologous to a common smaller fragment of a third genome does not ensure that two genomes having larger fragments are closely related unless it is established by direct pairwise comparison of those two genomes . Genes which are specific only to a set of organisms were found by first identifying all of the clusters of orthologs containing at least one genome in the set and then filtering out all of the clusters that contained the microbes which are not members of the set .
E . coli does not appear to be specifically related to B . subtilis, but ca . 34% of its genome is represented by orthologs in B . subtilis . If a line is drawn from the origin of Fig . 3C through the point representing B . subtilis, it may be taken to represent the approximate empirical relationship between orthologs and genome size . Thus, ca . 8.2% of the E . coli genes are expected to be present as orthologs in a bacterial genome of 1,000 genes . H . influenzae contains 1,717 protein coding genes; thus, E . coli should share ca . 14% of its genome with H . influenzae, based on the sizes of their genomes, if they are not specifically related . In fact, 31% of the E . coli genome is represented by orthologs in H . influenzae, more than twice the number expected by size alone, indicating that they are in fact specifically related . Likewise, V . cholerae should share 32% of the E . coli genome based on size but actually shares 47%, indicating that they, too, are related . Tiny Buchnera sp . should share <5% of the E . coli genome by this measure but contains 13%, again indicating a specific relationship . Neisseria meningitidis appears to be related to E . coli by this criterion but to a lesser extent, as seen by only a slightly greater percentage than represented by the scatter in the line . A self-consistent picture is obtained when other species are plotted in a similar manner as those in Fig . 3 . Thus, for those species apparently related to E . coli, a greater-than-expected percentage of, e.g., V . cholerae genes is present in E . coli, H . influenzae, Pasteurella multocida, N . meningitidis, and Buchnera sp . Likewise, the largest percentage of H . influenzae genes is present in P . multocida, E . coli, V . cholera, N . meningitidis, and Buchnera sp . and so on . It has been argued on the basis of 16S rRNA comparisons that archaebacteria are so different that they belong in a separate domain equivalent in rank to that of the bacteria and of the eukaryotes (74) . If archaebacteria do belong in a separate domain, then that difference should be reflected in the gene content as was found above for the eukaryotes in comparison with bacteria . In fact, the archaea group with the bacteria in all cases, that is, within error limits they share as many of their genes with the bacteria as distantly related bacteria of the same size share with one another and on this basis should not be considered to be in a separate domain . Mayr (40) previously argued on different grounds that archaebacteria did not deserve the status of a higher taxon, and our data on gene content are consistent with that conclusion . The archaebacteria fall on the line defined by B . subtilis in Fig . 3C, indicating that, on a normalized basis, they share no more nor less of the E . coli genes than do B . subtilis, P . aeruginosa, Mycobacterium tuberculosis, or Synechocystis sp . Aeropyrum pernix is an exception in that it consistently has somewhat fewer of the bacterial genes than do the other archaebacteria although it is still within error limits of being the same . The plot of the percentage of the Methanobacterium thermoautotrophicum genes present in other species in Fig . 3D shows that the archaebacteria are specifically related to one another as first proposed by Woese (71) . Although they do not belong in a separate domain according to our analysis, they do represent a major subdivision of bacteria such as the coliforms, the actinomycetes, and the bacilli . As expected by its lifestyle, Methanococcus jannaschii shares significantly more genes with Methanobacterium thermoautotrophicum than do other species . Archaeoglobus fulgidus also shares a surprising number of genes with the methanogens . Pyrococcus abyssi and Pyrococcus horikoshii are specifically related, as are Halobacterium sp . and Thermoplasma acidophilum, but to a lesser extent . Aeropyrum pernix is only slightly above the scatter in the line defined by P . aeruginosa and the other species of bacteria and in fact Aquifex aeolicus and Thermotoga maritima show greater apparent similarity in their gene content to the methanogens than does Aeropyrum pernix as previously observed by Nelson et al . (44) . This is yet another indication that the archaebacteria are not deserving of placement in a higher taxon . The above analysis is based upon gene content . However, another, independent, criterion of relatedness is the degree of similarity in pairs of orthologs . One should observe approximately the same relationship among species using any gene, provided it is neither a paralog nor resulting from recent gene transfer . When all of the orthologs in a pair of species are aligned and the mean percentage identities are determined, they provide an overall measure of relatedness that is superior to any single gene comparison . By simultaneously comparing all orthologs, any errors resulting from misidentification and misalignment are minimized . As shown in Fig . 5, the orthologous proteins range from highly conserved to barely recognizable . However, the percent identity of orthologs is more or less normally distributed except where the species are very closely or very distantly related, in which cases the distributions are markedly skewed and the median identity becomes a better measure of relatedness than does the mean . Such skew does not qualitatively affect the conclusions and was not taken into account because those comparisons are obvious . The mean of the distribution was calculated for all orthologs of each species pair as shown in Fig . 6 . These data are plotted in Fig . 7, where it is apparent that the mean of the mean percent identity for recognizable orthologs in all species is 36.9% (a number very similar to the 36.2% median identity observed by Nolling et al . [47] for Clostridium acetobutylicum in comparison to other species) . The standard deviation for this plot is 1.8; thus, species beyond two or three standard deviations (95.4 to 99.7% of the comparisons) or 40.5 to 42.3% mean identity in their orthologs may be considered to be significantly more closely related than the average as highlighted in Fig . 6 .
More distantly related species than those in the same genus can also be recognized by the mean percentage identity in their orthologs, a finding which is consistent with the fraction of the genome present in other species . Thus, E . coli, H . influenzae, P . multocida, Buchnera sp., and V . cholerae are specifically related to one another by both criteria . Buchnera sp., which has 99% of its genome represented in E . coli (1.5 times expected), shows 57% mean identity in its orthologs . H . influenzae, which has 76% of its genes represented in E . coli (1.5 times expected), also shows 57% identity in its orthologs . P . multocida has 75% of its genome in E . coli (1.5 times expected) and shows 57% mean identity in its orthologs . V . cholerae has 52% of its genome in E . coli (1.5 times expected) and shows 55% mean identity in its orthologs . H . influenzae has 81% of its genome in P . multocida (three times expected) and shows 72% mean identity in its orthologs . In fact, P . multocida and H . influenzae are similar enough to be considered members of the same genus by this measure . P . multocida and H . influenzae are slightly closer to E . coli than to V . cholerae, with 70% of their genomes in V . cholerae (1.5 times expected) and 55% mean identity of orthologs . Buchnera sp . is also closer to E . coli than to either H . influenzae or V . cholerae . Although the fraction of the smaller genomes present in E . coli is not very large (ca . 1.5 times), it is more significant for comparisons among the smaller species (up to three- to fourfold) . The gram-negative bacteria P . aeruginosa, N . meningitidis, and Xylella fastidiosa are somewhat more divergent from E . coli and relatives, with ca . 48 to 44% mean identity of their orthologs, respectively . These comparisons are still quite significant at more than four standard deviations from the mean, and the gene contents for many of the comparisons for the group as a whole are greater than that expected by size alone . In this instance, the mean identity of orthologs appears to give a clearer picture of relatedness than does gene content . There are species that show even less similarity to the coliforms than do P . aeruginosa, N . meningitidis, and X . fastidiosa and thus fall into a marginal category . Although individual comparisons are not in themselves very significant since they are within one standard deviation of the normal distribution, the Rickettsia prowazekii orthologs are consistently more like the coliforms and relatives than they are to other species, i.e., with 39 to 40% mean identity and as many as twice the number of expected orthologs in some cases . It could be argued that D . radiodurans is slightly closer to this group of species and even shows 41% mean identity in orthologs to P . aeruginosa, although it is not obvious from gene content . Campylobacter jejuni and Helicobacter pylori are clearly related to one another at the level of 47% mean identity in their orthologs and show more than three times greater similarity in gene content than expected . They consistently show slightly greater similarity to the coliforms and relatives than to other species (with twice the expected gene content but with only an insignificant 38 to 39% mean identity) . Aquifex aeolicus and T . maritima are related to one another at the level of 41% mean identity and 2.5 to 3 times the expected gene content . They also seem to be related to C . jejuni and H . pylori, with 2 to 2.5 times the expected gene content but with only 38 to 39% mean identity of orthologs . This similarity appears to extend to P . multocida, H . influenzae, and N . meningitidis . However, these results need to be corroborated by additional studies . Treponema pallidum and Borrelia burgdorferi show marginal similarity to one another at 40% mean identity in orthologs, but with more than five times the expected similarity in gene content . They also show greater-than-expected similarity in gene content to a number of both gram-negative and gram-positive species, but they cannot be assigned to a specific group at present . Although there is no specific relationship to the coliforms or to the spirochaetes, it is also possible to show a weak relationship between U . urealyticum and the mycoplasmas (M . genitalium and M . pneumoniae) at 40 to 41% mean identity in orthologs and a much more significant nine times the expected gene content . However, Thermoplasma acidophilum is not part of this small group . Lactococcus lactis shares twice as many genes as expected with B . subtilis and shows 44% mean identity in its orthologs . The Mycoplasma spp . and U . urealyticum share more genes with B . subtilis and L . lactis than expected by factors of 2 to 3, but the mean identity of orthologs is not significant (37 to 39%) . In this instance, gene content appears to be more informative than does the mean identity of orthologs . The two measures of relatedness need not necessarily give the same result, although it is more believable when they track together . In fact, there may be some limitations on the utility of mean identity of orthologs as discussed below . Of the eight archaebacteria analyzed, six can be clearly related to one another, and the remaining two are marginally related . The two Pyrococcus spp . (Pyrococcus abyssi and Pyrococcus horikoshii) are the most closely related species at more than 79% mean identity of orthologs and with more than seven times as many shared genes as expected . The two methanogens show the second most significant relationship to one another at 46% mean identity of orthologs and more than five times as many shared genes as expected . They are slightly more distant from the Pyrococcus species at 44% mean identity and with three to five times as many shared genes as expected . Archaeoglobus fulgidus is more distant at 43 to 44% mean identity to the first four and with three to four times as many shared genes as expected . Aeropyrum pernix is even less related to these five species at 39 to 41% mean identity, with slightly greater similarity to Pyrococcus and three to four times as many shared genes as expected . Halobacterium sp . and Thermoplasma acidophilum are the most distant to the other archaebacteria in that they show a marginal 38 to 40% mean identity in orthologs and share about three times as many genes as expected based upon size alone . The evolutionary position of Aquifex aeolicus and T . maritima is very interesting for several reasons . They are reported to be among the most ancient of bacteria and share many biochemical similarities with the archaebacteria . For example, they are extreme thermophiles and contain ether-linked lipids . As indicated above, their orthologs average 41% identity to one another, and they share nearly three times as many genes as expected . They also appear to share some similarity to the coliforms and related gram-negative bacteria . All of the archaebacteria show two- to fourfold more shared genes with Aquifex aeolicus and T . maritima than expected, although Pyrococcus abyssii is the only species in which mean identity of orthologs is slightly closer than average (to T . maritima, 40%) . On the other hand, from the standpoint of Aquifex aeolicus and T . maritima, neither show significantly greater similarity to the archaebacteria than to the other bacteria . Thus, these two species appear to share characteristics of both gram-negative bacteria and of archaebacteria, but further study is required to determine their precise relationships . An interesting observation from whole-genome analysis is that not all species can be specifically related to the others in the sense that they are no closer to one another than average by either criterion such as Mycobacterium tuberculosis or Synechocystis sp . Three relatively large groups of related species do stand out: the coliforms and certain other gram-negative species, B . subtilis and certain other gram-positive species, and the archaebacteria . The "tree of life" based upon 16S rRNA purports to relate all species in a hierarchical fashion . However, we know that there is significant gene transfer and duplication among species of bacteria that might compromise single gene comparisons . The simultaneous analysis of all orthologs should minimize the negative effects of gene transfer, duplication, and misalignment on evolutionary studies, but even that does not allow all species to be specifically related to one another since we have found that the mean percent identity of orthologs approaches a limit of ca . 37% for the most distantly related species and is not reliable for taxa much above the family level . It is thought that slowly evolving genes (those showing the largest overall percent identity) can reveal relationships for the most distantly related species . However, we believe that the slowly evolving genes only make it easier to align sequences and that all orthologous genes will give virtually the same result barring unforeseen gene transfer or paralogy and provided that they can be aligned unequivocally . There is an implied assumption that there is a positive correlation between slowly evolving genes (those with high sequence identity) and conserved genes (that are found in the majority of species), but neither this assumption nor its inverse is necessarily correct . It is also commonly assumed but not necessarily true that the most highly conserved genes are unlikely to be duplicated or transferred and therefore result in more reliable evolutionary trees . The individual comparison of more than a dozen proteins that are found in all species in our study (data not shown) revealed that species which are clearly related as deduced from whole-genome analyses are also obvious from single-gene comparisons . However, the distantly related and marginal species from whole-genome comparisons cannot be precisely and consistently positioned in single-gene trees . The single-gene analysis was particularly uncertain concerning the position of the two methanogens with respect to Archaeoglobus, but mean identity of orthologs and gene content clearly showed that the methanogens are more closely related to one another than to Archaeoglobus, as they should be considering their lifestyles . Halobacterium is generally placed among the methanogens in rRNA trees, and its position is uncertain from the other single-gene comparisons, but it is clearly one of the most divergent of the archaebacteria from whole-genome analysis . Aeropyrum pernix is thought to be the most divergent of the archaebacteria included in this study but is closer to the methanogens than is Halobacterium sp . based upon whole-genome analysis . Thus, we believe that whole-genome analysis can resolve at least some of the uncertainties from single-gene analyses and should be the method of choice where whole-genome sequences are available .
It was previously shown that the percentage of orthologs in one species is related to the size of the other genome (4, 61), although it was not quantified . We can empirically determine that relationship for each species by comparison with a large, presumably distantly related species . We chose B . subtilis for comparison of most of the gram-negative bacteria and P . aeruginosa for most of the gram-positive species . Synechocystis sp., Mycobacterium tuberculosis, or D . radiodurans would also work since they do not appear to have any near relatives in our database . Thus, ca . 8% or 340 of the E . coli genes should be found in unrelated species for every 1,000 genes they contain . We have found that this holds true for most of the larger bacterial species . However, as the genome size of the first species drops below ca . 2,000 genes, then the percentage of genes found in the other species increases to twice that frequency for the smallest genomes sequenced to date . That is probably because there is a greater percentage of essential genes in the smaller genomes . Thus, ca . 16% or 90 of the Buchnera sp . genes should be found among every 1,000 genes of other species based upon the size of the genomes alone . Either way it is viewed, from the Buchnera sp . or the E . coli standpoint, these two species share more genes than expected by factors of 1.4- to 2.9-fold (i.e., an actual 13.1% of the E . coli genome versus an expected 4.5% or an actual 99.5% of the Buchnera sp . genome versus an expected 70%), which suggests that they are specifically related . We can obtain some measure of the significance of this comparison by estimating the deviation from the fitted line in the D . radiodurans plot (2.8%, which is about the same as apparent for the E . coli plot and about half of that in the Buchnera sp . plot) . This indicates that the greater-than-expected similarity between E . coli and Buchnera sp . is also significantly greater than the average deviation . Although Snel et al . (61) recognized the effects of genome size on the numbers of shared genes, they nevertheless constructed a gene content tree, apparently without correction for size, that they found to be similar to rRNA trees . We did not construct a tree because the corrections for genome size are not sufficiently precise and because not all species can be specifically related by this measure . That is, some species do not deviate significantly more than the scatter in the plot . However, we were able to determine that eight species of coliforms and gram-negative bacteria are specifically related to one another, that six species of bacilliform and gram-positive bacteria are related, and that the eight species of archaebacteria are also specifically related to one another . Even so, that is hardly sufficient information to build a tree or trees . We have also established that the archaea are in fact just bacteria in terms of their gene content . Eukaryotes are clearly different from both archaebacteria and other bacteria in terms of the numbers of shared orthologs . That is, the eukaryotes have far fewer genes in common with the bacteria or prokaryotes as a whole than expected based upon the size relationship established above . If eukaryotes freely shared genetic information with bacteria to the same extent that bacteria do among themselves, then ca . 50% of the E . coli genome should be present in S . cerevisiae (2,145 genes as opposed to the actual 706 orthologs or one-third of the expected amount) . The observation by Olsen et al . (48) that the distinction between eukaryotes and prokaryotes has become blurred as a result of rRNA comparisons is clearly not supported by whole-genome analysis . There is in fact a very distinct separation in terms of gene content . Although the method we used to normalize the gene content data was also used by Tekaia et al . (65), they came to very different conclusions than we do . This could be because they did not take into account the effect of size of the genomes on the apparent similarity . In addition to normalization of the data to the percentage of the genome present in other species, they performed correspondence analysis and constructed several trees . Unfortunately, their genomic trees changed topology depending upon whether or not Mycoplasma was included . The number of organisms in the data set should have no effect on topology at all, but this is another common problem with evolutionary studies . In addition to the lack of consistency in their trees, the initial apparent relationships they found were not plausible . We know from other studies that E . coli and H . influenzae are specifically related, but they appear on different branches of the first two genomic trees of Tekaia et al . (65) . When only the unique genes of each genome were compared (duplications ignored), E . coli and H . influenzae were properly clustered . However, the lack of consistency in the three reported trees is disturbing . It was noted that there was a strong resemblance between the genomic trees and the 16S rRNA trees, but one has to ask which RNA trees because it is also true that the various published rRNA trees lack consistency and change topology as new species are added to existing trees or as the order of addition is changed . Another approach to whole-genome analysis is the comparison of clusters of orthologous groups of proteins (COGs) employed by Natale et al . (43) . Although their focus was on archaebacteria, these authors considered many of the same species as in our study and constructed trees based upon the co-occurrence of COGs or families of proteins rather than on individual genes . These trees showed several improbably close relationships such as between E . coli and B . subtilis or between H . influenzae and H . pylori which these authors recognized as anomalous but explained that it was due to a mixed reflection of phylogenetic relationships and similarities in gene repertoires related to the lifestyles of the organisms . However, those are largely the same thing; one determines phylogenetic relationships based upon similarities in gene content which are related to lifestyles . If all bacteria had the same lifestyle, they would largely have the same gene content and vice versa . Natale et al . (43) normalized their data by dividing the identities by the number of unique COGs but failed to account for the effect of genome size, which is basically the same problem as with the Snel et al . (61) and Tekaia et al . (65) analyses . Natale et al . also failed to explain how the bacteria acquired similar lifestyles, by common descent, by gene transfer, or by a combination of the two . Thus, we believe that gene content is a mixed reflection of phylogenetic relationships and gene transfer . Yet another approach to whole-genome comparison is through determination of the "genomic signature" (22) . Evolutionarily isolated species, e.g., the archaebacteria, are expected to contain a number of unique genes . However, genomic signature analysis presupposes that one already knows how the organisms are related before the analysis is performed . It is therefore not entirely objective, which is the same problem as with 16S rRNA signature analysis (73) . It is possible to find this type of "signature" for any grouping of organisms one wishes to emphasize . It can be useful for designing oligonucleotide probes for ecological studies, but it has little or no evolutionary value . Nevertheless, Graham et al . (22) found 351 clusters of 1,149 genes that were present in at least two of six species of archaebacteria but not in other kinds of bacteria or eukaryotes . These authors apparently found fewer clusters in all nine species considered in their study . Our analysis shows that there are 490 orthologs involving 1,694 genes which are specific to two or more of eight archaebacterial genomes . At least 22 orthologs are specifically conserved in all eight species of archaebacteria but not in other species . This is to be contrasted with the 45 orthologs that are conserved in all 37 species considered in our study, including the archaebacteria and eukaryotes . Characteristics previously considered to be unique to archaebacteria are the enzymes and cofactors necessary for methanogenesis, which Graham et al . (22) included in their study but which are also found in the aerobic methylotrophic bacteria that carry out the reverse reaction, the oxidation of methane to CO2 (10) . Another such characteristic is the presence of ether-linked lipids which we now know are also found in Aquifex aeolicus and T . maritima and are fairly common in nature (56) . For example, they are even present in the mesophilic sulfate-reducing bacteria, Desulfosarcina and Desulforhabdus spp . (55) . Their presence in archaebacteria is probably related to the need to maintain the integrity of membranes at hyperthermophilic growth temperatures . We have in fact found distant relationships between the Aquifex aeolicus and T . maritima genomes and with the archaebacteria (also observed by Nelson et al . [44]) that may in part account for the shared presence of ether-linked lipids . There is also evidence that some of the other supposedly unique coenzymes of archaebacteria such as 2-mercaptoethanesulfonate are present in other kinds of bacteria (35) . The cell walls of archaebacteria are supposed to be unique but, in fact, the single-constituent glycoprotein cell wall is common in the more complex cell walls of other bacteria (59) . For example, gram-positive and gram-negative bacteria contain peptidoglycan as well as glycoprotein, and gram-negative bacteria contain lipopolysaccharide in addition to peptidoglycan and glycoprotein . Thus, the supposedly unique biological characteristics of archaebacteria are in fact not unique when examined in more detail . There have been a number of studies with single genes and proteins other than 16S rRNA (7) . It is likely that they give conflicting results partly because of the problem of paralogy and partly due to the inherent imprecision in current alignment techniques . Alignment is one of the most important but neglected variables in sequence comparisons . In fact, we believe that alignment or more precisely misalignment is one of the main causes for the lack of consistency in previously published trees . That is, variations caused by approximations in the choice of penalty for gaps and mismatches in alignment matrices is likely to be one reason that the topology of trees change when additional species are added or when the order of addition is altered . Furthermore, we cannot completely eliminate paralogous genes, even for 16S rRNA, for which it is assumed to be rarely, if ever, transferred between species (72) . At least one instance of gene transfer of 16S rRNA has in fact been observed (75) . Furthermore, there are two disparate rRNA operons in Haloarcula marismortui (13) and more are likely to become apparent as the number of whole-genome sequences increases . Yet another problem is that all proteins and rRNA should have a limit to change that is determined by the structure-function relationship (42) . As two species diverge, their proteins should asymptotically approach such a limit to change . At the limit, divergence is equally balanced by convergent mutations (the sum of back and parallel mutations), resulting in a steady state that is characteristic for each protein having a well-defined function . Changes in the steady state can result from duplications and alterations in function which we recognize as one form of paralogy . Practically speaking, trees cannot be constructed for data near the limit of change for a given gene . The limit of change for genes and proteins has to be empirically determined but, to date, has only been established for some cytochromes (42) . It is expected that as the numbers of different genes compared increases, then limits to change will become generally recognized and errors due to paralogy and misalignment should be minimized . We cannot say how many genes are necessary, but the number of orthologs found in whole-genome comparisons should be more than sufficient . When we plotted the frequency of percent identity of orthologs in pairs of species, we found a more or less normal distribution for most comparisons . Only the most similar and most divergent species were markedly skewed . The mean of the distribution for all orthologs in a species pair provides a measure of how closely related the species are . When the frequencies of the means for all species pairs were plotted, they, too, could be fit by a normal distribution . We postulate that the distribution of the mean more or less defines the average limit to change for all orthologous proteins as defined in this study . This is ca . 37% identity, which is significantly greater than the 5% expected from amino acid composition alone but very similar to the 36% median identity determined by Nolling et al . (47) for a smaller number of species than was considered here . Only species comparisons significantly outside the normal distribution permit reasonable inferences about specific relationships, which we have taken to be two or more standard deviations above the mean or 40% identity . We found relatively few such exceptional comparisons . However, they are consistent with the analysis we performed on gene content . Our study is also in agreement with that of Nolling et al . (47), which was published after we completed our analysis and had no influence on the outcome . Our results strongly suggest that all-encompassing evolutionary trees cannot or should not be constructed for bacterial comparisons that approach such a limit to change . Gene content does not have the same limitations on its use and that may be why some specific relationships were observed from gene content but not indicated by mean identity of orthologs . Nevertheless, using our approach to whole-genome analysis, considering both gene content and mean identity in orthologous genes, we have found specific relationships among the majority of the 37 species considered in this study .
|