|
|
|
What Is Genome?In biology the genome of an organism is the whole hereditary information of an organism that is encoded in the DNA (or,for some viruses, RNA). This includes both the genes and the non-coding sequences. More precisely, the genome of an organism is a complete DNA sequence of one set of chromosomes; for example, one of the two sets that a diploid individual carries in every somatic cell. When people say that the genome of a sexually reproducing species has been "sequenced," typically they are referring to a determination of the sequences of one set of autosomes and one of each type of sex chromosome, which together represent both of the possible sexes. Even in species that exist in only one sex, what is described as "a genome sequence" may be a composite from the chromosomes of various individuals. In general use, the phrase genetic makeup is sometimes used conversationally to mean the genome of a particular individual or organism. The study of the global properties of genomes of related organisms is usually referred to as genomics, which distinguishes it from genetics which generally studies the properties of single genes or groups of genes. Types of genomes Most biological entities more complex than a virus sometimes or always carry additional genetic material besides that which resides in their chromosomes. In some contexts, such as sequencing the genome of a pathogenic microbe, "genome" is meant to include this auxiliary material, which is carried in plasmids. In such circumstances then, "genome" describes all of the genes and non-coding DNA that have the potential to be present. In vertebrates such as humans, however, "genome" carries the typical connotation of only chromosomal DNA. So although human mitochondria contain genes, these genes are not considered part of the genome. In fact, mitochondria are sometimes said to have their own genome, often referred to as the "mitochondrial genome". Genomes and genetic variation Note that a genome does not capture the genetic diversity or the genetic polymorphism of a species. For example, the human genome sequence in principle could be determined from just half the DNA of one cell from one individual. To learn what variations in DNA underlie particular traits or diseases requires comparisons across individuals. This point explains the common usage of "genome" (which parallels a common usage of "gene") to refer not to any particular DNA sequence, but to a whole family of sequences that share a biological context. Although this concept may seem counter intuitive, it is the same concept that says there is no particular shape that is the shape of a cheetah. Cheetahs vary, and so do the sequences of their genomes. Yet both the individual animals and their sequences share commonalities, so one can learn something about cheetahs and "cheetah-ness" from a single example of either. Genome projects Main article: Genome project The Human Genome Project was organized to map and to sequence the human genome. Other genome projects include mouse, rice, the plant Arabidopsis thaliana, the puffer fish, bacteria like E. coli, etc. Many genomes have been sequenced by various genome projects. The cost of sequencing continues to drop, and it is possible that eventually an individual's genome could be sequenced for around several thousand dollars (US). Genome evolution Genomes are more than the sum of an organism's genes and have traits that may be measured and studied without reference to the details of any particular genes and their products. Researchers compare traits such as chromosome number, chromosome size, gene order, codon usage bias, and G-C content to determine what mechanisms could have produced the great variety of genomes that exist today. Duplications play a major role in shaping the genome. Duplications may range from extension of short tandem repeats, to duplication of a cluster of genes, and all the way to duplications of entire chromosomes or even entire genomes. Such duplications are probably fundamental to the creation of genetic novelty. Horizontal gene transfer is invoked to explain how there is often extreme similarity between small portions of the genomes of two organisms that are otherwise very distantly related. Horizontal gene transfer seems to be common among many microbes. Also, eukaryotic cells seem to have experienced a transfer of some genetic material from their chloroplast and mitochondrial genomes to their nuclear chromosomes Genomics is the study of an organism's genome and the use of the genes. It deals with the systematic use of genome information, associated with other data, to provide answers in biology, medicine, and industry. Genomics has the potential of offering new therapeutic methods for the treatment of some diseases, as well as new diagnostic methods. Other applications are in the food and agriculture sectors. The major tools and methods related to genomics are bioinformatics, genetic analysis, measurement of gene expression, and determination of gene function. Genomics appeared in the 1980s and took off in the 1990s with the initiation of genome projects for several species. The related field of genetics is the study of genes and their role in inheritance. The first genome to be sequenced in its entirety was that of bacteriophage FX174 (5,368 kb) in 1980. The first free-living organism to be sequenced was that of Haemophilus influenzae (1.8Mb) in 1995, and since then genomes are being sequenced at a rapid pace. A rough draft of the human genome was completed by the Human Genome Project in early 2001 amid much fanfare. Comparison of genomes has resulted in some surprising biological discoveries. If a particular DNA sequence or pattern is present among many members of a clade, that sequence is said to have been conserved among the species. Evolutionary conservation of a DNA sequence may imply that it confers a relative selective advantage to the organisms that possess it. Conservation also suggests that sequence has functional significance. It may be a protein coding sequence or regulatory region. Experimental investigation of some of these sequences has shown that some are transcribed into small RNA molecules, although the functions of these RNAs were not immediately apparent. The identification of similar sequences (including many genes) in two distantly related organisms, but not in other members of one of the clades, has led to the theory that these sequences were acquired by horizontal gene transfer. This phenomenon is most prominent in thermophilic bacteria, where it seems that genes were transferred from Archaea to Eubacteria. It has also been noticed that bacterial genes exist in eukaryotic nuclear genomes and that these genes generally encode mitochondrial and plastid proteins, giving support to the endosymbiotic theory of the origin of these organelles. It is often stated that a particular organism shares X percent of its DNA with humans. This number indicates the percentage of base pairs that are identical between the two species. Here is a list of genetic similarity to humans, with sources, where known. While these numbers come from various secondary sources, the data may have originated from measures of DNA-DNA hybridization or from direct sequence comparisons. Informally, an omics is a neologism referring to a field of study in biology, ending in the suffix -omics such as genomics or proteomics. The related neologism omes are the objects of study of the field such the genome or proteome, respectively (omes stems from the Greek for 'all', 'every' or 'complete'). The original use of the suffix "ome" was in the word "genome", which refers to the complete genetic makeup of an organism. Because of the success of large-scale quantitative biology projects such as genome sequencing, the suffix "ome" has been extended to a host of other contexts. Bioinformaticians and some molecular biologists were amongst the first scientists to start to apply the "ome" suffix widely. New 'omes' are currently being defined by people in the field of bioinformatics. Observers have claimed that they vie for the most ridiculous 'ome', much like humourous names assigned to genes in the field of Drosophila developmental genetics. For example, son of sevenless and Darth Vader are both genes involved in regulating the cell cycle. The omes are a useful way for computational biologists to encapsulate a particular class of cellular processes, or information processing related mechanisms. Bioinformatics or computational biology is the use of mathematical and informational techniques, including statistics, to solve biological problems, usually by creating or using computer programs, mathematical models or both. One of the main areas of bioinformatics is the data mining and analysis of the data gathered by the various genome projects. Other areas are sequence alignment, protein structure prediction, systems biology, protein-protein interactions and virtual evolution. As a summary, the various genome projects produce many long lists of letters and one of the roles of bioinformatics is to attempt to determine the words, grammar, sentences and ultimately, meaning (functional significance) of those letters. There are many who hope that developments in this field will ultimately help in the discoveries of cures for various diseases including cancer. Since the Epstein-Barr virus was sequenced in 1984, the DNA sequence of more and more organisms is stored in electronic databases. These data are analyzed to determine genes that code for proteins, as well as regulatory sequences. A comparison of genes within a species or between different species can show similarities between protein functions, or relations between species (the use of molecular systematics to construct phylogenetic trees). With the growing amount of data, it becomes impossible to analyze DNA sequences manually. Today, computer programs are used to find similar sequences in the genome of dozens of organisms, within billions of nucleotides. These programs can compensate for mutations (exchanged, deleted or inserted bases) in the DNA sequence, in order to identify sequences that are related, but not identical. A variant of this sequence alignment is used in the sequencing process itself. The so-called shotgun sequencing (that was used, for example, by Celera Genomics to sequence the human genome) does not give a sequential list of nucleotides, but instead the sequences of thousands of small DNA fragments (each about 600 nucleotides long). The ends of these fragments overlap and, aligned in the right way, make up the complete genome. Shotgun sequencing yields sequence data quickly, but the task to re-align the fragments can be quite complicated for larger genomes. In the case of the Human Genome Project, it took several months on a supercomputer array to align them correctly. Shotgun sequencing is generally preferred for smaller genomes, such as bacteria, and often used at least partially on organisms with much larger genomes. Another aspect of bioinformatics in sequence analysis is the automatic search for genes and regulatory sequences within a genome. Not all of the nucleotides within a genome are genes. Within the genome of higher organisms, large parts of the DNA do not serve any obvious purpose. This so-called junk DNA may, however, contain unrecognized functional elements. Bioinformatics helps to bridge the gap between genome and proteome projects, for example in the use of DNA sequence for protein identification. Biochemistry, 2000 Sep 5, 39(35), 10877 - 83Sensitive monitoring of the dynamics of a membrane-bound transport protein by tryptophan phosphorescence spectroscopy; Broos J et al.; This paper presents a tryptophan phosphorescence spectroscopy study on the membrane-bound mannitol transporter, EII(mtl), from E . coli . The protein contains four tryptophans at positions 30, 42, 109, and 117 . Phosphorescence decays in buffer at 1 degrees C revealed large variations of the triplet lifetimes of the wild-type protein and four single-tryptophan-containing mutants . They ranged from <70 microseconds for the tryptophan at position 109 to 55 ms for the residue at position 30, attesting to widely different flexibilities of the tryptophan microenvironments . The decay of all tryptophans is multiexponential, reflecting multiple stable conformations of the protein . Both mannitol binding and enzyme phosphorylation had large effects on the triplet lifetimes . Mannitol binding induces a more ordered structure near the mannitol binding site, and the decay becomes significantly more homogeneous . In contrast, enzyme phosphorylation induces a large relaxation of the protein structure at the reporter sites . The implications of these structural changes on the coupling mechanism between the transport and the phosphorylation activity of EII(mtl) are discussed . Taken as a whole, our data show that tryptophan phosphorescence spectroscopy is a very sensitive technique to explore conformational dynamics in membrane proteins. Biochemistry, 2000 Sep 5, 39(35), 10812 - 22 Proteolysis of the exodomain of recombinant protease-activated receptors: prediction of receptor activation or inactivation by MALDI mass spectrometry; Loew D et al.; Protease-activated receptors (PARs) mediate cell activation after proteolytic cleavage of their extracellular amino terminus . Thrombin selectively cleaves PAR1, PAR3, and PAR4 to induce activation of platelets and vascular cells, while PAR2 is preferentially cleaved by trypsin . In pathological situations, other proteolytic enzymes may be generated in the circulation and could modify the responses of PARs by cleaving their extracellular domains . To assess the ability of such proteases to activate or inactivate PARs, we designed a strategy for locating cleavage sites on the exofacial NH(2)-terminal fragments of the receptors . The first extracellular segments of PAR1 (PAR1E) and PAR2 (PAR2E) expressed as recombinant proteins in Escherichia coli were incubated with a series of proteases likely to be encountered in the circulation during thrombosis or inflammation . Kinetic and dose-response studies were performed, and the cleavage products were analyzed by MALDI-TOF mass spectrometry . Thrombin cleaved PAR1E at the Arg41-Ser42 activation site at concentrations known to induce cellular activation, supporting a native conformation of the recombinant polypeptide . Plasmin, calpain and leukocyte elastase, cathepsin G, and proteinase 3 cleaved at multiple sites and would be expected to disable PAR1 by cleaving COOH-terminal to the activation site . Cleavage specificities were further confirmed using activation site defective PAR1E S42P mutant polypeptides . Surface plasmon resonance studies on immobilized PAR1E or PAR1E S42P were consistent with cleavage results obtained in solution and allowed us to determine affinities of PAR1E-thrombin binding . FACS analyses of intact platelets confirmed the cleavage of PAR1 downstream of the Arg41-Ser42 site . Mass spectrometry studies of PAR2E predicted activation of PAR2 by trypsin through cleavage at the Arg36-Ser37 site, no effect of thrombin, and inactivation of the receptor by plasmin, calpain and leukocyte elastase, cathepsin G, and proteinase 3 . The inhibitory effect of elastase was confirmed on native PAR1 and PAR2 on the basis of Ca(2+) signaling studies in endothelial cells . It was concluded that none of the main proteases generated during fibrinolysis or inflammation appears to be able to signal through PAR1 or PAR2 . This strategy provides results which can be extended to the native receptor to predict its activation or inactivation, and it could likewise be used to study other PARs or protease-dependent processes. Biochemistry, 2000 Sep 5, 39(35), 10730 - 8 Steady-state kinetic mechanism of recombinant avocado ACC oxidase: initial velocity and inhibitor studies; Brunhuber NM et al.; The gaseous plant hormone ethylene modulates a wide range of biological processes, including fruit ripening . It is synthesized by the ascorbate-dependent oxidation of 1-aminocyclopropyl-1-carboxylate (ACC), a reaction catalyzed by ACC oxidase . Recombinant avocado (Persea americana) ACC oxidase was expressed in Escherichia coli and purified in milligram quantities, resulting in high levels of ACC oxidase protein and enzyme activity . An optimized assay for the purified enzyme was developed that takes into account the inherent complexities of the assay system . Fe(II) and ascorbic acid form a binary complex that is not the true substrate for the reaction and enhances the degree of ascorbic acid substrate inhibition . The K(d) value for Fe(II) (40 nM, free species) and the K(m)'s for ascorbic acid (2.1 mM), ACC (62 microM), and O(2) (4 microM) were determined . Fe(II) and ACC exhibit substrate inhibition, and a second metal binding site is suggested . Initial velocity measurements and inhibitor studies were used to resolve the kinetic mechanism through the final substrate binding step . Fe(II) binding is followed by either ascorbate or ACC binding, with ascorbate being preferred . This is followed by the ordered addition of molecular oxygen and the last substrate, leading to the formation of the catalytically competent complex . Both Fe(II) and O(2) are in thermodynamic equilibrium with their enzyme forms . The binding of a second molecule of ascorbic acid or ACC leads to significant substrate inhibition . ACC and ascorbate analogues were used to confirm the kinetic mechanism and to identify important determinants of substrate binding. Biochemistry, 2000 Sep 5, 39(35), 10711 - 9 Interaction of flavodoxin with cobalamin-dependent methionine synthase; Hall DA et al.; Cobalamin-dependent methionine synthase catalyzes the transfer of a methyl group from methyltetrahydrofolate to homocysteine, forming tetrahydrofolate and methionine . The Escherichia coli enzyme, like its mammalian homologue, is occasionally inactivated by oxidation of the cofactor to cob(II)alamin . To return to the catalytic cycle, the cob(II)alamin forms of both the bacterial and mammalian enzymes must be reductively remethylated . Reduced flavodoxin donates an electron for this reaction in E . coli, and S-adenosylmethionine serves as the methyl donor . In humans, the electron is thought to be provided by methionine synthase reductase, a protein containing a domain with a significant degree of homology to flavodoxin . Because of this homology, studies of the interactions between E . coli flavodoxin and methionine synthase provide a model for the mammalian system . To characterize the binding interface between E . coli flavodoxin and methionine synthase, we have employed site-directed mutagenesis and chemical cross-linking using carbodiimide and N-hydroxysuccinimide . Glutamate 61 of flavodoxin is identified as a cross-linked residue, and lysine 959 of the C-terminal activation domain of methionine synthase is assigned as its partner . The mutation of lysine 959 to threonine results in a diminished level of cross-linking, but has only a small effect on the affinity of methionine synthase for flavodoxin . Identification of these cross-linked residues provides evidence in support of a docking model that will be useful in predicting the effects of mutations observed in mammalian homologues of E . coli flavodoxin and methionine synthase. Biochemistry, 2000 Sep 5, 39(35), 10702 - 10 Structural analysis of glyceraldehyde 3-phosphate dehydrogenase from Escherichia coli: direct evidence of substrate binding and cofactor-induced conformational changes; Yun M et al.; The crystal structures of gyceraldehyde 3-phosphate dehydrogenase (GAPDH) from Escherichia coli have been determined in three different enzymatic states, NAD(+)-free, NAD(+)-bound, and hemiacetal intermediate . The NAD(+)-free structure reported here has been determined from monoclinic and tetragonal crystal forms . The conformational changes in GAPDH induced by cofactor binding are limited to the residues that bind the adenine moiety of NAD(+) . Glyceraldehyde 3-phosphate (GAP), the substrate of GAPDH, binds to the enzyme with its C3 phosphate in a hydrophilic pocket, called the "new P(i)" site, which is different from the originally proposed binding site for inorganic phosphate . This observed location of the C3 phosphate is consistent with the flip-flop model proposed for the enzyme mechanism {Skarzynski, T., Moody, P . C., and Wonacott, A . J . (1987) J . Mol . Biol . 193, 171-187} . Via incorporation of the new P(i) site in this model, it is now proposed that the C3 phosphate of GAP initially binds at the new P(i) site and then flips to the P(s) site before hydride transfer . A superposition of NAD(+)-bound and hemiacetal intermediate structures reveals an interaction between the hydroxyl oxygen at the hemiacetal C1 of GAP and the nicotinamide ring . This finding suggests that the cofactor NAD(+) may stabilize the transition state oxyanion of the hemiacetal intermediate in support of the flip-flop model for GAP binding. Biochemistry, 2000 Sep 5, 39(35), 10662 - 76 Evolution of enzymatic activity in the enolase superfamily: structure of o-succinylbenzoate synthase from Escherichia coli in complex with Mg2+ and o-succinylbenzoate; Thompson TB et al.; The X-ray structures of the ligand free (apo) and the Mg(2+)*o-succinylbenzoate (OSB) product complex of o-succinylbenzoate synthase (OSBS) from Escherichia coli have been solved to 1.65 and 1.77 A resolution, respectively . The structure of apo OSBS was solved by multiple isomorphous replacement in space group P2(1)2(1)2(1); the structure of the complex with Mg(2+)*OSB was solved by molecular replacement in space group P2(1)2(1)2 . The two domain fold found for OSBS is similar to those found for other members of the enolase superfamily: a mixed alpha/beta capping domain formed from segments at the N- and C-termini of the polypeptide and a larger (beta/alpha)(7)beta barrel domain . Two regions of disorder were found in the structure of apo OSBS: (i) the loop between the first two beta-strands in the alpha/beta domain; and (ii) the first sheet-helix pair in the barrel domain . These regions are ordered in the product complex with Mg(2+)*OSB . As expected, the Mg(2+)*OSB pair is bound at the C-terminal end of the barrel domain . The electron density for the phenyl succinate component of the product is well-defined; however, the 1-carboxylate appears to adopt multiple conformations . The metal is octahedrally coordinated by Asp(161), Glu(190), and Asp(213), two water molecules, and one oxygen of the benzoate carboxylate group of OSB . The loop between the first two beta-strands in the alpha/beta motif interacts with the aromatic ring of OSB . Lys(133) and Lys(235) are positioned to function as acid/base catalysts in the dehydration reaction . Few hydrogen bonding or electrostatic interactions are involved in the binding of OSB to the active site; instead, most of the interactions between OSB and the protein are either indirect via water molecules or via hydrophobic interactions . As a result, evolution of both the shape and the volume of the active site should be subject to few structural constraints . This would provide a structural strategy for the evolution of new catalytic activities in homologues of OSBS and a likely explanation for how the OSBS from Amycolaptosis also can catalyze the racemization of N-acylamino acids {Palmer, D . R., Garrett, J . B., Sharma, V., Meganathan, R., Babbitt, P . C., and Gerlt, J . A . (1999) Biochemistry 38, 4252-4258}. Biochemistry, 2000 Sep 5, 39(35), 10656 - 61 Site-directed sulfhydryl labeling of the lactose permease of Escherichia coli: helix X; Venkatesan P et al.; Helix X in the lactose permease of Escherichia coli contains two residues that are irreplaceable with respect to active transport, His322 and Glu325, as well as Lys319, which is charge-paired with Asp240 in helix VII . Structural and dynamic features of transmembrane helix X are investigated here by site-directed thiol modification of 14 single-Cys replacement mutants with N-{(14)C}ethylmaleimide (NEM) in right-side-out membrane vesicles . Permease mutants with a Cys residue at position 326, 327, 329, 330, or 331 in the cytoplasmic half of the transmembrane domain are alkylated by NEM at 25 degrees C, a mutant with Cys at position 315 at the periplasmic surface is labeled in the presence of substrate exclusively, and mutants with Cys at positions 317, 318, 320, 321, 324, 328, 332, or 333 do not react with NEM under the conditions tested . Binding of substrate causes increased labeling of a Cys residue at position 315 and decreased labeling of Cys residues at positions 326, 327, and 329 . Studies with methanethiosulfonate ethylsulfonate indicate that Cys residues at positions 326, 329, 330, and 331 in the cytoplasmic half are accessible to the aqueous phase from the periplasmic face of the membrane . Ligand binding results in clear attenuation of solvent accessibility of Cys at position 326 and a marginal increase in accessibility of Cys at position 327 to solvent . The findings indicate that the cytoplasmic half of helix X is more reactive/accessible to thiol reagents and more exposed to solvent than the periplasmic half . Furthermore, positions that reflect ligand-induced conformational changes are located on the same face of helix X as Lys319, His322, and Glu325. Biochemistry, 2000 Sep 5, 39(35), 10649 - 55 Site-directed sulfhydryl labeling of the lactose permease of Escherichia coli: N-ethylmaleimide-sensitive face of helix II; Venkatesan P et al.; Cys-scanning mutagenesis of helix II in the lactose permease of Escherichia coli {Frillingos, S., Sun, J . et al . (1997) Biochemistry 36, 269-273} indicates that one face contains positions where Cys replacement or Cys replacement followed by treatment with N-ethylmaleimide (NEM) significantly inactivates the protein . In this study, site-directed sulfhydryl modification is utilized in situ to study this face of helix II . {(14)C}NEM labeling of 13 single-Cys mutants, including the nine NEM-sensitive Cys replacements, in right-side-out membrane vesicles is examined . Permease mutants with a single-Cys residue in place of Gly46, Phe49, Gln60, Ser67, or Leu70 are alkylated by NEM at 25 degrees C in 10 min, and mutants with Cys in place of Thr45 and Ser53 are labeled only in the presence of ligand, while mutants with Cys in place of Ile52, Ser56, Leu57, Leu62, Phe63, or Leu65 do not react . Binding of substrate leads to a marked increase in labeling of Cys residues at positions 45, 49, or 53 in the periplasmic half of helix II and a slight decrease in labeling of Cys residues at positions 60 or 67 in the cytoplasmic half . Labeling studies with methanethiosulfonate ethylsulfonate (MTSES) show that positions 45 and 53 are accessible to solvent in the presence of ligand only, while positions 46, 49, 67, and 70 are accessible to solvent in the absence or presence of ligand . Position 60 is also exposed to solvent, and substrate binding causes a decrease in solvent accessibility . The findings demonstrate that the NEM-sensitive face of helix II participates in ligand-induced conformational changes . Remarkably, this membrane-spanning face is accessible to the aqueous phase from the periplasmic side of the membrane . In the following paper in this issue {Venkatesan, P., Hu, Y., and Kaback, H . R . (2000) Biochemistry 39, 10656-10661}, the approach is applied to helix X. Biochemistry, 2000 Sep 5, 39(35), 10641 - 8 Site-directed sulfhydryl labeling of the lactose permease of Escherichia coli: helix VII; Venkatesan P et al.; Site-directed sulfhydryl modification in situ is employed to investigate structural and dynamic features of transmembrane helix VII and the beginning of the periplasmic loop between helices VII and VIII (loop VII/VIII) . Essentially all of the Cys-replacement mutants in the periplasmic half of the helix and the portion of loop VII/VIII tested are labeled by N-{(14)C}ethylmaleimide (NEM) . In contrast, with the exception of two mutants at the cytoplasmic end of helix VII, none of the mutants in the cytoplasmic half react with the alkylating agent . Labeling of most of the mutants is unaltered by ligand at 25 degrees C . However, at 4 degrees C, conformational changes induced by substrate binding become apparent . In the presence of ligand, permease mutants with a Cys residue at position 241, 242, 244, 245, 246, or 248 undergo a marked increase in labeling, while the reactivity of a Cys at position 238 is slightly decreased . Labeling of the remaining Cys-replacement mutants is unaffected by ligand . Studies with methanethiosulfonate ethylsulfonate (MTSES), a hydrophilic impermeant thiol reagent, show that most of the positions that react with NEM are accessible to MTSES; however, the two NEM-reactive mutants at the cytoplasmic end of helix VII and position 236 in the middle of the membrane-spanning domain are not . The findings demonstrate that positions in helix VII that reflect ligand-induced conformational changes are located in the periplasmic half and accessible to the aqueous phase from the periplasmic face of the membrane . In the following papers in this issue (Venkatesan, P., Lui, Z., Hu, Y., and Kaback H . R.; Venkatesan, P., Hu, Y., and Kaback H . R.), the approach is applied to helices II and X. Biochemistry, 2000 Sep 5, 39(35), 10613 - 8 Interaction between two discontiguous chain segments from the beta-sheet of Escherichia coli thioredoxin suggests an initiation site for folding; Tasayco ML et al.; The approach of comparing folding and folding/binding processes is exquisitely poised to narrow down the regions of the sequence that drive protein folding . We have dissected the small single alpha/beta domain of oxidized Escherichia coli thioredoxin (Trx) into three complementary fragments (N, residues 1-37; M, residues 38-73; and C, residues 74-108) to study them in isolation and upon recombination by far-UV CD and NMR spectroscopy . The isolated fragments show a minimum of ellipticity of ca . 197 nm in their far-UV CD spectra without concentration dependence, chemical shifts of H(alpha) that are close to the random coil values, and no medium- and long-range NOE connectivities in their three-dimensional NMR spectra . These fragments behave as disordered monomers . Only the far-UV CD spectra of binary or ternary mixtures that contain N- and C-fragments are different from the sum of their individual spectra, which is indicative of folding and/or binding of these fragments . Indeed, the cross-peaks corresponding to the rather hydrophobic beta(2) and beta(4) regions of the beta-sheet of Trx disappear from the (1)H-(15)N HSQC spectra of isolated labeled N- and C-fragments, respectively, upon addition of the unlabeled complementary fragments . The disappearing cross-peaks indicate interactions between the beta(2) and beta(4) regions, and their reappearance at lower temperatures indicates unfolding and/or dissociation of heteromers that are predominantly held by hydrophobic forces . Our results argue that the folding of Trx begins by zippering two discontiguous and rather hydrophobic chain segments (beta(2) and beta(4)) corresponding to neighboring strands of the native beta-sheet. Enzyme Microb Technol, 2000 Oct 1, 27(7), 475 - 481 Molecular cloning and expression of a novel family A endoglucanase gene from Fibrobacter succinogenes S85 in Escherichia coli; Cho KK et al.; A Fibrobacter succinogenes S85 gene that encodes endoglucanase hydrolysing CMC and xylan was cloned and expressed in Escherichia coli DH5 by using pUC19 vector . Recombinant plasmid DNA from a positive clone hydrolysing CMC and xylan was designated as pCMX1, harboring 2,043 bp insert . The entire nucleotide sequence was determined, and an open-reading frame (ORF) was deduced . The nucleotide sequence accession number of the cloned gene sequence in Genbank is U94826 . The endoglucanase gene cloned in this study does not have amino sequence homology to the other endoglucanase genes from F . succinogenes S85, but does show sequence homology to family 5 (family A) of glycosyl hydrolases from several species . The ORF encodes a polypeptide of 654 amino acids with a measured molecular weight of 81.3 kDa on SDS-PAGE . Putative signal sequences, Shine-Dalgarno-type ribosomal binding site and promoter sequences (-10) related to the consensus promoter sequences were deduced . The recombinant endoglucanase by E . coli harboring pCMX1 was partially purified and characterized . N-terminal sequences of endoglucanase were Ala-Gln-Pro-Ala-Ala, matched with deduced amino sequences . The temperature range and pH for optimal activity of the purified enzyme were 55 approximately 65 degrees C and 5.5, respectively . The enzyme was most stable at pH 6 but unstable under pH 4 with a K(m) value of 0.49% CMC and a V(max) value of 152 U/mg. Curr Microbiol, 2000 Oct, 41(4), 295 - 9 Overexpression of protein disulfide isomerase in Aspergillus; El-Adawi H et al.; One of the major problems with the production of biotechnologically valuable proteins has been the purification of the product . For Escherichia coli and Saccharomyces cerevisiae, there are several techniques for the purification of intracellular proteins, but these are time consuming and often result in poor yields . Purification can be considerably facilitated, if the product is secreted from the host cell . In the work presented, we have constructed an expression vector (pSGNH2) for the secretion of protein disulfide isomerase (PDI; EC 5.3.4.1) from Aspergillus niger, in which the retention signal His-Asp-Glu-Leu (H-D-E-L) was modified to Ala-Leu-Glu-Gln (A-L-E-Q) via the polymerase chain reaction (PCR) method . The PDI gene was placed under the control of the A . oryzae alpha-amylase promoter . This expression vector was transformed into A . niger NRRL3, resulting in PDI secretion into the medium . The catalytic activity of overexpressed PDI from A . niger was indistinguishable from that of PDI isolated from bovine liver . With further strain improvement and optimization of culture conditions, it could be possible to raise the PDI production to the bioprocessing scale. J Ocul Pharmacol Ther, 2000 Aug, 16(4), 353 - 61 Constitutive cyclooxygenase-1 and induced cyclooxygenase-2 in isolated human iris inhibited by S(+) flurbiprofen; van Haeringen NJ et al.; The purpose of the present study was to characterize the isoforms of cyclooxygenase (COX) in the human iris before and after stimulation with lipopolysaccharide (LPS) and to determine the selectivity of the nonsteroidal anti-inflammatory drug (NSAID), S(+) flurbiprofen, for inhibition of COX-1 and COX-2 in homogenates of this tissue . Spotblots were made of extracts of human iris in the absence and presence of LPS plus acetylsalicylic acid (aspirin) . After reacting with anti-COX-1 and anti-COX-2 immunoglobulin G, the presence of both immunoreactive COX enzymes was substantiated using an indirect immunoperoxidase method . Authentic COX-1 and COX-2 were used as controls . Using an enzyme immune assay (EIA), the production of prostaglandin E2 (PGE2) was quantified in tissue homogenates of human iris under the same conditions as described above . S(+) flurbiprofen was added to tissue homogenates in order to determine the inhibitory effect on PGE2 production . Half maximal inhibitory concentrations (IC50) of S(+) flurbiprofen for the PGE2 production in the tissue homogenates were determined from concentration inhibition curves . The selectivity of S(+) flurbiprofen for inhibition of COX-1 was expressed as the ratio of IC50 for COX-2/COX-1 . Spotblots of nonstimulated iris-extracts showed positive staining for COX-1 immunoreactivity (-ir) only . After incubation with LPS plus acetylsalicylic acid, positive staining was observed for both COX-1-ir and COX-2-ir . Concentrations of PGE2 released from homogenates of untreated iris varied from 1.5-4 ng/ml, and of LPS-stimulated tissue from 10-20 ng/ml of assay mixture . S(+) flurbiprofen inhibited PGE2 production of untreated tissue homogenates at an IC50 of 8 x 10(-10) M whereas, in the stimulated tissue, IC50 was found to be 3 x 10(-6) M . The selectivity of S(+) flurbiprofen for inhibition of constitutively present COX-1, relative to the inhibition of induced COX-2, was 3,600 . Our results indicate that specific expression of COX isoforms in normal human iris was substantiated at the protein level by immunoreaction on spotblots . COX-1 represents the constitutively present enzyme, and COX-2 appears after stimulation with LPS . At the functional level, S(+) flurbiprofen possesses a specificity for COX-1 in inhibiting PGE2 production. J Ocul Pharmacol Ther, 2000 Aug, 16(4), 345 - 52 Flurbiprofen and enantiomers in ophthalmic solution tested as inhibitors of prostanoid synthesis in human blood; van Haeringen NJ et al.; The purpose of this study was to assess the selectivity and potency of the nonsteroidal anti-inflammatory drug (NSAID), flurbiprofen, and its enantiomers in their inhibition of cyclooxygenase-1 (COX-1) and cyclooxygenase-2 (COX-2) . An assay was used with freshly drawn, heparinized human whole blood, incubated with 25 microM calcium ionophore A23187 during 60 min to produce thromboxane B2 (TXB2) by activity of COX-1 in platelets . Incubation with E . coli lipopolysaccharide (LPS) during 24 hr produced prostaglandin E2 (PGE2) by induction of COX-2 in monocytes, suppressing any possible contribution of COX-1 activity by the addition of acetylsalicylic acid . Concentration inhibition curves were determined with racemic, S(+), and R(-) flurbiprofen in final concentrations ranging from 10(-3) to 10(-10) M . The stereoselectivity of S(+) flurbiprofen vs . R(-) flurbiprofen, expressed as the reciprocal of the ratio of the concentrations giving 50% inhibition (IC50), is 340 for COX-1 and 56 for COX-2 . The selectivity for COX-1 vs . COX-2, expressed as the reciprocal ratio of the IC50, was 32 for racemic, 16 for S(+), and 5.3 for R(-) flurbiprofen . Meloxicam in the same assay showed COX-2 selectivity with a ratio of 0.19. Mol Endocrinol, 2000 Sep, 14(9), 1377 - 86 A novel glucocorticoid receptor binding element within the murine c-myc promoter; Ma T et al.; In the course of analyzing the murine c-myc promoter response to glucocorticoid, we have identified a novel glucocorticoid response element that does not conform to the consensus glucocorticoid receptor-binding sequence . This c-myc promoter element has the sequence CAGGGTACATGGCGTATGTGTG, which has very little sequence similarity to any known response element . Glucocorticoids activate c-myc/reporter constructs that contain this element . Deletion of these sequences from the c-myc promoter increases basal activity of the promoter and blocks glucocorticoid induction . Insertion of this element into SV40/reporters inhibits basal reporter gene activity in the absence of glucocorticoids . Glucocorticoids stimulate activity of reporters that contain this element . Recombinant glucocorticoid receptor binds to this element in vitro . An unidentified cellular repressor also binds to this element . The activated glucocorticoid receptor displaces this protein(s) . We conclude that the glucocorticoid receptor binds to the c-myc promoter in competition with this protein, which is a repressor of transcription . To our knowledge, no glucocorticoid response element with such properties has ever been reported. J Chromatogr A, 2000 Aug 18, 890(1), 37 - 43 High-performance chromatofocusing using linear and concave pH gradients formed with simple buffer mixtures . II . Separation of proteins; Kang X et al.; The separation of proteins using high-performance chromatofocusing with linear or concave pH gradients formed using simple mixtures of buffering species in the elution buffer is investigated experimentally . The separation achieved is comparable to that using polyampholyte elution buffers with these types of systems . More specifically, protein band widths at one half of the band height in the range between 0.1 and 0.025 pH units were observed, and good resolution was achieved of protein variants differing by a single amino acid residue in separation times of 30 min or less . An especially useful elution buffer is investigated that contains only four buffering species and that produces a linear pH gradient in the range between pH 9.5 and 6.0 when used together with a particular high-performance column packing made specifically for chromatofocusing . This elution buffer and column packing combination is evaluated by using it for the chromatofocusing of equine myoglobin and human hemoglobin variants . Additional applications are described in which a polyethyleneimine derivatized silica column packing and a pH gradient that is concave in shape are used for the separation of proteins in an E . coli cell lysate. Ann Neurol, 2000 Sep, 48(3), 330 - 5 Late-onset optic atrophy, ataxia, and myopathy associated with a mutation of a complex II gene; Birch-Machin MA et al.; Genetic defects affecting the mitochondrial respiratory chain are an important cause of neurological disease . Previously, we identified a family with complex II deficiency and late-onset neurodegenerative disease with progressive optic atrophy, ataxia, and myopathy . The affected family members are now shown to carry a C-to-T transition in one allele of the nuclear gene encoding the flavoprotein subunit of complex II . Mutation of the equivalent base in Escherichia coli generates an inactive enzyme unable to bind flavin adenine dinucleotide covalently . Compatible with these findings, our patients have an approximate 50% decrease in complex II and succinate dehydrogenase activity . These results suggest that genetic defects of nuclear-encoded subunits of the mitochondrial respiratory chain can result in late-onset neurodegenerative disease. Sheng Wu Gong Cheng Xue Bao, 2000 Mar, 16(2), 169 - 72 {Reactivation of denatured lysozyme with immobilized molecular chaperones GroE}; Dong XY et al.; The molecular chaperones GroEL and GroES were expressed in recombinant E . coli and purified by anion exchange chromatography . The renaturation of the denatured lysozyme with the free and immobilized GroEL/ES or GroEL was studied . We show here that using free GroEL alone could reactive the denatured lysozyme up to a relative activity of over 90% . The immobilized GroEL was also effective for promoting lysozyme refolding . Moreover, the optimal temperature (i.e., 37 degrees C) and (pH(i.e., 6 to 8) for the immobilizde GroEL-facilitated lysozyme refolding operation were determined . Under the optimal condition, the activity of lysozyme could be recovered up to 85% . In addition, the immobilized GroEL was repeatedly used five times without loss of its renaturation ability, indicating its potentiality to be used in practical downstream bioprocesses. Sheng Wu Gong Cheng Xue Bao, 2000 Mar, 16(2), 158 - 60 {Cloning of taxadiene synthase cDNA from the cell line of Taxus cuspidata}; Hu GW et al.; Taxadiene synthase plays an important role in taxol biosynthesis . RT-PCR was used for cloning taxadiene synthase cDNA fragment from the cells of T . cuspidata . The cDNA was cloned into vector pGEM and transformed to E . coli J M109 . The cloned cDNA named pCBMZ was further confirmed by Southern blotting assay and was sequenced . The result showed that taxadiene synthase cDNA of Taxus cuspidata was highly homologous with that of Taxus brevifolia. Sheng Wu Gong Cheng Xue Bao, 2000 Mar, 16(2), 150 - 4 {Research on renaturation of recombinant human pro-urokinase expressed from Escherichia coli}; Zhu H et al.; Recombinant human pro-urokinase forms insoluble inclusion body when overexpressed in Escherichia coli, and it must be denatured and renatured in vitro before it acquires activity . This study aimed to increase the renaturation yield of denatured pro-urokinase . We have evaluated the basic renaturation conditions of pro-urokinase through qualitative and quantitative analysis of pH, temperature, denaturant concentration, protein concentration, the ratio of reduced and oxidized thiol reagents . The effects of nonspecific additives, step-wise dilution and urea gradient dialysis have been also compared . The optimal conditions of pro-urokinase renaturation with the yield about 20%-30% have been obtained. Sheng Wu Gong Cheng Xue Bao, 2000 Mar, 16(2), 134 - 6 {Cloning and sequence of cDNA encoding ACC synthase specifically expressed in banana fruit}; Wang XL et al.; A cDNA encoding ACC synthase in banana pulp was amplified by RT-PCR and cloned in E . coli . The 5' terminal region of the ACC synthase transcript was determined by using 5' RACE procedure . The results showed that the ACC synthase cDNA in banana pulp is 1752 bp in length including 74 bp of 5' untranslated region, 1461 bp of coding region which encodes a polypeptide of 486 amino acides and 217 bp of 3' untranslated region . The Northern blot analysis indicates that the ACC synthase mRNA is specifically transcripted in banana fruits. Cancer Gene Ther, 2000 Aug, 7(8), 1179 - 87 Escherichia coli thymidylate synthase expression protects human cells from the cytotoxic effects of 5-fluorodeoxyuridine more effectively than human thymidylate synthase overexpression; Parsels LA et al.; In this study, we compared the relative abilities of human thymidylate synthase (hTS) and Escherichia coli thymidylate synthase (eTS) expression to confer resistance to the cytotoxic effects of treatment with the TS inhibitor 5-fluorodeoxyuridine (FdUrd) . G418-selected clones expressing either form of the protein were significantly more resistant than the lacZ-expressing clone, VALZ2, to FdUrd-induced cytotoxicity . Although eTS-expressing clones expressed 2- to 3-fold more TS protein than hTS-overexpressing clones, the representative eTS-expressing clone, VAEG8, and hTS-overexpressing clone, VAHGC, were equally sensitive to an FdUrd-induced loss of clonogenicity; in addition, a large fraction of either form of exogenously expressed TS appeared to be inactive in the intact cell . The clones differed, however, in their responses to leucovorin (LV) . Although LV significantly enhanced FdUrd-induced TS inhibition, growth inhibition, and cytotoxicity in VAHGC cells, it had no effect on these parameters in VAEG8 cells . These results suggest that eTS may more efficiently confer resistance to FdUrd plus LV when expressed for the purposes of a "host protection" strategy in vivo. Protein Sci, 2000 Aug, 9(8), 1559 - 66 Static light scattering studies of OmpF porin: implications for integral membrane protein crystallization; Hitscherich C Jr et al.; Integral membrane proteins carry out some of the most important functions of living cells, yet relatively few details are known about their structures . This is due, in large part, to the difficulties associated with preparing membrane protein crystals suitable for X-ray diffraction analysis . Mechanistic studies of membrane protein crystallization may provide insights that will aid in determining future membrane protein structures . Accordingly, the solution behavior of the bacterial outer membrane protein OmpF porin was studied by static light scattering under conditions favorable for crystal growth . The second osmotic virial coefficient (B22) was found to be a predictor of the crystallization behavior of porin, as has previously been found for soluble proteins . Both tetragonal and trigonal porin crystals were found to form only within a narrow window of B22 values located at approximately -0.5 to -2 X 10(-4) mol mL g(-2), which is similar to the "crystallization slot" observed for soluble proteins . The B22 behavior of protein-free detergent micelles proved very similar to that of porin-detergent complexes, suggesting that the detergent's contribution dominates the behavior of protein-detergent complexes under crystallizing conditions . This observation implies that, for any given detergent, it may be possible to construct membrane protein crystallization screens of general utility by manipulating the solution properties so as to drive detergent B22 values into the crystallization slot . Such screens would limit the screening effort to the detergent systems most likely to yield crystals, thereby minimizing protein requirements and improving productivity. Protein Sci, 2000 Aug, 9(8), 1530 - 9 Function of a conserved sequence motif in biotin holoenzyme synthetases; Kwon K et al.; The biotin holoenzyme synthetases (BHS) are essential enzymes in all organisms that catalyze post-translational linkage of biotin to biotin-dependent carboxylases . The primary sequences of a large number of these enzymes are now available and homologies are found among all . The glycine-rich sequence, GRGRXG, constitutes one of the homologous regions in these enzymes and, based on its similarity to sequences found in a number of mononucleotide binding enzymes, has been proposed to function in ATP binding in the BHSs . In the Escherichia coli enzyme, the only member of the family for which a three-dimensional structure has been determined, the conserved sequence is found in a partially disordered surface loop . Mutations in the sequence have previously been isolated and characterized in vivo . In this work these single-site mutants, G115S, R118G, and R119W, of the E . coli BHS have been purified and biochemically characterized with respect to binding of small molecule substrates and the intermediate in the biotinylation reaction . Results of this characterization indicate that, rather than functioning in ATP binding, this glycine-rich sequence is required for binding the substrate biotin and the intermediate in the biotinylation reaction, biotinyl-5'-AMP . These results are of general significance for understanding structure-function relationships in biotin holoenzyme synthetases. Protein Sci, 2000 Aug, 9(8), 1519 - 29 Ligand binding and thermodynamic stability of a multidomain protein, calmodulin; Masino L et al.; Chemical and thermal denaturation of calmodulin has been monitored spectroscopically to determine the stability for the intact protein and its two isolated domains as a function of binding of Ca2+ or Mg2+ . The reversible urea unfolding of either isolated apo-domain follows a two-state mechanism with relatively low deltaG(o)20 values of approximately 2.7 (N-domain) and approximately 1.9 kcal/mol (C-domain) . The apo-C-domain is significantly unfolded at normal temperatures (20-25 degrees C) . The greater affinity of the C-domain for Ca2+ causes it to be more stable than the N-domain at {Ca2+} > or = 0.3 mM . By contrast, Mg2+ causes a greater stabilization of the N- rather than the C-domain, consistent with measured Mg2+ affinities . For the intact protein (+/-Ca2+), the bimodal denaturation profiles can be analyzed to give two deltaG(o)20 values, which differ significantly from those of the isolated domains, with one domain being less stable and one domain more stable . The observed stability of the domains is strongly dependent on solution conditions such as ionic strength, as well as specific effects due to metal ion binding . In the intact protein, different folding intermediates are observed, depending on the ionic composition . The results illustrate that a protein of low intrinsic stability is liable to major perturbation of its unfolding properties by environmental conditions and liganding processes and, by extension, mutation . Hence, the observed stability of an isolated domain may differ significantly from the stability of the same structure in a multidomain protein . These results address questions involved in manipulating the stability of a protein or its domains by site directed mutagenesis and protein engineering. Protein Sci, 2000 Aug, 9(8), 1497 - 502 Bypassing the kinetic trap of serpin protein folding by loop extension; Im H et al.; The native form of some proteins such as strained plasma serpins (serine protease inhibitors) and the spring-loaded viral membrane fusion proteins are in a metastable state . The metastable native form is thought to be a folding intermediate in which conversion into the most stable state is blocked by a very high kinetic barrier . In an effort to understand how the spontaneous conversion of the metastable native form into the most stable state is prevented, we designed mutations of alpha1-antitrypsin, a prototype serpin, which can bypass the folding barrier . Extending the reactive center loop of alpha1-antitrypsin converts the molecule into a more stable state . Remarkably, a 30-residue loop extension allows conversion into an extremely stable state, which is comparable to the relaxed cleaved form . Biochemical data strongly suggest that the strain release is due to the insertion of the reactive center loop into the major beta-sheet, A sheet, as in the known stable conformations of serpins . Our results clearly show that extending the reactive center loop is sufficient to bypass the folding barrier of alpha1-antitrypsin and suggest that the constrain held by polypeptide connection prevents the conversion of the native form into the lowest energy state. Protein structure prediction is another important application of bioinformatics. The amino acid sequence of a protein, the so-called primary structure, can be easily determined from the sequence on the gene that codes for it. But, the protein can only function correctly if it is folded in a very special and individual way (if it has the correct secondary, tertiary and quaternary structure). The prediction of this folding just by looking at the amino acid sequence is quite difficult. Several methods for computer predictions of protein folding are currently (as of 2004) under development. One of the key principles in bioinformatics is homology. In the genomic branch of bioinformatics, homology is used to predict the function of a gene. If gene A is homologous to gene B of which the function is known, it is likely to have a similar function. In the structural branch of bioinformatics homology is used to determine which parts of the protein are important in structure formation and interaction with other proteins. In a technique called homology modelling, this information is used to predict the structure of a protein once the structure of a homologous protein is known. This currently remains the only way to predict protein structures reliably. One example of this is the similar protein homology between hemoglobin in humans and the hemoglobin in legumes (leghemoglobin). Both serve the same purpose of transporting oxygen in both organisms. Though both of these proteins have completely different amino acid sequences, their protein structures are virtually identical, which reflects their near identical purposes. Systems biology involves the use of computer simulations of cellular subsystems (such as the networks of metabolites and enzymes which comprise metabolism, signal transduction pathways and gene regulatory networks) to both analyze and visualize the complex connections of these cellular processes. Artificial life or virtual evolution attempts to understand evolutionary processes via the computer simulation of simple (artificial) life forms. DNA-DNA hybridization is a method in genetics to measure the degree of genetic similarity between DNA sequences. The technique is usually used to determine the genetic "distance" between two species. When several species are compared that way, the similarity values allow the species to be arranged in a phylogenetic tree; it is therefore one possible approach to carrying out molecular systematics. Method The DNA from the two species to be compared is extracted, purified and cut into short pieces (e.g., 600-800 base pairs). The DNA double strand is then separated by heating into two single strands. The single-stranded DNA is now allowed to anneal with the DNA pieces of the other species. The more similar the DNA, the more of the pieces will anneal and form a hybrid (thus the name) double strand. Strands with a high degree of similarity will bind more firmly, and require more energy to separate them: i.e. they separate when heated at a higher temperature than dissimilar strands. To assess this "melting temperature" the mixture is heated in small steps. At each step, samples are tested as to the amount of single- and double-stranded DNA. These results in a profile from which the amount of similar DNA, and thus the degree of genetic similarity, can be determined. Advantages and disadvantages This technique was considered a good one since it took all possible ways of aligning the sequences into account and the melting temperature would be a good average. However it is not considered the best approach these days since sequences can be computationally aligned. There is hardly any other approach used currently, however the sequence/s used for the comparisons are the major source of contention. Not all sequences evolve at the same rate. Some are too critical and changes can be lost if the result causes loss of function of the gene product/action. Finding the corresponding genes in more distant organisms can also be difficult. Some approaches have considered using non-coding sequences since these are not believed to be affected by evolution and all mutations in them would be retained faithfully. If one assumes a constant rate of background mutation, these mutations would indicate the age of the sequence lineage. However, these too are difficult for cross genera comparisons. The mitochondrial genome is the genetic material of the mitochondria. The mitochondria are organelles that reproduce themselves semi-autonomously when the eukaryotic cells that they occupy divide. The genetic material forming the mitochondrial genome is similar in structure to that of the prokaryotic genetic material. It is formed of a single circular DNA molecule. Mitochondia are thought to have arisen from intracellular bacterial symbiotes, this is called the endosymbiotic theory. The mitochondria of a sexually-reproducing animal comes only from the mother's side. The mitochondrial DNA of a human being is essentially the same as that of his or her mother. In this way, mitochondrial genetic diseases can affect both males and females, but can only be transmitted by females to their offspring. Compared to the nuclear genome, the mitochondrial genome possesses some very interesting features: All the genes are carried on a single circular DNA molecule. The genetic material is not bounded by a nuclear envelope. The DNA is not packed with proteins. The genome doesn't contain a lot of non-coding (junk DNA) areas. Some codons do not follow the universal rules in translation. Some bases are considered as a part of two different genes: as the last base of a gene and the first base of the next gene. Diploid (meaning double in Greek) cells have two copies of each somatic chromosome (non-sex chromosomes), usually one from the mother and one from the father. Most somatic cells (body cells) of higher organisms are diploid or polyploid (three or more copies of each chromosome, often found in plants), whereas their reproductive cells are usually haploid (they have only one copy of each chromosome). When reproducing, haploid sex cells (gametes) of both parents will generally merge to form a diploid cell, the zygote, with unique genetic properties, which quickly becomes the embryo. A somatic cell is a type of cell in an organism, such as the human body. Cells can be divided into two types- those that are part of the germline, and cells that are not. Somatic cells are those cells that are not part of the germline. Your liver is made entirely of somatic cells, your heart is made entirely of somatic cells, your hands are made entirely of somatic cells. Somatic cells are also called "body cells." The cells in the germline are cells such as the gametes (sperm or ovum,) cells that produce the gametes (such as gametocytes), and event the zygote, because it leads to the production of the gametes. But genetic material in the cells of your liver or arm or hand will never make it to your children. All somatic cells have 46 chromosomes, making them diploid cells. Gametes, in contrast, have only 23 chromosomes, making them haploid cells. (But not all non-somatic cells have only 23 chromosomes. Consider the zygote.) Somatic cells can be used in cloning, by a process called Somatic cell nuclear transfer. Genome projects are scientific endeavours that aim to map the genome of a living being or of a species (be it an animal, a plant, a fungus, a bacterium, an archaean, a protist or a virus), that is, the complete set of genes caried by this living being or virus. The Human Genome Project was such a project. Some have argued that the era of genomics is one of the more fundamental advances in human history. The Human Genome Project (HGP) endeavoured to map the human genome down to the nucleotide (or base pair) level and to identify all the 20-25,000 genes present in it. History The $3 billion project was founded in 1990 by the United States Department of Energy and the U.S. National Institutes of Health, and was expected to take 15 years. Due to widespread international cooperation and advances in the field of genomics (especially in sequence analysis), as well as huge advances in computing technology, a rough draft of the genome was finished in 2000 (announced jointly by US president Bill Clinton and British Prime Minister Tony Blair on June 26, 2000), two years earlier than planned. The consortium comprised: China France Germany Japan United Kingdom United States On April 14, 2003, a joint press release announced that the project had been successfully completed, with 99% of the genome sequenced with 99.99% accuracy. Another reason for the accelerated work was the commercially financed HGP at Celera Genomics, which used a new method called shotgun sequencing, and also that Celera Genomics planned to patent all genes found, unlike the gene sequences found by the original publicly-funded HGP, which are in line with the so called Bermuda Statement (Feb 1996) made freely available to the public, 24 hours a day. This sort of competition proved to be very good for the Project. Although the working draft was announced in June 2000, it was not until February 2001 that Celera and the HGP scientists published actual details of their drafts. Special issues of Nature (which published the publicly-funded project's scientific paper) and Science (which published Celera's paper) contained descriptions of the methods used to produce the draft sequence, as well as analysis of said sequence. These drafts are hoped to provide a 'scaffold' of about 90% of the genome upon which gaps can be closed. Each draft sequence has been checked at least four to five times to increase 'depth of coverage' or accuracy. Approximately 47% of the draft were high-quality sequences - the final version will have been checked eight to nine times giving an error rate of just 1 in 10,000 bases. The human genome project is one of a number of international genome projects in biology, each aimed at sequencing the DNA of a specific organism. While the human DNA sequence offers the most tangible benefits, important developments in biology and medicine are predicted as a result of the sequencing of model organisms including mice, fruitflies, zebrafish, yeast, nematodes and many microbial organisms and parasites. In October 2004, researchers of the HGP announced a new estimate of 20,000 to 25,000 genes in the human genome. Previously 30,000 to 40,000 had been predicted, while estimates at the start of the project reached up to as high as 100,000. Goals The goals of the original HGP were not only to determine all 3 billion base pairs in the human genome with a minimal error rate, but also to identify all the genes in this vast amount of data. This part of the project is still ongoing although a preliminary count indicates about 25,000 genes in the human genome, which is far fewer than predicted by most scientists. Another goal of the HGP was to develop faster, more efficient methods for DNA sequencing and sequence analysis and the transfer of these technologies to industry. Today, the sequence of the human DNA is stored in databases and is available for everyone on the Internet. The U.S. National Center for Bioinformatics (and sister organizations in Europe and Japan) houses the genomic sequence, along with sequences of known and hypothetical genes and proteins. Other organizations such as the University of California, Santa Cruz, and ENSEMBL present additional data and annotation and powerful tools for visualizing and searching through it. Computer programs have been developed to analyse that data, as the data itself is next to useless without interpretation. The process of identifying the boundaries of genes and other features in raw DNA sequence is called annotation and is the domain of bioinformatics. While expert biologists make the best annotators, such annotation proceeds slowly, and computer programs are increasingly used to meet the high-throughput demands of genome sequencing projects. The best current technologies for annotation make use of statistical models that take advantage of parallels between DNA sequences and human language, using concepts from computer science such as formal grammars. All humans have unique genomic sequence, as such, the data published by the HGP does not represent the exact sequence of each and every individual's genome. It is the combined genome of a small number of anonymous donors. The HGP genome is a scaffold for future work in identifying differences between individuals. Most of the current effort in identifying differences between individuals involves single nucleotide polymorphisms. Benefits The work on automated interpretation on genome data has just begun. The knowledge gained by the understanding of the genome is hoped to boost the fields of medicine and biotechnology, eventually leading to cures for cancer, Alzheimers disease and other diseases. For example, a biological researcher investigating a certain form of cancer may have narrowed down their search to a particular gene. By visiting the human genome database on the world-wide web, this researcher can examine what other scientists have written about this gene, including (potentially) its three-dimensional structure, its function(s), its evolutionary relationships to other human genes, or to genes in mice or yeast or fruitflies, possible detrimental mutations, interactions with other genes, body tissues in which this gene is activated, diseases associated with this gene... the list of datatypes is long, one reason why bioinformatics is so challenging. One particularly exciting technology arising from genomics is the DNA microarray (also called DNA chip), an array of probes for simultaneously measuring the amount of each of the 20,000+ human genes present in a given sample. This has aroused great interest as a potential diagnostic tool for science and medicine. It seems likely that there will be many more downstream technologies as a result of the human genome project. On a more philosophical level, the analysis of similarities between DNA sequences from different organisms is opening new avenues in the study of evolution. In many cases, evolutionary questions can now be framed in terms of molecular biology; indeed, many major evolutionary milestones (the emergence of the ribosome and organelles, the development of embryos with body plans, the vertebrate immune system) can be related to the molecular level. Many questions about the similarities and differences between humans and our closest relatives (the primates, and indeed the other mammals) are expected to be illuminated by the data from this project. In chemistry, sequence analysis is a techniques used to determine the sequence of a polymer formed of several monomers. In molecular biology and genetics, the same process is called simply "sequencing." The term "sequence analysis" in biology implies subjecting a DNA or amino acid sequence to sequence alignment, sequence database, repeated sequence searches, or other bioinformatics methods on a computer. In the field of bioinformatics, a sequence database is a large collection of DNA, protein, or other sequences stored on a computer. A database can include sequences from only one organism, as in databases including all the proteins in Saccharomyces cerevisiae, or it can include sequences from all organisms whose DNA has been sequenced. Sequence databases can be searched using a variety of methods. The most common is probably searching for a sequence similar to a certain target protein or gene whose sequence is already known to the user. The BLAST program is a method of this type. A major problem with all the large genetic sequence databases is that records are deposited in them from a wide range of sources, from individual researchers to large genome sequencing centers. As a result, the sequences themselves, and especially the biological annotations attached to these sequences, vary tremendously in quality. Also there is much redundancy, as multiple labs often submit numerous sequences that are identical, or nearly identical, to others in the databases. Many annotations are based not on laboratory experiments, but on the results of sequence similarity searches for previously-annotated sequences. Of course, once a sequence has been annotated based on similarity to others, and itself deposited in the database, it can also become the basis for future annotations. This leads to the transitive annotation problem because there may be several such annotation transfers by sequence similarity between a particular database record and actual wet-lab experimental information. Therefore, one must always regard the biological annotations in major sequence databases with a considerable degree of skepticism, unless they can be verified by reference to published papers describing high-quality experimental data, or at least by reference to a human-curated sequence database. Heredity, similarity between parents and offspring. In biology, offspring resemble their parents because the offspring inherit genes, carried on DNA molecules, from their parents (see Genetics). The word heredity is also used in a non-biological sense, in human affairs, to refer to the inheritance of cultural or material goods, such as religious or political beliefs, or land or money. This article is mainly concerned with biological heredity. In human beings, and many related forms of life, inheritance occurs by a set of detailed mechanisms, some (but not all) of which are well understood. In molecular terms, heredity is due to DNA. The DNA codes for genes, and the genes specify particular proteins. The DNA acts as a set of coded instructions for building a body, given a particular environment. At a reproductive level, inheritance is a sexual process. The offspring contain two copies of each gene, inherited from their two parents. At a cellular level, inheritance proceeds via meiosis, a special kind of cell division that produces the gametes (eggs in females, sperm in males). In meiosis, the two copies of each gene are reduced to a single copy. When male and female gametes combine, the double set is restored. This pattern of heredity is called Mendelian (see Mendel’s Laws). However, the particular hereditary mechanism that is used in modern humans is only one of many ways in which heredity occurs in all of life. Viruses, bacteria, and other microbes use different hereditary mechanisms. In this article, we look at the variety of hereditary mechanisms, and how they are related to the forms of life that use them. Moreover, theoretical biologists have thought up several further ways in which heredity could conceivably occur though in fact it does not. We shall also look at some of these theoretical hereditary mechanisms to see how they shed some light on why life uses the hereditary mechanisms it does rather than some alternative mechanisms. II The Mechanisms of Heredity in the Main Forms of Life Print Preview of Section Heredity is one of the main defining features of life as a whole. The existence of life probably requires three conditions to be met: reproduction, heredity, and variation. Any entities that possess these three attributes will be able to evolve, and may evolve into something we recognize as life. The three conditions are related. Heredity is impossible without reproduction. However, reproduction without heredity is possible. For example, fires can reproduce: a spark from one fire may ignite a second fire elsewhere. But the attributes of the “offspring” fire—its size, its duration, the pattern of flickering flames—depend on local features such as the supplies of combustible material, oxygen, and the wind, rather than on attributes of the “parental” fire. Fires show reproduction without heredity and do not evolve by natural selection; they are not alive. In life, reproduction is of a kind that produces heredity; offspring tend to resemble their parents. All life reproduces by “template reproduction”; that is, the parental hereditary molecule acts as a template for the production of the offspring hereditary molecule. Template reproduction is the best-known example of a method of reproduction that produces heredity. However, not all template reproduction takes the same form as DNA replication. Other examples of template reproduction include photocopying, old-fashioned printing with metal-typeset or blocks of woodcut, and industrial processes in which molten metal is poured into a mould. All these types of template reproduction produce some form of heredity. Fires, by contrast, do not spread by template reproduction. The third feature of life, variation, is also related to the hereditary process. Variation ultimately arises because of errors, or mutations, in heredity. But the amount of variation in a population depends on the shuffling and reshuffling of genetic variants that already exist as well as the creation of new variants by mutation. The shuffling is effected by a process called recombination, and recombination is another feature of the hereditary mechanism. During meiosis, the two gene-sets that an individual inherits from its two parents are shuffled before they are sent into the gametes. The offspring in the next generation show a greater range of variation in consequence. Thus, heredity is an essential property of life—not only is heredity one of the three defining conditions of life but it also influences the other two conditions. Without heredity, life would not exist. Moreover, the details of the hereditary mechanism influence the form that life takes on. Over evolutionary time, the hereditary mechanism has changed from simple beginnings at the origin of life to the forms seen in modern life. At each stage, the details of how inheritance occurs constrain the form that life can have. Indeed, many biologists think of the history of life as a series of ways in which genetic information is passed on from one generation to the next. Advances during evolution have depended on changes in the way that heredity occurs. (It is worth noting that hereditary mechanisms have not changed in order to cause any future evolutionary events. However, once the hereditary mechanism changes for some reason, certain other evolutionary changes may become more likely in consequence.) The main changes that have occurred during evolution in the form of heredity are now discussed in detail. A The Origin of Heredity The origin of heredity, together with reproduction, represents the transition from chemistry to biology. We have only inferential and incomplete knowledge of how the transition occurred. No one has observed the origin of life; indeed it is not even known to the nearest hundred million years when life on Earth originated. Heredity and reproduction in all life forms depend on base pairing. In modern DNA the bases are four molecules symbolized by A, C, G, and T. A binds to T, G binds to C. A base sequence such as GCTT will reproduce into CGAA, which in turn reproduces into GCTT. (Modern DNA can only reproduce with the assistance of many enzymes; but the earliest replicating molecules probably reproduced without enzymatic assistance.) The earliest replicating molecules probably relied on pairing between molecules other than A, G, C, and T; but we have several reasons to think that some kind of base pairing was used. One reason is that all life uses base pairing now. Indeed, the evolutionary biologist Leslie Orgel once remarked that if you imagine stripping away all the details of Earthly life one by one to try to define what is essential to all life, you can remove almost everything—skin and bones, eating and breathing, cells, even enzymes—but finally you are left with base pairing, like the smile of the Cheshire cat. Secondly, base pairing has the right theoretical attributes for a hereditary system that can allow the evolution of Earthly life. Base pairing allows what is called unlimited heredity. Theoreticians distinguish between mechanisms of limited heredity, which permit inheritance for only a small number of states, and mechanisms of unlimited heredity, which permit heredity among a large, or practically infinite, number of states. As an example of a system with limited heredity, consider autocatalytic cycles (cycles that catalyse themselves). Freeman Dyson has discussed how autocatalytic cycles may have arisen near the origin of life. Thus, a set of chemicals that can be symbolized by X, Y, and Z may catalyse one another in a cycle of the form X → Y, Y→ Z, Z → X. A system of X, Y, and Z can perpetuate itself, and generate more of X, Y, and Z. However, only very simple evolution is possible here. There might be a second autocatalytic cycle, X’ → Y’ → Z’. Then XYZ systems might compete with X’Y’Z’ systems, and one or other might increase in frequency depending on the local conditions. But there are only two inherited states (XYZ or X’Y’Z’), and evolution is confined to fluctuations in their relative frequencies. In general, the number of states of autocatalytic systems is likely to be limited, because only certain sets of chemicals will form autocatalytic cycles. The evolution of life, in its modern complexity and variety, requires unlimited heredity. Base pairing permits this property. For example, with four different bases, a sequence of only five bases can have over 1,000 forms: AAAAA, AAAAC, AAACG, and so on. With a longer sequence of bases, the number of heritable states soon becomes astronomical. The evolution of life with the kind of complexity that we see in life on Earth would have required unlimited heredity, and that probably required base pairing or something like it. B The RNA World The earliest living systems probably had heredity but no metabolism. (In genetic terms, they had genotypes but no phenotypes.) The molecule just replicated, without doing anything more. The next stage is for life to catalyse reactions, in the way that modern enzymes do, altering the conditions around the replicating molecule such that it can copy itself more effectively. Biologists suspect that, early in the history of life, there was an “RNA world”. In the RNA world, RNA acted as the hereditary molecule and also catalysed reactions. In most of modern life, DNA is the hereditary molecule and it codes for proteins that catalyse metabolic reactions. In life forms with DNA, heredity is separate from catalysis, and the hereditary molecule contains coded information. In the RNA world, the RNA molecules acted as ribozymes: that is the RNA molecule itself acted as an enzyme rather than coding for a protein that in turn acted as an enzyme. A system in which heredity and catalysis are two functions of the same molecule is a simpler system than one in which the two functions are performed by separate molecules. The simpler system is likely to have come first in evolution. However, the possibilities in the RNA world would have been limited. The need for an RNA molecule to act as a catalyst constrains what shape it can have, and therefore its base sequence. C Heredity by DNA DNA cannot act as a catalyst, because all DNA molecules, whatever their base sequence, have the same structure—the famous double helix. When life evolved from using RNA to using DNA as its hereditary molecule, the separation opened up between the coding hereditary molecule and the protein catalysts. The hereditary molecule was now freed from a constraint, as it no longer needed to function as a catalyst, and could evolve a much wider variety of coding sequences. There is a second reason why, once DNA evolved as the hereditary molecule, life could evolve to be more complex than before. The inheritance of DNA is much more accurate than the inheritance of RNA. A few modern life forms use RNA as the hereditary material. For example, some viruses, such as HIV (the agent of AIDS) and the influenza virus, reproduce via RNA. They have a mutation rate of approximately one error per 10,000 to 100,000 bases. Modern life forms that use DNA all make errors approximately once in 1 to 10 billion bases. DNA-based life forms copy themselves about 100,000 times more accurately than RNA-based life forms. Not all the improvement is due to the use of DNA rather than RNA, but some of it is. The accuracy of heredity influences how long the hereditary molecule can be. The distinguished German chemist Manfred Eigen has shown that, approximately, a hereditary molecule cannot evolve to be longer than the inverse of its error rate. A hereditary molecule with an error rate of 1 in 1,000 cannot evolve to be more than about 1,000 bases long. If the molecule is any longer, it accumulates too many errors and the life form is unsustainable. Thus, when life evolved from the RNA world to the DNA world, the hereditary molecule could evolve to be much longer. A longer hereditary molecule can code for a more complex life form than a shorter hereditary molecule. The evolution of DNA was for this reason a key step in the rise of complex life on Earth. Without DNA, life would be constrained to be simple replicating molecules such as probably existed in the RNA world. In modern life, only a few simple viruses use RNA as their hereditary material. Larger viruses use DNA, and all cellular life—from single-celled bacteria to the largest plants and animals—uses DNA. Without DNA, there would be no trees or horses, no seaweed or fish. D The Origin of Meiosis In the hereditary mechanisms we have considered so far, the offspring inherits a full copy of its parent’s DNA (or RNA). Inheritance is clonal, or asexual. The offspring is genetically identical to its parent, except for mutations. The next major transition in heredity is the evolution of meiosis; the same transition also resulted in the evolution of sex and of Mendelian heredity. Mendelian heredity is the kind of inheritance discovered by Gregor Mendel. A life form with Mendelian heredity has, at least in some part of its life cycle, two copies of its hereditary material (one inherited from the father, the other from the mother). In meiosis, the two copies are reduced to a single copy. The single copy then combines with a second copy from another individual. The reason why sex and meiosis exist is uncertain; there are two main theories at present. Sex and meiosis may help life to cope with mutational error, by increasing the efficiency with which natural selection removes harmful mutations; or they may help life to cope with infectious disease, by increasing the range of novel disease-resistance genotypes. Mendelian heredity, with sexual reproduction and meiosis, is associated with at least three other big features of life. One is complexity. Simpler life forms, such as bacteria, tend to be clonal, and it is the larger, multicellular life forms that use sex and meiosis. This could be because the more complex life forms suffer more from mutation, or more from infectious disease, than do simpler life forms. Secondly, meiotic, sexual life comes in the form of distinct species. For example, we can recognize distinct species such as human beings and chimpanzees. Over time, a species produces a distinct lineage, as it reproduces itself from one generation to the next. Genes are shuffled around within a species by interbreeding; but genes are not usually passed between species. Each species evolves independently. On a grand scale, the result is a tree-like pattern of evolution. Lineages branch off, evolve apart, and either produce new branches or go extinct. Before the origin of meiosis, and in non-meiotic life now (such as in bacteria), life either was not arranged in distinct lineages or the lineages were much less clear-cut than in meiotic life forms. Bacteria show a range of forms, but they are less clearly organized into distinct lineages than are multicellular animals and plants. Thirdly, Mendelian life forms evolve faster than pre-Mendelian life forms. In the fossil record, pre-Mendelian forms such as bacteria hardly change at all for hundreds, even thousands, of millions of years. Mendelian heredity evolved in an early eukaryote (cellular life form in which the DNA is carried in a distinct nucleus within the cell). The time of origin of eukaryotes is uncertain, but it may be somewhat before 2 billion years ago. In the fossil record, there is little evidence of rapid evolution this early. The tree-like pattern of evolution, with distinct branches that undergo change, radiating, splitting, and going extinct, does not clearly emerge in fossils until about 550 million years ago. However, molecular evidence suggests that the pattern is much older, dating back from at least 1 to 1.5 billion years ago. It is likely that the origin of meiosis in eukaryotes set the stage for evolution to proceed in its most familiar form—with distinct evolving lineages and a tree-like branching pattern. Before that there may have been less clear lineages, with more of a blur of forms that hardly changed, and without clear extinction events. E The Origin of Multicellular and Weismannist Heredity The simplest modern life forms with meiosis are single-celled creatures such as Paramecium. Meiosis probably originated in single-celled life forms in the past. Multicellular life then evolved from those single-celled ancestors. The earliest multicellular life forms were probably bundles, or rows, of cells, with all the cells having much the same form. But at some point, cell differentiation evolved: that is, a life form containing more than one kind of cell. A human body contains many (perhaps 200 or so) cell-types—skin cells, blood cells, muscle cells, and so on—and they all develop during the life of an individual. As life forms with many cell types evolved, a new kind of heredity also arose. Biologists distinguish between germ cells and somatic cells within a body. The germ cells are the cells that reproduce the next generation. They are the sperm and egg cells, together with their precursor cells. The line of germ cells, or germ-line, is potentially immortal as it reproduces down the generations. The somatic cells, or soma, consists of all the rest—all the cells in the body that die when the body dies, leaving no descendants. Blood cells, brain cells, liver cells, and so on are all somatic cells. In human beings, and in most other animals, the somatic cells do not contribute genetically to the offspring in the next generation. Heredity is entirely carried out by the germ cells. This strict distinction between germ and somatic cells is often referred to as Weismannist heredity, after the German biologist August Weismann. Biologists before Weismann had not made, or at least appreciated, the importance of this distinction. Charles Darwin, for example, put forward a theory of heredity, which he called his “provisional hypothesis of pangenesis”, in which all the parts of the body sent hereditary information to the reproductive cells. We now know this does not in fact happen. The DNA in the germ cells is segregated off from the rest of the body. In some colonial animals, and many plants, the distinction between germ and somatic cells is less rigid than in humans. In them, somatic cells may be capable of reproduction. For example, in marine animals such as sponges, an individual animal may be smashed to pieces by wave or rock action; the pieces may then regenerate into whole new animals. In sponges, most offspring are produced from specialized reproductive cells within the adult body, but it is also possible for offspring to be formed from other cells. Humans probably evolved from ancestors that had partly Weismannist heredity, as in sponges, and over time the division between reproductive germ cells and non-reproductive somatic cells has increased. Leo Buss has described the evolution of Weismannist heredity in plants and animals in The Evolution of Individuality. Richard Dawkins, in The Extended Phenotype (1982), thought about imaginary life forms in which the next generation is not formed from specialized reproductive cells. He suggested that any such life form would have to be simpler than Earthly multicellular life. Gene, unit of inheritance, a piece of the genetic material that determines the inheritance of a particular characteristic, or group of characteristics. Genes are carried by chromosomes in the cell nucleus and are arranged in a line along each chromosome. Every gene occupies a place, or locus, on the chromosome. Consequently, the word locus has become loosely interchangeable with the word gene. The genetic material is deoxyribonucleic acid, or DNA (see Nucleic Acids), a molecule that forms the “backbone” of the chromosome. Because the DNA in each chromosome is a single, long, thin, continuous molecule, the genes must be parts of that molecule; and because DNA is a chain of minute subunits known as nucleotide bases, each gene includes many bases. Four different kinds of bases exist in the chain—adenine, guanine, cytosine, and thymine—and their sequence in a gene determines its properties. Genes exert their effects through the molecules they produce. The immediate products of a gene are molecules of ribonucleic acid (RNA); these are copies of the DNA, except that RNA has the base uracil instead of thymine. The RNA molecules from some genes play a direct part in the metabolism of the organism, but most are used to make protein. Proteins are chains of subunits known as amino acids, and the sequence of bases in the RNA determines the sequence of amino acids in the protein by means of the genetic code (see Genetics: The Genetic Code). The sequence of amino acids in a protein dictates whether it will become part of the structure of the organism, or whether it will become an enzyme for promoting a particular chemical reaction. Thus, changes in the DNA can produce changes that affect the structure or the chemistry of an organism. Due to the complexity of living organisms, usually a number of genes will influence a process or major feature, but sometimes just a few or even one gene will affect an organism considerably. For example, it has been demonstrated in mice that just one gene can significantly affect memory. Inserting an extra copy of the gene that produces a component of the neuron receptor N-methyl-D-aspartate (NMDA) into the mouse genome causes a higher NMDA activity in the mouse’s brain and the mouse learns quicker and has a better memory than non-modified mice. Conversely, mice lacking this NR2B gene have impaired learning and memory. Scientists believe NMDA facilitates the creation of bonds between neurons that permit the association of two distinct stimuli, such as touching something hot and the sensation of pain. This ability is regarded by many scientists to be at the core of memory and learning. It is hoped that such work will aid drug design or gene-based therapies for memory loss in humans if there is sufficient similarity with human brain chemistry. The nucleotide bases in DNA that code the structure of RNAs and proteins are not the only components of genes; groups of bases adjacent to the coding sequences affect the quantities and dispositions of gene products. In higher organisms (animals and plants, rather than bacteria and viruses), the noncoding sequences outnumber the coding ones by a factor of ten or more, and the functions of these noncoding regions are largely unknown. This means that geneticists cannot yet set precise limits on the sizes of animal and plant genes in general. However, with the current advances in genetic mapping (most notably in the Human Genome Project) more and more information on the nature of genes is being gathered. For example, it is now known that the genome of the fruit fly Drosophila melanogaster contains 13,601 genes, and the human chromosome 22 contains an estimated 34.5 million building blocks of DNA, which comprise at least 545 genes and 134 pseudogenes (DNA sequences that resemble genes but do not instruct the cell to produce proteins). Further work on mapped genomes should reveal how certain genetic sequences determine specific protein synthesis and hence structure, metabolic functions, and processes. In 1994, taking advantage of new capabilities developed by the genome project, DOE initiated the Microbial Genome Program to sequence the genomes of bacteria useful in energy production, environmental remediation, toxic waste reduction, and industrial processing. A follow-on program, Genomics:GTL builds on data and resources from the Human Genome Project, the Microbial Genome Program, and systems biology. GTL will accelerate understanding of dynamic living systems for solutions to DOE mission challenges in energy and the environment. Despite our reliance on the inhabitants of the microbial world, we know little of their number or their nature: estimates are that less than 0.01% of all microbes have been cultivated and characterized. Microbial genome sequencing will help lay a foundation for knowledge that will ultimately benefit human health and the environment. The economy will benefit from further industrial applications of microbial capabilities. Information gleaned from the characterization of complete microbial genomes will lead to insights into the development of such new energy-related biotechnologies as photosynthetic systems, microbial systems that function in extreme environments, and organisms that can metabolize readily available renewable resources and waste material with equal facility. Expected benefits also include development of diverse new products, processes, and test methods that will open the door to a cleaner environment. Biomanufacturing will use nontoxic chemicals and enzymes to reduce the cost and improve the efficiency of industrial processes. e, g. Microbial enzymes have been used to bleach paper pulp, stone wash denim, remove lipstick from glassware, break down starch in brewing, and coagulate milk protein for cheese production. In the health arena, microbial sequences may help researchers find new human genes and shed light on the disease-producing properties of pathogens. Microbial genomics will also help pharmaceutical researchers gain a better understanding of how pathogenic microbes cause disease. Sequencing these microbes will help reveal vulnerabilities and identify new drug targets. Gaining a deeper understanding of the microbial world also will provide insights into the strategies and limits of life on this planet. Data generated in this young program have helped scientists identify the minimum number of genes necessary for life and confirm the existence of a third major kingdom of life. Additionally, the new genetic techniques now allow us to establish more precisely the diversity of microorganisms and identify those critical to maintaining or restoring the function and integrity of large and small ecosystems; this knowledge also can be useful in monitoring and predicting environmental change. Finally, studies on microbial communities provide models for understanding biological interactions and evolutionary history. Los Angeles Times Date: 02/11/01 22:15 WASHINGTON -- Scientists conducting the first thorough survey of human DNA have made a remarkable discovery: To create the complex organism known as a human takes only about twice the genes of a fruit fly or a roundworm. At the same time, though scientists once described the overwhelming majority of the human genetic code as "junk" with little apparent purpose, it is brimming with remnants of long-dead genes and bits of DNA that "live" amid the genes like parasites, reshuffling and reshaping them over time. This suggests that the "junk" plays a role in the process of evolution and deserves intensive study. The researchers also concluded that men, far more than women, produce the genetic mutations that bring disease into the human family and allow evolution to move forward. These and other findings will be reported today by the two teams of researchers that raced each other to map the chemical composition of human DNA, the inherited material within most cells that controls basic cellular operations and plays a role in most disease. That race ended in a tie in June, giving both teams a claim to one of the most important scientific achievements in history. Since then the teams -- one a private U.S. company, the other an international group funded largely by the U.S. government and a British charity -- have been scouring their data in order to conduct the first broad analysis of what lies within DNA. More than a dozen papers will be published on their findings in the journals Science and Nature. The results were to be released today, but a British newspaper broke the embargoed story in its Sunday edition. The two teams' conclusions, which largely agree with each other, say little directly about potential cures for disease, though that is ultimately the major goal of the research. Still, by revealing more about how genes work, as well as where in the DNA they are clustered, the work over time will help researchers studying a large variety of illnesses. Working independently, the two teams have concluded that humans have somewhere from 26,000 to 40,000 genes, with the best bet being less than 35,000. One team, led by Celera Genomics Corp., said the number could be as low as 26,000. Most experts until recently believed it took 100,000 genes or more to build and operate a human. "That's a `knock you over with a feather' kind of result," said Francis Collins, leader of the international team and director of the National Human Genome Research Institute. "On one level, it's a blow to the pride of our species: How can we hold our heads up if we have only a few more genes than a worm?" Collins said. "But what it tells you is that our complexity arises from some other source, and we will have to start looking for it." The reports offer evidence of how the body accomplishes so much with such a small gene set. Where simpler organisms rely on each gene to produce a single protein, human cells are able to skip over parts of genes at times, allowing each gene to produce on average three versions of a protein, and sometimes as many as five. This suggests that to understand the root causes of disease, researchers will have to intensively study how proteins, as well as genes, interact and how they go awry. Proteins are the workhorses of the body, handling such basic tasks as turning food into energy, signaling among cells and growing from an embryo into a child. Moreover, said Collins, the studies suggest that each human protein can handle more functions than those in a simpler organism. "If a worm protein is made to clip another protein ... then it's like a cutting knife that does one simple thing," he said. But the analogous human protein "would be like a Cuisinart -- it would have lots of settings and dice and slice and have more flexibility." The teams also found evidence of how DNA changes over time, causing organisms to evolve. In one surprising conclusion, the international team found that about 220 genes did not evolve in a straight line from all animals that came before humans. Instead, human ancestors adopted these genes millions of years ago directly from bacteria. In essence, this shows evolution making use of whatever material it found at hand, Collins suggested. Intriguingly, at least one of the genes in question plays a role in depression. Still, an independent scientist questioned the finding. "I think it's more likely that the gene transfer went the other way, from vertebrates to bacteria," said Philip Green of the University of Washington in Seattle. The two teams also found hundreds of thousands of copies of mysterious DNA bits that act like parasites, detaching themselves from the genome and then reinserting themselves in a new location. Their existence had been known, but there are far more of them than scientists had thought. Moreover, one type of these parasites has the ability to move other pieces of DNA with them. The international team found that it could move more DNA than previously known, suggesting that these elements play a role in reshuffling genes to form new ones. Another type was found to exist only where genes are plentiful, and it may help the body respond to extreme stress, the international team said. All this, as well as evidence of genes that became inactive long ago, exists in the 98 percent of DNA that does not produce proteins. Once, scientists considered this vast material to be junk. Now they are finding that much of the material has, or had, some kind of biological function. In addition to studying how genes work, the teams determined that genes are not scattered uniformly throughout DNA. Instead, some areas of the genome, as the sum total of human DNA is called, are packed densely with genes, like a busy, urban area, while others are like "deserts," barren of genes. Researchers are not sure why this clustering occurs. In April 2000 the Subcommittee on Energy and Environment of the Committee on Science of the U.S. House of Representatives conducted hearings on the status and benefits of genome sequencing in the public and private sectors. Speakers included representatives of the U.S. HGP and Celera Genomics, members of Congress, and the director of the Office of Science and Technology Policy. Robert Waterston, directory of the HGP sequencing center at Washington University, St. Louis, pointed to fruitful data sharing by the HGP and the private sector. Examples included (1) collaborations led by the pharmaceutical company Merck to develop partial sequences identifying genes and (2) the fruit fly sequencing project by Celera and the HGP. Examples of private-sector enrichment of public data included the SNP consortium, which generated a publicly available map containing human DNA variations. (See article.) In September 2000, Celera Genomics announced a reference database with more than 2.8 million unique SNPs, including those screened from public-sector databases. In October a public-private consortium announced the joint sequencing of the laboratory mouse. (See article.) Also, a Monsanto-University of Washington project generated a draft sequence of the rice plant genome for release to the public. These efforts show the value of sharing data to increase knowledge and ensure future discoveries for mutual benefit. j, a, j. Neal Lane (formerly Assistant to the President for Science and Technology and Director of the Office of Science and Technology Policy) echoed the importance of partnerships between public and private sectors in his testimony to the House committee. His observations follow. "Sequencing the genome...is only the beginning of genomics," he said. "It is the first step into a future of discoveries and innovations that genomics will enable, that the public and private sectors must pursue together...An expanding, evolving partnership has made human genomic discoveries possible and is now poised to make those discoveries beneficial for everyone...I believe that the policies we have pursued will help to strengthen this partnership, allowing genomic discoveries and innovations to move steadily forward for the benefit of our nation and for all humankind." Completed in 2003, the Human Genome Project (HGP) was a 13-year project coordinated by the U.S. Department of Energy and the National Institutes of Health. During the early years of the HGP, the Wellcome Trust (U.K.) became a major partner; additional contributions came from Japan, France, Germany, China, and others. What is DNA sequencing? DNA sequencing, the process of determining the exact order of the 3 billion chemical building blocks (called bases and abbreviated A, T, C, and G) that make up the DNA of the 24 different human chromosomes, was the greatest technical challenge in the Human Genome Project. Achieving this goal has helped reveal the estimated 20,000-25,000 human genes within our DNA as well as the regions controlling them. The resulting DNA sequence maps are being used by 21st Century scientists to explore human biology and other complex phenomena. Meeting Human Genome Project sequencing goals by 2003 required continual improvements in sequencing speed, reliability, and costs. Previously, standard methods were based on separating DNA fragments by gel electrophoresis, which was extremely labor intensive and expensive. Total sequencing output in the community was about 200 Mb for 1998. In January 2003, the DOE Joint Genome Institute alone sequenced 1.5 billion bases for the month. Gel-based sequencers use multiple tiny (capillary) tubes to run standard electrophoretic separations. These separations are much faster because the tubes dissipate heat well and allow the use of much higher electric fields to complete sequencing in shorter times. Whose genome was sequenced in the public (HGP) and private projects? The human genome reference sequences do not represent any one person’s genome. Rather, they serve as a starting point for broad comparisons across humanity. The knowledge obtained is applicable to everyone because all humans share the same basic set of genes and genomic regulatory regions that control the development and maintenance of their biological structures and processes. In the international public-sector Human Genome Project (HGP), researchers collected blood (female) or sperm (male) samples from a large number of donors. Only a few of many collected samples were processed as DNA resources. Thus the donor identities were protected so neither donors nor scientists could know whose DNA was sequenced. DNA clones from many different libraries were used in the overall project. Technically, it is much easier to prepare DNA cleanly from sperm than from other cell types because of the much higher ratio of DNA to protein in sperm and the much smaller volume in which purifications can be done. Using sperm does provide all chromosomes for study, including equal numbers of sperm with the X (female) or Y (male) sex chromosomes. However, HGP scientists also used white cells from the blood of female donors so as to include female-originated samples. Scientific vs Commercial Goals The HGP's commitment from the outset was to create a scientific standard (an entire reference genome). Most private-sector human genome sequencing projects, however, focused on gathering just enough DNA to meet their customers' needs—probably in the 95% to 99% range for gene-rich, potentially lucrative regions. Such private data continue to be enriched greatly by accurate free public mapping (location) and sequence information. Celera's shotgun sequencing strategy, for example, created millions of tiny fragments that had to be ordered and oriented computationally using HGP research results. Most data at Celera, Incyte, and other genomics information-based companies are proprietary or available only for a fee. In addition, companies are filing numerous patent applications to stake claims to genes and other potentially important DNA fragments. More than the Reference Sequence DNA sequencing will continue to be a major emphasis for the foreseeable future as gene sequences are surveyed across various populations. c, k, a, g, d. Both the DOE and NIH genome programs continue to support the development of fully integrated and innovative approaches to rapid, low-cost sequencing. Other HGP goals from the final 5-year plan were to enhance bioinformatics (computational) resources to support future research and commercial applications. The HGP also aimed to explore gene function through comparative mouse-human studies, train future scientists, study human variation, and address critical societal issues arising from the increased availability of human genome data and related analytical technologies. In the Celera Genomics private-sector project, DNAs from a few different genomes were mixed up and processed for sequencing. The DNA resources used for these studies came from anonymous donors of European, African, American (North, Central, South), and Asian ancestry. The lead scientist of Celera Genomics at that time, Craig Venter, has since acknowledged that his DNA was one of those in the pool. Many small regions of DNA that vary among individuals (called polymorphisms) also were identified during the HGP, mostly single nucleotide polymorphisms (SNPs). Most SNPs are without physiological effect, although a minority contribute to the delightful and beneficial diversity of humanity. A much smaller minority of polymorphisms affect an individual’s susceptibility to disease and response to medical treatments. Although the HGP has been completed, SNP studies continue in the International HapMap Project, whose goal is to identify patterns of SNP groups (called haplotypes, or “haps”). The DNA samples for the HapMap came from a total of 270 individuals: Yoruba people in Ibadan, Nigeria; Japanese in Tokyo; Han Chinese in Beijing; and the French Centre d’Etude du Polymorphisme Humain (CEPH) resource. Who sequenced the human genome? Human Genome Project research was funded at many laboratories around the U.S. by the Department of Energy (DOE), the National Institutes of Health (NIH), or both. A list of the major U.S. Human Genome Project research sites can be found here. Other researchers at numerous colleges, universities, and laboratories throughout the United States have also received DOE and NIH funding for human genome research. At any given time, the DOE Human Genome Program has funded about 100 separate principal investigators. For DOE-funded projects, see Research. To see a list of NIH-funded projects, visit their grants database. In addition, many large and small private U.S. companies are conducting genome research. For more on the genomics research partnership between the public and private sectors, see the Human Genome Project and the Private Sector Fact Sheet. At least 18 other countries have participated in the Human Genome Project. What is the difference between draft sequence and finished sequence? In generating the draft sequence (released in June 2000), scientists determined the order of base pairs in each chromosomal area at least 4 to 5 times (4x to 5x) to ensure data accuracy and to help with reassembling DNA fragments in their original order. This repeated sequencing is known as genome "depth of coverage." Draft sequence data are mostly in the form of 10,000 basepair-sized fragments whose approximate chromosomal locations are known. To generate the high-quality reference sequence, completed in April 2003, additional sequencing was done to close gaps, reduce ambiguities, and allow for only a single error every 10,000 bases, the agreed-upon standard for the HGP. Investigators believe that a high-quality sequence is critical for recognizing regulatory components of genes that are very important in understanding human biology and such disorders as heart disease, cancer, and diabetes. The finished version provides an estimated 8x to 9x coverage of each chromosome. What genomes have been sequenced completely? The small genomes of several viruses and bacteria and the much larger genomes of three higher organisms have been completely sequenced; they are bakers' or brewers' yeast (Saccharomyces cerevisiae), the roundworm (Caenorhabditis elegans), and the fruit fly (Drosophila melanogaster). In October 2001 the draft sequence of the pufferfish Fugu rubripes, the first vertebrate after the human, was completed; and scientists finished the first genetic sequence of a plant, that of the weed Arabidopsis thaliana, in December 2000. Many more genomes have been completed since then. For information on published and unpublished genomes, see Genomes Online Database (GOLD). What nonhuman genome sequencing projects are supported by the U.S. Department of Energy? A list of microbial genome sequencing projects supported by the U.S. Department of Energy Microbial Genome Program is available here. What happens now that the human genome sequence is completed? The working draft DNA sequence and the more polished 2003 version represent an enormous achievement, akin in scientific importance, some say, to developing the periodic table of elements. And, as in most major scientific advances, much work remains to realize the full potential of the accomplishment. Early explorations into the human genome, now joined by projects on the genomes of a number of other organisms, are generating data whose volume and complex analyses are unprecedented in biology. Genomic-scale technologies will be needed to study and compare entire genomes, sets of expressed RNAs or proteins, gene families from a large number of species, variation among individuals, and the classes of gene regulatory elements. Deriving meaningful knowledge from DNA sequence will define biological research through the coming decades and require the expertise and creativity of teams of biologists, chemists, engineers, and computational scientists, among others. A sampling follows of some research challenges in genetics--what we still won't know, even with the full human sequence in hand. Single nucleotide polymorphisms or SNPs (pronounced "snips") are DNA sequence variations that occur when a single nucleotide (A,T,C,or G) in the genome sequence is altered. For example a SNP might change the DNA sequence AAGGCTAA to ATGGCTAA. For a variation to be considered a SNP, it must occur in at least 1% of the population. SNPs, which make up about 90% of all human genetic variation, occur every 100 to 300 bases along the 3-billion-base human genome. Two of every three SNPs involve the replacement of cytosine (C) with thymine (T). SNPs can occur in both coding (gene) and noncoding regions of the genome. Many SNPs have no effect on cell function, but scientists believe others could predispose people to disease or influence their response to a drug. Although more than 99% of human DNA sequences are the same across the population, variations in DNA sequence can have a major impact on how humans respond to disease; environmental insults such as bacteria, viruses, toxins, and chemicals; and drugs and other therapies. This makes SNPs of great value for biomedical research and for developing pharmaceutical products or medical diagnostics. SNPs are also evolutionarily stable --not changing much from generation to generation --making them easier to follow in population studies. Scientists believe SNP maps will help them identify the multiple genes associated with such complex diseases as cancer, diabetes, vascular disease, and some forms of mental illness. These associations are difficult to establish with conventional gene-hunting methods because a single altered gene may make only a small contribution to the disease. Several groups worked to find SNPs and ultimately create SNP maps of the human genome. Among these groups were the U.S. Human Genome Project (HGP) and a large group of pharmaceutical companies called the SNP Consortium or TSC project. The likelihood of duplication among the groups was small because of the estimated 3 million SNPs, and the potential payoff was high. In addition to the pharmacogenomic, diagnostic, and biomedical research implications, SNP maps are helping to identify thousands of additional markers along the genome, thus simplifying navigation of the much larger genome map generated by researchers in the HGP. How can SNPs be used as risk factors in disease development? SNPs do not cause disease, but they can help determine the likelihood that someone will develop a particular disease. One of the genes associated with Alzheimer's, apolipoprotein E or ApoE, is a good example of how SNPs affect disease development. This gene contains two SNPs that result in three possible alleles for this gene: E2, E3, and E4. Each allele differs by one DNA base, and the protein product of each gene differs by one amino acid. Each individual inherits one maternal copy of ApoE and one paternal copy of ApoE. Research has shown that an individual who inherits at least one E4 allele will have a greater chance of getting Alzheimer's. Apparently, the change of one amino acid in the E4 protein alters its structure and function enough to make disease development more likely. Inheriting the E2 allele, on the other hand, seems to indicate that an individual is less likely to develop Alzheimer's. Of course, SNPs are not absolute indicators of disease development. Someone who has inherited two E4 alleles may never develop Alzheimer's, while another who has inherited two E2 alleles may. ApoE is just one gene that has been linked to Alzheimer's. Like most common chronic disorders such as heart disease, diabetes, or cancer, Alzheimer's is a disease that can be caused by variations in several genes. The polygenic nature of these disorders is what makes genetic testing for them so complicated. The answer to this question is based on information provided by the Genome News Network. Human Genome Project SNP Mapping Goals In 1998, as part of their last five-year plan, the DOE and NIH Human Genome Programs established goals to identify and map SNPs. These goals were as follows: Develop technologies for rapid, large-scale identification and scoring of SNPs and other DNA sequence variants. Identify common variants in the coding regions of most identified genes. Create a SNP map of at least 100,000 markers. Develop the intellectual foundations for studies of sequence variation. Create public resources of DNA samples and cell lines. What is The SNP consortium (TSC)? In April 1999, ten large pharmaceutical companies and the U.K. Wellcome Trust philanthropy announced the establishment of a consortium headed by Arthur L. Holden to find and map 300,000 common SNPs. The goal was to generate a widely accepted, high-quality, extensive, publicly available map using SNPs as markers evenly distributed throughout the human genome. In the end, many more SNPs (1.8 million total) were discovered than planned originally. Now that the SNP discovery phase of the TSC project is essentially complete, the emphasis has shifted to studying SNPs in populations. Various TSC member laboratories are genotyping a subset of SNPs as part of the Allele Frequency Project. The goal of the TSC allele frequency/genotype project is to determine the frequency of certain SNPs in three major world populations. See the TSC Web site for more information. Who are members of the SNP consortium? The international member companies, which together committed at least $30 million, are APBiotech, AstraZeneca Group PLC, Aventis, Bayer Group AG, Bristol-Myers Squibb Co., F. Hoffmann-La Roche, Glaxo Wellcome PLC, IBM, Motorola, Novartis AG, Pfizer Inc., Searle, and SmithKline Beecham PLC. The Wellcome Trust contributed at least $14 million. Laboratories funded by these companies to identify SNPs are located at the Whitehead Institute, Sanger Centre, Washington University (St. Louis), and Stanford University. Data management and analysis take place at Cold Spring Harbor Laboratory. See Consortium Updates: News Related to The SNP Consortium SNP Consortium Collaborates with HGP, Publishes First Progress Reports, Human Genome News article Nov. 2000. International SNP Meeting Updates, Human Genome News article Nov. 2000. Why should private companies fund a publicly accessible genome map? The SNP consortium views its map as a way to make available an important, precompetitive, high-quality research tool that will spark innovative work throughout the research and industrial communities. The map will be a powerful research tool to enhance the understanding of disease processes and facilitate the discovery and development of safer and more effective medications. Whose DNA was analyzed to create the consortium's SNP map? The SNP consortium used DNA resources from a pool of samples obtained from 24 individuals representing several racial groups. This is a subset of the DNA reference panel for SNP identification collected by the NIH National Human Genome Research Institute. The anonymous, voluntary DNA contributions were made with informed consent specifically for this use. Are SNP data available to the public? SNP data were made available through a consortium Web site at quarterly intervals during the project's first year and at monthly intervals during the second year. This cycle of releases ceased in the fall of 2001 once the discovery phase was finished, but with the recent additions of genotype and allele frequency data, a new data dump took place in the fall of 2002. Besides the TSC Web site, SNP data is also available from the following resources: dbSNP database -- From the National Center for Biotechnology Information (NCBI). HGVbase (Human Genome Variation Database) - A human gene-based polymorphism database. For tips on how to use these and other databases, see the Gene Mutation Resources at Gene Gateway, an online guide for learning about genes, proteins, and genetic disorders. GENOME VARIATIONS Genome variations are differences in the sequence of DNA from one person to the next. Just as you can look at two people and tell that they are different, you could, with the proper chemicals and laboratory equipment, look at the genomes of two people and tell that they are different, too. In fact, people are unique in large part because their genomes are unique. How different is one human genome from another? The more closely related two people are, the more similar their genomes. Scientists estimate that the genomes of non-related people—any two people plucked at random off the street—differ at about 1 in every 1,200 to 1,500 DNA bases, or "letters." Whether that's a little or a lot of variation depends on your perspective. There are more than three million differences between your genome and anyone else's. On the other hand, we are all 99.9 percent the same, DNA-wise. (By contrast, we are only about 99 percent the same as our closest relatives, chimpanzees.) Most genome variations are relatively small and simple, involving only a few bases—an A substituted for a T here, a G left out there, a short sequence such as CT added somewhere else, for example. Your genome probably doesn't contain long stretches of DNA that someone else's lacks. If the genome were a book, every person's book would contain the same paragraphs and chapters, arranged in the same order. Each book would tell more or less the same story. But my book might contain a typo on page 303 that yours lacks, and your book might use a British spelling on page 135—"colour"—where mine uses the American spelling—"color." If every human genome is different, what does it mean to sequence "the" human genome? With the June 26, 2000, announcement by the publicly funded Human Genome Project (HGP) and Celera Genomics that the draft sequence of the human genome was essentially complete, the complementary aspects of the public and private sectors sequencing projects were realized. After the spring of 1998, when Celera Genomics announced its sequencing goal, other private companies also declared their intention to sequence or map genomic regions to varying degrees. Some people questioned whether the HGP and the private sector were duplicating work, and they wondered who would win the race to sequence the human genome. Although the HGP and private companies did have overlapping sequencing goals, their finish lines were different because their ultimate goals were not the same. In a sense, through its policy of open data release, the HGP has all along facilitated the research of others. Additionally, the HGP funds projects at small companies to devise needed technologies. DOE, NIH, the National Institute for Standards and Technology, and other governmental funding sources also supported further application and commercialization of HGP-generated resources. HGP products spurred a boom in such spin-off programs as the NIH Cancer Genome Anatomy Project and the DOE Microbial Genome Program. Genomes of numerous animals, plants, and microbes are being sequenced, and the number of private endeavors is increasing. i, h, b, d, j. Technology transfer from developers to users and participation in collaborative, multidisciplinary projects closely unite researchers at academic, industrial, and governmental laboratories. The complete human genome sequence announced in June 2000 is a "representative" genome sequence based on the DNA of just a few individuals. The scientific paper was published in the February 16, 2001 issue of Science. Over the longer term, scientists will study DNA from many different people to identify where and what variations between individual genomes exist. Sequencing a genome is such a Herculean task that capturing its person-to-person variability on the first pass would be next to impossible. But that doesn't mean that the representative sequence we have now will be useless—far from it. The vast majority of the genome's sequence is the same from one person to the next, with the same genes in the same places. In other words, my genome is a pretty good approximation of yours, and if scientists sequenced your genome they would learn a lot about mine. Moreover, since every person's genome is unique, no one person is any more or less "representative" than any other and it hardly matters whose genome is sequenced first. Why is every human genome different? Every human genome is different because of mutations—"mistakes" that occur occasionally in a DNA sequence. When a cell divides in two, it makes a copy of its genome, then parcels out one copy to each of the two new cells. Theoretically, the entire genome sequence is copied exactly, but in practice a wrong base is incorporated into the DNA sequence every once in a while, or a base or two might be left out or added. These mistakes—"changes" might be a more accurate word, because they are not always bad news—are called mutations. When a mutation occurs in a sex cell—a sperm or an egg—it can be passed along to the next generation of people. Your genome contains about 100 "new" mutations—changes that occurred as your parents' bodies made the egg and sperm cells that became you. These genome variations are uniquely yours. Other variations in your genome arose many generations ago and have been passed down from parent to child over the years, until they ended up in you. You probably share each one of these older variations with many other people all over the world, but still, no one else has the exact same combination of variations that you have. Where are genome variations found? Variations are found all throughout the genome, on every one of the 46 human chromosomes. But this variation is by no means distributed evenly: It's not as if there is one difference every 1,000 bases as regular as rain. Instead, some parts of the genome are "hot spots" of variability, with hundreds of possible variations of a sequence. Other parts of the genome, meanwhile, don't vary much at all between individuals—in scientific parlance, they are said to be "stable." The majority of variations are found outside of genes, in the "extra" or "junk" DNA that does not affect a person's characteristics. Mutations in these parts of the genome are never harmful, so variations can accumulate without causing any problems. Genes, by contrast, tend to be stable because mutations that occur in genes are often harmful to an individual, and thus less likely to be passed on. What kinds of genome variations are there? Genome variations include mutations and polymorphisms. Technically, a polymorphism (a term that comes from the Greek words "poly," or "many," and "morphe," or "form") is a DNA variation in which each possible sequence is present in at least 1 percent of people. For example, a place in the genome where 93 percent of people have a T and the remaining 7 percent have an A is a polymorphism. If one of the possible sequences is present in less than 1 percent of people (99.9 percent of people have a G and 0.1 percent have a C), then the variation is called a mutation. What is functional genomics? Comparative genomics involves the use of computer programs that can line up
multiple genomes and look for regions of similarity among them. Some of these
sequence-similarity tools are accessible to the public over the Internet. One of
the most widely used is BLAST, which is available from the National Center for
Biotechnology Information. BLAST is a set of programs designed to perform
similarity searches on all available sequence data. For instructions on how to
use BLAST, see the tutorial Sequence similarity searching using NCBI BLAST
available through Gene Gateway, an online guide for learning about genes,
proteins, and genetic disorders. I know of only a few cases in which no mouse counterpart can be found for a particular human gene, and for the most part we see essentially a one-to-one correspondence between genes in the two species. The exceptions generally appear to be of a particular type --genes that arise when an existing sequence is duplicated. Gene duplication occurs frequently in complex genomes; sometimes the duplicated copies degenerate to the point where they no longer are capable of encoding a protein. However, many duplicated genes remain active and over time may change enough to perform a new function. Since gene duplication is an ongoing process, mice may have active duplicates that humans do not possess, and vice versa. These appear to make up a small percentage of the total genes. I believe the number of human genes without a clear mouse counterpart, and vice versa, won't be significantly larger than 1% of the total. Nevertheless, these novel genes may play an important role in determining species-specific traits and functions. However, the most significant differences between mice and humans are not in the number of genes each carries but in the structure of genes and the activities of their protein products. Gene for gene, we are very similar to mice. What really matters is that subtle changes accumulated in each of the approximately 30,000 genes add together to make quite different organisms. Further, genes and proteins interact in complex ways that multiply the functions of each. In addition, a gene can produce more than one protein product through alternative splicing or post-translational modification; these events do not always occur in an identical way in the two species. A gene can produce more or less protein in different cells at various times in response to developmental or environmental cues, and many proteins can express disparate functions in various biological contexts. Thus, subtle distinctions are multiplied by the more than 30,000 estimated genes. Sequencing Technologies and Biological Resources. Other major factors in cost and time reduction were greatly improved sequencing instruments and efficient biological resources such as the following: DOE-funded research on capillary-based DNA sequencing contributed to the development of the two major sequencing machines. The core optical system concept of the Perkin-Elmer 3700 sequencing machine (used by Celera and others) was pioneered with DOE support. The instrumentation concepts that matured as the MegaBACE sequencer were pioneered by Richard Mathies (University of California, Berkeley). The DOE JGI chose this sequencing hardware platform after competitive trials. DNA sequencing originally was done with radiolabeled DNA fragments. Today, DOE improvements to fluorescent dyes decrease the amount of DNA needed and increase the accuracy of sequencing data. Bacterial artificial chromosome (BAC) clones, developed in the DOE program, became the preferred starting resource in sequencing procedures because of their superior stability and large size. A critical component of public- and private-sector sequencing, BACs were used to assemble both the draft and final human DNA reference sequences. a, i, i, l, g. Further extending the usefulness of BACs, the DOE HGP funded the production of sequence tag connectors (STCs) from BAC ends. This early information enabled the selection of optimal BACs for complete sequencing, thus saving time and money. STC use for the HGP was advocated by Craig Venter and Nobelist Hamilton Smith (Celera), and Leroy Hood (now at the Institute for Systems Biology). A Successful Transformation These successes transferred much of the repetitive labor from humans to automated machines. In addition, new software for data processing both alleviated and sped human decision making. Over the last decade, advances in instrumentation, automation, and computation have transformed the entire process. Further innovations, however, still are needed for completing many large sequences and increasing the effectiveness of sequencing. The often-quoted statement that we share over 98% of our genes with apes (chimpanzees, gorillas, and orangutans) actually should be put another way. That is, there is more than 95% to 98% similarity between related genes in humans and apes in general. (Just as in the mouse, quite a few genes probably are not common to humans and apes, and these may influence uniquely human or ape traits.) Similarities between mouse and human genes range from about 70% to 90%, with an average of 85% similarity but a lot of variation from gene to gene (e.g., some mouse and human gene products are almost identical, while others are nearly unrecognizable as close relatives). Some nucleotide changes are “neutral” and do not yield a significantly altered protein. Others, but probably only a relatively small percentage, would introduce changes that could substantially alter what the protein does. Put these alterations in the context of known inherited human diseases: a
single nucleotide change can lead to inheritance of sickle cell disease, cystic
fibrosis, or breast cancer. A single nucleotide difference can alter protein
function in such a way that it causes a terrible tissue malfunction. Single
nucleotide changes have been linked to hereditary differences in height, brain
development, facial structure, pigmentation, and many other striking
morphological differences; due to single nucleotide changes, hands can develop
structures that look like toes instead of fingers, and a mouse's tail can
disappear completely. Single-nucleotide changes in the same genes but in
different positions in the coding sequence might do nothing harmful at all.
Evolutionary changes are the same as these sequence differences that are linked
to person-to-person variation: many of the average 15% nucleotide changes that
distinguish humans and mouse genes are neutral; some lead to subtle changes,
whereas others are associated with dramatic differences. Add them all together,
and they can make quite an impact, as evidenced by the huge range of metabolic,
morphological, and behavioral differences we see among organisms. When researchers isolate human genes with unknown functions, they can create knockout mice with these genes and observe the results. Instead of creating merely the mouse equivalent of the human gene, researchers are able to reproduce and express actual human genes and their corresponding proteins in mice. Subsequent offspring will inherit not only the instructions coded by their original mouse genome, but also the traits coded for by the inserted human DNA. This helps researchers understand health and disease by observing how genes work in cells. Knockout mice have many benefits. They not only allow researchers to
determine gene function and understand diseases at the molecular level, but they
also aid scientists in testing new drugs and devising novel therapies. Initial sequencing and analysis of the human genome International Human Genome Sequencing Consortium The human genome holds an extraordinary trove of information about human development, physiology, medicine and evolution. Here we report the results of an international collaboration to produce and make freely available a draft sequence of the human genome. We also present an initial analysis of the data, describing some of the insights that can be gleaned from the sequence. The rediscovery of Mendel's laws of heredity in the opening weeks of the 20th century1-3 sparked a scientific quest to understand the nature and content of genetic information that has propelled biology for the last hundred years. The scientific progress made falls naturally into four main phases, corresponding roughly to the four quarters of the century. The first established the cellular basis of heredity: the chromosomes. The second defined the molecular basis of heredity: the DNA double helix. The third unlocked the informational basis of heredity, with the discovery of the biological mechanism by which cells read the information contained in genes and with the invention of the recombinant DNA technologies of cloning and sequencing by which scientists can do the same. The last quarter of a century has been marked by a relentless drive to decipher first genes and then entire genomes, spawning the field of genomics. The fruits of this work already include the genome sequences of 599 viruses and viroids, 205 naturally occurring plasmids, 185 organelles, 31 eubacteria, seven archaea, one fungus, two animals and one plant. Here we report the results of a collaboration involving 20 groups from the United States, the United Kingdom, Japan, France, Germany and China to produce a draft sequence of the human genome. The draft genome sequence was generated from a physical map covering more than 96% of the euchromatic part of the human genome and, together with additional sequence in public databases, it covers about 94% of the human genome. The sequence was produced over a relatively short period, with coverage rising from about 10% to more than 90% over roughly fifteen months. The sequence data have been made available without restriction and updated daily throughout the project. The task ahead is to produce a finished sequence, by closing all gaps and resolving all ambiguities. Already about one billion bases are in final form and the task of bringing the vast majority of the sequence to this standard is now straightforward and should proceed rapidly. The sequence of the human genome is of interest in several respects. It is the largest genome to be extensively sequenced so far, being 25 times as large as any previously sequenced genome and eight times as large as the sum of all such genomes. It is the first vertebrate genome to be extensively sequenced. And, uniquely, it is the genome of our own species. Much work remains to be done to produce a complete finished sequence, but the vast trove of information that has become available through this collaborative effort allows a global perspective on the human genome. Although the details will change as the sequence is finished, many points are already clear. The genomic landscape shows marked variation in the distribution of a number of features, including genes, transposable elements, GC content, CpG islands and recombination rate. This gives us important clues about function. For example, the developmentally important HOX gene clusters are the most repeat-poor regions of the human genome, probably reflecting the very complex coordinate regulation of the genes in the clusters. There appear to be about 30,000–40,000 protein-coding genes in the human genome—only about twice as many as in worm or fly. However, the genes are more complex, with more alternative splicing generating a larger number of protein products. The full set of proteins (the 'proteome') encoded by the human genome is more complex than those of invertebrates. This is due in part to the presence of vertebrate-specific protein domains and motifs (an estimated 7% of the total), but more to the fact that vertebrates appear to have arranged pre-existing components into a richer collection of domain architectures. Hundreds of human genes appear likely to have resulted from horizontal transfer from bacteria at some point in the vertebrate lineage. Dozens of genes appear to have been derived from transposable elements. Although about half of the human genome derives from transposable elements, there has been a marked decline in the overall activity of such elements in the hominid lineage. DNA transposons appear to have become completely inactive and long-terminal repeat (LTR) retroposons may also have done so. The pericentromeric and subtelomeric regions of chromosomes are filled with large recent segmental duplications of sequence from elsewhere in the genome. Segmental duplication is much more frequent in humans than in yeast, fly or worm. Analysis of the organization of Alu elements explains the longstanding mystery of their surprising genomic distribution, and suggests that there may be strong selection in favour of preferential retention of Alu elements in GC-rich regions and that these 'selfish' elements may benefit their human hosts. The mutation rate is about twice as high in male as in female meiosis, showing that most mutation occurs in males. Cytogenetic analysis of the sequenced clones confirms suggestions that large GC-poor regions are strongly correlated with 'dark G-bands' in karyotypes. Recombination rates tend to be much higher in distal regions (around 20 megabases (Mb)) of chromosomes and on shorter chromosome arms in general, in a pattern that promotes the occurrence of at least one crossover per chromosome arm in each meiosis. More than 1.4 million single nucleotide polymorphisms (SNPs) in the human genome have been identified. This collection should allow the initiation of genome-wide linkage disequilibrium mapping of the genes in the human population. In this paper, we start by presenting background information on the project and describing the generation, assembly and evaluation of the draft genome sequence. We then focus on an initial analysis of the sequence itself: the broad chromosomal landscape; the repeat elements and the rich palaeontological record of evolutionary and biological processes that they provide; the human genes and proteins and their differences and similarities with those of other organisms; and the history of genomic segments. (Comparisons are drawn throughout with the genomes of the budding yeast Saccharomyces cerevisiae, the nematode worm Caenorhabditis elegans, the fruitfly Drosophila melanogaster and the mustard weed Arabidopsis thaliana; we refer to these for convenience simply as yeast, worm, fly and mustard weed.) Finally, we discuss applications of the sequence to biology and medicine and describe next steps in the project. A full description of the methods is provided as Supplementary Information on Nature's web site We recognize that it is impossible to provide a comprehensive analysis of this vast dataset, and thus our goal is to illustrate the range of insights that can be gleaned from the human genome and thereby to sketch a research agenda for the future. Background to the Human Genome Project The Human Genome Project arose from two key insights that emerged in the early 1980s: that the ability to take global views of genomes could greatly accelerate biomedical research, by allowing researchers to attack problems in a comprehensive and unbiased fashion; and that the creation of such global views would require a communal effort in infrastructure building, unlike anything previously attempted in biomedical research. Several key projects helped to crystallize these insights, including: (1) The sequencing of the bacterial viruses X1744, 5 and lambda6, the animal virus SV407 and the human mitochondrion8 between 1977 and 1982. These projects proved the feasibility of assembling small sequence fragments into complete genomes, and showed the value of complete catalogues of genes and other functional elements. (2) The programme to create a human genetic map to make it possible to locate disease genes of unknown function based solely on their inheritance patterns, launched by Botstein and colleagues in 1980 (ref. 9). (3) The programmes to create physical maps of clones covering the yeast10 and worm11 genomes to allow isolation of genes and regions based solely on their chromosomal position, launched by Olson and Sulston in the mid-1980s. (4) The development of random shotgun sequencing of complementary DNA fragments for high-throughput gene discovery by Schimmel12 and Schimmel and Sutcliffe13, later dubbed expressed sequence tags (ESTs) and pursued with automated sequencing by Venter and others14-20. Substantial public-sector R&D investment often was needed in feasibility demonstrations before such start-up ventures as those by Celera Genomics, Incyte, and Human Genome Sciences could begin. In turn, these companies furnished valuable commercial services that the government could not provide, and the taxes returned by their successes easily repay fundamental public investments. Following are a few key public R&D contributions that made some current genomics ventures commercially feasible. These examples describe DOE investments, but substantial commitments by NIH and the Wellcome Trust in the United Kingdom were equally important. Scientific Infrastructure. The scientific foundation for a human genome initiative existed at the national laboratories before DOE established the first genome project in 1986. Besides expertise in a number of areas critical to genomic research, the laboratories had a long history of conducting large multidisciplinary projects. Genomic Science and Pioneering Technology. GenBank, the world's DNA sequence repository, was developed at Los Alamos National Laboratory (LANL) and later transferred to the National Library of Medicine. Chromosome-sorting capabilities developed at LANL and Lawrence Livermore National Laboratory enabled the development of DNA clone libraries representing the individual chromosomes. These libraries were a crucial resource in genome sequencing. i, a, d, g, g. Sequencing Strategies. When the HGP was initiated, vital automation tools and high-throughput sequencing technologies had to be developed or improved. The cost of sequencing a single DNA base was about $10 then; by 2001, sequencing costs had fallen about 100-fold to $.10 to $.20 a base and still are dropping rapidly. DOE-funded enhancements to sequencing protocols, chemical reagents, and enzymes contributed substantially to increasing efficiencies. The commercial marketing of these reagents has greatly benefitted basic R&D, genome-scale sequencing, and lower-cost commercial diagnostic services. The idea of sequencing the entire human genome was first proposed in discussions at scientific meetings organized by the US Department of Energy and others from 1984 to 1986 (refs 21, 22). A committee appointed by the US National Research Council endorsed the concept in its 1988 report23, but recommended a broader programme, to include: the creation of genetic, physical and sequence maps of the human genome; parallel efforts in key model organisms such as bacteria, yeast, worms, flies and mice; the development of technology in support of these objectives; and research into the ethical, legal and social issues raised by human genome research. The programme was launched in the US as a joint effort of the Department of Energy and the National Institutes of Health. In other countries, the UK Medical Research Council and the Wellcome Trust supported genomic research in Britain; the Centre d'Etude du Polymorphisme Humain and the French Muscular Dystrophy Association launched mapping efforts in France; government agencies, including the Science and Technology Agency and the Ministry of Education, Science, Sports and Culture supported genomic research efforts in Japan; and the European Community helped to launch several international efforts, notably the programme to sequence the yeast genome. By late 1990, the Human Genome Project had been launched, with the creation of genome centres in these countries. Additional participants subsequently joined the effort, notably in Germany and China. In addition, the Human Genome Organization (HUGO) was founded to provide a forum for international coordination of genomic research. Several books24-26 provide a more comprehensive discussion of the genesis of the Human Genome Project. Through 1995, work progressed rapidly on two fronts. The first was construction of genetic and physical maps of the human and mouse genomes27-31, providing key tools for identification of disease genes and anchoring points for genomic sequence. The second was sequencing of the yeast32 and worm33 genomes, as well as targeted regions of mammalian genomes34-37. These projects showed that large-scale sequencing was feasible and developed the two-phase paradigm for genome sequencing. In the first, 'shotgun', phase, the genome is divided into appropriately sized segments and each segment is covered to a high degree of redundancy (typically, eight- to tenfold) through the sequencing of randomly selected subfragments. The second is a 'finishing' phase, in which sequence gaps are closed and remaining ambiguities are resolved through directed analysis. The results also showed that complete genomic sequence provided information about genes, regulatory regions and chromosome structure that was not readily obtainable from cDNA studies alone. In 1995, genome scientists considered a proposal38 that would have involved producing a draft genome sequence of the human genome in a first phase and then returning to finish the sequence in a second phase. After vigorous debate, it was decided that such a plan was premature for several reasons. These included the need first to prove that high-quality, long-range finished sequence could be produced from most parts of the complex, repeat-rich human genome; the sense that many aspects of the sequencing process were still rapidly evolving; and the desirability of further decreasing costs. Instead, pilot projects were launched to demonstrate the feasibility of cost-effective, large-scale sequencing, with a target completion date of March 1999. The projects successfully produced finished sequence with 99.99% accuracy and no gaps39. They also introduced bacterial artificial chromosomes (BACs)40, a new large-insert cloning system that proved to be more stable than the cosmids and yeast artificial chromosomes (YACs)41 that had been used previously. The pilot projects drove the maturation and convergence of sequencing strategies, while producing 15% of the human genome sequence. With successful completion of this phase, the human genome sequencing effort moved into full-scale production in March 1999. The idea of first producing a draft genome sequence was revived at this time, both because the ability to finish such a sequence was no longer in doubt and because there was great hunger in the scientific community for human sequence data. In addition, some scientists favoured prioritizing the production of a draft genome sequence over regional finished sequence because of concerns about commercial plans to generate proprietary databases of human sequence that might be subject to undesirable restrictions on use42-44. The consortium focused on an initial goal of producing, in a first production phase lasting until June 2000, a draft genome sequence covering most of the genome. Such a draft genome sequence, although not completely finished, would rapidly allow investigators to begin to extract most of the information in the human sequence. Experiments showed that sequencing clones covering about 90% of the human genome to a redundancy of about four- to fivefold ('half-shotgun' coverage; see Box 1) would accomplish this45, 46. The draft genome sequence goal has been achieved, as described below. The second sequence production phase is now under way. Its aims are to achieve full-shotgun coverage of the existing clones during 2001, to obtain clones to fill the remaining gaps in the physical map, and to produce a finished sequence (apart from regions that cannot be cloned or sequenced with currently available techniques) no later than 2003. Hierarchical shotgun sequencing Soon after the invention of DNA sequencing methods47, 48, the shotgun sequencing strategy was introduced49-51; it has remained the fundamental method for large-scale genome sequencing52-54 for the past 20 years. The approach has been refined and extended to make it more efficient. For example, improved protocols for fragmenting and cloning DNA allowed construction of shotgun libraries with more uniform representation. The practice of sequencing from both ends of double-stranded clones ('double-barrelled' shotgun sequencing) was introduced by Ansorge and others37 in 1990, allowing the use of 'linking information' between sequence fragments. The application of shotgun sequencing was also extended by applying it to larger and larger DNA molecules—from plasmids ( 4 kilobases (kb)) to cosmid clones37 (40 kb), to artificial chromosomes cloned in bacteria and yeast55 (100–500 kb) and bacterial genomes56 (1–2 megabases (Mb)). In principle, a genome of arbitrary size may be directly sequenced by the shotgun method, provided that it contains no repeated sequence and can be uniformly sampled at random. The genome can then be assembled using the simple computer science technique of 'hashing' (in which one detects overlaps by consulting an alphabetized look-up table of all k-letter words in the data). Mathematical analysis of the expected number of gaps as a function of coverage is similarly straightforward57. Practical difficulties arise because of repeated sequences and cloning bias. Small amounts of repeated sequence pose little problem for shotgun sequencing. For example, one can readily assemble typical bacterial genomes (about 1.5% repeat) or the euchromatic portion of the fly genome (about 3% repeat). By contrast, the human genome is filled (> 50%) with repeated sequences, including interspersed repeats derived from transposable elements, and long genomic regions that have been duplicated in tandem, palindromic or dispersed fashion (see below). These include large duplicated segments (50–500 kb) with high sequence identity (98–99.9%), at which mispairing during recombination creates deletions responsible for genetic syndromes. Such features complicate the assembly of a correct and finished genome sequence. There are two approaches for sequencing large repeat-rich genomes. The first is a whole-genome shotgun sequencing approach, as has been used for the repeat-poor genomes of viruses, bacteria and flies, using linking information and computational analysis to attempt to avoid misassemblies. The second is the 'hierarchical shotgun sequencing' approach, also referred to as 'map-based', 'BAC-based' or 'clone-by-clone'. This approach involves generating and organizing a set of large-insert clones (typically 100–200 kb each) covering the genome and separately performing shotgun sequencing on appropriately chosen clones. Because the sequence information is local, the issue of long-range misassembly is eliminated and the risk of short-range misassembly is reduced. One caveat is that some large-insert clones may suffer rearrangement, although this risk can be reduced by appropriate quality-control measures involving clone fingerprints (see below). The two methods are likely to entail similar costs for producing finished sequence of a mammalian genome. The hierarchical approach has a higher initial cost than the whole-genome approach, owing to the need to create a map of clones (about 1% of the total cost of sequencing) and to sequence overlaps between clones. On the other hand, the whole-genome approach is likely to require much greater work and expense in the final stage of producing a finished sequence, because of the challenge of resolving misassemblies. Both methods must also deal with cloning biases, resulting in under-representation of some regions in either large-insert or small-insert clone libraries. There was lively scientific debate over whether the human genome sequencing effort should employ whole-genome or hierarchical shotgun sequencing. Weber and Myers58 stimulated these discussions with a specific proposal for a whole-genome shotgun approach, together with an analysis suggesting that the method could work and be more efficient. Green59 challenged these conclusions and argued that the potential benefits did not outweigh the likely risks. In the end, we concluded that the human genome sequencing effort should employ the hierarchical approach for several reasons. First, it was prudent to use the approach for the first project to sequence a repeat-rich genome. With the hierarchical approach, the ultimate frequency of misassembly in the finished product would probably be lower than with the whole-genome approach, in which it would be more difficult to identify regions in which the assembly was incorrect. Second, it was prudent to use the approach in dealing with an outbred organism, such as the human. In the whole-genome shotgun method, sequence would necessarily come from two different copies of the human genome. Accurate sequence assembly could be complicated by sequence variation between these two copies—both SNPs (which occur at a rate of 1 per 1,300 bases) and larger-scale structural heterozygosity (which has been documented in human chromosomes). In the hierarchical shotgun method, each large-insert clone is derived from a single haplotype. Third, the hierarchical method would be better able to deal with inevitable cloning biases, because it would more readily allow targeting of additional sequencing to under-represented regions. And fourth, it was better suited to a project shared among members of a diverse international consortium, because it allowed work and responsibility to be easily distributed. As the ultimate goal has always been to create a high-quality, finished sequence to serve as a foundation for biomedical research, we reasoned that the advantages of this more conservative approach outweighed the additional cost, if any. A biotechnology company, Celera Genomics, has chosen to incorporate the whole-genome shotgun approach into its own efforts to sequence the human genome. Their plan60, 61 uses a mixed strategy, involving combining some coverage with whole-genome shotgun data generated by the company together with the publicly available hierarchical shotgun data generated by the International Human Genome Sequencing Consortium. If the raw sequence reads from the whole-genome shotgun component are made available, it may be possible to evaluate the extent to which the sequence of the human genome can be assembled without the need for clone-based information. Such analysis may help to refine sequencing strategies for other large genomes. Technology for large-scale sequencing Sequencing the human genome depended on many technological improvements in the production and analysis of sequence data. Key innovations were developed both within and outside the Human Genome Project. Laboratory innovations included four-colour fluorescence-based sequence detection62, improved fluorescent dyes63-66, dye-labelled terminators67, polymerases specifically designed for sequencing68-70, cycle sequencing71 and capillary gel electrophoresis72-74. These studies contributed to substantial improvements in the automation, quality and throughput of collecting raw DNA sequence75, 76. There were also important advances in the development of software packages for the analysis of sequence data. The PHRED software package77, 78 introduced the concept of assigning a 'base-quality score' to each base, on the basis of the probability of an erroneous call. These quality scores make it possible to monitor raw data quality and also assist in determining whether two similar sequences truly overlap. The PHRAP computer package then systematically assembles the sequence data using the base-quality scores. The program assigns 'assembly-quality scores' to each base in the assembled sequence, providing an objective criterion to guide sequence finishing. The quality scores were based on and validated by extensive experimental data. Another key innovation for scaling up sequencing was the development by several centres of automated methods for sample preparation. This typically involved creating new biochemical protocols suitable for automation, followed by construction of appropriate robotic systems. Coordination and public data sharing The Human Genome Project adopted two important principles with regard to human sequencing. The first was that the collaboration would be open to centres from any nation. Although potentially less efficient, in a narrow economic sense, than a centralized approach involving a few large factories, the inclusive approach was strongly favoured because we felt that the human genome sequence is the common heritage of all humanity and the work should transcend national boundaries, and we believed that scientific progress was best assured by a diversity of approaches. The collaboration was coordinated through periodic international meetings (referred to as 'Bermuda meetings' after the venue of the first three gatherings) and regular telephone conferences. Work was shared flexibly among the centres, with some groups focusing on particular chromosomes and others contributing in a genome-wide fashion. The second principle was rapid and unrestricted data release. The centres adopted a policy that all genomic sequence data should be made publicly available without restriction within 24 hours of assembly79, 80. Pre-publication data releases had been pioneered in mapping projects in the worm11 and mouse genomes30, 81 and were prominently adopted in the sequencing of the worm, providing a direct model for the human sequencing efforts. We believed that scientific progress would be most rapidly advanced by immediate and free availability of the human genome sequence. The explosion of scientific work based on the publicly available sequence data in both academia and industry has confirmed this judgement. The deluge of data and related technologies generated by the Human Genome Project (HGP) and other genomic research presents a broad array of commercial opportunities. Seemingly limitless applications cross boundaries from medicine and food to energy and environmental resources, and predictions are that life sciences may become the largest sector in the U.S. economy. Established companies are scrambling to retool, and many new ventures are seeking a role in the information revolution with DNA at its core. IBM, Compaq, DuPont, and major pharmaceutical companies are among those interested in the potential for targeting and applying genome data. In the genomics corner alone, dozens of small companies have sprung up to sell information, technologies, and services to facilitate basic research into genes and their functions. These new entrepreneurs also offer an abundance of genomic services and applications, including additional databases with DNA sequences from humans, animals, plants, and microbes. b, g, i, c, e. Other applications include gene fragments to use for drug development and target identification and evaluation, identification of candidate genes, and RNA expression information revealing gene activity. Products include protein profiles; particular genotypes associated with such specific medically important phenotypes as disease susceptibility and drug responsiveness; hardware, software, and reagents for DNA sequencing and other DNA-based tests; microarrays (DNA chips) containing tens of thousands of known DNA and RNA fragments for research or clinical use; and DNA analysis software. Generating the draft genome sequence Generating a draft sequence of the human genome involved three steps: selecting the BAC clones to be sequenced, sequencing them and assembling the individual sequenced clones into an overall draft genome sequence. A glossary of terms related to genome sequencing and assembly is provided in Box 1. The draft genome sequence is a dynamic product, which is regularly updated as additional data accumulate en route to the ultimate goal of a completely finished sequence. The results below are based on the map and sequence data available on 7 October 2000, except as otherwise noted. At the end of this section, we provide a brief update of key data. Quality assessment The draft genome sequence already covers the vast majority of the genome, but it remains an incomplete, intermediate product that is regularly updated as we work towards a complete finished sequence. The current version contains many gaps and errors. We therefore sought to evaluate the quality of various aspects of the current draft genome sequence, including the sequenced clones themselves, their assignment to a position in the fingerprint clone contigs, and the assembly of initial sequence contigs from the individual clones into sequence-contig scaffolds. Nucleotide accuracy is reflected in a PHRAP score assigned to each base in the draft genome sequence and available to users through the Genome Browsers (see below) and public database entries. A summary of these scores for the unfinished portion of the genome is shown in Table 9. About 91% of the unfinished draft genome sequence has an error rate of less than 1 per 10,000 bases (PHRAP score > 40), and about 96% has an error rate of less than 1 in 1,000 bases (PHRAP > 30). These values are based only on the quality scores for the bases in the sequenced clones; they do not reflect additional confidence in the sequences that are represented in overlapping clones. The finished portion of the draft genome sequence has an error rate of less than 1 per 10,000 bases. Individual sequenced clones. We assessed the frequency of misassemblies, which can occur when the assembly program PHRAP joins two nonadjacent regions in the clone into a single initial sequence contig. The frequency of misassemblies depends heavily on the depth and quality of coverage of each clone and the nature of the underlying sequence; thus it may vary among genomic regions and among individual centres. Most clone misassemblies are readily corrected as coverage is added during finishing, but they may have been propagated into the current version of the draft genome sequence and they justify caution for certain applications. We estimated the frequency of misassembly by examining instances in which there was substantial overlap between a draft clone and a finished clone. We studied 83 Mb of such overlaps, involving about 9,000 initial sequence contigs. We found 5.3 instances per Mb in which the alignment of an initial sequence contig to the finished sequence failed to extend to within 200 bases of the end of the contig, suggesting a possible false join in the assembly of the initial sequence contig. In about half of these cases, the potential misassembly involved fewer than 400 bases, suggesting that a single raw sequence read may have been incorrectly joined. We found 1.9 instances per Mb in which the alignment showed an internal gap, again suggesting a possible misassembly; and 0.5 instances per Mb in which the alignment indicated that two initial sequence contigs that overlapped by at least 150 bp had not been merged by PHRAP. Finally, there were another 0.9 instances per Mb with various other problems. This gives a total of 8.6 instances per Mb of possible misassembly, with about half being relatively small issues involving a few hundred bases. Some of the potential problems might not result from misassembly, but might reflect sequence polymorphism in the population, small rearrangements during growth of the large-insert clones, regions of low-quality sequence or matches between segmental duplications. Thus, the frequency of misassemblies may be overstated. On the other hand, the criteria for recognizing overlap between draft and finished clones may have eliminated some misassemblies. Layout of the sequenced clones. We assessed the accuracy of the layout of sequenced clones onto the fingerprinted clone contigs by calculating the concordance between the positions assigned to a sequenced clone on the basis of in silico digestion and the position assigned on the basis of BAC end sequence data. The positions agreed in 98% of cases in which independent assignments could be made by both methods. The results were also compared with well studied regions containing both finished and draft genome sequence. These results indicated that sequenced clone order in the fingerprint map was reliable to within about half of one clone length (100 kb). A direct test of the layout is also provided by the draft genome sequence assembly itself. With extensive coverage of the genome, a correctly placed clone should usually (although not always) show sequence overlap with its neighbours in the map. We found only 421 instances of 'singleton' clones that failed to overlap a neighbouring clone. Close examination of the data suggests that most of these are correctly placed, but simply do not yet overlap an adjacent sequenced clone. About 150 clones appeared to be candidates for being incorrectly placed. Alignment of the fingerprint clone contigs. The alignment of the fingerprint clone contigs with the chromosomes was based on the radiation hybrid, YAC and genetic maps of STSs. The positions of most of the STSs in the draft genome sequence were consistent with these previous maps, but the positions of about 1.7% differed from one or more of them. Some of these disagreements may be due to errors in the layout of the sequenced clones or in the underlying fingerprint map. However, many involve STSs that have been localized on only one or two of the previous maps or that occur as isolated discrepancies in conflict with several flanking STSs. Many of these cases are probably due to errors in the previous maps (with error rates for individual maps estimated at 1–2%100). Others may be due to incorrect assignment of the STSs to the draft genome sequence (by the electronic polymerase chain reaction (e-PCR) computer program) or to database entries that contain sequence data from more than one clone (owing to cross-contamination). Graphical views of the independent data sets were particularly useful in detecting problems with order or orientation (Fig. 5). Areas of conflict were reviewed and corrected if supported by the underlying data. In the version discussed here, there were 41 sequenced clones falling in 14 sequenced-clone contigs with STS content information from multiple maps that disagreed with the flanking clones or sequenced-clone contigs; the placement of these clones thus remains suspect. Four of these instances suggest errors in the fingerprint map, whereas the others suggest errors in the layout of sequenced clones. These cases are being investigated and will be corrected in future versions. Assembly of the sequenced clones. We assessed the accuracy of the assembly by using a set of 148 draft clones comprising 22.4 Mb for which finished sequence subsequently became available104. The initial sequence contigs lack information about order and orientation, and GigAssembler attempts to use linking data to infer such information as far as possible104. Starting with initial sequence contigs that were unordered and unoriented, the program placed 90% of the initial sequence contigs in the correct orientation and 85% in the correct order with respect to one another. In a separate test, GigAssembler was tested on simulated draft data produced from finished sequence on chromosome 22 and similar results were obtained. Some problems remain at all levels. First, errors in the initial sequence contigs persist in the merged sequence contigs built from them and can cause difficulties in the assembly of the draft genome sequence. Second, GigAssembler may fail to merge some overlapping sequences because of poor data quality, allelic differences or misassemblies of the initial sequence contigs; this may result in apparent local duplication of a sequence. We have estimated by various methods the amount of such artefactual duplication in the assembly from these and other sources to be about 100 Mb. On the other hand, nearby duplicated sequences may occasionally be incorrectly merged. Some sequenced clones remain incorrectly placed on the layout, as discussed above, and others (< 0.5%) remain unplaced. The fingerprint map has undoubtedly failed to resolve some closely related duplicated regions, such as the Williams region and several highly repetitive subtelomeric and pericentric regions (see below). Detailed examination and sequence finishing may be required to sort out these regions precisely, as has been done with chromosome Y89. Finally, small sequenced-clone contigs with limited or no STS landmark content remain difficult to place. Full utilization of the higher resolution radiation hybrid map (the TNG map) may help in this95. Future targeted FISH experiments and increased map continuity will also facilitate positioning of these sequences. Genome coverage We next assessed the nature of the gaps within the draft genome sequence, and attempted to estimate the fraction of the human genome not represented within the current version. Gaps in draft genome sequence coverage. There are three types of gap in the draft genome sequence: gaps within unfinished sequenced clones; gaps between sequenced-clone contigs, but within fingerprint clone contigs; and gaps between fingerprint clone contigs. The first two types are relatively straightforward to close simply by performing additional sequencing and finishing on already identified clones. Closing the third type may require screening of additional large-insert clone libraries and possibly new technologies for the most recalcitrant regions. We consider these three cases in turn. We estimated the size of gaps within draft clones by studying instances in which there was substantial overlap between a draft clone and a finished clone, as described above. The average gap size in these draft sequenced clones was 554 bp, although the precise estimate was sensitive to certain assumptions in the analysis. Assuming that the sequence gaps in the draft genome sequence are fairly represented by this sample, about 80 Mb or about 3% (likely range 2–4%) of sequence may lie in the 145,514 gaps within draft sequenced clones. Whose genome was sequenced in the public (HGP) and private projects? The human genome reference sequences do not represent any one person’s genome. Rather, they serve as a starting point for broad comparisons across humanity. The knowledge obtained is applicable to everyone because all humans share the same basic set of genes and genomic regulatory regions that control the development and maintenance of their biological structures and processes. In the international public-sector Human Genome Project (HGP), researchers collected blood (female) or sperm (male) samples from a large number of donors. Only a few of many collected samples were processed as DNA resources. Thus the donor identities were protected so neither donors nor scientists could know whose DNA was sequenced. DNA clones from many different libraries were used in the overall project. Technically, it is much easier to prepare DNA cleanly from sperm than from other cell types because of the much higher ratio of DNA to protein in sperm and the much smaller volume in which purifications can be done. Using sperm does provide all chromosomes for study, including equal numbers of sperm with the X (female) or Y (male) sex chromosomes. However, HGP scientists also used white cells from the blood of female donors so as to include female-originated samples. In the Celera Genomics private-sector project, DNAs from a few different genomes were mixed up and processed for sequencing. The DNA resources used for these studies came from anonymous donors of European, African, American (North, Central, South), and Asian ancestry. The lead scientist of Celera Genomics at that time, Craig Venter, has since acknowledged that his DNA was one of those in the pool. Many small regions of DNA that vary among individuals (called polymorphisms) also were identified during the HGP, mostly single nucleotide polymorphisms (SNPs). c, b, e, g, b. Most SNPs are without physiological effect, although a minority contribute to the delightful and beneficial diversity of humanity. A much smaller minority of polymorphisms affect an individual’s susceptibility to disease and response to medical treatments. Although the HGP has been completed, SNP studies continue in the International HapMap Project, whose goal is to identify patterns of SNP groups (called haplotypes, or “haps”). The DNA samples for the HapMap came from a total of 270 individuals: Yoruba people in Ibadan, Nigeria; Japanese in Tokyo; Han Chinese in Beijing; and the French Centre d’Etude du Polymorphisme Humain (CEPH) resource. The gaps between sequenced-clone contigs but within fingerprint clone contigs are more difficult to evaluate directly, because the draft genome sequence flanking many of the gaps is often not precisely aligned with the fingerprinted clones. However, most are much smaller than a single BAC. In fact, nearly three-quarters of these gaps are bridged by one or more individual BACs, as indicated by linking information from BAC end sequences. We measured the sizes of a subset of gaps directly by examining restriction fragment fingerprints of overlapping clones. A study of 157 'bridged' gaps and 55 'unbridged' gaps gave an average gap size of 25 kb. Allowing for the possibility that these gaps may not be fully representative and that some restriction fragments are not included in the calculation, a more conservative estimate of gap size would be 35 kb. This would indicate that about 150 Mb or 5% of the human genome may reside in the 4,076 gaps between sequenced-clone contigs. This sequence should be readily obtained as the clones spanning them are sequenced. The size of the gaps between fingerprint clone contigs was estimated by comparing the fingerprint maps to the essentially completed chromosomes 21 and 22. The analysis shows that the fingerprinted BAC clones in the global database cover 97–98% of the sequenced portions of those chromosomes86. The published sequences of these chromosomes also contain a few small gaps (5 and 11, respectively) amounting to some 1.6% of the euchromatic sequence, and do not include the heterochromatic portion. This suggests that the gaps between contigs in the fingerprint map contain about 4% of the euchromatic genome. Experience with closure of such gaps on chromosomes 20 and 7 suggests that many of these gaps are less than one clone in length and will be closed by clones from other libraries. However, recovery of sequence from these gaps represents the most challenging aspect of producing a complete finished sequence of the human genome. As another measure of the representation of the BAC libraries, Riethman109 has found BAC or cosmid clones that link to telomeric half-YACs or to the telomeric sequence itself for 40 of the 41 non-satellite telomeres. Thus, the fingerprint map appears to have no substantial gaps in these regions. Many of the pericentric regions are also represented, but analysis is less complete here (see below). Representation of random raw sequences. In another approach to measuring coverage, we compared a collection of random raw sequence reads to the existing draft genome sequence. In principle, the fraction of reads matching the draft genome sequence should provide an estimate of genome coverage. In practice, the comparison is complicated by the need to allow for repeat sequences, the imperfect sequence quality of both the raw sequence and the draft genome sequence, and the possibility of polymorphism. Nonetheless, the analysis provides a reasonable view of the extent to which the genome is represented in the draft genome sequence and the public databases. We compared the raw sequence reads against both the sequences used in the construction of the draft genome sequence and all of GenBank using the BLAST computer program. Of the 5,615 raw sequence reads analysed (each containing at least 100 bp of contiguous non-repetitive sequence), 4,924 had a match of 97% identity with a sequenced clone, indicating that 88 1.5% of the genome was represented in sequenced clones. The estimate is subject to various uncertainties. Most serious is the proportion of repeat sequence in the remainder of the genome. If the unsequenced portion of the genome is unusually rich in repeated sequence, we would underestimate its size (although the excess would be comprised of repeated sequence). We examined those raw sequences that failed to match by comparing them to the other publicly available sequence resources. Fifty (0.9%) had matches in public databases containing cDNA sequences, STSs and similar data. An additional 276 (or 43% of the remaining raw sequence) had matches to the whole-genome shotgun reads discussed above (consistent with the idea that these reads cover about half of the genome). We also examined the extent of genome coverage by aligning the cDNA sequences for genes in the RefSeq dataset110 to the draft genome sequence. We found that 88% of the bases of these cDNAs could be aligned to the draft genome sequence at high stringency (at least 98% identity). (A few of the alignments with either the random raw sequence reads or the cDNAs may be to a highly similar region in the genome, but such matches should affect the estimate of genome coverage by considerably less than 1%, based on the estimated extent of duplication within the genome (see below).) These results indicate that about 88% of the human genome is represented in the draft genome sequence and about 94% in the combined publicly available sequence databases. The figure of 88% agrees well with our independent estimates above that about 3%, 5% and 4% of the genome reside in the three types of gap in the draft genome sequence. Finally, a small experimental check was performed by screening a large-insert clone library with probes corresponding to 16 of the whole genome shotgun reads that failed to match the draft genome sequence. Five hybridized to many clones from different fingerprint clone contigs and were discarded as being repetitive. Of the remaining eleven, two fell within sequenced clones (presumably within sequence gaps of the first type), eight fell in fingerprint clone contigs but between sequenced clones (gaps of the second type) and one failed to identify clones in the fingerprint map (gaps of the third type) but did identify clones in another large-insert library. Although these numbers are small, they are consistent with the view that the much of the remaining genome sequence lies within already identified clones in the current map. Estimates of genome and chromosome sizes. Informed by this analysis of genome coverage, we proceeded to estimate the sizes of the genome and each of the chromosomes (Table 8). Beginning with the current assigned sequence for each chromosome, we corrected for the known gaps on the basis of their estimated sizes (see above). We attempted to account for the sizes of centromeres and heterochromatin, neither of which are well represented in the draft sequence. Finally, we corrected for around 100 Mb of artefactual duplication in the assembly. We arrived at a total human genome size estimate of around 3,200 Mb, which compares favourably with previous estimates based on DNA content. The Human Genome Project was completed in 2003. An important aspect of the project was functional and comparative genomics. This page details that research. Understanding the function of genes and other parts of the genome is known as functional genomics. The Human Genome Project is just the first step in understanding humans at the molecular level. Though the sequencing phase of the project is complete, work is still ongoing to determine the function of many of the human genes. Efficient interpretation of the functions of human genes and other DNA sequences requires that resources and strategies be developed to enable large-scale investigations across whole genomes. A technically challenging first priority is to generate complete sets of full-length cDNA clones and sequences for human and model-organism genes. Other functional-genomics goals include studies into gene expression and control, creation of mutations that cause loss or alteration of function in nonhuman organisms, and development of experimental and computational methods for protein analyses. Functional Genomics Technology Goals Generate sets of full-length cDNA clones and sequences that represent human genes and model organisms. Support research on methods for studying functions of nonprotein-coding sequences. Develop technology for comprehensive analysis of gene expression. Improve methods for genome-wide mutagenesis. Develop technology for large-scale protein analyses. Comparative Genomics The functions of human genes and other DNA regions often are revealed by studying their parallels in nonhumans. To enable such comparisons, HGP researchers have obtained complete genomic sequences for the bacterium Escherichia coli, the yeast Saccharomyces cerevisiae, the roundworm Caenorhabditis elegans, the fruitfly Drosophila melanogaster, the laboratory mouse, and many other organisms. The availability of complete genome sequences generated both inside and outside the HGP is driving a major breakthrough in fundamental biology as scientists compare entire genomes to gain new insights into evolutionary, biochemical, genetic, metabolic, and physiological pathways. HGP planners stress the need for a sustainable sequencing capacity to facilitate future comparisons. Comparative Genomics Goals Complete the sequence of the roundworm C. elegans genome by 1998. Complete the sequence of the fruitfly Drosophila genome by 2002. Develop an integrated physical and genetic map for the mouse, generate additional mouse cDNA resources, and complete the sequence of the mouse genome by 2008. Identify other useful model organisms and support appropriate genomic studies. The U.S. Department of Energy (DOE) Office of Science supports innovative high-impact and peer-reviewed science. Its missions include a range of such difficult challenges as environmental-waste cleanup, energy production, carbon sequestration, and biotechnology. To aid in carrying out its missions, DOE initiated the Microbial Genome Program (MGP) in late 1994 as a spinoff of its Human Genome Program. Scientists expect to find a vast repertoire of useful functions in the microbial world that could be applied to solving challenges in the human world. The MGP and the closely linked Genomics:GTL program are generating novel insights into both the biological underpinnings of climate change and the role of microbes in the overall processing of metals, carbon, radionuclides, and nitrogen. Scientists are only beginning to appreciate the power of microbial sequencing for generating new and testable hypotheses and advancing science. Why Microbes? Microbes, which make up most of the earth’s biomass, have evolved for some 3.8 billion years. They have been found in virtually every environment, surviving and thriving in extremes of heat, cold, radiation, pressure, salt, acidity, and darkness. Often in these environments, no other forms of life are found and the only nutrients come from inorganic matter. The diversity and range of their environmental adaptations indicate that microbes long ago “solved” many problems for which scientists are still actively seeking solutions. Potential Microbial Applications Researchers have only scratched the surface of microbial biodiversity. Knowledge about the enormous range of microbial capacities has broad and far-reaching implications for environmental, energy, health, and industrial applications. Cleanup of toxic-waste sites worldwide. Production of novel therapeutic and preventive agents and pathways. Energy generation and development of renewable energy sources (e.g., methane and hydrogen). Production of chemical catalysts, reagents, and enzymes to improve efficiency of industrial processes. Management of environmental carbon dioxide, which is related to climate change. Detection of disease-causing organisms and monitoring of the safety of food and water supplies. Use of genetically altered bacteria as living sensors (biosensors) to detect harmful chemicals in soil, air, or water. Understanding of specialized systems used by microbial cells to live in natural environments with other cells. To explore the possibilities for new applications, in 1994 the U.S. Department of Energy (DOE) established the Microbial Genome Program (MGP) as a companion to its Human Genome Program (HGP). From the start, the MGP experienced remarkable success, and microbial genomics has become one of the most exciting and high-profile fields in biology today. A principal goal of this spin-off project is to determine the complete DNA sequence—the genome—of a number of nonpathogenic microbes that may be useful to DOE in carrying out its missions (nonpathogenic microbes do not cause disease). The microbes chosen for genomic sequencing were selected with broad input from the scientific community. "The microbial diversity of the program is an absolute treasure trove for [research in] biotechnology, ecology, evolution, and bioremediation," notes David Schlessinger (National Institute on Aging). Only a few years ago, scientists could not have imagined having full access to the genetic structure of more than a few such organisms. Today, a number of complete microbial genomes, many supported by DOE's MGP, have been sequenced, and the rate of reported new genome sequences is increasing rapidly. (For a current listing, see Web site.) These DNA sequences, along with those from many viruses and more complex organisms such as fruitfly, roundworm, and yeast, are freely available in public databases. This information is being used by governmental, academic, medical, and industrial scientists. The number of possible applications of this information is staggering. Sequenced genomes provide us with a genetic "parts" list; the next challenge is to explore how these parts come together to form a functioning organism. The major focus of the DOE MGP will continue to be on genomic sequencing of microbes relevant to DOE missions. To avoid "starting from scratch" in sequencing new microbes, investigators are developing novel strategies to cost-effectively determine the DNA sequence of microbes that are very closely related to others whose sequence already is known. Additionally, the MGP is developing new tools to study how groups of genes work together to produce specific products or determine particular behaviors. Other objectives are to mine genomic information from sequenced microbes, improve tools for annotation and analysis of sequence data, develop high-throughput methods for determining gene function and gene expression, and develop methods for examining protein-protein and protein-nucleic acid interaction. The future promises many exciting developments as the fruits of the MGP mature. Already, we have become more appreciative of the extent of the microbial world's effect on earth, realizing how little we know about this kingdom and wondering at its potential benefits to our world if only we are wise enough to discover them. Benefits of the Program Imagine! A future in which we can use "super bugs" to detect chemical contamination in soil, air, and water and clean up oil spills and chemicals in landfills; cook and heat with natural gas collected from a backyard septic tank or bottled at a local waste-treatment facility; obtain affordable alcohol-based fuels and solvents from cornstalks, wood chips, and other plant by-products; and produce new classes of antibiotics and process food and chemicals more efficiently. These scenarios represent only a few of the possible ways that microbes—the invisible bacteria, archaea, protozoa, and fungi that inhabit our environment, our bodies, our food and water, and even the air we breathe—can be harnessed to serve humankind. Technological advances developed over the last decade, particularly in genetic research conducted as part of the international Human Genome Project, are enabling researchers to learn about microbes at their most fundamental level and to begin to ask questions about how the basic parts work together to form a functioning organism. The answers may challenge accepted scientific thought and offer beneficial applications in areas important to DOE's Biological and Environmental Research (BER) program, among them bioremediation, global climate change, biotechnology, and energy production. Why Microbes? By some estimates, microbes make up about 60% of the earth's biomass, yet less than 1% of microbial species have been identified. Microbes play a critical role in natural biogeochemical cycles. Because most do not cause disease in humans, animals, or plants and are difficult to culture, they have received little attention. Microbes have been found surviving and thriving in an amazing diversity of habitats, in extremes of heat, cold, radiation, pressure, salinity, and acidity, often where no other life forms could exist. Identifying and harnessing their unique capabilities, which have evolved over 3.8 billion years, will offer us new solutions to longstanding challenges in environmental and waste cleanup, energy production and use, medicine, industrial processes, agriculture, and other areas. Scientists also are starting to appreciate the role played by microbes in global climate processes, and we can expect insights about both the biological underpinnings of climate change and the contributions of microbes to earth's biosphere. Their capabilities soon will be added to the list of traditional commercial uses for microbes in the brewing, baking, dairy, and other industries. In 1995, the MGP's first full year, DOE funded four microbial genome sequencing projects focused on the bacterium Mycoplasma genitalium and three other microbes. Now fully characterized, the tiny M. genitalium genome—thought to have the smallest genome of any known free-living bacterium—provides a model for a minimal set of genes necessary for life. Its genome contains only 580,000 base pairs of DNA and yet encodes 470 genes. Future studies on this and other minimal genomes will help increase our understanding of more complex genomes. Among the oldest life forms known, the archaea make up one of three phylogenetic or evolutionary domains into which all life is classified. The other two are the eukarya and the bacteria. Archaea found thriving in extreme environments of heat and cold, acidity, pressure, and salinity are known as extremophiles ("extreme-loving" organisms). Understanding the biological mechanisms underlying their hardiness may help researchers develop new industrial, biomedical, and environmental applications. Microbes may, for example, contain enzymes that are effective in driving chemical reactions in extreme environments. Some may provide enzymes useful in research; one such "extremozyme" derived from a bacterium living in hot springs in Yellowstone National Park has become critical to current protocols for sequencing any genome, including that of humans. Other microbes have metabolic processes with potential for breaking down toxic waste or even producing methane, an energy source. Comparisons of the genomes of organisms from all three domains are helping scientists better understand the evolution of all living things. Descriptions of MGP-supported research on some other microbes follow. For a complete list see brochure. Methanococcus jannaschii was among the first archaea chosen for sequencing. In 1996 its completed sequencing and analysis confirmed that the "tree of life" has three domains, a hypothesis first advanced nearly 20 years before by Carl Woese (University of Illinois) but not given much credence at the time. The single-celled M. jannaschii was isolated from a sample collected beneath more than 8000 feet of water at the base of a deep-sea thermal vent on the floor of the Pacific Ocean. The microbe lives without the sunlight, oxygen, and organic carbon important to most other forms of life and uses carbon dioxide, nitrogen, and hydrogen expelled from the thermal vent for its life functions. When the entire DNA sequence of M. jannaschii was determined, scientists found that about 65% of its potential gene sequences were not related to any gene previously discovered, representing an exciting area for future investigation. The archaeon Archaeoglobus fulgidus and the bacterium Thermotoga maritima have potential for practical applications in industry and government-funded environmental remediation. Because they thrive in water temperatures above the boiling point, these organisms may provide DOE, the Department of Defense, and private companies with heat-stable enzymes for use in industrial processes. These processes could include conversion of wastes to useful chemicals. A. fulgidus has the added capability of surviving at the high pressures associated with deep oil wells, and T. maritima metabolizes simple and complex carbohydrates, including glucose, sucrose, starch, xylan, and cellulose. Cellulose and xylan are the most abundant biopolymers on earth and, through their conversion to fuels such as ethanol, have major potential as sources of renewable energy. Comparisons of the genomic sequences of these two microbes will contribute to a greater understanding of evolutionary relationships as well as high-temperature protein function. The archaeon Pyrobaculum aerophilum, first isolated from a boiling marine vent, thrives at temperatures close to the maximum tolerated by living systems (113oC). Unlike most hyperthermophiles, P. aerophilum is able to withstand exposure to oxygen and can thus be manipulated more easily in the laboratory. Also, the proteins encoded by hyperthermophilic genomes are more stable than those of organisms living in more temperate environments. The bacterium Shewanella putrefaciens, which can grow with or without oxygen, is an excellent model system for manipulating organisms for remediation. Whole-genome sequencing will elucidate metabolic pathways including those involved in corrosion, consumption of toxic organic pollutants, and removal of toxic metals and radiation waste by conversion to insoluble forms. Other organisms that could be of great genetic and biochemical interest are present in extreme surface environments but are almost impossible to grow in the laboratory. The MGP funds a project to identify and determine the abundance and activity of novel hard-to-cultivate organisms in two extreme surface environments in the arid southwestern United States. Preliminary samples indicate that most of these bacterial species contain few similarities to previously described cultivated bacteria. These collections offer a rich resource for identifying and isolating novel species with potentially unique sets of genes as well as proteins with environmental, energy, biotechnological, and other applications. Sequencing Candidates How Microbes are Chosen for Study Introduction Over the last 10 years, sequencing of a range of microorganisms that live in a wide diversity of environments has provided a considerable information base. This base is useful for scientific research related not only to DOE missions but also to those of other federal agencies and U.S. industry. (See a list of microbes that have been and are being sequenced.) Nonetheless, the preponderance of species in the environment remains largely unknown to science. Many are thought to grow as part of interdependent consortia in which one species supplies a nutrient necessary for the growth of another. Little is known of the organization, membership, or functioning of the consortia, especially those involved in environmental processes of DOE interest. Fungi and small multicellular eukaryotes play important roles in the environment as well. Genomic analyses of sequenced microbes have suggested that processes such as lateral gene transfers at various times in the evolutionary history of some microbial lineages may have blurred the understanding of their phylogenetic relationships. Genomic analyses are badly needed of microbial consortia and species refractory to laboratory culture that play important roles in environments challenged with metals, radionuclides, and chlorinated solvents or involved in carbon sequestration. To this end, the DOE Biological and Environmental Research (BER) program periodically seeks nominations of candidate microbes, microbial consortia, and organisms 250 Mb or smaller for draft genomic sequencing (6 to 8x coverage). Sequencing is carried out at the DOE Production Genomics Facility of the Joint Genome Institute (JGI) to support such BER programs as the Climate Change Research Program, Natural and Accelerated Bioremediation Research (NABIR) Program, Environmental Management Science Program (EMSP), Microbial Genome Program (MGP), Ocean Science Program, and Genomics:GTL Program. A subset of selected organisms may be identified for sequence finishing. The Human Genome Project (HGP), sponsored in the United States by the Department of Energy and the National Institutes of Health, has created the field of genomics --understanding genetic material on a large scale. The medical industry is building upon the knowledge, resources, and technologies emanating from the HGP to further understanding of genetic contributions to human health. As a result of this expansion of genomics into human health applications, the field of genomic medicine was born. Genetics is playing an increasingly important role in the diagnosis, monitoring, and treatment of diseases. Diagnosing and Predicting Disease and Disease Susceptibility All diseases have a genetic component, whether inherited or resulting from the body's response to environmental stresses like viruses or toxins. The successes of the HGP have even enabled researchers to pinpoint errors in genes--the smallest units of heredity--that cause or contribute to disease. The ultimate goal is to use this information to develop new ways to treat, cure, or even prevent the thousands of diseases that afflict humankind. But the road from gene identification to effective treatments is long and fraught with challenges. In the meantime, biotechnology companies are racing ahead with commercialization by designing diagnostic tests to detect errant genes in people suspected of having particular diseases or of being at risk for developing them. An increasing number of gene tests are becoming available commercially, although the scientific community continues to debate the best way to deliver them to the public and medical communities that are often unaware of their scientific and social implications. While some of these tests have greatly improved and even saved lives, scientists remain unsure of how to interpret many of them. Also, patients taking the tests face significant risks of jeopardizing their employment or insurance status. And because genetic information is shared, these risks can extend beyond them to their family members as well. Disease Intervention Explorations into the function of each human gene--a major challenge extending far into the 21st century --will shed light on how faulty genes play a role in disease causation. With this knowledge, commercial efforts are shifting away from diagnostics and toward developing a new generation of therapeutics based on genes. Drug design is being revolutionized as researchers create new classes of medicines based on a reasoned approach to the use of information on gene sequence and protein structure function rather than the traditional trial-and-error method. Drugs targeted to specific sites in the body promise to have fewer side effects than many of today's medicines. The potential for using genes themselves to treat disease--gene therapy--is the most exciting application of DNA science. It has captured the imaginations of the public and the biomedical community for good reason. This rapidly developing field holds great potential for treating or even curing genetic and acquired diseases, using normal genes to replace or supplement a defective gene or to bolster immunity to disease (e.g., by adding a gene that suppresses tumor growth). By the Numbers The human genome contains 3164.7 million chemical nucleotide bases (A, C, T, and G). The average gene consists of 3000 bases, but sizes vary greatly, with the largest known human gene being dystrophin at 2.4 million bases. The total number of genes is estimated at 30,000, much lower than previous estimates of 80,000 to 140,000 that had been based on extrapolations from gene-rich areas as opposed to a composite of gene-rich and gene-poor areas. The order of almost all (99.9%) nucleotide bases is exactly the same in all people. The functions are unknown for over 50% of discovered genes. The Wheat from the Chaff Less than 2% of the genome encodes for the production of proteins. Repetitive sequences that do not code for proteins (sometimes called "junk DNA") make up at least 50% of the human genome. Repetitive sequences are thought to have no direct functions, but they shed light on chromosome structure and dynamics. Over time, these repeats reshape the genome by rearranging it, thereby creating entirely new genes or modifying and reshuffling existing genes. During the past 50 million years, a dramatic decrease seems to have occurred in the rate of accumulation of repeats in the human genome. How It's Arranged The human genome's gene-dense "urban centers" are predominantly composed of the DNA building blocks G and C. In contrast, the gene-poor "deserts" are rich in the DNA building blocks A and T. GC- and AT-rich regions usually can be seen through a microscope as light and dark bands on chromosomes. Genes appear to be concentrated in random areas along the genome, with vast expanses of noncoding DNA between. Stretches of up to 30,000 C and G bases repeating over and over often occur adjacent to gene-rich areas, forming a barrier between the genes and the "junk DNA." These CpG islands are believed to help regulate gene activity. Chromosome 1 has the most genes (2968), and the Y chromosome has the fewest (231). How the Human Genome Compares with That of Other Organisms Unlike the human's seemingly random distribution of gene-rich areas, many other organisms' genomes are more uniform, with genes evenly spaced throughout. Humans have on average three times as many kinds of proteins as the fly or worm because of mRNA transcript "alternative splicing" and chemical modifications to the proteins. This process can yield different protein products from the same gene. Humans share most of the same protein families with worms, flies, and plants, but the number of gene family members has expanded in humans, especially in proteins involved in development and immunity. The human genome has a much greater portion (50%) of repeat sequences than the mustard weed (11%), the worm (7%), and the fly (3%). Although humans appear to have stopped accumulating repeated DNA over 50 million years ago, there seems to be no such decline in rodents. This may account for some of the fundamental differences between hominids and rodents, although gene estimates are similar in these species. Scientists have proposed many theories to explain evolutionary contrasts between humans and other organisms, including those of life span, litter sizes, inbreeding, and genetic drift. Variations and Mutations Scientists have identified about 1.4 million locations where single-base DNA differences (SNPs) occur in humans. This information promises to revolutionize the processes of finding chromosomal locations for disease-associated sequences and tracing human history. The ratio of germline (sperm or egg cell) mutations is 2:1 in males vs females. Researchers point to several reasons for the higher mutation rate in the male germline, including the greater number of cell divisions required for sperm formation than for eggs. What We Still Don't Understand: A Checklist for Future Research Exact gene number, exact locations, and functions Gene regulation DNA sequence organization Chromosomal structure and organization Noncoding DNA types, amount, distribution, information content, and functions Coordination of gene expression, protein synthesis, and post-translational events Interaction of proteins in complex molecular machines Predicted vs experimentally determined gene function Evolutionary conservation among organisms Protein conservation (structure and function) Proteomes (total protein content and function) in organisms Correlation of SNPs (single-base DNA variations among individuals) with health and disease Disease-susceptibility prediction based on gene sequence variation Genes involved in complex traits and multigene diseases Complex systems biology, including microbial consortia useful for environmental restoration Developmental genetics, genomics Applications, Future Challenges Deriving meaningful knowledge from the DNA sequence will define research through the coming decades to inform our understanding of biological systems. This enormous task will require the expertise and creativity of tens of thousands of scientists from varied disciplines in both the public and private sectors worldwide. The draft sequence already is having an impact on finding genes associated with disease. Over 30 genes have been pinpointed and associated with breast cancer, muscle disease, deafness, and blindness. Additionally, finding the DNA sequences underlying such common diseases as cardiovascular disease, diabetes, arthritis, and cancers is being aided by the human variation maps (SNPs) generated in the HGP in cooperation with the private sector. These genes and SNPs provide focused targets for the development of effective new therapies. One of the greatest impacts of having the sequence may well be in enabling an entirely new approach to biological research. In the past, researchers studied one or a few genes at a time. With whole-genome sequences and new high-throughput technologies, they can approach questions systematically and on a grand scale. They can study all the genes in a genome, for example, or all the transcripts in a particular tissue or organ or tumor, or how tens of thousands of genes and proteins work together in interconnected networks to orchestrate the chemistry of life. Post-sequencing projects are well under way worldwide. (See Genomics:GTL). These explorations will result in a profound, new, and more comprehensive understanding of complex living systems, with applications to agriculture, human health, energy, global climate change, and environmental remediation, among others. What is gene therapy? Genes, which are carried on chromosomes, are the basic physical and functional units of heredity. Genes are specific sequences of bases that encode instructions on how to make proteins. Although genes get a lot of attention, it’s the proteins that perform most life functions and even make up the majority of cellular structures. When genes are altered so that the encoded proteins are unable to carry out their normal functions, genetic disorders can result. Gene therapy is a technique for correcting defective genes responsible for disease development. Researchers may use one of several approaches for correcting faulty genes: A normal gene may be inserted into a nonspecific location within the genome to replace a nonfunctional gene. This approach is most common. An abnormal gene could be swapped for a normal gene through homologous recombination. The abnormal gene could be repaired through selective reverse mutation, which returns the gene to its normal function. The regulation (the degree to which a gene is turned on or off) of a particular gene could be altered. How does gene therapy work? In most gene therapy studies, a "normal" gene is inserted into the genome to replace an "abnormal," disease-causing gene. A carrier molecule called a vector must be used to deliver the therapeutic gene to the patient's target cells. Currently, the most common vector is a virus that has been genetically altered to carry normal human DNA. Viruses have evolved a way of encapsulating and delivering their genes to human cells in a pathogenic manner. Scientists have tried to take advantage of this capability and manipulate the virus genome to remove disease-causing genes and insert therapeutic genes. Target cells such as the patient's liver or lung cells are infected with the viral vector. The vector then unloads its genetic material containing the therapeutic human gene into the target cell. The generation of a functional protein product from the therapeutic gene restores the target cell to a normal state. See a diagram depicting this process. Some of the different types of viruses used as gene therapy vectors: Retroviruses - A class of viruses that can create double-stranded DNA copies of their RNA genomes. These copies of its genome can be integrated into the chromosomes of host cells. Human immunodeficiency virus (HIV) is a retrovirus. Adenoviruses - A class of viruses with double-stranded DNA genomes that cause respiratory, intestinal, and eye infections in humans. The virus that causes the common cold is an adenovirus. Adeno-associated viruses - A class of small, single-stranded DNA viruses that can insert their genetic material at a specific site on chromosome 19. Herpes simplex viruses - A class of double-stranded DNA viruses that infect a particular cell type, neurons. Herpes simplex virus type 1 is a common human pathogen that causes cold sores. Besides virus-mediated gene-delivery systems, there are several nonviral options for gene delivery. The simplest method is the direct introduction of therapeutic DNA into target cells. This approach is limited in its application because it can be used only with certain tissues and requires large amounts of DNA. Another nonviral approach involves the creation of an artificial lipid sphere with an aqueous core. This liposome, which carries the therapeutic DNA, is capable of passing the DNA through the target cell's membrane. Therapeutic DNA also can get inside target cells by chemically linking the DNA to a molecule that will bind to special cell receptors. Once bound to these receptors, the therapeutic DNA constructs are engulfed by the cell membrane and passed into the interior of the target cell. This delivery system tends to be less effective than other options. Researchers also are experimenting with introducing a 47th (artificial human) chromosome into target cells. This chromosome would exist autonomously alongside the standard 46 --not affecting their workings or causing any mutations. It would be a large vector capable of carrying substantial amounts of genetic code, and scientists anticipate that, because of its construction and autonomy, the body's immune systems would not attack it. A problem with this potential method is the difficulty in delivering such a large molecule to the nucleus of a target cell. Arch Biochem Biophys, 2000 Jun 15, 378(2), 210 - 5Mechanistic studies of Escherichia coli adenosine-5'-phosphosulfate kinase; Satishchandran C et al.; Adenosine-5'-phosphosulfate kinase (APS kinase) catalyzes the formation of 3'-phosphoadenosine 5'-phosphosulfate (PAPS), the major form of activated sulfate in biological systems . The enzyme from Escherichia coli has complex kinetic behavior, including substrate inhibition by APS and formation of a phosphorylated enzyme (E-P) as a reaction intermediate . The presence of a phosphorylated enzyme potentially enables the steady-state kinetic mechanism to change from sequential to ping-pong as the APS concentration decreases . Kinetic and equilibrium binding measurements have been used to evaluate the proposed mechanism . Equilibrium binding studies show that APS, PAPS, ADP, and the ATP analog AMPPNP each bind at a single site per subunit; thus, substrates can bind in either order . When ATPgammaS replaces ATP as substrate the V(max) is reduced 535-fold, the kinetic mechanism is sequential at each APS concentration, and substrate inhibition is not observed . The results indicate that substrate inhibition arises from a kinetic phenomenon in which product formation from ATP binding to the E . APS complex is much slower than paths in which product formation results from APS binding either to the E . ATP complex or to E-P . APS kinase requires divalent cations such as Mg(2+) or Mn(2+) for activity . APS kinase binds one Mn(2+) ion per subunit in the absence of substrates, consistent with the requirement for a divalent cation in the phosphorylation of APS by E-P . The affinity for Mn(2+) increases 23-fold when the enzyme is phosphorylated . Two Mn(2+) ions bind per subunit when both APS and the ATP analog AMPPNP are present, indicating a potential dual metal ion catalytic mechanism . Anal Biochem, 2000 Jun 15, 282(1), 65 - 9 Genetically fused protein A-luciferase for immunological blotting analyses; Zhang XM et al.; The gene expression plasmid pMALU5 for the fusion protein of protein A (SpA) with a complete sequence of firefly luciferase (Luc) was constructed . The fused gene was expressed in Escherichia coli, and the resulting SpA-Luc fusion protein was purified by one-step affinity chromatography on IgG-Sepharose . The protein retained both activities: IgG binding capability of protein A and enzymatic activity of luciferase . Blotting analyses were performed with the fusion protein to determine a tumor marker of alpha-fetoprotein (AFP) . AFP was detected at the lowest detection limit of 5 pg by dot blotting and Western blotting . The SpA-Luc fusion protein provides a highly selective, sensitive, and versatile marker for blotting analyses . Anal Biochem, 2000 Jun 15, 282(1), 39 - 45 Catalytic chromatography; Jurado LA et al.; Catalytic chromatography exploits both specific biological affinity and catalytic specificity to selectively purify enzymes . Two different applications are presented . Purification of EcoRI restriction endonuclease to apparent homogeneity was accomplished in a single step with significantly greater yield and purification than was obtained with affinity chromatography . An attempt to purify the multiple DNA polymerase activities of Escherichia coli was also developed . Five well-resolved peaks of DNA polymerase activity were fractionated . In this new chromatographic mode, the enzyme binds immobilized substrate coupled to a column in the absence of some required cofactor . When the missing cofactor is added, the enzyme converts substrate to product and selectively elutes from the column . Spec Care Dentist, 1999 May-Jun, 19(3), 128 - 34 Respiratory pathogen colonization of the dental plaque of institutionalized elders; Russell SL et al.; Although it has been established that aspiration of pharyngeal bacteria is the major route of infection in the development of nosocomial pneumonia, colonization of the pharyngeal mucosa by respiratory pathogens has been shown to be a transient phenomenon . It has been suggested that the dental plaque may constitute an additional, possibly more stable, reservoir of respiratory pathogens . The purpose of this study was to assess the prevalence of oral colonization by potential respiratory pathogens in a group of elderly (mean age = 75.9 yrs) chronic-care-facility residents (n = 28) and a group of age-, gender-, and race-matched outpatient control subjects (n = 30), with specific attention to plaque present on tooth, denture, and oral mucosal surfaces . Plaque scores on teeth and dentures were significantly higher in the chronic-care-facility (CCF) subjects than in the dental outpatient control (DOC) subjects (PII 2.3 vs . 1.2 and denture plaque 1.4 vs . 0.3) . While no subjects in the DOC group were found to be colonized with respiratory pathogens (> 1.0% of the cultivable aerobic flora), 14.3% (4/28) of the CCF subjects were found to be colonized . Oral colonization with respiratory pathogens in CCF subjects was associated with the presence of chronic obstructive pulmonary disease (COPD) and higher plaque scores . These results suggest that deficient dental plaque control and the presence of COPD may be related to respiratory pathogen colonization of dental plaque in chronic-care-facility residents. Cell Calcium, 2000 May, 27(5), 257 - 67 Differential modulation of inositol 1,4,5-trisphosphate receptor type 1 and type 3 by ATP; Maes K et al.; Binding of ATP to the inositol 1,4,5-trisphosphate receptor (IP(3)R) results in a more pronounced Ca(2+)release in the presence of inositol 1,4,5-trisphosphate (IP(3)) . Two recently published studies demonstrated a different ATP sensitivity of IP(3)-induced Ca(2+)release in cell types expressing different IP(3)R isoforms . Cell types expressing mainly IP(3)R3 were less sensitive to ATP than cell types expressing mainly IP(3)R1 (Missiaen L, Parys JB, Sienaert I et al . Functional properties of the type 3 InsP(3)receptor in 16HBE14o- bronchial mucosal cells . J Biol Chem 1998;273: 8983-8986; Miyakawa T, Maeda A, Yamazawa T et al . Encoding of Ca(2+)signals by differential expression of IP(3)receptor subtypes . EMBO J 1999;18: 1303-1308) . In order to investigate the difference in ATP sensitivity between IP(3)R isoforms at the molecular level, microsomes of Sf9 insect cells expressing full-size IP(3)R1 or IP(3)R3 were covalently labeled with ATP by using the photoaffinity label 8-azido{alpha-(32)P}ATP . ATP labeling of the IP(3)R was measured after immunoprecipitation of IP(3)Rs with isoform-specific antibodies, SDS-PAGE and Phosphorimaging . Unlabeled ATP inhibited covalent linking of 8-azido{alpha-(32)P}ATP to the recombinant IP(3)R1 and IP(3)R3 with an IC(50)of 1.6 microM and 177 microM, respectively . MgATP was as effective as ATP in displacing 8-azido{alpha-(32)P}ATP from the ATP-binding sites on IP(3)R1 and IP(3)R3, and in stimulating IP(3)-induced Ca(2+)release from permeabilized A7r5 and 16HBE14o- cells . The interaction of ATP with the ATP-binding sites on IP(3)R1 and IP(3)R3 was different from its interaction with the IP(3)-binding domains, since ATP inhibited IP(3)binding to the N-terminal 581 amino acids of IP(3)R1 and IP(3)R3 with an IC(50)of 353 microM and 4.0 mM, respectively . The ATP-binding sites of IP(3)R1 bound much better ATP than ADP, AMP and particularly GTP, while IP(3)R3 displayed a much broader nucleotide specificity . These results therefore provide molecular evidence for a differential regulation of IP(3)R1 and IP(3)R3 by ATP . Neuroimmunomodulation, 2000, 8(1), 8 - 12 Effect of interleukin-11 on body temperature in afebrile and febrile rats; Gourine AV et al.; Interleukin (IL)-11 is a member of the gp130 family of cytokines . In contrast to IL-6 (another gp130 cytokine), IL-11 does not induce fever in humans . In the present study, the effect of recombinant human IL-11 (hrIL-11) injected intracerebroventricularly on body temperature of afebrile and febrile rats was studied . Results showed that: (i) hrIL-11 in doses of 5, 50 and 500 ng injected into the cerebral ventricles does not alter body temperature in rats; (ii) febrile response induced by intraperitoneal injection of E . coli endotoxin (50 microg/kg) was initiated more rapidly in rats injected with 500 ng of hrIL-11 in the cerebral ventricles, and (iii) the enhancement of the initial phase of fever induced by hrIL-11 was not accompanied by changes in plasma concentrations of IL-6 and tumor necrosis factor (TNF) . These results indicate that hrIL-11 is not pyrogenic when administered into the brain ventricles . The data obtained also demonstrate that central application of hrIL-11 alters body temperature in conditions of pyrogenic stimulation, but that this effect is not due to the alterations in plasma concentrations of IL-6 or TNF . These data suggest that during the development of the systemic inflammatory response, activation of gp130 subunit becomes effective in altering body temperature . Proc Natl Acad Sci U S A, 2000 Jul 5, 97(14), 8057 - 62 Differences in the polar clustering of the high- and low-abundance chemoreceptors of Escherichia coli; Lybarger SR et al.; The chemosensory complexes in Escherichia coli are localized predominantly in large aggregates at one or both of the cell poles, however, neither the role of the polar localization nor the role of the clustering is understood . In E . coli, the two classes of chemoreceptors or transducers, high- and low-abundance, differ in their ability to support chemotaxis when expressed as the sole chemoreceptor type in the cell . In this study, we examined both the contribution of individual chemoreceptors to polar clustering and the ability of each chemoreceptor type to cluster in the absence of all others . We found that polar clustering of methyl-accepting chemotaxis proteins (MCPs) is not dependent on any one chemoreceptor type . Remarkably, when expressed individually at similar levels, the chemoreceptors display differential clustering abilities . The high-abundance transducers cluster at the cell pole almost as well as do the MCPs in cells expressing all four species, whereas the low-abundance transducers, although polar, are not particularly clustered . CheA and CheW distributions in strains expressing only one chemoreceptor type coincide with MCP localization, indicating that the low-abundance chemoreceptors are competent for ternary complex formation but are defective in aggregation . These studies reveal that, in contrast to our previous model, polarity of the chemoreceptors is independent of clustering, suggesting that the polar localization of the chemoreceptors is not simply caused by diffusion limitations on large protein aggregates. Proc Natl Acad Sci U S A, 2000 Jul 5, 97(14), 7963 - 8 Exploring the sequence space for tetracycline-dependent transcriptional activators: novel mutations yield expanded range and sensitivity; Urlinger S et al.; Regulatory elements that control tetracycline resistance in Escherichia coli were previously converted into highly specific transcription regulation systems that function in a wide variety of eukaryotic cells . One tetracycline repressor (TetR) mutant gave rise to rtTA, a tetracycline-controlled transactivator that requires doxycycline (Dox) for binding to tet operators and thus for the activation of P(tet) promoters . Despite the intriguing properties of rtTA, its use was limited, particularly in transgenic animals, because of its relatively inefficient inducibility by doxycycline in some organs, its instability, and its residual affinity to tetO in absence of Dox, leading to elevated background activities of the target promoter . To remove these limitations, we have mutagenized tTA DNA and selected in Saccharomyces cerevisiae for rtTA mutants with reduced basal activity and increased Dox sensitivity . Five new rtTAs were identified, of which two have greatly improved properties . The most promising new transactivator, rtTA2(S)-M2, functions at a 10-fold lower Dox concentration than rtTA, is more stable in eukaryotic cells, and causes no background expression in the absence of Dox . The coding sequences of the new reverse TetR mutants fused to minimal activation domains were optimized for expression in human cells and synthesized . The resulting transactivators allow stringent regulation of target genes over a range of 4 to 5 orders of magnitude in stably transfected HeLa cells . These rtTA versions combine tightness of expression control with a broad regulatory range, as previously shown for the widely applied tTA. J Biol Chem, 2000 Sep 1, 275(35), 27311 - 5 Identification that KfiA, a protein essential for the biosynthesis of the Escherichia coli K5 capsular polysaccharide, is an alpha -UDP-GlcNAc glycosyltransferase . The formation of a membrane-associated K5 biosynthetic complex requires KfiA, KfiB, and KfiC; Hodson N et al.; The Escherichia coli K5 capsular polysaccharide consists of the repeat structure -4)GlcA-beta(1,4)-GlcNAc-alpha(1- and requires the KfiA, KfiB, KfiC, and KfiD proteins for its synthesis . Previously, the KfiC protein was shown to be a beta-UDP-GlcA glycosyltransferase, and KfiD was shown to be a UDP-Glc dehydrogenase . Here, we demonstrate that KfiA is an alpha-UDP-GlcNAc glycosyltransferase and that biosynthesis of the K5 polysaccharide involves the concerted action of the KfiA and KfiC proteins . By site-directed mutagenesis, we determined that the acidic motif of DDD, which is conserved between the C family of glycosyltransferases, is essential for the enzymatic activity of KfiA . In addition, by Western blot analysis, we determined that association of KfiA with the cytoplasmic membrane requires KfiC but not KfiB, whereas the interaction of KfiC with the cytoplasmic membrane was dependent on both KfiA and KfiB . Likewise, KfiB was only detectable in cytoplasmic membrane fractions when both KfiA and KfiC were present . These data suggest that the interaction between the KfiA, KfiB, and KfiC proteins is essential for the stable association of these proteins with the cytoplasmic membrane and the biosynthesis of the K5 polysaccharide. J Biol Chem, 2000 Sep 22, 275(38), 29207 - 16 A novel plant glutathione S-transferase/peroxidase suppresses Bax lethality in yeast; Kampranis SC et al.; The mammalian inducer of apoptosis Bax is lethal when expressed in yeast and plant cells . To identify potential inhibitors of Bax in plants we transformed yeast cells expressing Bax with a tomato cDNA library and we selected for cells surviving after the induction of Bax . This genetic screen allows for the identification of plant genes, which inhibit either directly or indirectly the lethal phenotype of Bax . Using this method a number of cDNA clones were isolated, the more potent of which encodes a protein homologous to the class theta glutathione S-transferases . This Bax-inhibiting (BI) protein was expressed in Escherichia coli and found to possess glutathione S-transferase (GST) and weak glutathione peroxidase (GPX) activity . Expression of Bax in yeast decreases the intracellular levels of total glutathione, causes a substantial reduction of total cellular phospholipids, diminishes the mitochondrial membrane potential, and alters the intracellular redox potential . Co-expression of the BI-GST/GPX protein brought the total glutathione levels back to normal and re-established the mitochondrial membrane potential but had no effect on the phospholipid alterations . Moreover, expression of BI-GST/GPX in yeast was found to significantly enhance resistance to H(2)O(2)-induced stress . These results underline the relationship between oxidative stress and Bax-induced death in yeast cells and demonstrate that the yeast-based genetic strategy described here is a powerful tool for the isolation of novel antioxidant and antiapoptotic genes. Biol Reprod, 2000 Jul, 63(1), 347 - 53 Immunization of male mice with luteinizing hormone-releasing hormone fusion proteins reduces testicular and accessory sex gland function; Quesnell MM et al.; Genes for ovalbumin-luteinizing hormone-releasing hormone 7 (LHRH-7) and thioredoxin-LHRH-7 fusion proteins (containing seven LHRH inserts) were constructed by cassette and mismatch mutagenesis and expressed in Escherichia coli . In experiment 1, 10 microgram of either ovalbumin-LHRH-7 or thioredoxin-LHRH-7 were suspended in Z-max adjuvant and injected three times at 4-wk intervals into postpubertal male BALB/c mice . In experiment 2, the fusion proteins were suspended in Immumax adjuvant and administered in equimolar quantities (0.4 nmol per injection) to postpubertal male BALB/c mice . In addition to injection of these two proteins alone, the proteins were also administered in different sequences or together in a mixture . Both LHRH fusion proteins induced significant antibody titers, which resulted in a significant decrease in vesicular gland and anterior prostate weight (measure of biological response) in both experiments . Vesicular gland and anterior prostate weight and LHRH antibody titers were significantly correlated in experiments 1 (r = -0.64) and 2 (r = -0.53) . Percentage of animals responding to treatment varied from 40-60% in experiment 1 and from 11-89% in experiment 2, with the highest responses in treatments that used a combination of both fusion proteins . The variation in responders and nonresponders was evaluated by estimating antibody K(D) from displacement curves . Part, but not all, of the high antibody nonresponders can be explained by antibody affinity. Am J Physiol Gastrointest Liver Physiol, 2000 Jun, 278(6), G895 - 904 Heterogeneity of detergent-insoluble membranes from human intestine containing caveolin-1 and ganglioside G(M1); Badizadegan K et al.; In intestinal epithelia, cholera and related toxins elicit a cAMP-dependent chloride secretory response fundamental to the pathogenesis of toxigenic diarrhea . We recently proposed that specificity of cholera toxin (CT) action in model intestinal epithelia may depend on the toxin's cell surface receptor ganglioside G(M1) . Binding G(M1) enabled the toxin to elicit a response, but forcing the toxin to enter the cell by binding the closely related ganglioside G(D1a) rendered the toxin inactive . The specificity of ganglioside function correlated with the ability of G(M1) to partition CT into detergent-insoluble glycosphingolipid-rich membranes (DIGs) . To test the biological plausibility of these hypotheses, we examined native human intestinal epithelia . We show that human small intestinal epithelia contain DIGs that distinguish between toxin bound to G(M1) and G(D1a), thus providing a possible mechanism for enterotoxicity associated with CT . We find direct evidence for the presence of caveolin-1 in DIGs from human intestinal epithelia but find that these membranes are heterogeneous and that caveolin-1 is not a structural component of apical membrane DIGs that contain CT. Rapid Commun Mass Spectrom, 2000, 14(12), 1047 - 57 A tandem quadrupole/time-of-flight mass spectrometer with a matrix-assisted laser desorption/ionization source: design and performance Loboda AV, Krutchinsky AN, Bromirski M, Ens W, Standing KG. A matrix-assisted laser desorption/ionization (MALDI) source has been coupled to a tandem quadrupole/time-of-flight (QqTOF) mass spectrometer by means of a collisional damping interface . Mass resolving power of about 10,000 (FWHM) and accuracy in the range of 10 ppm are observed in both single-MS mode and MS/MS mode . Sub-femtomole sensitivity is obtained in single-MS mode, and a few femtomoles in MS/MS mode . Both peptide mass mapping and collision-induced dissociation (CID) analysis of tryptic peptides can be performed from the same MALDI target . Rapid spectral acquisition (a few seconds per spectrum) can be achieved in both modes, so high throughput protein identification is possible . Some information about fragmentation patterns was obtained from a study of the CID spectra of singly charged peptides from a tryptic digest of E . coli citrate synthase . Reasonably successful automatic sequence prediction (>90%) is possible from the CID spectra of singly charged peptides using the SCIEX Predict Sequence routine . Ion production at pressures near 1 Torr (rather than in vacuum) is found to give reduced metastable fragmentation, particularly for higher mass molecular ions . Cryobiology, 2000 May, 40(3), 187 - 209 The enhancement of the ability of mouse sperm to survive freezing and thawing by the use of high concentrations of glycerol and the presence of an Escherichia coli membrane preparation (Oxyrase) to lower the oxygen concentration; Mazur P et al.; The cryobiological preservation of mouse spermatozoa has presented difficulties in the form of poor motilities or irreproducibility . We have hypothesized several underlying problems . One is that published studies have used concentrations of the cryoprotectant glycerol that are substantially lower (<0.3 M) than the approximately 1 M concentrations that are optimal for most mammalian cells . Another may arise from the known high susceptibility of mouse sperm to free radical damage . We have been able to obtain high motilities in 0.8 M glycerol provided that the exposure time is held to approximately 5 min to minimize toxicity and provided that the glycerol is added and removed stepwise to minimize osmotic shock . Since free radical damage in mouse sperm is proportional to the oxygen concentrations, we have determined the consequences of reducing the oxygen to <3% of atmospheric by maintaining the sperm in contact with an Escherichia coli membrane preparation, Oxyrase, from the moment of collection throughout the assessment of motility . Prior studies have shown that the procedure significantly reduces damage from centrifugation and osmotic shock . In the experiments reported here we obtained approximately 50% motility relative to untreated controls when suspensions containing 3.8% Oxyrase were exposed approximately 5 min to a solution of 0.8 M glycerol and 0.17 M (10%) raffinose in a supplemented PBS and then frozen at approximately 25 degrees C/min to -75 degrees C . In the absence of Oxyrase, the normalized motility dropped to 31% . The protection by Oxyrase was in part a consequence of minimizing centrifugation damage, but in part it reflected a reduction in freeze-thaw damage . Preliminary experiments indicate that the number of motile sperm after cryopreservation in Oxyrase is higher when the sperm are collected without swim-up than when they are collected by swim-up . This is in part due to the fact that more cells are collected in the absence of swim-up and in part due to a greater protective effect of Oxyrase on those cells . The minimum temperature in these initial experiments was limited to -75 degrees C to avoid the potential contribution of other injurious factors between -75 and -196 degrees C . Int Arch Allergy Immunol, 2000 May, 122(1), 66 - 75 Isolation of a lactose-binding protein with monocyte/macrophage chemotactic activity . Biological and physicochemical characteristics; Yamanaka T et al.; BACKGROUND: We established a T cell line, STO-5, which constitutively produced monocyte/macrophage chemotactic activity via human T cell lymphoma-leukemia-virus-induced transformation of normal human T cells . METHODS: We isolated and purified a lactose-binding protein, MCF-pl5-L (MW of about 50 kD, pl of about 5) from a conditioned medium of STO-5 . By using highly purified MCF-pl5-L, its biological and physicochemical properties were elucidated in comparison with C5a and MCP-1 . RESULTS: MCF-pl5-L exhibited an evident dose-dependent monocyte chemotactic activity (MCA) . MCF-pl5-L had no or little affinity for heparin unlike chemokines such as MCP-1 . We further found that MCF-pl5-L exhibited potent chemotactic activity not only for monocytes but also for alveolar macrophages . In contrast, C5a and MCP-1 failed to show evident chemotactic activity for alveolar macrophages though they did show MCA . MCF-pl5-L failed to exhibit evident eosinophil and neutrophil chemotactic activities, indicating its chemotactic activity is selective for monocytes/macrophages . Regarding the biological functions of MCF-pl5-L other than MCA and chemotactic activity for alveolar macrophages, we found that MCF-pl5-L but not C5a and MCP-1 could prolong the life span of alveolar macrophages, probably by inhibiting apoptosis of macrophages, and stimulate the production of TNF-alpha from macrophages . CONCLUSIONS: These results suggest that MCF-pl5-L plays a role as an immune modulator for monocytes/macrophages in the site . J Gen Virol, 2000 Jul, 81(Pt 7), 1649 - 58 Mutational analysis of hepatitis C virus NS3-associated helicase; Paolini C et al.; Nonstructural protein 3 (NS3) of hepatitis C virus contains a bipartite structure consisting of an N-terminal serine protease and a C-terminal DEXH box helicase . To investigate the roles of individual amino acid residues in the overall mechanism of unwinding, a mutational-functional analysis was performed based on a molecular model of the NS3 helicase domain bound to ssDNA, which has largely been confirmed by a recently published crystal structure of the NS3 helicase-ssDNA complex . Three full-length mutated NS3 proteins containing Tyr(392)Ala, Val(432)Gly and Trp(501)Ala single substitutions, respectively, together with a Tyr(392)Ala/Trp(501)Ala double-substituted protein were expressed in Escherichia coli and purified to homogeneity . All individually mutated forms showed a reduction in duplex unwinding activity, single-stranded polynucleotide binding capacity and polynucleotide-stimulated ATPase activity compared to wild-type, though to different extents . Simultaneous replacement of both Tyr(392) and Trp(501) with Ala completely abolished all these enzymatic functions . On the other hand, the introduced amino acid substitutions had no influence on NS3 intrinsic ATPase activity and proteolytic efficiency . The results obtained with Trp(501)Ala and Val(432)Gly single-substituted enzymes are in agreement with a recently proposed model for NS3 unwinding activity . The mutant phenotype of the Tyr(392)Ala and Tyr(392)Ala/Trp(501)Ala enzymes, however, represents a completely novel finding. Plant Physiol, 2000 Jun, 123(2), 733 - 42 ACX3, a novel medium-chain acyl-coenzyme A oxidase from Arabidopsis; Froman BE et al.; In a database search for homologs of acyl-coenzyme A oxidases (ACX) in Arabidopsis, we identified a partial genomic sequence encoding an apparently novel member of this gene family . Using this sequence information we then isolated the corresponding full-length cDNA from etiolated Arabidopsis cotyledons and have characterized the encoded recombinant protein . The polypeptide contains 675 amino acids . The 34 residues at the amino terminus have sequence similarity to the peroxisomal targeting signal 2 of glyoxysomal proteins, including the R-{I/Q/L}-X5-HL-XL-X15-22-C consensus sequence, suggesting a possible microsomal localization . Affinity purification of the encoded recombinant protein expressed in Escherichia coli followed by enzymatic assay, showed that this enzyme is active on C8:0- to C14:0-coenzyme A with maximal activity on C12:0-coenzyme A, indicating that it has medium-chain-specific activity . These data indicate that the protein reported here is different from previously characterized classes of ACX1, ACX2, and short-chain ACX (SACX), both in sequence and substrate chain-length specificity profile . We therefore, designate this new gene AtACX3 . The temporal and spatial expression patterns of AtACX3 during development and in various tissues were similar to those of the AtSACX and other genes expressed in glyoxysomes . Currently available database information indicates that AtACX3 is present as a single copy gene. Plant Physiol, 2000 Jun, 123(2), 711 - 24 Cytochrome P450-dependent metabolism of oxylipins in tomato . Cloning and expression of allene oxide synthase and fatty acid hydroperoxide lyase; Howe GA et al.; Allene oxide synthase (AOS) and fatty acid hydroperoxide lyase (HPL) are plant-specific cytochrome P450s that commit fatty acid hydroperoxides to different branches of oxylipin metabolism . Here we report the cloning and characterization of AOS (LeAOS) and HPL (LeHPL) cDNAs from tomato (Lycopersicon esculentum) . Functional expression of the cDNAs in Escherichia coli showed that LeAOS and LeHPL encode enzymes that metabolize 13- but not 9-hydroperoxide derivatives of C(18) fatty acids . LeAOS was active against both 13S-hydroperoxy-9(Z),11(E),15(Z)-octadecatrienoic acid (13-HPOT) and 13S-hydroperoxy-9(Z),11(E)-octadecadienoic acid, whereas LeHPL showed a strong preference for 13-HPOT . These results suggest a role for LeAOS and LeHPL in the metabolism of 13-HPOT to jasmonic acid and hexenal/traumatin, respectively . LeAOS expression was detected in all organs of the plant . In contrast, LeHPL expression was predominant in leaves and flowers . Damage inflicted to leaves by chewing insect larvae led to an increase in the local and systemic expression of both genes, with LeAOS showing the strongest induction . Wound-induced expression of LeAOS also occurred in the def-1 mutant that is deficient in octadecanoid-based signaling of defensive proteinase inhibitor genes . These results demonstrate that tomato uses genetically distinct signaling pathways for the regulation of different classes of wound responsive genes. Mol Med, 2000 Feb, 6(2), 126 - 35 A secreted tumor-suppressor, mac25, with activin-binding activity; Kato MV; BACKGROUND: mac25 is a follistatin (FS)-like protein that has a growth-suppressing effect on a p53-deficient osteosarcoma cell line (Saos-2) . The protein exhibits a strong homology to FS, an activin-binding protein, and part of its sequence includes the consensus sequence of the member of the Kazal serine protease inhibitor family . MATERIALS AND METHODS: Localization of mac25 protein was analyzed using mac25 protein fused with green fluorescent protein (GFP) . Recombinant mac25 protein was expressed in E . coli and purified . The recombinant mac25 protein was added in culture medium for analysis of growth suppression and cell cycle analysis . Binding of mac25 protein to activin A was studied by immunoprecipitation and Western blots analysis . RESULTS: mac25 protein was localized in the cytoplasm and secreted into culture medium . Addition of recombinant mac25 protein (10-7 M) into the culture medium induced significant suppression of the growth of human cervical carcinoma cells (HeLa) and murine embryonic carcinoma cells (P19), as well as osteosarcoma cells (Saos-2) . mac25 protein was co-immunoprecipitated with activin A, a result that suggests that mac25 may be a secreted tumor-suppressor that binds activin A . CONCLUSION: mac25 exhibits homology to insulin-like growth factor-binding proteins (IGF-BPs) and to fibroblast growth factor receptor . The multi-functional nature of mac25 protein may be important for growth-suppression and/or cellular senescence. Mol Med, 2000 Feb, 6(2), 96 - 103 Intravenous injection of an adenovirus encoding hepatocyte growth factor results in liver growth and has a protective effect against apoptosis; Phaneuf D et al.; BACKGROUND: Hepatocyte growth factor/scatter factor (HGF/SF) is a pleiotropic cytokine with mitogenic, motogenic and morphogenic effects for a wide variety of cells . Previous studies have reported that the in vivo infusion in normal, untreated mice of recombinant HGF results in low levels of DNA synthesis and liver proliferation . In this study, we examined whether liver regeneration could be obtained by the in vivo injection of a recombinant adenoviral vector encoding human HGF (Ad.CMV.rhHGF) in normal, intact mice . MATERIALS AND METHODS: C57BL/6 mice were infused intravenously with doses increasing from 1 to 4 x 1011 particles of the recombinant human HGF (rhHGF) adenoviral vector or with a control virus encoding Escherichia coli beta-galactosidase (Ad.CMV.lacZ) . At day 5, mice were sacrificed and evaluated for the presence of hepatocyte mitogenesis and liver regeneration (5-bromo-2'-deoxyuridine (BrdU) assays and liver weight determination) and for the presence of liver damage (serum alanine amino-transferase (ALT) measurements and TUNEL assays) . RESULTS: In vivo administration of rhHGF stimulated DNA synthesis of hepatocytes and liver weight in a dose-dependent fashion . The maximal effect was seen after the infusion of 3 x 1011 particles which resulted at day 5 in >130% increase in relative liver mass with little cytopathic effect . In contrast, administration of the lacZ adenoviral vector caused little hepatocyte replication, but induced high levels of serum ALT (approximately 3 times higher than the rhHGF vector) and significant apoptotic cell death . CONCLUSIONS: This study shows that a single injection of Ad.CMV.rhHGF alone is able to induce in vivo and in a very short period of time, robust DNA synthesis and liver proliferation in normal mice without liver injury or partial hepatectomy . This recombinant adenoviral vector has a lower toxicity than the control lacZ adenovirus . This suggests that HGF may have a protective effect against adenovirus-induced pathology. Biochim Biophys Acta, 2000 Jun 21, 1492(1), 269 - 70 Extraordinarily high density of unrelated genes showing overlapping and intraintronic transcription units; Misener SR et al.; The cloning of pyrroline 5-carboxylate reductase from Drosophila melanogaster was accomplished by cDNA complementation of an Escherichia coli proline auxotroph . The corresponding P5cr gene is tightly clustered with three other expressed coding regions . A bidirectional promoter, an overlapping 3'UTR and an intraintronic sequence may all be found in only 4.3 kb, making this the most densely clustered region of unrelated genes in any eukaryote. Biochim Biophys Acta, 2000 Jun 21, 1492(1), 94 - 9 K(+) and Mg(2+) ions promote the self-splicing of the td intron RNA inhibited by spectinomycin; Park IK et al.; The effects of Mg(2+) and K(+) ions on the self-splicing inhibition of the td (thymidylate synthase gene) intron RNA by spectinomycin were investigated . The maximum splicing activity occurred at 20 mM KCl . The K(m) and V(max) values for GTP in the presence of 5 mM Mg(2+) are 2.25 microM and 0.55 min(-1), whereas those for GTP both in the presence of 5 mM Mg(2+) and 5 mM K(+) are 1.23 microM and 0 . 46 min(-1), respectively . Spectinomycin at 10 mM concentration inhibited the splicing by about 10%, but at 20 mM concentration, the splicing rate was inhibited by about 63% . The splicing inhibition by the low concentration of spectinomycin was overcome markedly as the concentration of Mg(2+) ion was raised . At 30 mM spectinomycin, however, the splicing inhibition was not significantly affected by increasing the concentration of Mg(2+) . A similar activation of the splicing rate was observed as the concentration of K(+) ion was increased . The concentration of K(+) ion required for the normal recovery of the splicing was much higher than that of Mg(2+) ion . Unlike Mg(2+) ion, 30 mM K(+) ion effectively alleviated the splicing inhibition by spectinomycin at its high concentration . The results indicate that K(+) and Mg(2+) ions may show mechanistically different interactions with spectinomycin in the self-splicing reaction of the td intron RNA. J Biol Chem, 2000 Jun 23, 275(25), 18698 - 703 A thermodynamic coupling mechanism for the disaggregation of a model peptide substrate by chaperone secB; Panse VG et al.; Molecular chaperones prevent protein aggregation in vivo and in vitro . In a few cases, multichaperone systems are capable of dissociating aggregated state(s) of substrate proteins, although little is known of the mechanism of the process . SecB is a cytosolic chaperone, which forms part of the precursor protein translocation machinery in Escherichia coli . We have investigated the interaction of the B-chain of insulin with chaperone SecB by light scattering, pyrene excimer fluorescence, and electron spin resonance spectroscopy . We show that SecB prevents aggregation of the B-chain of insulin . We show that SecB is capable of dissociating soluble B-chain aggregates as monitored by pyrene fluorescence spectroscopy . The kinetics of dissociation of the B-chain aggregate by SecB has been investigated to understand the mechanism of dissociation . The data suggests that SecB does not act as a catalyst in dissociation of the aggregate to individual B-chains, rather it binds the small population of free B-chains with high affinity, thereby shifting the equilibrium from the ensemble of the aggregate toward the individual B-chains . Thus SecB can rescue aggregated, partially folded/misfolded states of target proteins by a thermodynamic coupling mechanism when the free energy of binding to SecB is greater than the stability of the aggregate . Pyrene excimer fluorescence and ESR methods have been used to gain insights on the bound state conformation of the B-chain to chaperone SecB . The data suggests that the B-chain is bound to SecB in a flexible extended state in a hydrophobic cleft on SecB and that the binding site accommodates approximately 10 residues of substrate. J Biol Chem, 2000 Sep 1, 275(35), 26821 - 7 A key point mutation (V156E) affects the structure and functions of human apolipoprotein A-I; Cho KH et al.; A naturally occurring point mutant of human apolipoprotein A-I (apoA-I), V156E, which is associated with extremely low plasma apoA-I and high density lipoprotein (HDL) levels, and coronary artery disease (Huang, W., Sasaki, J., Matsunaga, A., Nanimatsu, H., Moriyama, K., Han, H . Kugi, M., Koga, T., Yamaguchi, K., and Arakawa, K . (1998) Arterioscler . Throm . Vasc . Biol . 18, 389-396), was produced in an Escherichia coli expression system . The purified recombinant proapoA-I V156E mutant was examined in its structural and functional properties, both, in the lipid-free and lipid-bound states . In the lipid-free form the mutant protein exhibited small changes in conformation, but was more stable, and quite resistant to self-association, compared with control apoA-I . The V156E mutant was able to interact with phospholipid (PL) at high PL:protein ratios (95:1, mol/mol), but was inefficient in forming reconstituted HDL (rHDL) complexes at lower PL:protein ratios (40:1) . In the lipid-bound, rHDL state, the mutant protein was somewhat more alpha-helical and formed a larger complex (110 A) than control apoA-I (97 A) . Furthermore, the rHDL particles containing the V156E mutant did not rearrange to smaller particles in the presence of low density lipoproteins, and had minimal reactivity with lecithin-cholesterol acyltransferase (LCAT), compared with rHDL particles made with control apoA-I . These results suggest a key role for Val-156, or the adjacent central region of apoA-I in the modulation of apoA-I conformation, stability, and self-association in solution, and in the formation of small HDL, the conformational adaptability of apoA-I leading to structural rearrangements of HDL, and the activation of LCAT. Chest, 2000 Jun, 117(6), 1720 - 7 Effects of eicosapentaenoic and gamma-linolenic acids (dietary lipids) on pulmonary surfactant composition and function during porcine endotoxemia; Murray MJ et al.; STUDY OBJECTIVES: To investigate whether a diet enriched with fish and borage oils, with their high polyunsaturated fatty acid (PUFA) content, alters surfactant composition and function during endotoxemia . DESIGN: Prospective, randomized, blinded, controlled animal study . SETTING: Research laboratory at a medical center . PARTICIPANTS: Thirty-six 15- to 25-kg, disease-free, castrated male pigs . DIETS AND MEASUREMENTS: Three groups of pigs (n = 12 per group) were fed for 8 days diets containing either omega-6 fatty acids (FAs) (corn oil; diet A), or omega-3 FAs (fish oil; diet B), or a combination of omega-6 and omega-3 FAs (borage and fish oils; diet C) . Eight of 12 pigs in each group received a 0.1-mg/kg bolus of Escherichia coli endotoxin followed by a continuous infusion (0 . 075 mg/kg/h) . One lung was subsequently isolated ex vivo, and pressure-volume curves were measured . The contralateral lung was lavaged, and surfactant was analyzed for total and individual phospholipids and FA composition . Minimum and maximum surface tension was measured by bubble surfactometry . RESULTS: Pigs fed either diet B or C had increased oleic acid (C(18:1) omega-9), eicosapentaenoic acid (EPA; C(20:5) omega-3), docosahexaenoic acid (C(22:6) omega-3), and total omega-3 and monounsaturated FAs in their surfactant PUFA pools . The relative percentage of linoleic acid (C(18:2) omega-6) and total omega-6 FAs were significantly lower from pigs fed diets B and C compared with diet A . Palmitic acid (C(16:0)) concentrations, the primary FA in surfactant, had a tendency to be lower in pigs fed diets B and C . There were no demonstrable effects on surfactant function or pulmonary compliance . CONCLUSIONS: Diets containing EPA or EPA and gamma-linolenic acid altered the PUFA composition of pulmonary surfactant, but without demonstrable effects on surfactant function during porcine endotoxemia. Chem Res Toxicol, 2000 Jun, 13(6), 523 - 8 Mutagenicity of the 1-nitropyrene-DNA adduct N-(deoxyguanosin-8-yl)-1-aminopyrene in Escherichia coli located in a nonrepetitive CGC sequence; Bacolod MD et al.; 1-Nitropyrene, a common environmental pollutant, forms a major DNA adduct, N-(deoxyguanosin-8-yl)-1-aminopyrene (dG(AP)) . Mutational spectra of randomly introduced dG(AP) in Escherichia coli included many different types of mutations . However, a prior site-specific study in a CGCG(AP)CG sequence showed only CpG deletions and +1 frame shifts . To further explore the context effects of dG(AP) in mutagenesis, in this work this adduct was incorporated into a nonrepetitive CGC sequence in single-stranded M13mp7L2 DNA . Upon replication of this construct in repair-competent E . coli, one-base deletions and base substitutions were detected . The -1 frame shifts, whose frequency increased 3-6-fold with SOS (to an average frequency of 1.5%), involved deletion of the adjacent C residues . The base substitutions ( approximately 2.2%) included targeted G-to-T and G-to-C transversions, whose frequencies did not increase with SOS . This suggests that dG(AP) mutagenesis is highly dependent on the local DNA sequence. What is the current status of gene therapy research? The Food and Drug Administration (FDA) has not yet approved any human gene therapy product for sale. Current gene therapy is experimental and has not proven very successful in clinical trials. Little progress has been made since the first gene therapy clinical trial began in 1990. In 1999, gene therapy suffered a major setback with the death of 18-year-old Jesse Gelsinger. Jesse was participating in a gene therapy trial for ornithine transcarboxylase deficiency (OTCD). He died from multiple organ failures 4 days after starting the treatment. His death is believed to have been triggered by a severe immune response to the adenovirus carrier. Another major blow came in January 2003, when the FDA placed a temporary halt on all gene therapy trials using retroviral vectors in blood stem cells. FDA took this action after it learned that a second child treated in a French gene therapy trial had developed a leukemia-like condition. Both this child and another who had developed a similar condition in August 2002 had been successfully treated by gene therapy for X-linked severe combined immunodeficiency disease (X-SCID), also known as "bubble baby syndrome." FDA's Biological Response Modifiers Advisory Committee (BRMAC) met at the end of February 2003 to discuss possible measures that could allow a number of retroviral gene therapy trials for treatment of life-threatening diseases to proceed with appropriate safeguards. FDA has yet to make a decision based on the discussions and advice of the BRMAC meeting. What factors have kept gene therapy from becoming an effective treatment for genetic disease? Short-lived nature of gene therapy - Before gene therapy can become a permanent cure for any condition, the therapeutic DNA introduced into target cells must remain functional and the cells containing the therapeutic DNA must be long-lived and stable. Problems with integrating therapeutic DNA into the genome and the rapidly dividing nature of many cells prevent gene therapy from achieving any long-term benefits. Patients will have to undergo multiple rounds of gene therapy. Immune response - Anytime a foreign object is introduced into human tissues, the immune system is designed to attack the invader. The risk of stimulating the immune system in a way that reduces gene therapy effectiveness is always a potential risk. Furthermore, the immune system's enhanced response to invaders it has seen before makes it difficult for gene therapy to be repeated in patients. Problems with viral vectors - Viruses, while the carrier of choice in most gene therapy studies, present a variety of potential problems to the patient --toxicity, immune and inflammatory responses, and gene control and targeting issues. In addition, there is always the fear that the viral vector, once inside the patient, may recover its ability to cause disease. Multigene disorders - Conditions or disorders that arise from mutations in a single gene are the best candidates for gene therapy. Unfortunately, some the most commonly occurring disorders, such as heart disease, high blood pressure, Alzheimer's disease, arthritis, and diabetes, are caused by the combined effects of variations in many genes. Multigene or multifactorial disorders such as these would be especially difficult to treat effectively using gene therapy. For more information on different types of genetic disease, see Genetic Disease Information. What are some recent developments in gene therapy research? University of California, Los Angeles, research team gets genes into the brain using liposomes coated in a polymer call polyethylene glycol (PEG). The transfer of genes into the brain is a significant achievement because viral vectors are too big to get across the "blood-brain barrier." This method has potential for treating Parkinson's disease. See Undercover genes slip into the brain at NewScientist.com (March 20, 2003). RNA interference or gene silencing may be a new way to treat Huntington's. Short pieces of double-stranded RNA (short, interfering RNAs or siRNAs) are used by cells to degrade RNA of a particular sequence. If a siRNA is designed to match the RNA copied from a faulty gene, then the abnormal protein product of that gene will not be produced. See Gene therapy may switch off Huntington's at NewScientist.com (March 13, 2003). New gene therapy approach repairs errors in messenger RNA derived from defective genes. Technique has potential to treat the blood disorder thalassaemia, cystic fibrosis, and some cancers. See Subtle gene therapy tackles blood disorder at NewScientist.com (October 11, 2002). Gene therapy for treating children with X-SCID (sever combined immunodeficiency) or the "bubble boy" disease is stopped in France when the treatment causes leukemia in one of the patients. See 'Miracle' gene therapy trial halted at NewScientist.com (October 3, 2002). Researchers at Case Western Reserve University and Copernicus Therapeutics are able to create tiny liposomes 25 nanometers across that can carry therapeutic DNA through pores in the nuclear membrane. See DNA nanoballs boost gene therapy at NewScientist.com (May 12, 2002). Sickle cell is successfully treated in mice. See Murine Gene Therapy Corrects Symptoms of Sickle Cell Disease from March 18, 2002, issue of The Scientist. What are some of the ethical considerations for using gene therapy? --Some Questions to Consider... What is normal and what is a disability or disorder, and who decides? Are disabilities diseases? Do they need to be cured or prevented? Does searching for a cure demean the lives of individuals presently affected by disabilities? Is somatic gene therapy (which is done in the adult cells of persons known to have the disease) more or less ethical than germline gene therapy (which is done in egg and sperm cells and prevents the trait from being passed on to further generations)? In cases of somatic gene therapy, the procedure may have to be repeated in future generations. Preliminary attempts at gene therapy are exorbitantly expensive. Who will have access to these therapies? Who will pay for their use?
|
© 2005
Transgalactic Ltd (manufacturer of Bioscreen C software) |
Privacy Statement | P.O. Box
1393, 00101 Helsinki, Finland,
Last modified: May 25, 2005
| ||||||