|








| |
Scientific
Publications - Work Done by Microbiology Reader Bioscreen C
| United States Patent Application |
20040210398 |
| Kind Code |
A1 |
| Palsson, Bernhard O. ; et al. |
October 21, 2004 |

Methods and systems to identify operational reaction pathways
Abstract
The present invention provides a method of refining a biosystem reaction
network. The method consists of: (a) providing a mathematical representation of
a biosystem; (b) reconciling said mathematical representation of said biosystem;
(c) determining differences between observed behavior of a biosystem and in
silico behavior of said mathematical representation of said biosystem under
similar conditions; (d) modifying a structure of said mathematical
representation of said biosystem, and (e) determining differences between said
observed behavior of said biosystem and in silico behavior of said modified
mathematical representation of said biosystem under similar conditions.

| Inventors: |
Palsson, Bernhard O.; (La Jolla, CA) ;
Covert, Markus W.; (South Pasadena, CA) ; Herrgard, Markus; (La
Jolla, CA) |
| Correspondence Name and Address:
|
GRAY CARY WARE & FREIDENRICH LLP
4365 EXECUTIVE DRIVE
SUITE 1100
SAN DIEGO
CA
92121-2133
US
|
| Assignee Name and Adress: |
The Regents of the University of California
Oakland
CA
|
| Serial No.: |
833584 |
| Series Code: |
10 |
| Filed: |
April 27, 2004 |
| U.S. Current Class: |
702/19 |
| U.S. Class at Publication: |
702/019 |
| Intern'l Class: |
G06F 019/00; G01N 033/48; G01N
033/50 |

Claims

We claim:
1. A method of refining a biosystem reaction network, comprising: (a) providing
a mathematical representation of a biosystem; (b) reconciling said mathematical
representation of said biosystem; (c) determining differences between observed
behavior of a biosystem and in silico behavior of said mathematical
representation of said biosystem under similar conditions; (d) modifying a
structure of said mathematical representation of said biosystem, and (e)
determining differences between said observed behavior of said biosystem and in
silico behavior of said modified mathematical representation of said biosystem
under similar conditions.
2. The method of claim 1, further comprising, repeating steps (d) and (e) until
behavioral differences are minimized, wherein satisfaction of a predetermined
accuracy criteria indicates an improvement in said biosystem reaction network.
3. The method of claim 1, further comprising repeating steps (c) through (e)
under different conditions.
4. The method of claim 3, further comprising performing iterations until
behavioral differences are minimized.
5. The method of claim 4, further comprising exhausting said iterations to
produce an improved biosystem reaction network representing an optimal biosystem
reaction network.
6. The method of claim 1, wherein said biosystem is a prokaryotic cell, or
biological process thereof.
7. The method of claim 6, wherein said prokaryotic organism is selected from the
group consisting of E. coli, B. subtilis, H. influenzae and H. pylori.
8. The method of claim 6, wherein said biological process is metabolism.
9. The method of claim 1, wherein said biosystem is a eukaryotic cell, or
biological process thereof.
10. The method of claim 9, wherein said eukaryotic organism is selected from the
group consisting of S. cerevisiae and H. sapiens.
11. The method of claim 10, wherein said biological process is metabolism.

Description

CROSS REFERENCE TO RELATED APPLICATIONS
[0001] This application is a continuation-in-part of U.S. Ser. No. 10/367,248,
filed Feb. 14, 2003, now pending, which under 35 U.S.C. .sctn.119(e) claims
benefit of 60/419,023 filed Oct. 15, 2002 (now abandoned); and Application
Serial No. 60/562,055, filed Apr. 13, 2004 (now pending), the contents of which
are herein incorporated by reference in their entirety.
BACKGROUND OF THE INVENTION
[0002] This invention relates generally to the construction of in silico model
organisms and, more specifically, methods and systems specifying operational
reaction pathways and for the generation of optimal in silico models of actual
organisms.
[0003] Therapeutic agents, including drugs and gene-based agents, are being
rapidly developed by the pharmaceutical industry with the goal of preventing or
treating human disease. Dietary supplements, including herbal products, vitamins
and amino acids, are also being developed and marketed by the nutraceutical
industry. Additionally, efforts for faster and more effective methods for
biological fermentation and other bioprocessing of food stuffs and industrial
compounds has been under development. Faster and more efficient production of
crops and other agricultural products is also yet another area of intense
development in the food industry.
[0004] Because of the complexity of biochemical reaction networks in and between
cells of an organism, even relatively minor perturbations caused by a
therapeutic agent, change in a dietary component or environmental or growth
conditions, can affect hundreds of biochemical reactions. Such changes or
perturbations can lead to both desirable and undesirable effects in any
therapeutic, industrial or agricultural process involving living cells. It would
therefore be beneficial if a particular process could predict the effects on a
living system such as a cell or organism of such perturbations.
[0005] However, current approaches to therapeutic, industrial and agricultural
development for compounds and processes used therein do not take into account
the effect of perturbations on cellular behavior at the level of accuracy needed
for efficient and economical production of products. In order to design
effective methods of manipulating cellular activities for the optimization of
such processes or to achieve the optimal intended effect of an applied a
compound, it would be helpful to understand cellular behavior from an integrated
perspective.
[0006] However, cellular behaviors involve the simultaneous function and
integration of many interrelated genes, gene products and chemical reactions.
Because of this interconnectivity, it is difficult to predict a priori the
effect of a change in a single gene or gene product, or the effect of a drug or
an environmental factor, on cellular behavior. The ability to accurately predict
cellular behavior under different conditions would be extremely valuable in many
areas of medicine and industry. For example, if it were possible to predict
which gene products are suitable drug targets, it would considerably shorten the
time it takes to develop an effective antibiotic or anti-tumor agent. Likewise,
if it were possible to predict the optimal fermentation conditions and genetic
make-up of a microorganism for production of a particular industrially important
product, it would allow for rapid and cost-effective improvements in the
performance of these microorganisms.
[0007] Thus, there exists a need for models and modeling methods that can be
used to accurately simulate and effectively analyze the behavior of cells and
organisms under a variety of conditions. The present invention satisfies this
need and provides related advantages as well.
SUMMARY OF THE INVENTION
[0008] The present invention provides a method of refining a biosystem reaction
network. The method consists of: (a) providing a mathematical representation of
a biosystem; (b) reconciling said mathematical representation of said biosystem;
(c) determining differences between observed behavior of a biosystem and in
silico behavior of said mathematical representation of said biosystem under
similar conditions; (d) modifying a structure of said mathematical
representation of said biosystem, and (e) determining differences between said
observed behavior of said biosystem and in silico behavior of said modified
mathematical representation of said biosystem under similar conditions.
BRIEF DESCRIPTION OF THE DRAWINGS
[0009] FIG. 1 shows a schematic diagram for steps involved in determining
operational pathways of a biochemical reaction network.
[0010] FIG. 2A shows a schematic representation of systemic reaction pathways as
one branch of a regulatory tree with the regulated genes shown on the horizontal
axis.
[0011] FIG. 2B shows a process by which mathematical representations of
biosystems can be improved in an iterative fashion using algorithmic approaches
and targeted experimentation.
[0012] FIG. 3 shows a phase plane for succinate for an in silico-generated
metabolic flux profile of core metabolism in E. coli was prepared.
[0013] FIG. 4 shows phase I of a phase plane for a flux distribution matrix
generated with the E. coli core metabolism using the oxygen and succinate input
values show next to the figure.
[0014] FIG. 5 shows an Singular Value Decomposition (SVD) analysis on the flux
matrix shown in FIG. 4.
[0015] FIG. 6 shows the contribution level of each condition, or point shown in
phase I of the FIG. 4 phase plane, for various modes obtained from SVD.
[0016] FIG. 7 shows the contribution level of each condition, or point shown in
phase I of the FIG. 4 phase plane, for various modes obtained from SVD.
[0017] FIG. 8 shows the reduced set of extreme pathways for succinate that is
presented in Table 2.
[0018] FIG. 9 shows a schematic diagram of flux balance analysis (FBA) and
convex analysis to identify extreme and operational pathways of the invention.
[0019] FIG. 10 shows decomposed flux vectors using the modes obtained from SVD
of P for the extreme pathways of the red blood cell (RBC) metabolic network.
[0020] FIG. 11 shows a histogram of the first five modes of the SVD analysis
shown in FIG. 10 under maximum (Max), moderate (Mid) and nominal state (no load)
oxidative and energy loads.
[0021] FIG. 12 shows a schematic diagram for building large-scale in silico
models of complex biological processes.
[0022] FIG. 13 shows the localization of single nucleotide polymorphism clusters
found in clinically diagnosed glucose-6-phosphate dehydrogenase (G6PD) patients.
[0023] FIG. 14 shows the toleration of oxidative load between chronic and
non-chronic hemolytic anemia states having G6PD SNPs.
[0024] FIG. 15 shows the characterization and toleration of energy loads for
glycolytic states harboring different pyruvate kinase (PK) SNP variants.
[0025] FIG. 16 shows the reconciliation of legacy and empirical data sets for
regulatory networks of yeast and E. coli.
[0026] FIG. 17 shows a schematic diagram of an algorithm for reconciliation of
data sets and iterative improvement of a mathematical or in silico model.
[0027] FIG. 18 shows a skeleton network of core metabolism and regulation,
together with a table containing relevant chemical reactions and regulatory
rules which govern the transcriptional regulation.
[0028] FIG. 19 shows calculation of the expression of regulated genes in an
actual organism and model system resulting from phase I of an iterative process
of the invention.
[0029] FIG. 20 shows computed flux distributions using flux balance analysis (FBA)
for the aerobic growth without regulation using an in silico model of the
invention.
[0030] FIG. 21 shows the results of a growth phenotype study. Panel (a) is a
comparison of high-throughput phenotyping array data with in silico predictions
for an E. coli regulatory network. Panel (b) shows the results for individual
knockout strains under each environmental condition whereas panel (c) summaries
the environments or knockout strains for which the fraction of agreement between
regulatory model determinations and observed phenotypes meet a threshold.
[0031] FIG. 22 shows (a) a characterization and accuracy of a regulatory network
related to the aerobic-anaerobic shift; (b) a comparison of predicted and
observed expression changes for a base in silico and a refinement of that model,
and (c) the results of a perturbation analysis identifying transcription factors
responsible for gene expression change.
[0032] FIG. 23 shows biosystem network reconciliation and refinement for the
expansion of multiple interrelated networks by applying phenotyping and gene
expression data in connection with in silico model results.
[0033] FIG. 24 shows a sensitivity analysis of the phenotype cutoff parameter
used for data normalization.
DETAILED DESCRIPTION OF THE INVENTION
[0034] The invention provides methods and systems for determining the
interaction, integration and coordination of a set of components of a biosystem.
The invention can thus be used to rapidly and systematically specify a
reconstructed biochemical reaction network at the genome-scale and to relate the
activity of the components and their interaction to a specific phenotype or
physiological state. Understanding which components are operational under
particular conditions allows for improved methods of engineering desirable
functions into living cells, fixing malfunctioning circuits, and controlling
endogenous circuits by the proper manipulation of the cells' environment.
Furthermore, a rapid method for characterizing a biochemical network allows for
the characterization of a virtually uncharacterized biosystem with a minimum of
experimental effort.
[0035] The invention provides a method for determining the operational pathways
of a biochemical reaction network. The invention method is practiced by (a)
providing a biochemical reaction network, comprised of reactions which can be
regulated; (b) providing a set of experimental data which represent various
physiological or pathological states of the biosystem under given conditions;
(c) determining a set of systemic pathways which define the biosystem in whole
or in part; (d) determining a set of phenomenological reaction pathways which
describe the experimental states of the biosystem; and (e) determining the
operational pathways common to both the systemic and phenomenological pathways
sets both at whole-genome and biosystem subcomponent scale (FIG. 1).
[0036] As used herein, the term "reaction" is intended to mean a chemical
conversion that consumes a substrate or forms a product. A conversion included
in the term can occur due to the activity of one or more enzymes that are
genetically encoded by an organism, or can occur spontaneously in a cell or
organism. A conversion included in the term can be, for example, a conversion of
a substrate to a product, such as one due to nucleophilic or electrophilic
addition, nucleophilic or electrophilic substitution, elimination, reduction or
oxidation. A conversion included in the term can also be a change in location,
such as a change that occurs when a reactant is transported across a membrane or
from one compartment to another. The substrate and product of a reaction can be
differentiated according to location in a particular compartment, even though
they are chemically the same. Thus, a reaction that transports a chemically
unchanged reactant from a first compartment to a second compartment has as its
substrate the reactant in the first compartment and as its product the reactant
in the second compartment. The term "reaction" also includes a conversion that
changes a macromolecule from a first conformation, or substrate conformation, to
a second conformation, or product conformation. Such conformational changes can
be due, for example, to transduction of energy due to binding a ligand such as a
hormone or receptor, or from a physical stimulus such as absorption of light. It
will be understood that when used in reference to an in silico biochemical
reaction network, a "reaction" is intended to be a representation of a
conversion as described above.
[0037] As used herein, the term "reactant" is intended to mean a chemical that
is a substrate or a product of a reaction. The term can include substrates or
products of reactions catalyzed by one or more enzymes encoded by an organism's
genome, reactions occurring in an organism that are catalyzed by one or more
non-genetically encoded catalysts, or reactions that occur spontaneously in a
cell or organism. Metabolites are understood to be reactants within the meaning
of the term. It will be understood that when used in the context of an in silico
model or data structure, a reactant is understood to be a representation of
chemical that is a substrate or product of a reaction.
[0038] As used herein the term "substrate" is intended to mean a reactant that
can be converted to one or more products by a reaction. The term can include,
for example, a reactant that is to be chemically changed due to nucleophilic or
electrophilic addition, nucleophilic or electrophilic substitution, elimination,
reduction or oxidation or that is to change location such as by being
transported across a membrane or to a different compartment. The term can
include a macromolecule that changes conformation due to transduction of energy.
[0039] As used herein, the term "product" is intended to mean a reactant that
results from a reaction with one or more substrates. The term can include, for
example, a reactant that has been chemically changed due to nucleophilic or
electrophilic addition, nucleophilic or electrophilic substitution, elimination,
reduction or oxidation or that has changed location such as by being transported
across a membrane or to a different compartment. The term can include a
macromolecule that changes conformation due to transduction of energy.
[0040] As used herein, the term "regulatory reaction" is intended to mean a
chemical conversion or interaction that alters the activity of a catalyst. A
chemical conversion or interaction can directly alter the activity of a catalyst
such as occurs when a catalyst is post-translationally modified or can
indirectly alter the activity of a catalyst such as occurs when a chemical
conversion or binding event leads to altered expression of the catalyst. Thus,
transcriptional or translational regulatory pathways can indirectly alter a
catalyst or an associated reaction. Similarly, indirect regulatory reactions can
include reactions that occur due to downstream components or participants in a
regulatory reaction network. When used in reference to a data structure or in
silico model, the term is intended to mean a first reaction that is related to a
second reaction by a function that alters the flux through the second reaction
by changing the value of a constraint on the second reaction.
[0041] A regulatory reaction can further include information about inhibitory or
inducing effects of an active or inactive regulator on transcription of a gene.
For example, a regulatory reaction may have one or more regulators associated
with it which effect transcription of a gene.
[0042] A regulatory reaction can further include information about the
interaction of regulators which influence gene expression. For example a
regulatory reaction may have a combination of two or more regulators associated
with it which are dependent upon each other to effect transcription of a gene.
[0043] A regulatory reaction can further include information in the form of
Boolean logic statements which indicates the interaction and dependency of
regulators for transcription of a particular gene. For example, a particular
gene may have a Boolean logic assigned to it which describes the necessary
regulators and regulatory interactions required for expression of that gene.
[0044] As used herein, the term "regulator" refers to a substance which
regulates transcription, post-transcriptional modification or activity of one or
more genes, proteins, mRNA transcripts. Such a regulator may be a regulatory
protein, small molecule and the like.
[0045] As used herein, the term "regulatory event" is intended to mean a
modifier of the flux through a reaction that is independent of the amount of
reactants available to the reaction. A modification included in the meaning of
the term can be a change in the presence, absence, or amount of an enzyme that
catalyzes a reaction. A modifier included in the term can be a regulatory
reaction such as a signal transduction reaction or an environmental condition
such as a change in pH, temperature, redox potential or time. It will be
understood that when used in reference to an in silico model or data structure a
regulatory event is intended to be a representation of a modifier of the flux
through a reaction that is independent of the amount of reactants available to
the reaction.
[0046] As used herein, the term "reaction network" refers to a representation of
the functional interrelationships between a collection of reactions and reaction
components. Reaction components included in a reaction network can be any
component involved in a reaction, such as a substrate, product, enzyme,
cofactor, activator, inhibitor, transporter, and the like. Functional
interrelationships include, for example, those between a substrate and its
product; those between a substrate or product and the enzyme that catalyzes the
conversion from substrate to product; those between an enzyme and its cofactor,
activator or inhibitor; those between a receptor and a ligand or other pairs of
macromolecules that physically interact; those between a macromolecule and its
transporter; those between proteins involved in transcriptional regulation and
their DNA-binding sites in regulatory regions regulating specific target genes;
and the like.
[0047] A reaction network can further include information regarding the
stoichiometry of reactions within the network. For example, a reaction component
can have a stoichiometric coefficient assigned to it that reflects the
quantitative relationship between that component and other components involved
in the reaction.
[0048] A reaction network can further include information regarding the
reversibility of reactions within the network. A reaction can be described as
occurring in either a reversible or irreversible direction. Reversible reactions
can either be represented as one reaction that operates in both the forward and
reverse direction or be decomposed into two irreversible reactions, one
corresponding to the forward reaction and the other corresponding to the
backward reaction.
[0049] A reaction network can include both intra-system reactions and exchange
reactions. Intra-system reactions are the chemically and electrically balanced
interconversions of chemical species and transport processes, which serve to
replenish or drain the relative amounts of certain reactants. Exchange reactions
are those which constitute sources and sinks, allowing the passage of reactants
into and out of a compartment or across a hypothetical system boundary. These
reactions represent the demands placed on the biological system. As a matter of
convention the exchange reactions are further classified into demand exchange
and input/output exchange reactions. Input/output exchange reactions are used to
allow components to enter or exit the system. A demand exchange reaction is used
to represent components that are required to be produced by the cell for the
purposes of creating a new cell, such as amino acids, nucleotides,
phospholipids, and other biomass constituents, or metabolites that are to be
produced for alternative purposes.
[0050] A reaction network can further include both metabolic and regulatory
reactions. Metabolic reactions can be represented by stoichiometry and
reversibility while regulatory reactions can be represented by Boolean logic
statements which both depend on and effect the presence or absence, activity or
inactivity of metabolic or regulatory proteins.
[0051] A reaction network can be represented in any convenient manner. For
example, a reaction network can be represented as a reaction map with
interrelationships between reactants indicated by arrows. For mathematical
manipulation according to the methods of the invention, a reaction network can
conveniently be represented as a set of linear algebraic equations or presented
as a stoichiometric matrix. A stoichiometric matrix, S, can be provided, which
is an m.times.n matrix where m corresponds to the number of reactants and n
corresponds to the number of reactions in the network. Stoichiometric matrices
and methods for their preparation and use are described, for example, in
Schilling et al., Proc. Natl. Acad. Sci. USA 95:4193-4198 (1998). As a further
example, a reaction network can conveniently be represented as a set of linear
algebraic equations and Boolean logic equations. The Boolean logic equations may
be evaluated and lead to the removal or addition of certain reactions from the
stoichiometric matrix, due to the inhibitory or inducing effect of regulatory
events. Such a representation is described, for example, in Covert M W,
Schilling C H, Palsson B. J. Theor Biol. 213:73-88 (2001).
[0052] The invention methods can be practiced with reaction networks of either
low or high complexity, such as networks that include substantially all of the
reactions that naturally occur for a particular biosystem. Thus, a reaction
network can include, for example, at least about 10, 50, 100, 150, 250, 400,
500, 750, 1000, 2500, 5000 or more reactions, which can represent, for example,
at least about 5%, 10%, 20%, 30%, 50%, 60%, 75%, 90%, 95% or 98% of the total
number of naturally occurring reactions for a particular biosystem.
[0053] A reaction network represents reactions that participate in one or more
biosystems. As used herein, the term "biosystem" refers to an entire organism or
cell therefrom, or to a "biological process" that occurs in, to or by the
organism or cell. Thus, a reaction network can represent reactions that occur at
the whole organismal, whole cell or subcellular level. Additionally, the
reaction network may represent interactions between different organisms or
cells.
[0054] The term "organism" refers both to naturally occurring organisms and to
non-naturally occurring organisms, such as genetically modified organisms. An
organism can be a virus, a unicellular organism, or a multicellular organism,
and can be either a eukaryote or a prokaryote. Further, an organism can be an
animal, plant, protist, fungus or bacteria. Exemplary organisms include
pathogens, and organisms that produce or can be made to produce commercially
important products, such as therapeutics, enzymes, nutraceuticals and other
macromolecules. Examples of organisms include Arabidopsis thaliana, Bacillus
subtilis, Bos taurus, Caenorhabditis elegans, Chlamydomonas reihardtii, Danio
rerio, Dictyostelium discoideum, Drosophila melanogaster, Escherichia coli,
hepatitis C virus, Haemophilus influenzae, Helicobacter pylori, Homo sapiens,
Mus musculus, Mycoplasma pneumoniae, Oryza sativa, Plasmodium falciparum,
Pnemocystis carinii, Rattus norvegicus, Saccharomyces cerevisiae,
Schizosaccharomyces pombe, Takifugu rubripes, Xenopus laevis, Zea mays, and the
like.
[0055] A "biological process" of an organism or cell refers to a physiological
function that requires a series of integrated reactions. A biological process
can be, for example, cellular metabolism; cell motility; signal transduction
(including transduction of signals initiated by hormones, growth factors,
hypoxia, cell-substrate interactions and cell-cell interactions); cell cycle
control; transcription; translation; degradation; sorting; repair;
differentiation; development; apoptosis; and the like. Biological process are
described, for example, in Stryer, L., Biochemistry, W.H. Freeman and Company,
New York, 4th Edition (1995); Alberts et al., Molecular Biology of The Cell,
Garland Publishing, Inc., New York, 2nd Edition (1989); Kuby, Immunology, 3rd
Edition, W.H. Freeman & Co., New York (1997); and Kornberg and Baker, DNA
Replication, W.H. Freeman and Company, New York, 2nd Edition (1992).
[0056] In one embodiment, the biosystem includes the biological process of
cellular metabolism, and the reaction network representing the biosystem,
referred to as a "metabolic reaction network," includes cellular metabolic
reactions. A basic review of cellular metabolism can be found, for example, in
Stryer, L., Biochemistry, W.H. Freeman and Company, New York, 4th Edition
(1995). Cellular metabolism can be usefully divided into central and peripheral
metabolic reactions. Central metabolism includes reactions that belong to
glycolysis, pentose phosphate pathway (PPP), tricarboxylic acid (TCA) cycle and
respiration. Peripheral metabolism, which includes all metabolic reactions that
are not part of central metabolism, includes reactions involved in the
biosynthesis of an amino acid, degradation of an amino acid, biosynthesis of a
purine, biosynthesis of a pyrimidine, biosynthesis of a lipid, metabolism of a
fatty acid, biosynthesis of a cofactor, metabolism of a cell wall component,
transport of a metabolite or metabolism of a carbon source, nitrogen source,
phosphate source, oxygen source, sulfur source, hydrogen source or the like.
[0057] In another embodiment, the biosystem includes the biological process of
transcriptional regulation, and the reaction network representing the biosystem,
referred to as a "transcriptional regulatory reaction network," includes
cellular transcriptional regulatory reactions. A basic review of cellular
transcriptional regulation can be found, for example, in Alberts et al.,
Molecular Biology of The Cell, Garland Publishing, Inc., New York, 2nd Edition
(1989). Transcriptional regulatory events may be grouped by the types of genes
regulated, for example those genes associated with metabolism, cell cycle,
flagellar biosynthesis and the like.
[0058] In another embodiment, the biosystem includes the biological processes of
cellular metabolism and transcriptional regulation and the reaction network
representing the biosystem includes both metabolic and transcriptional
regulatory reactions.
[0059] A reaction network that includes substantially all of the reactions of a
whole organism or cell, or substantially all of the reactions of a particular
biological process of an organism or cell, is referred to as a "genome-scale"
reaction network. Genome-scale reaction networks representing the metabolism of
various organisms have been described, including E. coli (PCT publication WO
00/46405); H. pylori (Schilling et al., J. Bacteriol. 184:4582-4593 (2002)); and
H. influenzae Edwards J. S. and Palsson B. O. J. Biol. Chem. 274:17410-6
(2001)).
[0060] For other biosystems, genome-scale reaction networks can be prepared by
methods known in the art. Generally, these methods involve first generating a
comprehensive list of reactions that are capable of occurring in the organism,
cell or biosystem, and determining their interconnectivity. The list can include
reactions determined from an analysis of the annotated genome of the organism,
supplemented as required from scientific literature and from experimental data.
Also included can be transport reactions, biomass composition demands, growth
associated energy requirements, and the like.
[0061] The genome sequences of a large number of animals, plants, protists,
fungi, bacteria and viruses have been completed or are in progress (see, for
example, genome entries in The Institute for Genome Research (TIGR) database (www.tigr.org/tdb/)
and in the NCBI Entrez Genome database (www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Genome)).
Other World Wide Web-based sources of annotated genome sequence information and
reconstructed network information include EcoCyc, Metabolic pathways database (MPW),
Kyoto Encyclopedia of Genes and Genomes (KEGG), What is There (WIT) and Biology
Workbench.
[0062] For organisms whose genomes have not yet been sequenced, a variety of
methods for obtaining the genomic sequence are known in the art. In most
large-scale genome sequencing methods, every step from isolating DNA, cloning or
amplifying DNA, preparing sequencing reactions, and separating and detecting
labeled fragments to obtain sequence, is automated (Meldrum, Genome Res. 10:
1081-1092 (2000)). Most methods use a combination of sequencing methods, such as
a combination of random shotgun sequencing with a directed finishing phase.
Other methods use a whole-genome shotgun approach, in which random fragments of
total genomic DNA are subcloned directly, and high-throughput sequencing is used
to provide redundant coverage of the genome. Another approach is to sequence
each end of every BAC in a genome library, and match a finished sequence to a
BAC end sequence to select the next clone (Venter et al., Science 280:1540-1542
(1998); Waterston et al, Science 282:53-54 (1998)).
[0063] For a newly sequenced genome, the open reading frames (ORFs) or coding
regions may be distinguished from the rest of the DNA sequence by variety of
methods. Determining the location of an ORF in a DNA sequence, its strand, and
nucleotide composition may be conducted by searching for gene signals (e.g.,
promoters, binding sites, start and stop codon, etc.) or by analzying gene
content (e.g., codon preference, positional base frequency, etc.), or a
combination of both methods. Algorithms and computational tools are available to
determine the ORFs of an entire DNA sequence using these methods available
through institutes such as the University of Wisconsin Genetics Computer Group
and National Center for Biotechnology Information. Furthermore, other
computational algorithms have been developed by which bacterial or eukaryotic
genes may be identified by algorithmic methods such as hidden Markov models,
which routinely find more than 99% of protein-coding regions and RNA genes (Pevzner,
"Computational molecular biology: an algorithmic approach," in Computational
Molecular Biology. Cambridge, Mass.:MIT Press, xviii, p.314 (2000); Baldi et
al., "Bioinformatics: the machine learning approach," in Adaptive Computation
and Machine Learning. Cambridge, Mass.: MIT Press xviii, p. 351 (1998); Fraser
et al., Nature 406:799-803 (2000)).
[0064] In order to assign function to the coding regions, newly identified ORFs
are searched against databases containing genes and protein sequences of known
function for sequence similarity. Several algorithms such as the BLAST and FASTA
family of programs have been developed and are available publically by which the
similarity of a functionally unknown ORF may be determined against functionally
annotated genes. A major portion of unidentified genes in a newly sequence
organism can be assigned functionally with this procedure.
[0065] If the putative function of a gene is not established by gene or protein
sequence similarity, other techniques such as gene clustering by function or
location may be used to assess the role of a gene in the network. Gene products
that participate in the same overall function can constitute a pathway in the
cell. "Missing links" in a pathway constructed from an initial sequence
annotation suggests the existence of genes that have not yet been identified.
Searching the sequence against other organisms provides clues about the possible
nucleotide sequence of the missing genes, which in turn facilitates targeting
functionality of the unassigned coding regions. Algorithms have been developed
that perform this procedure in various genome databases such as KEGG and WIT. In
addition, genes of the neighboring location may be clustered into operons that
are regulated and function in a coordinated fashion when the DNA sequence is
compared to that of other organisms. From the annotated genetic information,
together with biochemical and physiological information, the interrelatedness of
reactions and reaction components is determined and the reaction network is
completed.
[0066] In addition to defining the ORFs or coding regions of the genome,
regulatory regions can be defined by variety of methods. Regulatory regions
contain binding sites for transcriptional regulators and components of the
transcriptional machinery. These sites determine the specificity of
transcriptional regulation as the ability of transcriptional regulators to
regulate the gene controlled by the regulatory region. The methods to identify
regulatory regions and sites include comparing non-coding regions of closely
related genomes to identify highly conserved segments of the genome that may
correspond to regulatory regions. Groups of non-coding regions of a genome can
also be searched for commonly occurring sequence fragments to identify specific
binding site patterns in the genome. These groups can be defined for example by
similarity in biological function of the genes controlled by the regulatory
regions. In addition existing definitions of binding site patterns for specific
transcriptional regulators stored in specific databases such as Saccharomyces
Promoter Database (Zhu and Zhang, Bioinformatics 15:607-611 (1999)) or TRANSFAC
(Wingender et al., Nucl. Acids Res. 29:281-283 (2001)) can be used to search the
genome for new binding sites for a regulator. Identifying regulatory sites for
specific transcription regulators allows establishing potential target genes
regulated by these regulators and thus suggesting new regulatory reactions to be
added to the regulatory network.
[0067] As used herein, the term "reaction pathway" refers to a route through a
reaction network through which reaction components, regulatory information or
signaling molecules can potentially flow. It will be appreciated that the actual
amount and/or rate of substrate to product conversion through a reaction pathway
(also known as "flux") is a function of the physiological state of the biosystem
under consideration, and that reaction pathways (including operational, extreme
and phenomenological reaction pathways as described below) are generally
specified in connection with the physiological state of the biosystem. The term
"physiological state" is intended to refer to any specified internal and
external parameters that affect, or are likely to affect, flux through a
biosystem. Parameters that can affect flux include, for example, the actual or
intended inputs to the biosystem (such as the carbon, nitrogen, phosphorus,
sulfur or hydrogen source; the presence or amount of oxygen, nutrients,
hormones, growth factors, inhibitors and the like); the actual or intended
outputs of the biological system (such as biomass components, secreted products
and the like) and environmental variables (such as temperature, pH and the
like). Other parameters that can affect flux include, for example, the state of
differentiation or transformation of the cell; cell age; its contact with a
substrate or with neighboring cells; the addition or deletion of expressed
genes; and the like.
[0068] As used herein, term "systemic reaction pathway" refers to a reaction
pathway identified by an automated method applied to a suitable representation
of a reaction network. The method may involve mathematical or algorithmic
operations to identify the reaction pathways, and it may include user definable
parameters that influence the identification of reaction pathways. The systemic
reaction pathways need not to be unique and they may only apply to a subset of
the reaction network.
[0069] Methods of identifying systemic reaction pathways using convex analysis
have been described in the art. Such methods include, for example,
stoichiometric network analysis (SNA) (Clarke, Cell Biophys. 12:237-253 (1988);
elementary mode analysis (Schuster et al., Trends Biotech. 17:53-60 (1999); and
extreme pathway analysis (Schilling et al., J. Theor. Biol. 203:229-248 (2000);
Schilling et al., Biotechnol. Bioeng. 71:286-306 (2001)). The distinctions
between these types of analysis are described in Schilling et al. supra (2000).
[0070] In one embodiment, the systemic reaction pathway is an extreme pathway.
The term "extreme pathway" refers to a systemically independent pathway that
spans a convex, high-dimensional space that circumscribes all potential steady
state flux distributions achievable by a defined reaction network.
[0071] It will be understood that the steps needed to "provide" a set of
systemic reaction pathways for use in the invention methods will depend on the
amount and type of information already available regarding the biosystem and
reaction network. For certain biosystems and physiological states, sets of
extreme reaction pathways have been described in the art. For example, extreme
pathways for a human red blood cell metabolic network are described in Wiback et
al., Biophys. J. 83:808-818 (2002). Extreme pathways for a H. influenzae
metabolic network are described in Schilling et al., J. Theor. Biol. 203:249-283
(2000) and Papin et al., J. Theor. Biol. 215:67-82 (2002). Extreme pathways for
a H. pylori metabolic network are described in Price et al., Genome Res.
12:760-769 (2002).
[0072] Extreme reaction pathways can also be determined de novo, using methods
known the art (Schilling et al. supra (2000); Schilling et al. supra (2001)).
Appropriate stoichiometric and thermodynamic constraints can be imposed on the
intrasystem and exchange reactions in the reaction network under steady-state
conditions. Constraints can also be imposed on the input and output of reactants
to and from the biosystem. Optionally, regulatory constraints can also be
imposed (Covert et al., J. Theor. Biol. 213:73-88 (2001); Covert and Palsson, J.
Biol. Chem. 277:28058-28064 (2002)). This results in a system of linear
equalities and inequalities that can be solved using convex analysis. The
solution space corresponds geometrically to a convex polyhedral cone in
high-dimensional space emanating from the origin, which is referred to as the
steady state "flux cone." Within this flux cone lie all of the possible
steady-state solutions, and hence all the allowable flux distributions of the
biosystem. The extreme pathways correspond to vectors that define the edges of
the flux cone.
[0073] In another embodiment, the systemic reaction pathway is one branch of a
regulatory tree. The regulated genes of a biosystem may be depicted as shown in
FIG. 2A with the regulated genes shown on the horizontal axis. In a Boolean
representation, each protein and each gene may be considered "on" or "off"
(active or inactive, respectively). The combination of the activity state of all
genes and proteins in a biosystem may be considered a "systemic regulatory
pathway" or a "systemic signaling pathway".
[0074] In another embodiment, the systemic reaction pathway is a set of
regulators and regulatory reactions influencing the activity of a regulated gene
or the set of genes regulated by a regulator or a group of regulators. These
sets may be identified by analyzing the connectivity of a regulatory network
represented as a graph and identifying nodes in the network connected to a
particular node (regulator or regulated gene). The smallest possible set of such
kind is one involving one regulatory reaction between a regulator and a target
gene.
[0075] As used herein, the term "phenomenological reaction pathway" refers to a
reaction pathway defined through analyzing experimental data to describe the
state of the biosystem in whole or part. The data types that can be used to
define phenomenological reaction pathways include but are not limited to
transcriptomic, proteomic, metabolomic, fluxomic, protein-protein interaction,
and DNA-binding site occupancy data. The data analysis methods used to define
the phenomenological pathways from the experimental data include but are not
limited to systems identification, statistical, algorithmic, or signal
processing techniques.
[0076] Phenomenological information about the reactions and reactants of a
biosystem can be determined by methods known in the art, and can be either
qualitative or quantitative. For example, phenomenological information can be
obtained by determining transcription of genes, expression or interactions of
proteins, production of metabolites or other reactants, or use of reactions in
the biosystem. By analogy to the term "genome," such information, when obtained
at the scale of substantially the whole organism or cell, is called,
respectively, the "transcriptome," "proteome," "metabolome" and "fluxome."
[0077] Methods of determining gene expression at the transcriptome scale (also
known as "transcriptomics") are known in the art and include, for example, DNA
microarray methods, which allow the simultaneous analysis of all transcripts
simultaneously (Shena et al., Science 270:467-470 (1995); DeRisi et al., Science
278:680-686 (1997)) and serial analysis of gene expression (SAGE) methods (Velculescu
et al., Trends Genet. 16:423-425 (2000)); Methods of determining protein
expression (also known as "proteomics") are also known in the art. Expression
proteomic methods generally involve separation of proteins, such as by
two-dimensional gel electrophoresis, followed by protein imaging using
radiolabels, dyes or stains. Separated proteins are then identified using
methods such as peptide mass fingerprinting by mass spectrometry and
peptide-sequence tag analysis by nano-electrospray (Blackstock et al., Trends
Biotechnol. 17:121-127 (1999)).
[0078] Method for determining interactions between biological molecules in the
cell at a large scale are also known in the art. Protein-protein interaction
information, which allows inferences as to a protein's function, can be
obtained, for example, using large-scale two-hybrid analysis to identify
pairwise protein interactions (Fromont-Racine et al., Nat. Genet. 16:277-282
(1997). Indirect protein-DNA interaction information can be obtained using
chromatin immunoprecipitation chip (ChIP-ChIP) methods, which allows the
genome-scale identification of genomic binding sites of DNA-binding proteins and
genomic targets of transcription factors (Iyer et al., Nature 409:533-538
(2001)).
[0079] Methods of determining the complement of metabolites in a cell (also
known as "metabolomics") are also known in the art and include, for example,
nuclear magnetic resonance (NMR) spectroscopy such as 13C-NMR; mass spectroscopy
such as gas chromatography/time-of-flight mass spectroscopy (GC/TOFMS); and
liquid chromatography (Fiehn, Plant Mol. Biol. 48:155-171 (2002); Phelps et al.,
Curr. Opin. Biotech. 13:20-24 (2002)).
[0080] Likewise, methods of measuring the fluxes through reaction pathways (also
known as "fluxomics") are known in the art, such as metabolic flux ratio
analysis (METAFoR) (Sauer et al., J. Bacteriol. 181:6679-6688 (1999)). METAFoR
quantifies the relative abundance of intact carbon bonds in biomass constituents
that originate from uniformly isotopically labeled precursor molecules, which
reflects the metabolic pathways used.
[0081] By repeatedly varying the physiological state of the biosystem, either
experimentally or in silico, a series of phenomenological measurements at
different states can be obtained or predicted. These data can be organized in
vectorial form and represented in matrix or tabular formats. For example, a set
of gene array expression data can be organized as a matrix where each row is a
gene, each column is an experiment, and each value is an expression level or
ratio. As another example, a set of fluxome data can be organized as a matrix
where each row is a reaction, each column is an experiment and each value is a
flux level or ratio. As a further example, a set of phenotypic data can be
organized as a matrix where each row is an experiment, each column is an
environmental component (such as nutrients, waste products, or biomass) and each
value is a rate of uptake, secretion, or growth.
[0082] The phenomenological information can be analyzed by various methods known
in the art, such as methods of system identification, statistical data analysis,
combinatorial algorithms, or signal processing to determine a set of
phenomenological reaction pathways.
[0083] Methods of system identification are known in the art and include, for
example, various types of clustering analysis methods (reviewed in Sherlock et
al., Curr. Opin. Immunol. 12:201-205 (2000)). Clustering methods can be applied
to experimental data in matrix or tabular formats to extract groups of genes
that are co-expressed. These groups that can either be disjoint or overlapping
can be used as definitions of phenomenological pathways. Alternatively, a data
vector within each cluster can be chosen to be a representative phenomenological
pathway for that cluster--this vector could for example be the mean value of the
data points within the cluster also known as the centroid of the cluster.
[0084] Clustering analysis methods include, for example, hierarchical clustering
analysis (Eisen et al., Proc. Natl. Acad. Sci. USA 95:14863-14868 (1998); Wen et
al., Proc. Natl. Acad. Sci. USA 95:334-339 (1998)), whereby single reactant
profiles are successively joined to form nodes, which are then joined further.
The process continues until all individual profiles and nodes have been joined
to form a single hierarchical tree. Clustering analysis methods also include
divisive clustering analysis (Alon et al., Proc. Natl. Acad. Sci. USA
96:6745-6750 (1999)), in which two vectors are initialized randomly, and each
reactant is assigned to one of the two vectors using a probability function. The
vectors are iteratively recalculated to form the centroids of the two clusters,
and each cluster is successively split in the same manner until each cluster
consists of a single profile. Clustering analysis methods also include methods
in which the data is partitioned into reasonably homogeneous groups. Clustering
methods that incorporate partitioning include, for example, self-organizing maps
(Kohenen, "Self Organizing Maps," Berlin: Springer (1995); Tamayo et al., Proc.
Natl. Acad. Sci. USA 96:2907-2912 (1999)) and k-means clustering (Everitt,
"Cluster Analysis 122," London: Heinemann (1974)).
[0085] Another method of system identification is principal component analysis
of the data, which is closely related to the singular value decomposition (SVD)
of the data matrix (Holter et al., Proc. Natl. Acad. Sci. USA 97:8409-9414
(2000); Alter et al., Proc. Natl. Acad. Sci. USA 97:10101-10106 (2000); Holter
et al., Proc. Natl. Acad. Sci. USA 98:1693-1698 (2001)). Principal component
analysis is a statistical technique for determining the key variables in a
multidimensional data set that explain the differences in the observations, and
can be used to simplify the analysis and visualization of multidimensional data
sets. SVD is a linear transformation of data, such as gene expression data, from
genes x arrays space to reduced diagonalized "eigengenes" x "eigenarrays" space,
where the eigengenes (or eigenarrays) are unique orthonormal superpositions of
the genes (or arrays). After normalization and sorting of the data, the
individual genes and arrays become grouped according to similar regulation and
function, or similar physiological state, respectively. Principal component and
SVD analysis output a set of vectors in the data space (e.g. n dimensional where
n is the number of genes) ordered by how much of the variability in the data
each vector each principal component or mode captures. These vectors can each be
interpreted as phenomenological pathways describing the major modes of usage of
the gene/protein complement of the organism under specific conditions that the
experiments analyzed represent.
[0086] Software for various types of large-scale data analysis, including
hierarchical clustering, self-organizing maps, K-means clustering and principal
component analysis, is known in the art or can be developed for a particular
application. Exemplary analysis software includes "XCluster" (see genome-www.stanford.edu/.about.sherlock/cluster.html
on the World Wide Web), "Cluster" software (see rana.lbl.gov/EisenSoftware.htm
on the World Wide Web) and "Genesis" software (see genome.tugraz.at/Software/Gen-
esis/Description.html on the World Wide Web).
[0087] The skilled person can determine which method, or which combination of
methods, is suitable to analyze phenomenological information to determine a set
of phenomenological reaction pathways.
[0088] As used herein, the term "operational reaction pathway" refers to a
systemic reaction pathway of a biosystem that is feasible taking into account
the reactants present in, or fluxes through, the biosystem. Operational reaction
pathways thus constitute a subset of systemic reaction pathways that are likely
to actually exhibit flux in the biosystem. The subset of systemic pathways that
are consistent with phenomenological information about the biosystem are
determined to identify operational reaction pathways consistent with the
reactants present or reaction fluxes through the biosystem.
[0089] Once a set of systemic reaction pathways and a set of phenomenological
reaction pathways have been provided, the two sets are compared, and common
pathways identified. As described above, the two sets of pathways can be
represented in vectorial form, or in the form of groups of genes participating
in the pathways, or in other convenient ways. There are a number of mathematical
methods known in the art by which two vectors or two groupings can be compared.
[0090] For example, the two sets of vectors can be compared using a number of
measures for pairwise similarity between vectors including: (1) Euclidean
distance, which corresponds to the squared distance between two points in space,
or in this case tow vectors, taking into account both the direction and the
magnitude of the vectors (Hubbard J. H. and Hubbard B. B. Vector Calculus,
Linear Algebra, and Differential Forms, Prentice-Hall (1999)); (2) Pearson
correlation coefficient, which measures the angle between two vectors whose
length is normalized to one, and is thus independent of the length of the
vectors (Larsen R. J. and Marx M. L. An Introduction to Mathematical Statistics
and Applications, Prentice Hall, New Jersey (1986)); or (3) Jackknife
correlation coefficient, which is similar to Pearson correlation coefficient,
but is corrected for the effect of single outliers components of the vectors to
provide a more robust measure (Heyer et al., Genome Res. 9:1106-1115 (1999)).
Other methods for comparing vectors are known in the art.
[0091] Similarly, methods for comparing groupings of genes based on systemic and
phenomenological definitions include: (1) the Rand index, which measures the
overlap between two different groupings of the same set of genes (Yeung K. Y et
al. Bioinformatics 17:177 (2001)); and (2) correspondence analysis, which
provides a two-dimensional graphical representation of the agreement between two
groupings such that the systemic and phenomenological pathways that are most
similar to each other are shown to be located closest to each other (Johnson R.
A. and Wichern D. W., Applied Multivariate Statistical Analysis, 5th Ed.,
Prentice Hall, New Jersey (2002)).
[0092] The skilled person can determine which method, or which combination of
methods, is suitable for comparing systemic reaction pathways and
phenomenological reaction pathways to identify operational reaction pathways.
[0093] The invention also provides a method determining the effect of a genetic
polymorphism on whole cell function. The method consists of: (a) generating a
reaction network representing a biosystem with a genetic polymorphism-mediated
pathology; (b) applying a biochemical or physiological condition stressing a
physiological state of the reaction network, and (c) determining a sensitivity
to the applied biochemical or physiological condition in the stressed
physiological state compared to a reaction network representing a normal
biosystem, wherein the sensitivity is indicative of a phenotypic consequence of
the genetic polymorphism-mediated pathology. The biochemical or physiological
condition can be, for example, a change in flux load, pH, reactants, or products
as well as system or subsystem changes such as those in oxidative or energy
load.
[0094] Briefly, the above methods for analyzing physiological states of a
biosystem, comparing them to systemic reaction pathways and determining one or
more operational reaction pathways can similarly be employed to determine the
effect of genetic polymorphisms on a biosystem or subcomponent thereof. For
example, phenomenological information used for comparison with systemic
reactions can be obtained from either actual or simulated genetic mutations of
enzymes or other polypeptides. Changes in activity of the enzyme or polypeptide
due to the mutation can be obtained from sources describing the defect or
estimated based on available information or predictive computations using a
variety of methods well known in the art. The activities that can be assessed
include, for example, catalytic function of an enzyme or binding activity of a
polypeptide such as a transcription regulator.
[0095] In silico models constituting a reaction network of a genetic
polymorphism can be constructed as described previously and the effect of the
polymorphism can be assessed in context of the biosystem as a whole. Conditions
that the reaction network are subjected to can be varied and the effect of
single or multiple, combined polymorphisms can be determined on whole biosystem
function or as the polymorphism relates to subsystems thereof. For example,
systemic pathways or operational pathways can be calculated in the presence or
absence of the genetic polymorphism. Comparison of systemic pathways,
operational pathways or a phenotypic manifestation between the two reaction
networks can be performed to determine the differences, if any, between the
native reaction network and the polymorphic counterpart. Such differences can
include, for example, creation of a new systemic or operational pathway,
omission of such a pathway and changes in the rate or magnitude of such a
pathway. The result of such changes between the normal and polymorphic states
also will reveal the consequential impact on biochemical or physiological
function or on phenotypic expression of the genetic polymorphism.
[0096] Conditions that can be varied include, for example, any biochemical or
physiological component of the system. Such conditions can be either external to
the biosystem including, for example, external environmental growth conditions
such as temperature, pH, carbon source and other input/output reactions which
allow components to enter or exit the biosystem. Alternatively, such biochemical
or physiological conditions can be internal to the biosystem. Specific examples
of internal conditions include, for example, exchange reactions indicative of
sources and sinks allowing passage of reactants across a system or subsystem
boundary, intra-system reactions that replenish or drain reactants, and demand
reactions which represent categories of components produced by the cell.
Biochemical or physiological conditions internal to the biosystem also can
include changes in pH, utilization of carbon sources, availability of
metabolites, cofactors, substrates and products. Other changed internal
conditions can include, for example, alterations in system loads such as
oxidative or energy load on its corresponding subsystem. Various other
biochemical or physiological conditions well known to those skilled in the art
can similarly be varied in the methods of the invention to obtain comparative
reaction network simulations for determining the effect of a genetic
polymorphism on biosystem function.
[0097] Altering or changing a condition for each biosystem will generally be
sufficient for a comparison between a native and a counterpart polymorhic
biosystem. However, the effect can be enhanced when the biochemical or
physiological condition is applied to the native and polymorphic biosystem at a
magnitude sufficient to stress the biosystem or a correlative subsystem thereof.
For example, where the activity of a polymorphic enzyme is altered only slightly
compared to its native counterpart, the difference in activity may not
substantially affect cellular function within an activity range tested. In part,
an insignificant impact on cellular function can be due to the production of
sufficient product to perform normal cellular activity regardless of an activity
deficiency. However, where the activity of the polymorphic enzyme is tested
under stressed conditions, it can be unable to fulfill the added cellular demand
due to the additional work required of the system. Accordingly, under stressed
conditions, a comparison of the native reaction network functioning and that of
the polymorphic reaction network will more readily reveal those activity effects
of the polymorphic enzyme due to failure of product production under excess
requirements.
[0098] The term "stress" or "stressing" as used in reference to applying a
biochemical or physiological condition is intended to mean placing a biosystem,
reaction network or subsystem thereof under a state of strain or influence of
extra effort. The stress can be mild or intense so long as it applies demands,
loads or effort on the components extra to that under the normal or nominal
state of the biosystem, reaction network or subsystem thereof. Therefore,
stressing a system state is intended to include imposing a condition that causes
the system to exert additional effort toward achieving a goal. Specific examples
of applying a biochemical or physiological condition to a biosystem that
stresses a physiological state is described further below in Example III.
[0099] Genetic polymorphisms can constitute, for example, single nucleotide
polymorphisms (SNPs) and well as multiple changes within a encoding gene
resulting in a polymorphic region within the gene or its polypeptide coding
region. Polymorphisms in gene or coding region structure can alter the
expression levels of the harboring nucleic acid, activity of the encoded
polypeptide or both. Polymorphisms well known to those skilled in the art of
genetics and genomics include, for example, allelic variants of genes, SNPs and
polymorphic regions of a referenced nucleic acid. Specific examples, of genetic
polymorphisms include those variations in coding sequence described in Example
III for glucose-6-phosphate dehydrogenase (G6PD) and pyruvate kinase (PK).
Numerous other genetic polymorphisms and their associated diseases are similarly
well known to those skilled in the art.
[0100] Given the teachings and guidance provided herein, the methods of the
invention for determining the effect of a genetic polymorphism on cellular
function can be used with any known or subsequently determined genetic
polymorphism. Similarly, the linkage between the genetic defect and the
pathology mediated also can be previously known or subsequently determined.
Moreover, and as described further below, it can be used to diagnose previously
undetermined genetic polymorphisms that alter an activity of an enzyme or
polypeptide. However, by determining the effect of the defect in the context of
a whole biosystem, a more accurate phenotype and assessment of the functional
abilities of the biosystem can be obtained. Accurate determination of phenotypic
and functional attributes of such complicated systems can be advantageously
applied for a more meaningful treatment of the genetic polymorphism-mediated
disease.
[0101] Sensitivities of the polymorphic enzyme to the stressed condition can be
more or less pronounced depending on which polymorphisim is incorporated into
the reaction system, the degree of polypeptide activity change due to the
polymorphism and the level of stress that is exerted on the system. Those
skilled in the art will know or can determine, given the teachings and guidance
provided herein, what sensitivities are indicative of a particular polymorphic
enzyme or other polypeptide. For example, glucose-6-phosphate dehydrogenase
(G6PD) functions in the oxidative branch of the pentose pathway and is sensitive
to changes in maximum velocity (V.sub.max) and cofactor binding affinity (K.sub.i-NADPH).
Enzymes with alterations in these activities result in changed in oxidative
requirements which can be used as indicators of the metabolic state for G6PD's
having altered activity. For example, one sensitive indicator of the metabolic
state of the biosystem is the NADPH/NADP ratio. This ratio can be measured under
stressed conditions and compared between the polymorphic reaction network with
that of the normal network to determine the phenotypic and functional changes on
the biosystem. As described further below in Example III, polymorphic enzymes
having alterations in these G6PD activities can be distinguished in the methods
of the invention as those mediating non-chronic and chronic hemolytic anemia.
[0102] Similarly, pyruvate kinase (PK) functions in glycolysis and is sensitive
to changes in V.sub.max and the affinity for substrates such as
phosphoenolpyruvate (K.sub.PEP). Alterations in these activities result in
changes in ATP concentration, and 2,3 DPG concentration. Sensitive indicators of
V.sub.max and K.sub.PEP can include, for example, the concentration of ATP when
the biosystem is under maximum energy loads or stress compared to normal
conditions. As with G6PD, polymorphic PK enzymes having alterations in these
activities show that anemic patients have a diminished ability to deviate from
the normal homeostatic state.
[0103] For determining the effect on function, a reaction network specifying the
activity of the polymorphic enzyme is constructed and the system is stressed as
described above. Sensitivity to the stressed condition compared to that of the
normal or native reaction network can then be determined using a variety of
indicators. Those described above for G6PD and PK are exemplary indicators for
enzyme activity. Those skilled in the art will understand, given the teachings
and guidance provided herein that other indicators of biochemical or
physiological activity of the particular enzyme or polypeptide being assessed
can be used in the methods of the invention. For example, essentially any
measure of substrate, product, cofactor, or other metabolite can be used as an
indicator of polypeptide activity. Such indicators can be assessed directly or
indirectly such as by measuring the products of downstream reactions and the
like. Moreover, ratios of such indicators or of general indicators of a
particular biochemical or physiological state can similarly be used. For
example, ATP, and energy cofactors such as NADPH and NADP are general indicators
of the oxidative state and energy charge, respectively, of a biosystem.
[0104] Changes in activity under stressed conditions of such biochemical or
physiological indicators will identify the change in function of the biosystem
due to the altered activity as well as show the phenotypic consequences of the
polymorphic enzyme. For example, the inability of a biosystem to respond to
excess oxidative or energy requirements can show, for example, that the
polymorphic enzyme is unable to adequately produce components within its
assigned subsystem to handle the increased work requirements caused by the
stress. A functional biosystem change can correspond to, for example, altered
demands and products that are produced as well as changes in flux or pathways
which compensate the deficient enzyme activity. A phenotypic outcome can be, for
example, inhibition of biosystem proliferation, decrease in biosystem mass or
even biosystem lysis and death.
[0105] The methods of the invention also can be used for diagnosis of a genetic
polymorphism-mediated pathology. The methods described above can be used to
generate a biosystem reaction network representing activities of suspected
genetic polymorphism. The biosystem reaction network can be stressed as
described above and the reaction network containing the suspected polymorphic
enzyme activity compared to that of a normal reaction network. A change in
function or phenotype of the suspected polymorphic network compared to the
normal will indicate that the genetic alteration is linked to the enzyme
deficiency. Those skilled in the art will understand that a plurality of
suspected enzyme defects can be identified and linked to a particular disease
given the teachings and guidance provided herein. For example, those skilled in
the art can use activity measurements from a suspected patient in the creation
of a plurality of reaction networks. Comparison of the function or phenotype of
the networks harboring suspect activities with normal networks will identify the
differences in function or phenotype and whether any of such identified
differences are sufficient to result in a pathological condition.
[0106] Therefore, the invention provides a method of diagnosing a genetic
polymorphism-mediated pathology. The method consists of: (a) applying a
biochemical or physiological condition stressing a physiological state of a
reaction network representing a biosystem with a genetic polymorphism-mediated
pathology, the applied biochemical or physiological condition correlating with
the genetic polymorphism-mediated pathology, and (b) measuring one or more
biochemical or physiological indicators of the pathology within the reaction
network, wherein a change in the one or more biochemical or physiological
indicators in the stressed state compared to an unstressed physiological state
indicates the presence of a genetic polymorphism corresponding to the pathology.
[0107] The invention further provides a method of reconciling biosystem data
sets. The method consists of: (a) providing a first regulatory network
reconstructed from legacy data comprising a plurality of hierarchical regulatory
events; (b) providing a second regulatory network obtained from empirical data,
and (c) determining a consistency measure between the hierarchical regulatory
events in the first regulatory network and elements in the second regulatory
network, wherein a high degree of the consistency measure for the hierarchical
regulatory events indicates the validity of the first regulatory network or a
subcomponent thereof.
[0108] The method of the invention for reconciling data sets is useful for
determining the accuracy of a biosystem model as well as for identifying new
components, linkages, networks and subnetwork of a biosystem model. The model
can be based on scientifically proven data, mathematical interpretations as well
as on pure computational analysis or even theoretical prediction. Regardless of
the source of a biosystem model, the method for reconciling data sets compares
the model or a data set representation thereof to another source of data to
identify the consistency between one model or data set and that of the
comparison model or data set. The degree of consistency between the two models
or data sets thereof will show how accurate the initial model is to its
corresponding natural biosystem.
[0109] Data sets representing whole biosystems can be reconciled using the
methods of the invention as well as any substructure thereof. Substructures can
consist of subnetworks or modules of the biosystem reaction network. While the
exact boundaries of subnetworks and boundaries can vary depending on the
assessment criteria used, one feature is that such substructures can be
evaluated, analyzed or identified as a unit in itself. Criteria for boundary
determination can include, for example, functional attributes, structural
attributes and graphical or mathematical separateness, for example. Specific
examples of subnetworks or modules of a biosystem have been described above and
below and are further shown in FIG. 16 and its associated Example IV. Other
examples are well known to those skilled in the art and can be employed in the
methods of the invention given the teachings provided herein.
[0110] Data sets applicable for comparison can include a broad range of
different types and sizes. For example, the data sets can contain a large and
complex number of diverse data elements or components of the reaction network.
Alternatively, the data sets can be small and relatively simple such as when
comparing subnetworks or modules of the reaction network. Those skilled in the
art will appreciate that the more inclusive each data set for comparison is with
respect to its system components, the more accurate and reliable will be the
consistency measure. However, those skilled in the art will know, or can
determine, a reliable means to compensate for inherent differences based on the
character of one or both of the initial data sets. Therefore, the method of the
invention can be used for reconciling data sets where the pair of data sets for
comparison can be either large or small, or diverse or simple, as well as for
comparison where the data sets within the pair are either large or small, or
diverse or simple with respect to each other.
[0111] As used herein, the term "legacy" or "legacy data" is intended to refer
to known information or data such as that obtainable from literature, other
reports, computational data, databases or a combination thereof. The information
can be obtained from the public domain or previously known by the user's own
investigations. The term therefore is intended to include secondary data that
has received the benefit of scientific evaluation and considerations toward the
system to which it pertains, the scientific authenticity or the theory which it
promotes. Legacy data in essentially any obtainable form can be used in the
methods of the invention and can include, for example, literary, graphical,
electronic, mathematical or computational forms as well as functional
equivalents and transformations thereof. Given the teachings and guidance
provided herein, those skilled in the art will known how to use a particular
format either directly or following transformation into a useful format for
representing a reaction network of the invention. A variety of such useful
formats have been described above and below and others are well known to those
skilled in the art.
[0112] As used herein, the term "empirical" or "empirical data" refers to data
based on primary factual information, observation, or direct sense experience.
Empirical data is therefore intended to refer to raw data or primary data that
has not received the benefit of scientific evaluation and considerations toward
the system to which it pertains, the scientific authenticity or the theory which
it promotes. The term is intended to include, for example, data, data sets or
equivalent transformational forms thereof corresponding to gene expression data,
protein activity data and the like. It can include, for example, large, high
throughput datasets such as that obtainable by genomic, proteomic,
transcriptomic, metabolic and fluxomic data acquisition as well as small data
sets obtainable by a variety of research methods well known to those skilled in
the art. Other forms of primary data well known to those skilled in the art can
similarly be employed in the methods of the invention.
[0113] Useful attributes of reconciling data sets include, for example, both
validation of known reaction network and subnetwork models as well as the
identification or discovery of new subnetworks or modules thereof. Validation of
an existing model is useful in itself because it authenticates previous
scientific theories as well as subsequent discoveries based on the original
model. Similarly, invalidation of a network model can be useful, for example,
because it informs the user that components, links or scientific premises may be
omitted from the network model as a whole. Moreover, reconciliation of data sets
can identify subnetworks or modules of the biosystem reaction network model by
showing differential validation of a particular subsystem or of several
subsystems within the whole. For example, discovery of new subnetworks or
identification of valid subnetworks within the whole can occur when some, but
not all, modules within the biosystem network are reconciled. Identifications
are particularly striking where the subnetwork or module thereof constitute
relatively independent entities within the biosystem reaction network or are
relatively decoupled from the body of the biosystem network. Finally,
information gained from reconciliation of data sets and validation of whole
networks, subnetworks or modules thereof can be used to refine the network or
subnetworks by altering the model determining whether the altered model
reconciles with the comparative data set.
[0114] Validation and discovery methods of the invention are applicable to
essentially any form or format of the reaction network. For example, data sets
can be reconciled where a reaction network is represented by an in silico model,
a mathematical representation thereof, a statistical representation,
computational representation, graphical representation or any of a variety of
other formats well known to those skilled in the art.
[0115] Reconciliation of data sets allows for the validation of essentially any
causal relationships within the compared biosystem networks. For example, the
method for reconciliation of data sets can be employed on data sets specifying
all types of reaction networks described herein. Therefore, the method is
applicable to reaction networks corresponding to a metabolic reaction network, a
regulatory reaction network, a transcriptional reaction network or a
genome-scale reaction network, or any combination thereof. To perform the method
of reconciliation, a first reaction network can be provided that is
reconstructed from legacy data. As described previously, the legacy data can be
obtained from a secondary source that has assembled primary data into a working
model of the biosystem network components. The first reaction network is
compared with a second reaction network obtained from empirical data. The
empirical data can consist of, for example, any primary data representing an
activity or other attribute of the components within the biosystem.
[0116] A comparison of data sets can be accomplished by, for example, any method
known to those skilled in the art that provides a measure of consistency between
the network representation and the empirical data. In one embodiment a
consistency measure is determined between the empirical data and the legacy
data, or the legacy-derived network model by, for example, grouping the network
components into hierarchical organization of reaction categories. The reaction
categories are useful for determining consistency measurements between the data
sets to be reconciled. The reaction categories can include, for example,
reactants and products, reaction fluxes, metabolic reactions, regulatory
reactions and regulatory events. Moreover, the reaction categories can be
arbitrary, or based on, for example, functional criteria, statistical criteria,
or molecular associations so long as the categories provide an acceptable
framework for obtaining a consistency measure between the legacy-derived network
and the empirical data set.
[0117] Exemplary reaction categories for the specific embodiment of a regulatory
reaction network are described further below in Example IV. Briefly, elements of
a regulatory network can be separated into, for example, three categories based
on functional interactions. These categories include, for example, pair-wise
regulatory interactions, target-regulator units and regulons. Given the
teachings and guidance provided herein, categories other than these for
regulatory networks as well as categories for other types of reaction networks
can be identified or generated by those skilled in the art. For example, other
types of categories can include anabolic or catabolic reactions or cell
signaling functions. The particular type of category will depend on the type of
reaction network to be reconciled and the measure of consistency selected to be
used in the method of the invention.
[0118] Consistency of the data sets to be reconciled can be determined by a
variety of methods well known to those skilled in the art. Such methods can be
employed to generate a value for each of category or element within a network
that can be analyzed for significance. For example, in the above exemplary
reaction categories, consistency measurements for pair-wise interactions can be
obtained, for example, by Pearson correlation coefficients whereas consistency
measurements for target-regulator units can be determined by, for example,
multiple correlation coefficients. Further, consistency measurements for
regulons can be determined by, for example, the average within regulon
correlation. Other methods well known in the art also can be employed and
include, for example, mutual information-based measures (Cover T M & Thomas J.
A., Elements of Information Theory, Wiley (1991)), or nonlinear regression
methods (Hastie T, Thibshirani R & Friedman J., The Elements of Statistical
Learning, Springer (2001)). The mutual information measures require
discretization of the original data, but allow incorporating nonlinear
dependencies that are not accounted for by Pearson or multiple correlation
coefficients. Similarly non-linear correlation measures can be used as
consistency metrics, but their added flexibility compared to linear correlation
may result in overestimating the consistency between empirical data and a
proposed network structure. The statistical significance of particular values of
a consistency measure can be determined to assess whether the legacy data and
empirical data constitute a good fit. A high degree of consistency measure, such
as those that are statistically significant, indicate that the two networks,
subnetworks or subcomonents reconcile. Further, those data sets that reconcile
either as to the whole network or a subnetwork thereof indicate a validation of
the legacy model whereas those that are inconsistent indicate a divergence
between the legacy-derived model and the empirical data.
[0119] The invention further provides a method of refining a biosystem reaction
network. The method consists of: (a) providing a mathematical representation of
a biosystem; (b) determining differences between observed behavior of a
biosystem and in silico behavior of the mathematical representation of the
biosystem under similar conditions; (c) modifying a structure of the
mathematical representation of the biosystem; (d) determining differences
between the observed behavior of the biosystem and in silico behavior of the
modified mathematical representation of the biosystem under similar conditions,
and (e) repeating steps (d) and (e) until behavioral differences are minimized,
wherein satisfaction of a predetermined accuracy criteria indicates an
improvement in the biosystem reaction network.
[0120] The method can further include the steps of: (f) determining a behavior
of the biosystem under different conditions, and (g) repeating steps (b) through
(e) of the method for refining a biosystem reaction network under the different
conditions. The method for refining a biosystem reaction network can
additionally include repeating steps (f) and (g) until the minimized behavioral
differences are exhausted, wherein the improved biosystem reaction network
representing an optimal biosystem reaction network.
[0121] The methods of the invention can also be applied in a general process by
which mathematical representations of biosystems can be improved in an iterative
fashion using algorithmic approaches and targeted experimentation. Many
biological systems are incompletely characterized and additional experimentation
can be required to reconstruct a reaction network of these systems. For such a
process to converge quickly on an optimal model, an iterative experimentation
can be systematized. FIG. 2B exemplifies such a procedure, which is further
described in Example V.
[0122] The model building process can begin with a statement of model scope and
accuracy. Alternatively, the model building process can proceed in the absence
of such a predetermined assessment of scope or accuracy but terminated once a
desired scope or accuracy is ultimately obtained.
[0123] The purpose for building the model leads to specification of expected
accuracy and the scope of capabilities that the model is to have. The scope of a
model can range from, for example, describing a single pathway to a genome-scale
description of a wild type strain of an organism. An even broader scope would be
to include sequence variations and thus insist that a model describes all the
variants of the wild type strain.
[0124] The accuracy can be based on, for example, qualitative or quantitative
criteria. A useful model can be qualitative and be able to make statements that
predict, for example, that the growth rate of an organism is reduced when a
particular gene product is inhibited under a particular growth condition. A
quantitative model can insist, within measurement error, on predicting the
percent reduction in growth rate of inhibition of all the gene products under
one or more growth conditions. The extent of the iterative model-building
process is therefore dictated and predetermined by the user who can specify a
required scope and accuracy of the model to be generated.
[0125] A reconstructed biochemical reaction network can be envisioned as a model
of an experimental system. In this regard, it is a duplicate of an actual
organism that is capable of flexible manipulation and study under any conditions
that is desirable to subject the actual organism to. One advantage of a
reconstructed biosystem reaction network, or an in silico version thereof, is
that it is capable of generating an immense amount of information that
characterizes the function and phenotype of the organism. The accuracy of the in
silico model can also be determined by, for example, using the methods described
above for reconciliation and determining the consistency of the reconstructed
network with that of empirical data obtained from the actual organism. The
availability of both an actual organism and a reconstructed model of the
organism that is readily manipulable can be used synergistically to harness the
power of in silico models for reliable and accurate predictions of organism
behavior and function.
[0126] An approach to reconstructing an in silico model of a biosystem is
through iterative refinement of a biochemical reaction network. The refinement
of a model can be accomplished by assessing a particular function of the actual
organism and incorporating into the model new information gained from that
particular study. Because the model is an duplicate of the organism, deviations
in performance from the model compared to the actual organism when performed
under similar conditions will yield data indicating that additions, omissions or
revisions to the in silico that can account for the deviations. By successive
iterations of studies duplicating conditions that the actual and in silico
organisms are subjected to, altering the model structure to correct and be
consistent with the empirical data obtained from the actual organism and
repeating the condition or subjecting the pair to different conditions, the
accuracy of the model to predict function and phenotype of the actual organism
will successively increase.
[0127] Briefly, studies can be performed with the actual organism under defined
conditions prescribed by an experiment design algorithm. Similarly, the in
silico model that describes the actual organism can be used to simulate the
behavior of the actual organism under the same conditions. Based on the
available data at any given time, if the model fails to meet the desired scope
or accuracy requirements, further studies can be performed to improve the model.
These studies can be designed using, for example, a systematic procedure to
step-wise or incrementally probe network function. One approach to probe network
function can be, for example, to incrementally move from a robust or validated
subsystem of the network to less validated parts. Another approach can be, for
example, to target different types functions or different types of methods for
probing function. Particular examples of such targeted methods of study include,
for example, genomic knock-outs, expression profiling, protein-protein
interactions, and the like. Therefore, the content and capabilities of the in
silico model are subject to iterative updates.
[0128] The decision on what experiments to perform can be determined, for
example, based on the nature of the deviation and the requirements in an
accuracy specification. Deviations can include a gene expression array that is
not predicted correctly by the model, a set of calculated flux values which does
not match the experimentally-determined fluxome under given conditions, or a set
of phenotypes, for example, growth, secretion and/or uptake rates, which shows
discrepancy from model predictions. Experiments which could be performed to
resolve such discrepancies include perturbation analysis wherein one or more
genes thought to be responsible for the discrepancy are knocked out, upon which
the resulting organism is characterized using transcriptomics, fluxomics and the
like, or environmental analysis wherein one or more component of the
extracellular environment thought to contribute to model deviations is removed
and the system is re-characterized.
[0129] Algorithms can be devised that design such experiments automatically. An
algorithm which can be used in the case of gene expression can be, for example
(1) determine the gene(s) which exhibit a discrepancy from the predictions of
the model, (2) use the regulatory network model to identify the regulatory
protein(s) which control the gene(s) in step (1), (3) knockout one or more genes
in the organism which encode one or more regulatory proteins (4) perform the
same transcriptome experiment under the same environmental conditions but with
the new knockout strain. A second such algorithm which could be used in the case
of a high-throughput phenotype study with a reconstructed metabolic network
could be (1) determine the phenotype(s) which exhibit discrepancy (e.g., growth
rates do not correlate), (2) systematically add all biochemical reactions, one
or more at a time, until the model prediction matches the observed phenotype(s),
(3) identify gene locus/loci with significant sequence similarity to identified
enzymes which catalyze the reaction(s) in step (2), (4) clone and characterize
the gene in step (3) to verify whether it can catalyze the predicted reaction(s).
The inputs algorithm are several, including the present model, the data that it
has been tested against, the magnitude and nature of deviations, and so forth.
The output from the algorithm can be component experiments of whole organism
experiments.
[0130] An algorithm can identify, for example, missing components in the model
and request that specific biochemical, protein-DNA binding, protein-protein
interaction, or enzyme kinetic activity experiments be performed. As described
above, the missing components in the two above examples would be regulatory
interactions and identified enzymes. If these studies reveal missing components
of the model appropriate model updates are performed.
[0131] An algorithm can be facilitated by, for example, the inclusion of
additional data from whole cell behavior. It may request that growth,
transcription profiling, metabolic profiling, DNA-transcription factor binding
state, or proteomic experiments be performed under one or more environmental
conditions in order to obtain sufficient information to allow model updating.
[0132] Given a set of inputs such as gene deletions or environmental inputs, the
response of the biochemical reaction network can be examined both actually and
computationally. The actual system will yield an observed response characterized
through phenomenological pathways of the system, while the model of the actual
system will predict a response characterized by the systemic pathways of the
system. The observed and computed responses can be compared to identify
operational pathways as described previously. The difference in the measured and
computed cellular functions under the defined conditions where the experiment is
performed can be characterized, for example, as an "error". This difference
corresponds to those systemic pathways that are not operational. The error can
then be used to update the model.
[0133] Model update also can be accomplished by, for example, using an algorithm
for updating parameters in the model so that the model error is minimized. As
identified in Example VI, an algorithm for characterization of a regulatory
network can be, for example, (1) obtain the activity of each protein as
predicted by the model, (2) for each protein, generate a rule based on the
activity of the given protein which results in the correct expression value for
T5a, (3) recalculate the overall expression array for the regulated genes, (4)
evaluate the difference between the criterion for model accuracy by determining
the new model error, and (5) choose the model(s) with the lowest error as the
new model for future iterations. Following optimal model updates are
implemented, the remaining "error" between corrected model predictions and
actual responses can be used to design new studies to further probe the system.
The process can be repeated, for example, one or more times to further update
the model based on these new studies and until a desired scope or accuracy is
obtained.
[0134] Model updates that can minimize error on a round of the iterative
reconstruction process can be non-unique or very similar to each other in
generating optimal model updates. To preserve the availability of such data and
increase the efficiency of subsequent rounds, alternative model updates can be
stored, for example, so that they capable of being retrieved and available for
subsequent use on further rounds of iterative model building. Additionally, a
collection of experimental outcomes can be stored as a historic record of the
behavioral data or phenotypic data that has been obtained on a particular
organism. Model updates and design algorithms can be optionally capable of
querying this database during execution. Various other records and system data
can be alternatively stored for later efficient utilization in one or more steps
of the iterative process. Such computational approaches are well known in the
art and can be routinely implemented given the teachings and guidance provided
herein.
[0135] Further, combinations and permutations of the various methods of the
invention can be combined in any desired fashion to facilitate the model
building process or to augment a purpose or implementation of the method.
Additionally, single or other "off-line" studies can be performed and the
information generated used in any of the methods of the invention to facilitate,
augment or optimize results or implementation. For example, in addition to
studies designed for the iterative process, in some cases specific pair-wise
interactions among molecules can be probed in separate off-line studies to
further characterize individual molecular components.
[0136] Advantageous properties of the iterative model-building procedure include
convergence of system components into an operative and optimal representation of
the actual organism and efficiency of constructing such a model. Efficiency in
convergence is important since it will minimize the number of studies that need
to be performed.
[0137] It is understood that modifications which do not substantially affect the
activity of the various embodiments of this invention are also included within
the definition of the invention provided herein. Accordingly, the following
examples are intended to illustrate but not limit the present invention.
EXAMPLE I
Decomposing a Set of Phenomenological Flux Distributions for the E. coli Core
Metabolic Network in Order to Identify Operational Extreme Pathways
[0138] This example shows how a set of phenomenological pathways (flux
distributions) can be decomposed into dominant modes these modes can be compared
with a set of systemic pathways (extreme pathways) to identify operational
reaction pathways of a metabolic reaction network (E. coli core metabolism).
[0139] An in silico-generated metabolic flux profile of core metabolism in E.
coli was prepared. The reactions were taken from table 6.3 of Schilling, "On
Systems Biology and the Pathway Analysis of Metabolic Networks," Department of
Bioengineering, University of California, San Diego: La Jolla. p. 198-241
(2000), with the exception that reaction pntAB was not included, and instead of
T3P2 in reaction tktA2, T3P1 was used. The reaction list is tabulated in Table
1.
[0140] The flux profile, which is the input matrix for Singular Value
Decomposition (SVD) analysis, consists of 57 fluxes (rows) and 7 conditions in
each phase (columns). The phase plane for succinate for this system is presented
in FIG. 3; generation of Phase Planes is described in (Edwards J S, Ramakrishna
R, Palsson B O. Characterizing the metabolic phenotype: a phenotype phase plane
analysis. Biotechnol Bioeng. Jan. 5, 2002; 77(1):27-36). The points on FIG. 3
were chosen to define the upper limit of oxygen and succinate available to the
system. Each point, therefore, represents a different condition (or column of
the flux matrix) in constructing the flux profile.
[0141] SVD analysis was performed on each phase (each of the 7 conditions)
separately. The decomposition of the flux matrix, A, results in three distinct
matrices U (the left singular matrix), E (singular value matrix), and V (right
singular matrix):
A=U.epsilon.VT
[0142] For phase I of the phase plane, the flux distribution matrix was
generated with the E. coli core metabolism using the oxygen and succinate input
values that are tabulated next to FIG. 4. The points lie on phase I as shown.
[0143] SVD analysis on the flux matrix revealed that there is only one dominant
mode in phase I as demonstrated by the singular value fractions shown in FIG. 5.
Therefore, there is a common expression that dominates nearly all of the
system's behavior in this phenotypic phase, which can be called a phase
invariant singular value.
[0144] The contribution level of each condition (i.e. each point shown in Phase
I of the phase plane) is shown in FIGS. 6 and 7 for various modes obtained from
SVD. The weight that each mode has on the overall contribution of a pathway is
seen by how far the curve of that mode is from the zero contribution level
(horizontal zero level). Also, for each mode, the expression level increases
with the condition number which shows how fluxes increase in the pathway
represented by that mode. These representations provide information regarding
where on the phase plane the point lies relative to other points (i.e. at a
higher or lower growth rate). Thus, not only is information provided about the
dominant modes, but also additional information is provided on biomass
production rate. The slope of the first dominant mode ("first mode") should
correspond to the slope of growth rate. The first mode captures nearly 100% of
the overall contribution.
[0145] To compare the results from SVD and with the results from pathway
analysis, extreme pathways of the core E. coli system were calculated, using
succinate as the sole carbon source. The reduced set of extreme pathways for
succinate is presented in Table 2 (adopted from Schilling, supra (2000), Table
6.6) and shown in FIG. 8.
[0146] For the Phase I analysis described above, to compare the extreme pathways
with the 1st mode, the genes were arranged in the same order and fluxes were
normalized by succinate uptake rate. The angles between the I st mode and each
of the 12 extreme pathways were calculated and sorted in descending order. Also,
the number of different fluxes (i.e. fluxes that are zero in one case and
non-zero in the other case or have opposite signs) and the net flux difference
between the first mode and each pathway were calculated and sorted in the same
fashion. Table 3 provides the results of this analysis.
[0147] This analysis shows that the first mode in phase I is exactly equivalent
to the line of optimality (i.e. P.sub.--33). It also shows that following this
pathway, the first mode is the closest to pathways 32, 30, and so on. Therefore,
column angle not only shows what pathways best describe flux distribution in
phase I in the order of similarity, but it also shows how similar they are
amongst themselves.
[0148] The analysis was repeated for Phases II and III, and for all phases
together. When all phases were analyzed by SVD together, again a single dominant
mode was identified (FIG. 14), with relatively low entropy (4.80E-3). The angle
between this mode and each of the 12 extreme pathways was calculated. Table 4
provides the results of this analysis. By this analysis, the dominant mode was
closest to extreme pathways 33 and 32 shown in Table 2.
EXAMPLE II
Identifying Human Red Blood Cell Extreme Pathways Corresponding to
Physiologically Relevant Flux Distributions
[0149] This example shows how a set of phenomenological pathways (flux
distributions) generated by a kinetic model can be compared with the modal
decomposition of a set of systemic pathways (extreme pathways) to identify
dominant regulatory modes of a metabolic reaction network (human red blood cell
metabolism).
[0150] The extreme pathways of the red blood cell (RBC) metabolic network have
been computed (Wiback, S. J. & Palsson, B. O. Biophysical Journal 83, 808-818
(2002)). Here, SVD analysis was applied to the extreme pathway matrix, P, formed
by these pathways. A full kinetic model of the entire metabolic network of the
RBC has been developed (Jamshidi, N., Edwards, J. S., Fahland, T., Church, G.
M., Palsson, B. O. Bioinformatics 17, 286-7 (2001); Joshi, A. & Palsson, B. O.
Journal of Theoretical Biology 141, 515-28 (1991)), and was used to generate
flux vectors (v) for physiologically relevant states. These flux vectors were
decomposed using the modes obtained from SVD of P.
[0151] The rank of the V.sub.max-scaled RBC extreme pathway matrix, P, was 23.
The first mode represents 47% of the variance (FIG. 10F). Combined, the first
five modes capture 86% of the variance of the solution space, while the first
nine modes capture 95% of its variance.
[0152] The first five modes of P are shown on the metabolic maps in FIG.
10(A-E). The first mode shows low flux values though the adenosine reactions,
higher fluxes through the glycolytic reactions, with an exit through the R/L
shunt, and the highest flux levels through the pentose phosphate pathway. This
map describes the principal variance of the steady-state solution space. The
subsequent modes describe the next directions of greatest variance in the
steady-state solution space (FIG. 10). Movement along a mode in the positive
direction corresponds to increasing the fluxes shown in red and decreasing those
shown in green. Since the modes are required to be orthogonal, they specifically
describe the directions of variance in the cone that are independent from each
other. The subsequent modes can be interpreted biochemically as follows:
[0153] The second mode describes the flux split between glycolysis and the
pentose phosphate pathway. If the contribution of this mode is added to the
first mode it would lead to decreased flux through the pentose phosphate pathway
and reduced production of NADPH. The increased glycolytic flux exits through the
Rapoport-Leubering (R/L) shunt leading to decreased ATP production since ATP is
used in upper glycolysis and not recovered in lower glycolysis. The production
of NADH increases.
[0154] The third mode describes the glycolytic pathway down to pyruvate with
production of ATP and NADH. It also describes lowered dissipation of ATP as a
consequence of AMP dissipation by AMPase. This mode has a significant ATP
production.
[0155] The fourth mode describes the flux split between lower glycolysis and the
R/L shunt. It thus naturally interacts biochemically with the second mode. The
fourth mode further describes an increase in ATP dissipation via the AMPase-AK
cycle leading to little net production of ATP, and interacts with mode three.
[0156] The fifth mode is actually one of the extreme pathways. It describes
importing pyruvate and converting it to lactate, thus dissipating one NADH. It
thus will be important in balancing NADH redox metabolism.
[0157] As shown below the first five modes account for most of the RBC's
physiological states.
[0158] The nominal state (no additional metabolic load) of the red blood cell
metabolic network was calculated using a full kinetic model and is shown on the
RBC metabolic map (FIG. 10G). This nominal physiologic steady state of the RBC
was decomposed into 23 modes (FIG. 10H). The relative error remaining in the
reconstructed solution after the addition of each mode to the reconstruction of
the nominal steady state fell sharply (FIG. 10H). After the contribution of the
first five modes, the reconstructed nominal state had a relative error of 0.013
(RE(5) =0.013).
[0159] An inspection of the first five modes (FIG. 10A-E) demonstrates how they
reconstruct the physiologic steady state solution. Relative to the first mode
(FIG. 10A), adding the second mode (FIG. 10B) increases the flux through the
first half of glycolysis, decreases the flux through the pentose phosphate
reaction, and decreases NADPH production, all of which moves the reconstructed
solution significantly towards the physiologic steady state (FIG. 10G). Adding
the third mode (FIG. 10C) increases the flux through all of glycolysis,
particularly through lower glycolysis. The addition of the fourth mode (FIG.
10D) appropriately decreases the amount of 23DPG that is produced and instead
sends that flux through lower glycolysis. Finally, the addition of the fifth
mode increases the flux from pyruvate to lactate, which leads essentially to the
steady state solution where lactate is the primary output of glycolysis. Thus,
the significant features of the physiologic steady state are captured within the
first five modes. A regulatory structure that can move the solution along these
five independent directions in the solution space will be able to generate the
desired physiological state.
[0160] Steady-state flux distributions for two load levels of NADPH, ATP, and
NADH were calculated using the RBC kinetic model. These pairs of load levels
each represented the maximum load the in silico RBC could withstand, as well as
one value chosen within the tolerated load range. NADPH loads simulate
physiologic states corresponding to the red blood cell's response to oxidative
free radicals. The maximum NADPH load is 2.5 mM/hr. The ATP loads simulate
conditions of increased energy loads, such as in hyperosmotic media. The maximum
ATP load is 0.37 mM/hr. Two NADH loads, important for methemoglobin reduction in
the RBC, were also applied. These six computed flux vectors thus represent
extreme physiological states of the RBC, and help designate the region of
physiologically meaningful states within the steady-state solution space.
[0161] The modal composition of each of the six "stressed" steady state flux
solutions gives significant weighting to the first five modes (FIG. 10H). In
addition, some "fine tuning" appears in modes 7 to 11. All of the other modes
are essentially insignificant in reconstructing these solutions to the RBC
kinetic model.
[0162] The application of metabolic loads changed the weighting of the first
five modes to reconstruct the appropriate metabolic flux distribution (FIG.
10H,I). Increases in the NADPH load resulted in a substantial increase of the
weighting on the first mode, increasing the flux through the pentose phosphate
reactions and thus elevating the production of NADPH. The weightings on the
second, third, fourth, and fifth modes decrease with the application of higher
NADPH loads largely because as NADPH production is maximized the flux
distribution approaches that of the first mode. The reduction in the weighting
of the second mode, however, is the most dramatic. The application of increasing
ATP loads resulted in little change in the values of the weightings on all of
the first five modes. The application of ATP load is handled in the RBC by a
decrease in an ATP-consuming futile cycle, with the ATP generated instead being
used instead to satisfy the load imposed upon the cell. Thus, the usage of an
ATP-dissipiating futile cycle in the unstressed state of the RBC acts to dampen
the effects of changing ATP loads, allowing the RBC to respond to changing ATP
loads with little change in the overall flux distribution in the cell. Related
experimental findings have demonstrated that the concentration of ATP in the RBC
does not change much as environmental conditions change within specified limits,
as a result of this buffer, but then changes dramatically when the ATP load is
pushed beyond those limits. The application of the NADH loads resulted in a
significant decrease of all the mode weightings because the length of the flux
vector decreases. The weighting on the fifth mode decreased most dramatically
since it consumes NADH when utilized in the positive direction and thus had
needed to be scaled down.
[0163] After the inclusion of the first five modes, the relative error (RE(5))
of all the reconstructed solutions ranged from 0.005 to 0.018. In all six cases,
the first five modes reconstructed at least 98% of the steady state solutions.
Thus, the physiologically relevant portion of the steady-state solution space
appears to be only 5 dimensional, and therefore there are effectively only five
degrees of freedom to the problem of regulating red cell metabolism.
[0164] Decomposition of the extreme pathway vectors into the modes shows that
the most important mode, in the reconstruction, is often not one of the first
five modes (FIG. 10J). Thus, many portions of the allowable solution space, as
defined by the extreme pathways, are poorly characterized by the first five
modes, which effectively reconstruct each solution to the full RBC kinetic
model. Thus, many of the extreme pathways are physiologically irrelevant and
they can be identified using SVD of P, if the approximate location of
physiologically meaningful solutions is known.
[0165] Study of regulation of metabolism has historically focused on the
identification and characterization of individual regulatory events. Now that we
can reconstruct full metabolic reaction networks we can address the need for
regulation from a network-based perspective. This study has focused on
interpreting regulation from a network-based perspective using singular value
decomposition of the extreme pathway matrix for human red blood cell metabolism.
Two main results were obtained. First, the dominant modes obtained by SVD
interpret RBC metabolic physiology well. Second, the first five modes
effectively characterize all the relevant physiological states of the red cell.
[0166] RBC metabolic physiology is well interpreted by the dominant modes
obtained for SVD. Using the calculated modes, seven physiologically relevant
solutions to the full RBC kinetic model were reconstructed. The RE(5) for these
solutions was within 0.017. Thus, the first five modes can be used to
essentially completely recapture each of the physiologically relevant kinetic
solutions. However, most of the extreme pathways could not be reconstructed to
such a high degree by the first five modes. Thus, the first five modes
represented the space relevant to solutions to the full kinetic model better
than they did to the space as a whole, even though they were calculated to
optimize their description of the entire space. This fact suggests that
developing constraints-based methods that take into account kinetics and
metabolomics will result in defining a solution space that is much smaller than
the space circumscribed by the extreme pathways.
[0167] The results obtained herein were based on the topology of the metabolic
network and knowledge of some V.sub.max values. The next step to bridge the gap
between the network-based results and the study of individual regulatory events
is to find the best ways to pair candidate regulatory molecules and the systemic
regulatory needs. In control theory this is known as the `loop-pairing` problem
(Seborg, D. E., Edgar, T. F. & Mellichamp, D. A. Process dynamics and control
(Wiley, N.Y., 1989)). As a part of its solution we may have to relax the need
for strict orthonormality of the modes and look for oblique modal bases that are
more in line with the underlying biochemistry of the network.
[0168] Taken together, this study presents a network-based approach to studying
regulatory networks and defines the degrees of freedom of the regulatory
problem. This method calculates the modalities needed to enable the metabolic
network to navigate its solution space and thus could be used to infer candidate
regulatory loops of metabolic systems for which the regulation is largely
unknown. Further, based upon their contribution to the steady-state solution
space, these regulatory loops can potentially be ordered in terms of their
importance to the reconstruction of the space. Network-based approaches to
studying regulation, such as the one offered herein, complement component-based
studies and provide a potential framework to better understand the interaction
of regulatory components needed to achieve the regulatory demands of the cell.
EXAMPLE III
In Silico Assessment of the Phenotypic Consequences of Red Blood Cell Single
Nucleotide Polymorphisms
[0169] The following example illustrates the application of the described
methods to analysis of phenomenological pathways defined through pathological
data.
[0170] The Human Genome Project (HGP) is now essentially complete. One result of
the HGP is the definition of single nucleotide polymorphisms (SNPs) and their
effects on the development of human disease. Although the number of SNPs in the
human genome is expected to be a few million, it is estimated that only 100,000
to 200,000 will effectively define a unique human genotype. A subset of these
SNPs are believed to be "informative" with respect to human disease (Syvanen,
A., 2001. Accessing genetic variation: Genotyping single nucleotide
polymorphisms. Nat Rev Genet 2: 930-942). Many of these SNPs will fall into
coding regions while others will be found in regulatory regions. The human
genotype-phenotype relationship is very complex and it will be hard to determine
the causal relationship between sequence variation and physiological function.
One way to deal with this intricate relationship is to build large-scale in
silico models of complex biological processes (FIG. 12). Defects or alterations
in the properties of a single component in complex biological processes can be
put into context of the rest by using an in silico model. In this work, recent
data on SNPs in key red blood cell enzymes (FIG. 12a) and corresponding
alterations in their kinetic properties (FIG. 12b) were used in an in silico red
blood cell model (FIG. 12c) to calculate the overall effect of SNPs on whole
cell function (FIG. 12d).
[0171] The study of variations in the kinetic properties of red blood cell
enzymes is not merely an academic study of the quality of a mathematical model,
but has real utility in the clinical diagnosis and treatment of enzymopathies
and can provide a link to the underlying sequence variation (FIG. 12). Here, an
in silico model is used to study SNPs in two of the most frequent red blood cell
enzymopathies: glucose-6-phosphate dehydrogenase (G6PD) and pyruvate kinase (PK).
[0172] For both enzyme deficiencies, clinical data was obtained from the
published literature to determine measured values for the various kinetic
parameters (V.sub.max's, Km's, Ki's) associated with each clinically diagnosed
variant. These numerical values were then used in the iii silico model (Jamshidi,
N., Edwards, J. S., Fahland, T., Church, G. M., Palsson, B. O. Bioinformatics
17, 286-7 (2001)) and sensitivities to various oxidative and energy loads (above
normal, baseline values) were simulated. The results are interpreted with
respect to the genetic basis of the enzymopatby in an attempt to establish a
direct link between the genotype and phenotype (FIG. 12).
[0173] Glucose-6-phosphate dehydrogenase (G6PD) catalyzes the first step in the
oxidative branch of the pentose pathway (FIG. 12c) and is thus of critical
importance in maintaining the red blood cell's resistance to oxidative stresses.
G6PD is the most common erythrocyte enzymopathy, affecting approximately 400
million people worldwide.
[0174] G6PD from normal patients and patients with hemolytic anemia have been
characterized on the molecular level. A total of 61 G6PD class I variants have
been described at the molecular level. Of the 61 class I chronic variants, 55
are the result of SNPs involving amino acid changes, 5 result from frame
deletions and one results from a splicing defect (Fiorelli, G., F. M. d.
Montemuros and M. D. Cappellini, Bailliere's Clinical Haematology 13: 35-55
(2000)).
[0175] Clinically diagnosed SNPs cluster around important, active regions of
G6PD enzyme including the dimer interface and substrate binding sites (FIG.
13a). Numerical values of G6PD kinetic parameters were varied in silico to
determine the sensitivity of red blood cell metabolic functions to these changes
in enzyme function. The most sensitive parameters were found to be V.sub.max and
Ki-NADPH. The NADPH/NADP ratio proved to be the most informative indicator of
metabolic status as it was the most sensitive to changes in these two parameters
and it gives an indication as to the oxidative state of the cell (Kirkman, H.
N., G. D. Gaetani, E. H. Clemons and C. Mareni, Journal of Clinical
Investigation 55: 875-8 (1975)). For each documented variant there appears to be
no direct correlation between V.sub.max and Ki-NADPH (FIG. 13b). Clinically,
G6PD deficiencies are broken down
|