A number of different Markov models of DNA sequence evolution have been proposed. This is because evolutionary processes vary between genomes and between different regions of a genome, for example different evolutionary processes apply to coding and noncoding regions. These models mostly differ in the parametrization of the rate matrix and in the modeling of rate variation.
Contents |
Continuous-time Markov chains have
the usual transition matrices which are, in addition, parameterized
by time,
.
Specifically, if
are the states, then the transition matrix
where each individual entry,
refers to the probability that state
will change to state
in time
.Example: We would like to model the substitution process in DNA sequences (i.e. Jukes–Cantor, Kimura, etc.) in a continuous-time fashion. The corresponding transition matrices will look like:

where the top-left and bottom-right 2 × 2 blocks correspond to transition probabilities and the top-right and bottom-left 2 × 2 blocks corresponds to transversion probabilities.
Assumption: If at some time
,
the Markov chain is in state
,
then the probability that at time
,
it will be in state
depends only upon
,
and
.
This then allows us to write that probability as
.
Theorem: Continuous-time transition matrices satisfy:

Consider a DNA sequence of fixed length m evolving in time by base replacement. Assume that the processes followed by the m sites are Markovian independent, identically distributed and constant in time. For a fixed site, let

be the column vector of probabilities of states
and
at time
.
Let

be the state-space. For two distinct
,
let 
be the transition rate from state
to state
.
Similarly, for any
,
let:

The changes in the probability distribution
for small increments of time
are given by:

In other words (in frequentist language), the frequency of
's
at time
is equal to the frequency at time
minus the frequency of the lost
's
plus the frequency of the newly created
's.
Similarly for the probabilities
.
We can write these compactly as:

where,

or, alternately:

where,
is the rate matrix. Note that by definition, the rows of
sum to zero.
If all the transition probabilities,
are positive, i.e. if all states
communicate, then the Markov chain has a
stationary distribution
where each
is the proportion of time spent in state
after the Markov chain has run for infinite time, and this
probability does not depend upon the initial state of the process.
Such a Markov chain is called, ergodic.
In DNA evolution, under the assumption of a common process for each
site, the stationary frequencies,
correspond to equilibrium base compositions.
Definition A Markov process is
stationary if its current distribution is the stationary
distribution, i.e.
Thus, by using the differential equation above,

Definition: A stationary Markov process is
time reversible if (in the steady state) the amount of
change from state
to
is equal to the amount of change from
to
,
(although the two states may occur with different frequencies).
This means that:

Not all stationary processes are reversible, however, almost all DNA evolution models assume time reversibility, which is considered to be a reasonable assumption.
Under the time reversibility assumption, let
,
then it is easy to see that:

Definition The symmetric term
is called the exchangeability between states
and
.
In other words,
is the fraction of the frequency of state
that results as a result of transitions from state
to state
.
Corollary The 12 off-diagonal enteries of the
rate matrix,
(note the off-diagonal enteries determine the diagonal enteries,
since the rows of
sum to zero) can be completely determined by 9 numbers; these are:
6 exchangeability terms and 3 stationary frequencies
,
(since the stationary frequencies sum to 1).
JC69 is the simplest substitution model. There are
several assumptions. It assumes equal base frequencies (
)
and equal mutation rates. The only parameter of this
model is therefore μ, the overall
substitution rate.


Distance between two sequences is given by

where p is the proportion of sites that differ between the two sequences.
The K80 model distinguishes between transitions (A <-> G, i.e. from purine to purine, or C <-> T, i.e. from pyrimidine to pyrimidine) and transversions (from purine to pyrimidine or vice versa) (α/β).
It also assumes equal base frequencies (
)
Rate matrix 
The Kimura two-parameter distance is given by:

where p is the proportion of sites that show transitional differences and q is the proportion of sites that show transversional differences.
Unequal base frequencies (
)
Rate matrix:

The HKY85 model distinguishes between transitions and transversions (α/β).
It allows unequal base frequencies (
).
Rate matrix 
One frequency only πGC


Rate matrix 
The evolutionary distance between two noncoding sequences according to this model is given by

where h = 2θ(1 − θ) where
is the GC content.
The TN93 model distinguishes between the two different types of transition - i.e. (A <-> G) is allowed to have a different rate to (C<->T). Transversions are all assumed to occur at the same rate, but that rate is allowed to be different from both of the rates for transitions.
TN93 also allows unequal base frequencies (
).
Rate matrix 
GTR is the most general neutral, independent, finite-sites, time-reversible model possible. It was first described in a general form by Simon Tavaré in 1986.
The GTR parameters consist of an equilibrium base frequency vector, Π = (π1,π2,π3,π4), giving the frequency at which each base occurs at each site, and the rate matrix

Therefore, GTR (for four characters, as is often the case in phylogenetics) requires 6 substitution rate parameters, as well as 4 equilibrium base frequency parameters. However, this is usually eliminated down to 9 parameters plus μ, the overall number of substitutions per unit time. When measuring time in substitutions (μ=1) only 9 free parameters remain.
In general, to compute the number of parameters, you count the
number of entries above the diagonal in the matrix, i.e. for n
trait values per site
,
and then add n for the equilibrium base frequencies, and
subtract 1 because μ is fixed. You
get

For example, for an amino acid sequence (there are 20 "standard"
amino acids that make up proteins), you would find there are 209
parameters. However, when studying coding regions of the genome, it
is more common to work with a codon substitution model (a codon is three
bases and codes for one amino acid in a protein). There are 43 = 64 codons, but the rates for
transitions between codons which differ by more than one base is
assumed to be zero. Hence, there are
parameters.
|
|||||||||||
|
|||||||||||||||||||||||||||||
|
|