[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Subject Index][Author Index]

Re: Phylogenetics was Re: "Ratite" polyphyly and paleognathous dromornithids

David Marjanovic <david.marjanovic@gmx.at> wrote:

> Sorry for the delay. I got busy in meatspace...

I love that term. :-)

> That's great! I have to read up on those.

I should have mentioned that the mixture model method have problems of
its own -- Matsen & Steel (2007) showed that a mixture of trees with
the same topology but different branch lengths can result in several
different topologies fitting the data perfectly (= having exactly the
same likelihood score). However, mixture models can be useful for
detecting heterotachy in the first place (returning to the original
topic, Smith et al. based their conclusion that their paleognath
phylogeny isn't misled by heterotachy on the fact that MCMC sampled
only one branch length for most branches on their tree), and there are
other models designed to address heterotachy in the data. There are
covarion models that do the same thing for rate variation across the
tree that the gamma parameter does for rate variation across sites,
allowing sites to switch between several rate categories as they
evolve (Galtier 2001; Wang et al. 2007).

> As far as I know, ML uses the parsimony-uninformative characters to
> estimate the parameters of the model; that doesn't make it non-phylogenetic
> -- quite the opposite. It still uses synapomorphies to build the trees,
> right?

It does not. When a probabilistic analysis (ML or Bayes) calculates
the likelihood score of a tree, it sums the likelihood over all
possible transformations of each character. It assigns every possible
character state to every single internal node in the tree. Suppose you
have the following four-taxon tree based (for simplicity) on a single
binary character:

X--Y-- 0
|  `-- 0
`--Z-- 1
   `-- 1

(I really hope the spaces won't get merged.)

The tree is rooted, so let's say there is an outgroup and its
character state is zero. Under parsimony, you can say that the
grouping of two zeros (stemming from the node Y) is united by a
symplesiomorphy -- it wouldn't survive on the final (consensus) tree
because there are two other equally parsimonious trees that don't
contain that grouping. Probabilistic methods consider this
possibility, too, and calculate its likelihood using a model of
evolution and branch lengths -- the relevant parameters are estimated
from all characters in the data set, not just from the
parsimony-uninformative ones, and are tweaked along with the topology
during the analysis. However, ML and BI _also_ consider the seven
remaining possibilites (there are 2 characters states and 3 internal
nodes, so there are 2^3 = 8 possible state assignments), and calculate
their probabilities as well. In some of them, the zero-zero grouping
is united by a homoplasy:

X--1-- 0
|  `-- 0
`--Z-- 1
   `-- 1

And finally, there is a character-state assignment where the zero-zero
grouping is united by a synapomorphy and it's the one-one grouping
that is held together by a symplesiomorphy:

1--0-- 0
|  `-- 0
`--1-- 1
   `-- 1

When you sum the probabilities of all possible state assignments, you
get the likelihood of a character. Do it for every character in your
data set, calculate the product of the resulting likelihoods, and you
have the likelihood score of a tree. Then you are left with two
options: you can either use some kind of heuristics and try to find
the tree with the highest likelihood by tweaking the topology, branch
lengths, and values of substitution model parameters (maximum
likelihood); or you can multiply the likelihood by prior probabilities
of the current parameter values -- including topology -- to obtain the
posterior, let MCMC sample the posterior distribution of tree
topologies and other parameters, and when the chain reaches
convergence, summarize the sample as a majority-rule consensus tree
with branch lengths averaged over all trees in the sample containing
that branch (Bayesian inference).

Of course, this account is too simplistic*, but it should be evident
that the distinction between synapomorphies and symplesiomorphies
doesn't enter into the procedure at all. Parsimony is the only method
that makes that distinction because it simultaneously optimizes the
topology and the character states at its internal nodes, whereas
probabilistic methods integrate the ancestral states out. That's their
advantage: while it's improbable that the most recent common ancestor
of two "ones" was a "zero", it cannot be ruled out, and parsimony
doesn't take that possibility into account. However, you can still
retrieve synapomorphies from a probabilistic analysis _a posteriori_
-- the simplest way is to constrain a parsimony program to find the
ML/Bayesian tree and then check the resulting character state
optimization. That's how Lee & Worthy (2011) were able to say what
characters support their likelihood tree -- the relationships between
nodes on the tree and individual characters isn't as clear-cut with
likelihood or Bayesian trees as it is with parsimony trees. In fact,
one might argue this is the only reason we should care about
synapomorphies at all: they provide the "explanatory connection"
between a tree (a result of a statistical analysis) on the one hand
and a phylogenetic history on the other (Morrison 2012). However, this
certainly doesn't mean you have to separate apomorphies from
plesiomorphies in order to _infer_ a phylogeny, and methods that
attempt to do such a thing (different variants of parsimony) are
actually routinely outperformed by methods that do not. In fact, even
the ancestral state reconstruction might be better left to
probabilistic methods that can combine the uncertainty about the
existence of any particular node with the uncertainty about the
ancestral state on that node.

*There are good articles that give a much more complete picture about
statistical phylogenetics. Lewis (1998) is excellent and the
likelihood chapter in the recent second edition of _Phylogenetics:
Theory and Practice_ (Wiley & Lieberman 2011) also does a good job
explaining the method, although the book has been criticized as being
too parsimony-oriented (Morrison 2012). Paul O. Lewis's lecture slides
cover pretty difficult concepts (such as MCMC) in a very intuitive
way; they are freely available on the web.

> What makes NJ (and UPGMA and WPGMA...) non-phylogenetic is that they take
> all the differences between any two taxa, average them into a single number
> (the percentage of similarity), assemble these into a distance matrix, and
> then work on the distance matrix. There is no attempt in there to
> distinguish synapomorphies from symplesiomorphies. That's why these
> algorithms can be, and are, used for entirely non-phylogenetic problems like
> the similarities between faunas at different sites.

I strongly disagree. I can see how UPGMA or WPGMA can be used to
cluster entities that don't have any phylogenetic history, but there
is no way you can do it using, say, NJ with the HKY85 distance
correction. You can't interpret the results as anything other than a
phylogenetic hypothesis. It doesn't matter that NJ doesn't distinguish
synapomorphies from plesiomorphies -- neither does ML or BI (see
above), both of which demonstrably outperform the methods that do. The
important thing is that NJ attempts to infer evolutionary distances
from observed distances, which would make no sense at all if the data
haven't evolved on a tree.

> That must be why people use ML and BI instead of phenetic methods these
> days.

I don't think that's the reason. The loss of information about
individual characters is a big downside of distance-based methods
(more so in morphology, though; there just aren't that many
interesting things you can say about evolution of individual
nucleotides) and ML or BI certainly perform better than NJ, but they
don't care about synapomorphies either. In fact, there are die-hard
cladists who would call them "phenetic" as well for this precise

> Fair enough; that's why ML and BI were developed.

ML wasn't developed to relax the assumptions of parsimony, parsimony
was developed as a fast approximation to ML.

> I'll have to read those; however, I don't see a reason to assume a priori
> that any two characters (that aren't correlated) wold evolve at the same
> speed ( = have the same set of branch-length parameters).

That's where the statistical approach to phylogenetics is useful.
Let's suppose you are right. Then we might indeed want to give a
separate set of branch lengths to each character in the data set, and
it's mathematically guaranteed that by doing this, we arrive at
parsimony: Tuffley & Steel (1997) proved that an ML analysis with a
different (= separately parameterized) JC69 model for each character
will give the same results as unweighted parsimony (unless, as
discovered by a later study, you impose certain restrictions on the
substitution process). This is sometimes referred to as "no common
mechanism" (NCM). Is it a good idea? Some have argued that it is,
because the NCM model is so general that it would apply to almost any
data set (Farris 2008). And that's generally true of statistical
models: the more parameters they have, the more realistic they are.

However, it has a huge drawback: in order to actually capture reality,
the estimates of those parameters must be close to their real values,
and "the power of a given amount of data to estimate several
parameters accurately is generally low" (Steel 2005:309). If you have
2,000 sites from a single locus (... OK, it's a big locus), you might
be able to estimate the 9 free parameters of one GTR+gamma model
common to all of the sites (5 relative rate parameters, 3 frequency
parameters, and the alpha parameter of the gamma distribution) with a
reasonable level of precision. However, the NCM model described above
needs 2,000 parameters estimated accurately from the same amount of
data, which is madness. That's why the information criteria used in
the model choice (AIC, BIC) compare not only the realism of different
models (likelihoods) but also their simplicity (number of parameters),
and why they try to find a reasonable trade-off between the two. It
can be proved that AIC will _never_ choose the NCM model (Holder et
al. 2010).

Surprisinigly, it's not that the gain in likelihood is compromised by
the huge number of parameters: there is no gain in likelihood at all.
Huelsenbeck et al. (2011) took six empirical data sets and used MCMC
to calculate their marginal likelihoods given the fixed (MP) topology
and a variety of models including JC69, HKY85, GTR+gamma, and their
NCM versions. In every single case the marginal likelihoods of the
best common-mechanism models (usually HKY85+gamma or GTR+gamma)
exceeded the likelihood of any submodel of the NCM model by many
orders of magnitude. Complex submodels of the NCM often outperformed
oversimplified CM models such as JC69, but _only_ when they allowed
branch lengths to be shared across sites. So there is a very good
reason to assume the common mechanism: it fits the data better than
the NCM.

>> On the other hand, you can use a model to correct the data for
>> unobserved changes, just as with neighbor-joining, and subject the
>> resulting data matrix to a parsimony analysis (= to a
>> maximum-likelihood analysis using one of the "parsimony models").
>> Steel et al. (1993) described how to do it, it's still called
>> parsimony, nobody does it. Apparently it's philosophically
>> objectionable.
> Huh. Maybe it was too computation-intensive for 1993, so people forgot about
> it?

Maybe, but the philosophical criticism seems to be more common
(Siddall & Kluge 1997; Doyle & Davis 1998).

> Oh, so long-branch repulsion would be expected, right?

Actually, no, and that was the whole point of the paper: there is no
such thing as long-branch repulsion. If there is little enough data,
ML may fail to unite the long branches in the Farris zone -- it finds
the correct tree in over 30% of cases, because there are 3 possible
unrooted trees for 4 taxa and not enough information on the short
internal branch to choose between them. It does not fail because of
"long-branch repulsion", it fails because it isn't prone to
long-branch attraction. On the other hand, parsimony can't distinguish
the handful of synapomorphies on the short internal branch from
homoplasies on the two long terminal branches, so it sticks the long
branches together with ridiculously inflated support -- and thus finds
the right tree more often than ML. However, unlike parsimony in the
Felsenstein zone, ML isn't inconsistent in the Farris zone, and if you
give it more data, it _will_ converge on the correct topology sooner
or later.

"This behavior of parsimony in the extreme regions of the Felsenstein
and inverse-Felsenstein zones is analogous to an oracle who responds
to any question by responding “0.492.” If the question asked is, “What
is the sum of 0.450 and 0.042?” or “What is 3 times 0.164?” the oracle
will answer correctly, but presumably once interrogators realized that
the answer was always the same regardless of the question, they would
not be ready to give up their electronic calculators. There are times
when “I don’t know” is a better answer than a confident guess that has
a high probability of being incorrect."

-- Swofford et al. 2001:535

> So its bias to long-branch attraction is strong enough to overcome
> long-branch repulsion. Good to know.

I think you misunderstand what long-branch repulsion was supposed to
be (I don't say "what long-branch repulsion is", because the
phenomenon simply doesn't exist). Siddall (1998) coined the term for
an alleged bias of ML towards keeping long branches apart -- it was
never supposed to be a problem for parsimony, or something that
parsimony would have to overcome, but a property of maximum

>> UPGMA grouped the short branches together (also correctly) because of
>> their symplesiomorphies.
> Were there enough of those left, or were they independent reversals?

They were genuine symplesiomorphies. UPGMA found the correct topology
by grouping the short branches together, not the long ones, so there's
no need to worry about homoplastic reversals.

>> When the model is misspecified, posterior probabilities
>> can be either inflated or too conservative.
> Then apparently the former happens a lot. If you look through publications,
> almost every node in a Bayesian tree has a PP of 0.99 or 1.00.

Well, not all high PPs are necessarily inflated!

> Estimating models clearly isn't a nontrivial problem. Unfortunately, I don't
> know how the successor to ModelTest does it, or what happened to
> MrModelTest.

I suppose that should read "is a nontrivial problem". That's true,
although there is some hope that the problem could be avoided entirely
with reversible jump MCMC, which make it possible to move among models
with different numbers of parameters. This way the model of evolution
itself becomes just another random variable, and the results of the
analysis can be averaged across its multiple values.

> Thanks! I'll check them out from Monday onwards.

You're welcome, glad to be of any help. :-)


Doyle JJ, Davis JI 1998 Homology in molecular phylogenetics: a
parsimony perspective. 101–31 _in_ Soltis DE, Soltis PS, Doyle JJ,
eds. _Molecular Systematics of Plants II: DNA Sequencing_. Boston:
Kluwer Academic Publishers

Farris JS 2008 Parsimony and explanatory power. Cladistics 24: 825–47

Galtier N. 2001. Maximum-likelihood phylogenetic analysis under a
covarion-like model. Mol Biol Evol. 18(5): 866–73

Holder M, Lewis PO, Swofford DL 2010 The Akaike information criterion
will not chose the no common mechanism model. Syst Biol 59(4): 477–85

Huelsenbeck JP, Alfaro ME, Suchard MA 2011 Biologically-inspired
phylogenetic models strongly outperform the no-common-mechanism
model. Syst Biol 60(2): 225–32

Lee MSY, Worthy TH 2011 Likelihood reinstates _Archaeopteryx_ as a
primitive bird. Biol Lett 8(2): 299–303

Lewis PO 1998 Maximum likelihood as an alternative to parsimony for
inferring phylogeny using nucleotide sequence data. 132–63 _in_ Soltis
DE, Soltis PS, Doyle JJ, eds. _Molecular Systematics of Plants II: DNA
Sequencing_. Boston: Kluwer Academic Publishers

Matsen FA, Steel M 2007 Phylogenetic mixtures on a single tree can
mimic a tree of another topology. Syst Biol 56(5): 767–75

Morrison DA 2012 [Review of] Phylogenetics: The Theory and Practice of
Phylogenetic Systematics, 2nd edition. Syst Biol

Siddall ME 1998 Success of parsimony in the four-taxon case:
long-branch repulsion by likelihood in the Farris zone. Cladistics 14:

Siddall ME, Kluge AG 1997 Probabilism and phylogenetic inference.
Cladistics 13: 313–36

Steel M 2005 Should phylogenetic models be trying to ‘fit an elephant’?
Trends Genet 21(6): 307–9

Swofford DL, Waddell PJ, Huelsenbeck JP, Foster PG, Lewis PO, Rogers
JS 2001 Bias in phylogenetic estimation and its relevance to the
choice between parsimony and likelihood methods. Syst Biol 50(4):

Tuffley C, Steel M 1997 Links between maximum likelihood and maximum
parsimony under a simple model of site substitution. Bull Math Biol
59: 581–607

Wang HC, Spencer M, Susko E, Roger AJ 2007 Testing for covarion-like
evolution in protein sequences. Mol Biol Evol 24(1): 294–305

Wiley EO, Lieberman BS 2011 _Phylogenetics: The Theory and Practice of
Phylogenetic Systematics, 2nd edition_. Hoboken: Wiley-Blackwell

David Černý