[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Subject Index][Author Index]

Re: Morpho v molecular (was Re: Tinamous: living dinosaurs)

Mickey Mortimer <mickey_mortimer111@msn.com> wrote:

> This doesn't mean it's easy to combine genes in analyses.  As has been 
> mentioned, genes need different models.  This slows
> down analyses, and you need to get them right.  I don't see how different
> evolutionary rates in different genes is an issue if you have enough genes to 
> work with though.  Any noise should cancel itself
> out given enough data.  The only way it wouldn't is if it's not random, and
> which molecular artifacts can cause that over multiple genes?  Base 
> composition bias maybe, but that's easily identified.

The assumption that the noise should cancel itself out given enough
data sounds entirely reasonable.  This relies on the premise that the
phylogenetic signal is additive whereas the random noise is averaged
across multigene datasets.  However, how do you deal with non-random
noise - which strictly speaking isn't actually noise, but bias?
Biases can be caused by base composition, codon usage, or
transition-transversion ratios.  Such biases may be easy to identify
(especially if compositional), but what if these biases are actually
part of the phylogenetic signal itself?!  Can you (or should you)
discriminate between biases introduced by shared evolutionary history
(good) and biases that result from homplasy (bad)?

The thing to recall about bias (compositional, codon.
transition-transversion ratio) is that it's only homoplastic ('bad')
if it arises independently in two or more lineages, and therefore
qualifies as 'noise' (i.e., is not reflective of a shared evolutionary
history = the phylogenetic signal).  Bias can also occur by common
ancestry.  To use an analogy in morphology-based analyses, shared
adaptive traits can either arise from a common ancestor, or
convergently (homplasy).  For example, are the shared aquatic
characters of hupehsuchians and ichthyosaurs a result of convergence
within two independently aquatic lineages, or the product of
inheritance from a common ancestor that possessed these aquatic
adaptations?  Would it be useful to exclude all aquatic-related
characters?  Similarly, re the aerial locomotor adaptations of colugos
(Dermoptera) and bats (Chiroptera) a product of shared ancestry, or
homoplastic?  Morphology- and molecular-based phylogenies disagree on
this point, with the latter finding that the shared aerial locomotor
characters must be homoplastic.

In molecular-based analyses, biases of any kind introduce a pattern (=
structure) in the dataset that is not random noise.  If you remove any
and all biases, you run the risk of removing some phylogenetically
informative characters.  Again to give an an analogy in the
morphological realm, one study of pterosaur affinities (Bennett, 1996
) removed hindlimb characters on the basis that they were functionally
correlated with bipedal, digitigrade locomotion in dinosaurs and
pterosaurs.  Unsurprisingly, the dinosaur-pterosaur link broke down
when hindlimb characters were excluded.

Your contention is that any homplasy will eventually be swamped by the
phylogenetic signal by expanding the dataset, because adding more data
will override the random noise.  But this only works if (1) the noise
is random, and (2) there is sufficient phylogenetic signal to override
any noise.  This may not always be the case, and third codon positions
may be especially problematic.  As you know, in protein-coding genes,
each codon is composed of three bases.  The third bases of codons tend
to vary more than first bases, which in turn tend to vary more than
second bases.  This is due to relative levels of degeneracy of the
genetic code at these three positions.  For this reason, the third
codon position (the most rapidly changing position) is often
invaluable for discerning recent divergences.  However, for deep
divergences, the third positions become saturated.  At this point, the
third codon positions cease to be useful for phylogenetic analysis.
This means that a whole third of the dataset is not merely useless for
retaining the phylogenetic signal (more, if the first position also
becomes saturated), but acts as a potential source of homoplasy if the
base substitutions are not evenly distributed.  Thus, this non-random
'noise' at the third position can therefore create structure that
conflicts with the phylogenetic signal, especially in deep
divergences, but can be mistaken for the phylogenetic signal.

> I don't disagree that working out the best method to combine genes is 
> difficult or that developing these methods is more important
>  than just throwing more genes into the analysis, but my point still stands 
> that unanalysed genes are a huge resource for testing
>  molecular phylogenies and that I don't know of any times (post-2001,  as 
> David notes) that adding more genes to an analysis has
>  led to changing a well supported result to one newly congruent with 
> morphological

But if the "well supported result" is itself an artifact of the genes
themselves (including the method by which the algorithm seeks to
extract the phylogenetic signal), then adding more genes is not likely
to overturn this result.  My argument is unless the algorithm gets its
right, then adding more genes will compound the problem - it will give
the same wrong tree, but with better support.

My concern is not that all molecular trees are bad - I don't believe
that at all.  My concern is the assumption that given enough taxa and
enough genes, a molecular analysis will always get it right.  Whereas
most nodes are recovered courtesy of the "true" phylogenetic signal,
others may be being supported by homplastic 'noise' which is
generating structure that is not phylogenetic.