[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Subject Index][Author Index]

Re: Morpho v molecular (was Re: Tinamous: living dinosaurs)

> So do you string all the genes together, and analyze the
> concatenated
> multigene dataset as a single "supergene" using common
> parameters -
> and hope that all the quirks of individual genes "come out
> in the
> wash" to generate the "true" phylogeny.  Or do you try
> and deal with
> potential heterogeneities across different genes by
> partitioning the
> multigene dataset - and analyze the different genes
> separately, each
> with its own parameters, and combine them all at the end as
> a kind of
> gene 'supertree'.  Both approaches have their pros and
> cons, and each
> has its own proponents and detractors.  To be honest,
> I don't know
> which approach is better at capturing the true
> phylogeny.

Ask me again in two years, but until then I can only say that partitioned 
analyses at least allow to exclude those genes where you find a bad SNR. I.e. 
those which yield phylogenies that are categorically refuted by other evidence 
(fossils for example).

The problem with multigenes (either method) is also that it combines sequences 
evolving at different speeds. If you compare cytB and RAG-1 and ND2 sequences 
for Accipitridae, you'll find that each locus has the best resolution (where 
support is consistently high except for very short branches) at a different 
area in the phylogeny, corresponding to different periods in time. Before that 
period (i.e. further from the present) the noise is too large. After that 
period (i.e. closer to the present) the signal becomes too weak. Either way the 
SNR drops off.

It is easy to see that depending on your choice of loci, "more genes" can mean 
"making the SNR decline badly across all periods of time". If I combined 1 
kilobasepair of the three sequences above, I'd probably get worse results than 
I'd get from each locus individually: with individual loci, you'd get good 
resolution in a short period of time and crap resolution otherwise. Using loci 
with non-overlapping "optimal resolution periods" you'll have the crap 
resolution cancel out the good resolution.

For cytB, RAG-1/2 and ND2/4 in accipitrids, this problem is not so bad as to 
render the results problematic, but you already note some branches are less 
well resolved in the combined dataset versus the best-resolving individual 
locus. E.g. cytB resolves best back at about 5-20 MA, whereas the other two 
don't (they evolve more slowly), and combining the three will decrease 
resolution at the intrageneric/sister genus levels.

If you go further back in time - say, base of Neoaves - it gets risky indeed. 
The Hackett et al. study was monumental in its scope, but honestly nobody knows 
what of the numerous strange sister groupings they found are good and which are 
artefacts. And it's sad to see that the International Ornithological Congress 
fairly blindly relied on the results for their taxonomy. Because almost none of 
it has been tested (only "Coronaves/Metaves" has, and the results were not 

And this is another problem of these huge multigene studies: I'd prefer half 
the number of loci BUT ALSO some dedicated attempt at hypothesis-testing versus 
a huge number of loci but only the barest of testing (if any at all) at all 

The fact that we can do high-throughput sequencing doesn't alleviate the need 
to test, test, test. 

If you find a weakly-supported "clade" that is unusual, don't just claim it's 
good and true and submit your results to Science or Nature (which will publish 
any crap these days as long as it's not too obviously flawed and as long as 
it's sufficiently "novel" to splash the journal's name all across the daily 

Rather, try to determine whether you have actually found something that all 
previous researchers have overlooked, or whether it's simply due to some error 
somewhere. In the above example, remove either branch of the "clade" from the 
dataset and see what happens to the other. If you find it'll clade in another 
position with a) better (quantitative) support in the DNA study and b) better 
(qualitative) support by the fossil record and fossil and modern biogeography 
(essentially, minimizing the number of open-ocean crossings), your new "clade" 
should NOT be touted as genuine.

But this is not how it's being done. "Mihi itch" is not a disease of the past; 
people simply seem to like their own name behind a nomen too much to exercise 
restraint (the 6 authors of _Raptorex_ come to mind).

But as I said, ask me again in 2 years; hopefully I can then give some more 
detailed examples. Until then, you may want to consult the literature on 
Hoatzin relationships, where we had precisely this phenomenon: each locus 
yields a phylogenetic hypothesis with better-than-equivocal (but still not 
entirely convincing) support, but taken together and from a falsificationist 
perspective, they simply cancel each other out leaving NO satisfying hypothesis 
at all.

Perhaps the only feasible approach, before this phenomenon has been studied in 
detail in and by itself, is building simple additive supertrees from those 
clades which in each particular dataset show extremely high supports (95% and 
above). Passeriformes phylogeny could be well resolved by this approach 
(Jonsson & Fjeldsa supertree).