[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Subject Index][Author Index]

Re: Learning cladistics (was Re: Dinosaur Web Pages' Re-Opening)



First, I would like to point out, I actually agree that cladistic analysis
is the best method currently available for estimating phylogenies.  I use
it regularly.  I just think it can be improved upon.

[Note, the main meat of this article is in the section on statistics,
further down, the earlier part is just minor matters of relative weighting].

At 08:50 PM 9/9/97 -0500, Jonathan R. Wagner wrote:
>Stan (the man) Friesen writes:
>>True, but the question is, are character transitions the only hypotheses
>>implied by a cladistic analysis?
>        No. As stratocladists point out, assumptions of the 
>fragmentary
>nature of the fossil record are also implied. However, since these
>hypotheses are easily demonstrated to be more likely than not, this 
>is not,
>IMHO, such a far-out concept.

True.  But I would much prefer to analyze the completeness with regard to
some set of lineages than to assume it either way.  One can certainly use
continuity of the fossil record *within* each subclade as a surrogate for
completeness.  In other words, do we really need to *assume* the
completeness *or* the incompleteness of the fossil record?

Also, other things being more or less equal (that is within a few steps),
the cladogram that requires fewer ghost lineages is, IMHO, more likely.

>(amongst other agendas). However, it does not take very complex 
>algebra to determine that we simply do not have enough of a fossil
>record preserved for terrestrial vertebrates to falsify the concept 
>of ghost lineages ...

I think this depends on the group.

For bird I have no problem postulating 10's of millions of years of ghost
lineage.

A comparable length ghost lineage of prosauropod-like forms that are yet
not prosauropods is another matter (which is what you get if some
prosauropod group is not ancestral to sauropods).

>>dichotomies,
> ...
>        It is likely, however, that as one views the history of life 
>at a wider and wider scope, the relations amongst *major* taxa will
>be dichotmous.

OK, I will concede this.

It is at the mid- to small scale that the issue is most significant.  For
instance, in the relationships among the various hadrosaurs, or the various
bullatosaur subgroups.
>
>>For instance, *real* evolution almost certainly is NOT
>>restricted to just dichotomous branching - species tend to bud
>     Also, I should point out that the idea of "species budding" is
>incompatible with the Biological Species Concept. It seems clear that 
>the BSC dicatated that the departure of a "coherent" body of genetic 
>material(one or more populations) from the interbreeding whole 
>nullifies the species as defined ("a group of populations linked by 
>continuous interbreeeding").

I would add the phrase "at a given time" to this definition.

>Thus, although the larger collection of populations may not be 
>affected morphologically, it is BY DEFINITION a new species just as 
>the "off-branching" population(s) is/are.

I would not define things so.  This gets into strange results.  The set of
populations which maintains the most continuity with the preceding set of
populations is best treated as belonging the the same species.  Now, if
more or less equal branching turns out to be more common than now seems
likely, this point would have validity, as then deciding which new
population set continues the old set becomes arbitrary.

That unequal branching model of speciation developed by Ernst Mayr, and
used as the basis of punctuated equilibrium theory by Eldridge and Gould,
almost invariably involves a sort of break in continuity by one population
(change in local habitat, catastrophic isolation, polyploidy [often with
hybridization - oops, there goes single ancestry], peripheral habitat
fragmentation etc.).

>        Gripping hand is, however, if you expect to speak of species 
>as real natural units, you must first consider if they are or not...

I would consider them so, at any given time.
>
>>and the parent species may easily be polymorphous for traits
>>that become fixed in descendents
>   I do not find this to be a problem with cladistics any more than I
>find the recaptulation of phylogeny in ontogeny to be a problem. Both 
>are considerations, neither is fatal.

I agree it isn't *fatal*.  It just means that the results are less robust
than one might wish.

I wish more cladistic analyses took this into consideration.  Even most of
those  I have seen that attempt to treat it, do not, IMHO, quite go far
enough with it.  I have seen at most one study that actually explicitly
deduced polymorphisms for the interior (ancestral) nodes.
>
>>However my real beef is with the lack of statistical testing.
>        [The following is merely my opinion. Please, anyone who has
>knowledge of this problem (like biological knowledge, not just
>computer-modelling knowledge), please speak up!]
>        Cladistics is not a statistical procedure. Parsimony is a 
>method of choosing between phylogenetic trees.

But what I am saying is that choosing between trees of closely spaced
parsimony *should* be a statistical procedure!  Due to sampling limitations
(choice of a subset of the actual character characters, incompleteness
fossil record, incompleteness of some character sets, and so on) a certain
amount of random variation in parsimony scores is unavoidable!  Indeed the
mere act of choosing a set of characters in itself is sufficient to
partially randomize the parsimony score.

This means that judging the utility of a parsimony score requires some
estimate of the expected variation.  Trees that differ in parsimony by less
than the expected range of variation simply *cannot* be validly
distinguished by the method!  Acting as if such relatively small
differences really mean anything is simply invalid reasoning.  If the
differences *could* be the result of random variation due to character
choice and incompleteness, then one cannot really conclude that the
differences are due to any other cause.

On this I am very emphatic.  I am, by training, a statistician (even if I
am not earning my living that way).  Short of doing a cladogram based on
whole-genome DNA sequences that includes every known species in a group I
cannot think of any way to avoid *some* random variation in parsimony scores.

> It cannot tell you how likely it is
>that any one is right, in part because we have no objective way of 
>telling.

It is true that Maximum Likelihood Estimation is in general the best method
for solving this kind of problem.  But it is not the only one.  Simply
estimating the expected variance in the parsimony score, and then treating
all values within about 2*sigma of the best score as effectively equal
would be satisfactory.  (I would have to look up the correct multiplier for
a one-sided 95% confidence interval: the multiplier 1.96 commonly seen is
for two-sided confidence intervals).

>This is the same difficulty as in George's desire to test cladistics 
>using artificially generated trees.


> In statistics, some of the more basic
>confidence interval tests presume that the data are normally 
>distributed, or at least some statistics are provided from which we
>may make assumptions which often prove sufficent.

There are some non-parametric estimation methods, though I have not figured
out any way to apply them in this case.

However, many of the normal-distribution based tests are actually quite
robust to deviations from normality.  The main effect is often just to
change the real confidence level - that is the notional 95% confidence
level is actually some other value, like 92%.  Often the *power* of the
test is effected more than the confidence level.

Given the sources of variation in parsimony scores I discuss above,
assuming *approximate* normality to the error is actually a reasonable
approach (I think one can invoke some variant of the weak law of large
numbers).

> Without a knowledge of the basic structure of
>phylogeny, what can we test our trees against? What assumptions can 
>we make?

There have been some papers discussing this issue with regard to DNA-based
trees.  I just have not yet figured out how to extend these methods to
character-based trees.

>        Does this mean that a two or three step difference in a small
>dataset is actually more significant? Or are we assuming a minimum 
>dataset size. Careful, Stan, I got blasted from all corners last time 
>I suggested any such thing (and quite rightly, I might add).

Oops.  I did goof here.

In fact with a sufficiently small data set one may be unable to validly
distinguish among any trees at all!!  I plumb forgot about the deleterious
effect on power of a small sample!

> Once again, you are apply
>the model of statistics to what,

No, I forgot a basic rule of thumb in statistics: one needs an adequate
sample to get useful results.  I was concentrating so hard on significance
I forgot power!  (There are even methods of estimating how large a sample
one needs to reach a certain level of distinction).

I get three wacks with a wet noodle!  I should definitely have known better.

> IMHO, is an inherently unstatistical
>procedure. Although the sum of character transformations does affect 
>the parsimony algorythm, this is not really a mathematical procedure. 

Statistics is about events and measurement, not mathematical procedures.

>        For example, in one dataset, the blastopore forming the anus 
>may be
>the one character transformation which unites Deuterostoma [sic?], 
>but boy
>is it a doozy!

Well, except that the state of this character is intermediate in some minor
phyla, and is even variable at the species level in some groups! [This case
is simply not as clear cut as you might think].

>      The point of parsimony is, however, to choose one tree. ...
>        Did I miss your point?

Yes.  I am saying one cannot always validly choose only one tree given the
available data.

>        And how, pray, does one calculate this? I just had Baby 
>Stats, I can take the math...

I wish I knew!  The lack of such a thing for character-based cladistics is
its biggest weakness!  It may well be why we can get such divergent trees
for the same group from different authors!

>        DNA cladograms, amongst other things, are based on a very 
>limited
>set of options, G, A, T, and C. This would seem to give them a much 
>firmer
>base to work from. Also, some of the assumptions under which DNA 
>cladists
>work are not the same ("molecular clocks" come to mind).

The most recent DNA statistical method I saw did not seem to depend on this
limited vocabulary of possible changes (though I admit I did not really
study it in depth to be sure).

>Missing data will *always* be a problem.

Yep.  The trick is to try to estimate *how* *much* of a problem!  That is
what statistics is for!
>
>>My main diffidence is in accepting weakly supported clades as real.  
>>I tend
>>to treat them as an over-resolved polychotomy.
>        Y'know, this isn't all that bad a method. Despite what I said 
>above
>about the "one or two steps" thing, I believe you are perfectly 
>within your
>right to evaluate the characters supporting a node and consequently 
>doubt
>the validity of the node. Indeed, the best thing you can do is recode 
>or
>discard any characters for which you can justfy such a treatment, 
>then
>re-run the analysis. This is much better than the Altangerel et al. 
>(the
>_Erlicosaurus_ skull paper) treatment of Holtz 1994a, where an entire 
>tree
>topology was discarded because of percieved discrepancies in the 
>coding of a
>few characters, yet no subsequent re-analysis was performed.
>

Yuck.  I agree here.

[Note, I would still prefer a proper statistical analysis, but the above is
the best I can do for now].


Actually, I did come up with one other "quick trick", just not a very
powerful one.  Compute the parsimony scores for the couple of hundred or so
best trees, and then look for any gaps any the covered ranges.  And treat
the gap with the lowest score as the tentative "confidence limit".  I
*think* this will tend to somewhat over-estimate the confidence interval,
so anything below the last gap should be safely significant.  (Of course
may data sets my have NO real gap, in which case this method is useless -
it is a crude, inadequate trick anyhow).

--------------
May the peace of God be with you.         sarima@ix.netcom.com
                                          sfriesen@netlock.com