[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Subject Index][Author Index]

*To*: "Jonathan R. Wagner" <znc14@ttacs1.ttu.edu>*Subject*: Re: Learning cladistics (was Re: Dinosaur Web Pages' Re-Opening)*From*: Stanley Friesen <sarima@ix.netcom.com>*Date*: Wed, 10 Sep 1997 23:56:15 -0700*Cc*: dinosaur@usc.edu, wagner@ttacs1.ttu.edu, th81@umail.umd.edu*In-reply-to*: <01INGALVWEV68WYRVA@TTACS.TTU.EDU>*Reply-to*: sarima@ix.netcom.com*Sender*: owner-dinosaur@usc.edu

First, I would like to point out, I actually agree that cladistic analysis is the best method currently available for estimating phylogenies. I use it regularly. I just think it can be improved upon. [Note, the main meat of this article is in the section on statistics, further down, the earlier part is just minor matters of relative weighting]. At 08:50 PM 9/9/97 -0500, Jonathan R. Wagner wrote: >Stan (the man) Friesen writes: >>True, but the question is, are character transitions the only hypotheses >>implied by a cladistic analysis? > No. As stratocladists point out, assumptions of the >fragmentary >nature of the fossil record are also implied. However, since these >hypotheses are easily demonstrated to be more likely than not, this >is not, >IMHO, such a far-out concept. True. But I would much prefer to analyze the completeness with regard to some set of lineages than to assume it either way. One can certainly use continuity of the fossil record *within* each subclade as a surrogate for completeness. In other words, do we really need to *assume* the completeness *or* the incompleteness of the fossil record? Also, other things being more or less equal (that is within a few steps), the cladogram that requires fewer ghost lineages is, IMHO, more likely. >(amongst other agendas). However, it does not take very complex >algebra to determine that we simply do not have enough of a fossil >record preserved for terrestrial vertebrates to falsify the concept >of ghost lineages ... I think this depends on the group. For bird I have no problem postulating 10's of millions of years of ghost lineage. A comparable length ghost lineage of prosauropod-like forms that are yet not prosauropods is another matter (which is what you get if some prosauropod group is not ancestral to sauropods). >>dichotomies, > ... > It is likely, however, that as one views the history of life >at a wider and wider scope, the relations amongst *major* taxa will >be dichotmous. OK, I will concede this. It is at the mid- to small scale that the issue is most significant. For instance, in the relationships among the various hadrosaurs, or the various bullatosaur subgroups. > >>For instance, *real* evolution almost certainly is NOT >>restricted to just dichotomous branching - species tend to bud > Also, I should point out that the idea of "species budding" is >incompatible with the Biological Species Concept. It seems clear that >the BSC dicatated that the departure of a "coherent" body of genetic >material(one or more populations) from the interbreeding whole >nullifies the species as defined ("a group of populations linked by >continuous interbreeeding"). I would add the phrase "at a given time" to this definition. >Thus, although the larger collection of populations may not be >affected morphologically, it is BY DEFINITION a new species just as >the "off-branching" population(s) is/are. I would not define things so. This gets into strange results. The set of populations which maintains the most continuity with the preceding set of populations is best treated as belonging the the same species. Now, if more or less equal branching turns out to be more common than now seems likely, this point would have validity, as then deciding which new population set continues the old set becomes arbitrary. That unequal branching model of speciation developed by Ernst Mayr, and used as the basis of punctuated equilibrium theory by Eldridge and Gould, almost invariably involves a sort of break in continuity by one population (change in local habitat, catastrophic isolation, polyploidy [often with hybridization - oops, there goes single ancestry], peripheral habitat fragmentation etc.). > Gripping hand is, however, if you expect to speak of species >as real natural units, you must first consider if they are or not... I would consider them so, at any given time. > >>and the parent species may easily be polymorphous for traits >>that become fixed in descendents > I do not find this to be a problem with cladistics any more than I >find the recaptulation of phylogeny in ontogeny to be a problem. Both >are considerations, neither is fatal. I agree it isn't *fatal*. It just means that the results are less robust than one might wish. I wish more cladistic analyses took this into consideration. Even most of those I have seen that attempt to treat it, do not, IMHO, quite go far enough with it. I have seen at most one study that actually explicitly deduced polymorphisms for the interior (ancestral) nodes. > >>However my real beef is with the lack of statistical testing. > [The following is merely my opinion. Please, anyone who has >knowledge of this problem (like biological knowledge, not just >computer-modelling knowledge), please speak up!] > Cladistics is not a statistical procedure. Parsimony is a >method of choosing between phylogenetic trees. But what I am saying is that choosing between trees of closely spaced parsimony *should* be a statistical procedure! Due to sampling limitations (choice of a subset of the actual character characters, incompleteness fossil record, incompleteness of some character sets, and so on) a certain amount of random variation in parsimony scores is unavoidable! Indeed the mere act of choosing a set of characters in itself is sufficient to partially randomize the parsimony score. This means that judging the utility of a parsimony score requires some estimate of the expected variation. Trees that differ in parsimony by less than the expected range of variation simply *cannot* be validly distinguished by the method! Acting as if such relatively small differences really mean anything is simply invalid reasoning. If the differences *could* be the result of random variation due to character choice and incompleteness, then one cannot really conclude that the differences are due to any other cause. On this I am very emphatic. I am, by training, a statistician (even if I am not earning my living that way). Short of doing a cladogram based on whole-genome DNA sequences that includes every known species in a group I cannot think of any way to avoid *some* random variation in parsimony scores. > It cannot tell you how likely it is >that any one is right, in part because we have no objective way of >telling. It is true that Maximum Likelihood Estimation is in general the best method for solving this kind of problem. But it is not the only one. Simply estimating the expected variance in the parsimony score, and then treating all values within about 2*sigma of the best score as effectively equal would be satisfactory. (I would have to look up the correct multiplier for a one-sided 95% confidence interval: the multiplier 1.96 commonly seen is for two-sided confidence intervals). >This is the same difficulty as in George's desire to test cladistics >using artificially generated trees. > In statistics, some of the more basic >confidence interval tests presume that the data are normally >distributed, or at least some statistics are provided from which we >may make assumptions which often prove sufficent. There are some non-parametric estimation methods, though I have not figured out any way to apply them in this case. However, many of the normal-distribution based tests are actually quite robust to deviations from normality. The main effect is often just to change the real confidence level - that is the notional 95% confidence level is actually some other value, like 92%. Often the *power* of the test is effected more than the confidence level. Given the sources of variation in parsimony scores I discuss above, assuming *approximate* normality to the error is actually a reasonable approach (I think one can invoke some variant of the weak law of large numbers). > Without a knowledge of the basic structure of >phylogeny, what can we test our trees against? What assumptions can >we make? There have been some papers discussing this issue with regard to DNA-based trees. I just have not yet figured out how to extend these methods to character-based trees. > Does this mean that a two or three step difference in a small >dataset is actually more significant? Or are we assuming a minimum >dataset size. Careful, Stan, I got blasted from all corners last time >I suggested any such thing (and quite rightly, I might add). Oops. I did goof here. In fact with a sufficiently small data set one may be unable to validly distinguish among any trees at all!! I plumb forgot about the deleterious effect on power of a small sample! > Once again, you are apply >the model of statistics to what, No, I forgot a basic rule of thumb in statistics: one needs an adequate sample to get useful results. I was concentrating so hard on significance I forgot power! (There are even methods of estimating how large a sample one needs to reach a certain level of distinction). I get three wacks with a wet noodle! I should definitely have known better. > IMHO, is an inherently unstatistical >procedure. Although the sum of character transformations does affect >the parsimony algorythm, this is not really a mathematical procedure. Statistics is about events and measurement, not mathematical procedures. > For example, in one dataset, the blastopore forming the anus >may be >the one character transformation which unites Deuterostoma [sic?], >but boy >is it a doozy! Well, except that the state of this character is intermediate in some minor phyla, and is even variable at the species level in some groups! [This case is simply not as clear cut as you might think]. > The point of parsimony is, however, to choose one tree. ... > Did I miss your point? Yes. I am saying one cannot always validly choose only one tree given the available data. > And how, pray, does one calculate this? I just had Baby >Stats, I can take the math... I wish I knew! The lack of such a thing for character-based cladistics is its biggest weakness! It may well be why we can get such divergent trees for the same group from different authors! > DNA cladograms, amongst other things, are based on a very >limited >set of options, G, A, T, and C. This would seem to give them a much >firmer >base to work from. Also, some of the assumptions under which DNA >cladists >work are not the same ("molecular clocks" come to mind). The most recent DNA statistical method I saw did not seem to depend on this limited vocabulary of possible changes (though I admit I did not really study it in depth to be sure). >Missing data will *always* be a problem. Yep. The trick is to try to estimate *how* *much* of a problem! That is what statistics is for! > >>My main diffidence is in accepting weakly supported clades as real. >>I tend >>to treat them as an over-resolved polychotomy. > Y'know, this isn't all that bad a method. Despite what I said >above >about the "one or two steps" thing, I believe you are perfectly >within your >right to evaluate the characters supporting a node and consequently >doubt >the validity of the node. Indeed, the best thing you can do is recode >or >discard any characters for which you can justfy such a treatment, >then >re-run the analysis. This is much better than the Altangerel et al. >(the >_Erlicosaurus_ skull paper) treatment of Holtz 1994a, where an entire >tree >topology was discarded because of percieved discrepancies in the >coding of a >few characters, yet no subsequent re-analysis was performed. > Yuck. I agree here. [Note, I would still prefer a proper statistical analysis, but the above is the best I can do for now]. Actually, I did come up with one other "quick trick", just not a very powerful one. Compute the parsimony scores for the couple of hundred or so best trees, and then look for any gaps any the covered ranges. And treat the gap with the lowest score as the tentative "confidence limit". I *think* this will tend to somewhat over-estimate the confidence interval, so anything below the last gap should be safely significant. (Of course may data sets my have NO real gap, in which case this method is useless - it is a crude, inadequate trick anyhow). -------------- May the peace of God be with you. sarima@ix.netcom.com sfriesen@netlock.com

**References**:**Re: Learning cladistics (was Re: Dinosaur Web Pages' Re-Opening)***From:*"Jonathan R. Wagner" <znc14@ttacs1.ttu.edu>

- Prev by Date:
**Re: Evolution of Language** - Next by Date:
**Re: Interesting hypothical..** - Previous by thread:
**Re: Learning cladistics (was Re: Dinosaur Web Pages' Re-Opening)** - Next by thread:
**Elmisaurids (was Re: Dinosaur Web Pages' Re-Opening)** - Indexes: