PAUP* activity

table of contents

  • expected learning outcome
  • getting started
  • exercise 1: basic data manipulation and parsimony analysis
  • exercise 2: distance and likelihood analyses

expected learning outcome

The objective of this activity is to help you become familiar with using PAUP* for several different analysis types including parsimony, distance, and likelihood. In addition you will also gain understanding of numerous fundamental aspects of phylogenetic analysis and models of molecular evolution.

getting started

The data set we will use for this PAUP* exercise is called simulated-dna-data.nex. It is a simulated data set consisting of DNA sequences for 9 organisms and 2500 sites (taxon 9 is the outgroup). You can assume that the simulation was a homogeneous Markov process (i.e., all sites evolving according to the same model, and the model does not change over the tree.

exercise 1: basic data manipulation and parsimony analysis

  1. Start PAUP*, executing the simulated-dna-data.nex data file.
    $ paup simulated-dna-data.nex; [ “$” represents the Linux/MacOS prompt ] (or)
    $ paup
    paup> exec simulated-dna-data.nex;
  2. Conduct an exact tree search to find the most parsimonious tree under simple parsimony with equal weights.
    paup> bandb; (for example; could also do alltrees)
    What is the length of the optimal tree?
    [toggle title_open=”Hide Answer” title_closed=”Show Answer” hide=”yes” border=”yes” style=”default” excerpt_length=”0″ read_more_text=”Read More” read_less_text=”Read Less” include_excerpt_html=”no”]4269 steps[/toggle]
  3. Examine the topology of the tree.
    paup> showtrees; (for example; could also do describe)
  4. Change the outgroup status so that the only outgroup taxon is taxon 9 and re-examine the tree topology.
    paup> outgroup 9;
    (or)
    paup> outgroup ‘taxon_9’;

    • What happens to the shape of the tree?
      [toggle title_open=”Hide Answer” title_closed=”Show Answer” hide=”yes” border=”yes” style=”default” excerpt_length=”0″ read_more_text=”Read More” read_less_text=”Read Less” include_excerpt_html=”no”]Tree is rooted at a different position.[/toggle]
    • Inspect the parsimony scores.
      paup> pscores; (could also do describe)
      What happens to the tree length?
      [toggle title_open=”Hide Answer” title_closed=”Show Answer” hide=”yes” border=”yes” style=”default” excerpt_length=”0″ read_more_text=”Read More” read_less_text=”Read Less” include_excerpt_html=”no”]No change in tree length. The length under the parsimony criterion with any freely reversible character type is invariant to the position of the root.[/toggle]
  5. Do a heuristic search using the default seetings in PAUP*. Then perform another heuristic search using the random-addition-sequence method with 100 replicates.
    paup> hsearch;
    paup> hsearch/addseq=random nrep=100;
    Does it look like the optimal trees are hard to find for this data set?
    [toggle title_open=”Hide Answer” title_closed=”Show Answer” hide=”yes” border=”yes” style=”default” excerpt_length=”0″ read_more_text=”Read More” read_less_text=”Read Less” include_excerpt_html=”no”]No, they are not hard to find. All 100 replicates find the same tree. (Note: this is not a typical result.)[/toggle]
  6. “Describe” the most-parsimonious tree as a phylogram.
    describe/plot=p;
  7. Perform a bootstrap analysis with 1,000 replicates and 10 random-addition-sequence replicates per bootstrap replicate.
    paup> bootstrap nreps=1000/addseq=random nreps=10;
    Which grouping on the most-parsimonious tree is least well supported?
    [toggle title_open=”Hide Answer” title_closed=”Show Answer” hide=”yes” border=”yes” style=”default” excerpt_length=”0″ read_more_text=”Read More” read_less_text=”Read Less” include_excerpt_html=”no”]The 4,6 clade.[/toggle]
  8. What is the shortest tree that makes taxa 1 and 6 monophyletic with respect to the remaining taxa? (hint: use a constraint tree)
    paup> constraints 1and6 = ((1,6)); (“1and6” can be any name you like)
    Equivalently, but more typing:
    paup> constraints 1and6 = ((1,6),2,3,4,5,7,8,9);
    paup> hsearch enforce constraints=1and6;
    [toggle title_open=”Hide Answer” title_closed=”Show Answer” hide=”yes” border=”yes” style=”default” excerpt_length=”0″ read_more_text=”Read More” read_less_text=”Read Less” include_excerpt_html=”no”]4273 steps[/toggle]

exercise 2: distance and likelihood analyses

  1. Reset all program options to their “factory defaults”.
    paup> factory;
  2. Perform neighbor-joining analyses using uncorrected (“p”), Jukes-Cantor, Kimura 2-parameter, HKY85, Tamura-Nei, and GTR distances. Examine the distance matrix in each case.
    paup> dset dist=jc; showdist; nj;
    paup> dset dist=k2p; showdist; nj;
    paup> dset dist=hky; showdist; nj;
    paup> dset dist=tamnei; showdist; nj;
    paup> dset dist=gtr; showdist; nj;

    • How does the distance model chosen affect the tree found by NJ for this data set?
      [toggle title_open=”Hide Answer” title_closed=”Show Answer” hide=”yes” border=”yes” style=”default” excerpt_length=”0″ read_more_text=”Read More” read_less_text=”Read Less” include_excerpt_html=”no”]Changing the distance measure does not greatly affect the tree. For this example, the NJ tree is the same for the first 4 models, but the last tree is slightly different. This would not always be the case, depending on your data set.[/toggle]
    • How does the distance model chosen affect the magnitude of the pairwise distance estimates?
      [toggle title_open=”Hide Answer” title_closed=”Show Answer” hide=”yes” border=”yes” style=”default” excerpt_length=”0″ read_more_text=”Read More” read_less_text=”Read Less” include_excerpt_html=”no”]In general, the more complex the model used for distance calculation, the greater the inferred distance, because it better corrects for unobserved substitutions (multiple hits).[/toggle]
    • Are smaller distances less affected or more affected by the choice of a distance model?
      [toggle title_open=”Hide Answer” title_closed=”Show Answer” hide=”yes” border=”yes” style=”default” excerpt_length=”0″ read_more_text=”Read More” read_less_text=”Read Less” include_excerpt_html=”no”]Less affected. When taxa are more recently diverged, fewer multiple hits will have occurred and the distance correction has less effect.[/toggle]
  3. Using the Tamura-Nei distance, search for a tree under the minimum evolution (distance) criterion.
    paup> set criterion=distance;
    paup> dset objective=me dist=tamnei;
    paup> hsearch;

    What is the tree score under this criterion?
    [toggle title_open=”Hide Answer” title_closed=”Show Answer” hide=”yes” border=”yes” style=”default” excerpt_length=”0″ read_more_text=”Read More” read_less_text=”Read Less” include_excerpt_html=”no”]ME score = 2.00162[/toggle]
  4. Now evaluate the tree using the Fitch-Margoliash weighted least-squares criterion (inverse squared weighting).
    paup> dscores/objective=ls power=2;
    What is the tree score under this criterion?
    [toggle title_open=”Hide Answer” title_closed=”Show Answer” hide=”yes” border=”yes” style=”default” excerpt_length=”0″ read_more_text=”Read More” read_less_text=”Read Less” include_excerpt_html=”no”]SS = 0.03281 (APSD=3.062)[/toggle]
  5. Perform a heuristic search for an optimal tree under the likelihood criterion using the default model settings in PAUP*.
    paup> set criterion=likelihood;
    paup> hsearch;

    “Describe” the tree and examine the topology. How does it compare to the trees found previously using parsimony and distance methods?
    [toggle title_open=”Hide Answer” title_closed=”Show Answer” hide=”yes” border=”yes” style=”default” excerpt_length=”0″ read_more_text=”Read More” read_less_text=”Read Less” include_excerpt_html=”no”]The likelihood score for the tree is -19327.97.[/toggle]
  6. Set up a model with unequal base frequencies and two substitution types with ti/tv ratio estimated from the data (LSet command). Use the “Likelihood Scores” (LScores) command to estimate the ti/tv ratio on the tree found in the previous step.
    paup> lset tratio=estimate basefreq=estimate;
    paup> lscores;

    (or you can do it all in one command)
    lscores/tratio=estimate basefreq=estimate;

    • What is the estimate of the ti/tv ratio?
      [toggle title_open=”Hide Answer” title_closed=”Show Answer” hide=”yes” border=”yes” style=”default” excerpt_length=”0″ read_more_text=”Read More” read_less_text=”Read Less” include_excerpt_html=”no”]0.985[/toggle]
    • How does it compare to the previous value (i.e., the default setting)?
      [toggle title_open=”Hide Answer” title_closed=”Show Answer” hide=”yes” border=”yes” style=”default” excerpt_length=”0″ read_more_text=”Read More” read_less_text=”Read Less” include_excerpt_html=”no”] About half. (note that the default tratio is 2.0).[/toggle]
    • What effect did optimization of the ti/tv ratio have on the likelihood score for the tree?
      [toggle title_open=”Hide Answer” title_closed=”Show Answer” hide=”yes” border=”yes” style=”default” excerpt_length=”0″ read_more_text=”Read More” read_less_text=”Read Less” include_excerpt_html=”no”] Dramatic improvement: ln L = – 19114.70; over 200 log-likelihood units.[/toggle]
  7. Fix the ti/tv ratio and base frequencies to the values you estimated in the step above. Perform another heuristic search using the modified settings.
    lset tratio=previous basefreq=prev;
    hsearch;
    (note: in a real analysis, you would probably want to work harder than this)
    describe;
    Did the tree topology change?
    [toggle title_open=”Hide Answer” title_closed=”Show Answer” hide=”yes” border=”yes” style=”default” excerpt_length=”0″ read_more_text=”Read More” read_less_text=”Read Less” include_excerpt_html=”no”] No, the tree is the same.[/toggle]
  8. Explore several models that include among-site rate variation.
    lset tratio=estimate;
    lscores/rates=gamma shape=estimate basefreq=estimate pinv=0;

    gamma shape ==> 1.046, ln L = -18742.18
    lscores/rates=equal pinv=estimate;
    pinv ==> 0.200, ln L = -18857.50
    lscores/rates=gamma shape=estimate pinv=estimate;
    pinv ==> 9 ( 10-6, shape ==> 1.046, ln L = – 18742.18 )
    Based on the tree topology found in the previous search, does it appear that all sites are evolving at the same rate?
    [toggle title_open=”Hide Answer” title_closed=”Show Answer” hide=”yes” border=”yes” style=”default” excerpt_length=”0″ read_more_text=”Read More” read_less_text=”Read Less” include_excerpt_html=”no”]Rates are not equal. Under any of the models with among-site rate variation, the likelihood-score improvement is huge with only one or two additional parameters.[/toggle] If not, what model of among-site rate variation do you think best explains the data?
    [toggle title_open=”Hide Answer” title_closed=”Show Answer” hide=”yes” border=”yes” style=”default” excerpt_length=”0″ read_more_text=”Read More” read_less_text=”Read Less” include_excerpt_html=”no”]Rates follow a gamma distribution. In this case, adding an invariable-sites category to a model already containing gamma-distributed rates does not improve the fit.[/toggle]
  9. Try using simpler DNA substitution rate matrices (e.g., JC, F81, K2P).
    lset rates=gamma shape=est pinv=0; (i.e., model for rates selected above)
    lscores/nst=1 basefreq=equal;
    lscores/nst=1 basefreq=estimate;
    lscores/nst=2 basefreq=equal tratio=estimate;

    Can the data be explained adequately using a simpler DNA substitution rate matrix?
    [toggle title_open=”Hide Answer” title_closed=”Show Answer” hide=”yes” border=”yes” style=”default” excerpt_length=”0″ read_more_text=”Read More” read_less_text=”Read Less” include_excerpt_html=”no”]Likelihood scores for these three models are -19230.54, -18970.60, and -19055.09, respectively. All of these are clearly significantly worse than the HKY+G model (>300 log-likelihood units difference or more, formal LRT would be overkill).[/toggle]
  10. Evaluate Tamura-Nei and GTR models and perform likelihood ratio tests; helpful: Chi-Square Calculator.
    lscores/basefreq=estimate rates=gamma shape=estimate nst=6 rmatrix=estimate; (GTR)
    lscores/rclass=(a b a a c a); (Tamura-Nei)
    GTR model: ln L = -18731.11
    TamNei model: ln L = -18732.66
    HKY model: ln L = -18742.18 (from above)
    delta(TamNei vs. HKY) = 2(18742.18 – 18732.66) = 19.04
    df = 1, P-value < 0.00001 (1-probability given by calculator above)
    delta(GTR vs. TamNei) = 2(18732.66 – 18731.11) = 3.10
    df = 3, P-value 0.3765 (1-probability given by calculator above)
    Is a more complex model than HKY85 (with among-site rate variation) justified according to a likelihood ratio test?
    [toggle title_open=”Hide Answer” title_closed=”Show Answer” hide=”yes” border=”yes” style=”default” excerpt_length=”0″ read_more_text=”Read More” read_less_text=”Read Less” include_excerpt_html=”no”]HKY model strongly rejected in favor of Tamura-Nei. GTR model is unnecessarily complex (fails to reject Tamura-Nei).[/toggle]
  11. Because the data were simulated on a known phylogeny, we know what the correct tree topology is. Based on all of the previous analyses, which topology do you believe is the correct one? (hint: getting this right may require additional tree searching!)
    [toggle title_open=”Hide Answer” title_closed=”Show Answer” hide=”yes” border=”yes” style=”default” excerpt_length=”0″ read_more_text=”Read More” read_less_text=”Read Less” include_excerpt_html=”no”](re-estimate Tamura-Nei + G parameters)[/toggle] lscores/nst=6 rclass=(abaaca) rmat=estimate basefreq=est rates=gamma shape=estimate;
    lset basefreq=prev rmat=prev shape=prev;
    set criterion=likelihood;
    hsearch;

    This search gets the true tree that generated the data:
    true tree = (((((1,6),8),4),5),(2,(3,7)))
    (To continue the successive approximations approach, we re-estimate parameters on this new tree, and repeat the search using the new parameters.)
    lscores/rmatrix=estimate basefreq=estimate shape=estimate;
    ln L = -18732.50
    lset rmatrix=previous basefreq=previous shape=previous;
    hsearch;

    Same tree is found, so we stop.