## PAUP* activity

• expected learning outcome
• getting started
• exercise 1: basic data manipulation and parsimony analysis
• exercise 2: distance and likelihood analyses

#### expected learning outcome

The objective of this activity is to help you become familiar with using PAUP* for several different analysis types including parsimony, distance, and likelihood. In addition you will also gain understanding of numerous fundamental aspects of phylogenetic analysis and models of molecular evolution.

#### getting started

The data set we will use for this PAUP* exercise is called simulated-dna-data.nex. It is a simulated data set consisting of DNA sequences for 9 organisms and 2500 sites (taxon 9 is the outgroup). You can assume that the simulation was a homogeneous Markov process (i.e., all sites evolving according to the same model, and the model does not change over the tree.

#### exercise 1: basic data manipulation and parsimony analysis

1. Start PAUP*, executing the simulated-dna-data.nex data file.
\$ paup simulated-dna-data.nex; [ “\$” represents the Linux/MacOS prompt ] (or)
\$ paup
paup> exec simulated-dna-data.nex;
2. Conduct an exact tree search to find the most parsimonious tree under simple parsimony with equal weights.
paup> bandb; (for example; could also do alltrees)
What is the length of the optimal tree?
3. Examine the topology of the tree.
paup> showtrees; (for example; could also do describe)
4. Change the outgroup status so that the only outgroup taxon is taxon 9 and re-examine the tree topology.
paup> outgroup 9;
(or)
paup> outgroup ‘taxon_9’;

• What happens to the shape of the tree?
• Inspect the parsimony scores.
paup> pscores; (could also do describe)
What happens to the tree length?
[toggle title_open=”Hide Answer” title_closed=”Show Answer” hide=”yes” border=”yes” style=”default” excerpt_length=”0″ read_more_text=”Read More” read_less_text=”Read Less” include_excerpt_html=”no”]No change in tree length. The length under the parsimony criterion with any freely reversible character type is invariant to the position of the root.[/toggle]
5. Do a heuristic search using the default seetings in PAUP*. Then perform another heuristic search using the random-addition-sequence method with 100 replicates.
paup> hsearch;
Does it look like the optimal trees are hard to find for this data set?
[toggle title_open=”Hide Answer” title_closed=”Show Answer” hide=”yes” border=”yes” style=”default” excerpt_length=”0″ read_more_text=”Read More” read_less_text=”Read Less” include_excerpt_html=”no”]No, they are not hard to find. All 100 replicates find the same tree. (Note: this is not a typical result.)[/toggle]
6. “Describe” the most-parsimonious tree as a phylogram.
describe/plot=p;
7. Perform a bootstrap analysis with 1,000 replicates and 10 random-addition-sequence replicates per bootstrap replicate.
Which grouping on the most-parsimonious tree is least well supported?
8. What is the shortest tree that makes taxa 1 and 6 monophyletic with respect to the remaining taxa? (hint: use a constraint tree)
paup> constraints 1and6 = ((1,6)); (“1and6” can be any name you like)
Equivalently, but more typing:
paup> constraints 1and6 = ((1,6),2,3,4,5,7,8,9);
paup> hsearch enforce constraints=1and6;

#### exercise 2: distance and likelihood analyses

1. Reset all program options to their “factory defaults”.
paup> factory;
2. Perform neighbor-joining analyses using uncorrected (“p”), Jukes-Cantor, Kimura 2-parameter, HKY85, Tamura-Nei, and GTR distances. Examine the distance matrix in each case.
paup> dset dist=jc; showdist; nj;
paup> dset dist=k2p; showdist; nj;
paup> dset dist=hky; showdist; nj;
paup> dset dist=tamnei; showdist; nj;
paup> dset dist=gtr; showdist; nj;

• How does the distance model chosen affect the tree found by NJ for this data set?
[toggle title_open=”Hide Answer” title_closed=”Show Answer” hide=”yes” border=”yes” style=”default” excerpt_length=”0″ read_more_text=”Read More” read_less_text=”Read Less” include_excerpt_html=”no”]Changing the distance measure does not greatly affect the tree. For this example, the NJ tree is the same for the first 4 models, but the last tree is slightly different. This would not always be the case, depending on your data set.[/toggle]
• How does the distance model chosen affect the magnitude of the pairwise distance estimates?
[toggle title_open=”Hide Answer” title_closed=”Show Answer” hide=”yes” border=”yes” style=”default” excerpt_length=”0″ read_more_text=”Read More” read_less_text=”Read Less” include_excerpt_html=”no”]In general, the more complex the model used for distance calculation, the greater the inferred distance, because it better corrects for unobserved substitutions (multiple hits).[/toggle]
• Are smaller distances less affected or more affected by the choice of a distance model?
[toggle title_open=”Hide Answer” title_closed=”Show Answer” hide=”yes” border=”yes” style=”default” excerpt_length=”0″ read_more_text=”Read More” read_less_text=”Read Less” include_excerpt_html=”no”]Less affected. When taxa are more recently diverged, fewer multiple hits will have occurred and the distance correction has less effect.[/toggle]
3. Using the Tamura-Nei distance, search for a tree under the minimum evolution (distance) criterion.
paup> set criterion=distance;
paup> dset objective=me dist=tamnei;
paup> hsearch;

What is the tree score under this criterion?
4. Now evaluate the tree using the Fitch-Margoliash weighted least-squares criterion (inverse squared weighting).
paup> dscores/objective=ls power=2;
What is the tree score under this criterion?
5. Perform a heuristic search for an optimal tree under the likelihood criterion using the default model settings in PAUP*.
paup> set criterion=likelihood;
paup> hsearch;

“Describe” the tree and examine the topology. How does it compare to the trees found previously using parsimony and distance methods?
6. Set up a model with unequal base frequencies and two substitution types with ti/tv ratio estimated from the data (LSet command). Use the “Likelihood Scores” (LScores) command to estimate the ti/tv ratio on the tree found in the previous step.
paup> lset tratio=estimate basefreq=estimate;
paup> lscores;

(or you can do it all in one command)
lscores/tratio=estimate basefreq=estimate;

• What is the estimate of the ti/tv ratio?
• How does it compare to the previous value (i.e., the default setting)?
• What effect did optimization of the ti/tv ratio have on the likelihood score for the tree?
7. Fix the ti/tv ratio and base frequencies to the values you estimated in the step above. Perform another heuristic search using the modified settings.
lset tratio=previous basefreq=prev;
hsearch;
(note: in a real analysis, you would probably want to work harder than this)
describe;
Did the tree topology change?
8. Explore several models that include among-site rate variation.
lset tratio=estimate;
lscores/rates=gamma shape=estimate basefreq=estimate pinv=0;

gamma shape ==> 1.046, ln L = -18742.18
lscores/rates=equal pinv=estimate;
pinv ==> 0.200, ln L = -18857.50
lscores/rates=gamma shape=estimate pinv=estimate;
pinv ==> 9 ( 10-6, shape ==> 1.046, ln L = – 18742.18 )
Based on the tree topology found in the previous search, does it appear that all sites are evolving at the same rate?
[toggle title_open=”Hide Answer” title_closed=”Show Answer” hide=”yes” border=”yes” style=”default” excerpt_length=”0″ read_more_text=”Read More” read_less_text=”Read Less” include_excerpt_html=”no”]Rates are not equal. Under any of the models with among-site rate variation, the likelihood-score improvement is huge with only one or two additional parameters.[/toggle] If not, what model of among-site rate variation do you think best explains the data?
9. Try using simpler DNA substitution rate matrices (e.g., JC, F81, K2P).
lset rates=gamma shape=est pinv=0; (i.e., model for rates selected above)
lscores/nst=1 basefreq=equal;
lscores/nst=1 basefreq=estimate;
lscores/nst=2 basefreq=equal tratio=estimate;

Can the data be explained adequately using a simpler DNA substitution rate matrix?
[toggle title_open=”Hide Answer” title_closed=”Show Answer” hide=”yes” border=”yes” style=”default” excerpt_length=”0″ read_more_text=”Read More” read_less_text=”Read Less” include_excerpt_html=”no”]Likelihood scores for these three models are -19230.54, -18970.60, and -19055.09, respectively. All of these are clearly significantly worse than the HKY+G model (>300 log-likelihood units difference or more, formal LRT would be overkill).[/toggle]
10. Evaluate Tamura-Nei and GTR models and perform likelihood ratio tests; helpful: Chi-Square Calculator.
lscores/basefreq=estimate rates=gamma shape=estimate nst=6 rmatrix=estimate; (GTR)
lscores/rclass=(a b a a c a); (Tamura-Nei)
GTR model: ln L = -18731.11
TamNei model: ln L = -18732.66
HKY model: ln L = -18742.18 (from above)
delta(TamNei vs. HKY) = 2(18742.18 – 18732.66) = 19.04
df = 1, P-value < 0.00001 (1-probability given by calculator above)
delta(GTR vs. TamNei) = 2(18732.66 – 18731.11) = 3.10
df = 3, P-value 0.3765 (1-probability given by calculator above)
Is a more complex model than HKY85 (with among-site rate variation) justified according to a likelihood ratio test?
11. Because the data were simulated on a known phylogeny, we know what the correct tree topology is. Based on all of the previous analyses, which topology do you believe is the correct one? (hint: getting this right may require additional tree searching!)
lset basefreq=prev rmat=prev shape=prev;
set criterion=likelihood;
hsearch;

This search gets the true tree that generated the data:
true tree = (((((1,6),8),4),5),(2,(3,7)))
(To continue the successive approximations approach, we re-estimate parameters on this new tree, and repeat the search using the new parameters.)
lscores/rmatrix=estimate basefreq=estimate shape=estimate;
ln L = -18732.50
lset rmatrix=previous basefreq=previous shape=previous;
hsearch;

Same tree is found, so we stop.