modsel-sim-tutorial

Download the gzip archive containing the materials for this exercise into the workshop_materials directory:

cd ~/workshop_materials
wget http://evomicsorg.wpengine.netdna-cdn.com/wp-content/uploads/2016/06/modsel.tar.gz

Model selection using jModelTest

For most of this tutorial we'll be using PAUP*, but we'll do one simple example with jModelTest here; refer to the manual at the link above for details on how to use the program. We'll use the primate-mtdna.nex data file.

To make it easier to run jModelTest, first move the run-jmodeltest script to your ~/bin directory and make it executable:

cd ~/workshop_materials/modsel_sim
mv run-jmodeltest ~/bin
chmod a+x ~/bin/run-jmodeltest

Now we will evaluate 56 models on a tree obtained using BioNJ with Jukes-Cantor distances. We'll use both the corrected AIC (AICc) and the Bayesian Information Criterion (BIC):

run-jmodeltest -d primate-mtdna.nex -s 7 -t fixed -g 4 -f -i -AICc -BIC

Model selection using PAUP*

First, move the latest PAUP* executable into your ~/bin directory (this directory precedes the system directories in your PATH setting, and will therefore override the previously installed version).

cd ~/workshop_materials/modsel_sim
mv paup4a152_ubuntu64.gz ~/bin/paup.gz
cd ~/bin
gunzip paup.gz
chmod a+x paup

While we're in the ~/bin directory, let's do one more thing that will simplify our subsequent work. Create a "symlink" to the system's RAxML binary so that we can invoke RAxML simply by typing raxml.

ln -s /usr/local/bin/raxmlHPC raxml

cd ~/workshop_materials/modsel_sim

GUI versions of PAUP* are also available for MacOSX and Windows. If you want to install them on your laptop, you can download them from this page: http://people.sc.fsu.edu/~dswofford/paup_test/

Load the primate-mtdna.nex file into your favorite editor (e.g., nano, vi, or emacs). After I have made some comments, exit the editor without saving any changes.

For the first run, we will type PAUP* input commands interactively. Start the program and execute the primate-mtdna.nex data file:

paup primate-mtdna.nex -L log.txt

The -L switch above opens a log file that will receive all output that is displayed on the screen during the run.

Now, we will attempt to exactly reproduce the jModelTest analysis. First, we need a tree, so we'll compute a BioNJ tree using Jukes-Cantor distances:

dset distance=jc;
nj bionj=y;

We'll use the automodel command to perform model selection. First, let's see what the available options and current settings are:

automodel ?

automodel

The AICc criterion chooses the HKY+G model, and this model is automatically set up in PAUP* for subsequent analysis. You can compare the output from PAUP* to the previously obtained jModelTest results.

We can now search for a maximum likelihood tree in PAUP*. First, we have to change the active optimality criterion to maximum likelihood (the default criterion is parsimony, not because I think parsimony is a good method, but because it can be used on any kind of character data).

set criterion=likelihood;

Remember that you can always abbreviate commands and options as long as they remain unambiguous, so the above command could also have been entered as:

set crit=l;

hsearch;

describe/plot=phylogram;

A somewhat safer searching strategy is to use different starting points, similar to RAxML, Garli, and other programs. The usual way to this in PAUP* is to use "random addition sequences" to generate different starting trees by stepwise addition. For difficult data sets, many starting trees for branch swapping will be different, and the searches may land in different parts of "tree space." If all random-addition sequence starts end up finding the same tree, you can be reasonably confident that you have found the optimal tree(s).

For the sake of completeness, let's do a random-addition-sequence search using the current model settings:

hsearch addseq=random nreps=50;

Obviously, this is an "easy" data set as most, if not all, starting points end up finding the same tree.

PAUP* now has the ability to initiate searches using RAxML and Garli (and I plan to add PhyML, FastTree2, etc., as well). The current PAUP* model settings are used to determine the option settings for these programs. Input and/or config files are created, the program is invoked as an external command, and the resulting trees are imported back into PAUP*. For example, let's look at the options available for the raxml command in PAUP*:

raxml ?

lset shape=est basefreq=est rmatrix=est;
raxml;

lscores;

Note that the likelihood score from PAUP* is slightly worse than the one reported by RAxML. This is because PAUP* is using the Tamura-Nei model rather than the more general GTR model, which RAxML cannot use. To validly compare likelihoods, we need to set up the model in PAUP* to match the one used by RAxML. This involves undoing the restriction on the GTR substitution rates imposed by Tamura-Nei:

lset rclass=(abcdef);
lscores;

The previous (Tamura-Nei) model had set rclass=(abaaca), so that all 4 of the transversions are occurring at the same rate, but each of the transitions are potentially occurring at distinct rates. After the change to 'rclass' PAUP* gets a slightly better likelihood than RAxML. Presumably, it just did a slightly better job at optimizing the model parameters and branch lengths (I have worked very hard on this aspect of the program).

Partitioned analyses:

PAUP* can do "partitioned" analyses in which different models are assigned to different subsets of sites (e.g., genes, or codon positions, or a combination of these). The easiest way to set up a partitioned analysis is to use the autopartition command.

Partitioning begins by defining a set of basic "blocks", which are the smallest divisions of sites. The goal is to see if an adequate model can be achieved by combining some of these blocks into larger ones, thereby reducing model unnecessary model complexity (and speeding searches because fewer parameters need to be estimated).

The primate-mtdna.nex data file defines a partition called codons that places sites into one of four categories: 1st, 2md, and 3rd positions, plus non-protein-coding (tRNA) sites. Let's first look at the options:

autopartition ?

autopartition partition=codons;

Partitioning by BIC combines 1st position and noncoding blocks into the same larger subset.

You can evaluate the likelihood of the resulting model with the lscores command:

lscores;

Before starting a search, you should fix all model parameters to the MLEs obtained above rather than optimizing parameters during the search (which is exceedingly slow!)

lset fixall;
lscores;

The second lscores command above verifies that the parameters have been fixed to the MLE values. Now you can do a search as before:

set crit=like;
hsearch;
describe/plot=p;

Wrap-up

The above is intended to just give a flavor of how PAUP* works. We've barely scratched the surface with respect to what it can do. PAUP* can perform parsimony analysis under a variety of cost schemes, neighbor-joining and criterion-based distance searches using a large number of distance transformations, exact tree searches for smaller numbers of taxa, consensus tree calculation, agreement subtrees, distances between trees, analysis of base frequency composition, likelihood under amino-acid models, likehood for discrete character data, and much more. But it's time to move on to simulations...

Part 2: Exploring Models and Hypothesis Testing using Simulation

Why use simulation?

What simulation is not good for:

How simulation works (in a nutshell):

Your first simulation (maybe)

We'll use PAUP* as our simulator for this exercise. There are other good simulators out there, notably:

The nice thing about PAUP*'s (new) simulation capability is that you don't have to write pipelines to pass simulated data to inference programs -- it's all internal!

Use your favorite editor to type in the following Nexus script and save it as sim3taxa.nex (I'll let you use pre-made files after this, but entering one file the hard way will be good for your soul. The simulation is very simple: a three species tree.

#nexus
[ This example just simulates three sequences under the HKY model ]
begin paup;
    cd *;
    set warntree=no notifybeep=no;
end;

begin taxa;
    dimensions ntax=3;
    taxlabels X Y Z;
end;

begin trees;
    tree 1 = [&R] ((X:0.05,Y:0.05):0.45,Z:0.5);
end;

begin dnasim;   
    simdata nchar=1000;
    lset nst=6 basefreq=(.1 .2 .3 .4) rmatrix=(1 4 1 1 4 1);
    truetree source=memory treenum=1 showtruetree=brlens;
    beginsim nreps=1 seed=0 monitor=y;
        [ Ordinarily, there will be commands in here, but we're just
          getting started.  You don't have to include this comment,
          but if you do don't forget to close the opening square
          bracket with a closing square bracket! ]
    endsim;
end;

paup sim3taxa.nex;

Fix any typos that PAUP* complains about, and keep executing as above until the file is accepted without errors or warnings. Then look at your simulated data by typing the command:

showmatrix;

Raise your hand when you see sequence data. When everyone has more or less gotten to this point, we'll proceed together.

Download and unzip some additional files that we'll work through as a group, using the link provided on the web site:

Part 1: Fundamentals of Model Selection

Preliminaries