table of contents
- expected learning outcome
- getting started
- exercise 1: start a basic nucleotide run
- exercise 2: monitor an ongoing run
- exercise 3: inspect the final results of a run
- exercise 4: constrain a search
- exercise 5: bootstrap analyses
- exercise 6: use checkpointing
- exercise 7: use a partitioned model
- exercise 8: use a custom amino acid model
- exercise 9: use a codon model
- exercise 10: use gap (insertion-deletion) models
- exercise 11: use the GARLI web service
- exercise 12: further exercises
expected learning outcome
In this activity you will learn about the basics of GARLI, as well as some more advanced features. Once you have completed the activity, you should be able to analyze various types of data sets with different evolutionary models, monitor runs in progress, checkpoint and restart runs, and understand the output from analyses. You should also be able to perform constrained searches, bootstrap analyses, and post-processing. Finally, you should have gained some experience with the GARLI web service.
getting started
GARLI reads all of its settings from a configuration file. By default it looks for a file named garli.conf
, but other configuration files can be specified as a command line argument after the executable (e.g., if the executable were named garli
, you would type garli myconfig.conf
). Note that most of the settings typically do not need to be changed from their default values, at least for exploratory runs. We will only experiment with some of the interesting settings in this activity, but detailed discussions of all settings can be found on the support web site.
The config file is divided into two parts: [general] and [master]. The [general] section specifies settings such as the data set to be read, starting conditions, model settings, and output files. To start running a new data set, the only setting that must be specified is the datafname, which specifies the file containing the data matrix in Nexus, PHYLIP, or FASTA format. The [master] section specifies settings that control the functioning of the genetic algorithm itself, and typically the default settings can be used. Basic template configuration files are included with any download of the program.
We will be using a small, 29-taxon by 1218-nucleotide mammal data set that was taken from a much larger data set (44 taxa, 17kb over 20 genes) presented in (Murphy 2001). The genes included here are RAG1 and RAG2 (Recombination Activating Gene). This is a difficult data set because the internal branches are quite short, and the terminals quite long. GARLI performs less repeatably on this data set than on many others of similar size, so consider this a “worst case” data set. There are definitely local topological optima. The trees inferred using this gene also do not match our understanding of the relationships of these taxa in many places, but that is not really important for our purposes here.
exercise 1: start a basic nucleotide run
cd wme_jan2015/activities/GARLI/01-03.basicNucleotide
- Open the
garli.conf
file in a text editor (for example,emacs garli.conf
).
There is not much that needs to be changed in the config file to start a preliminary run. In this case, a number of changes from the defaults have been made so that the example is more instructive and a bit faster (therefore, do NOT use the settings in this file as defaults for any serious analyses). You will still need to change a few things yourself. Note that the configuration file specifies that the program perform two independent search replicates (searchreps = 2). Also note that taxon 1 (Opossum) is set as the outgroup (outgroup = 1).
Make the necessary changes:
- Set datafname = murphy29.rag1rag2.nex
- Set ofprefix = run1. This will tell the program to begin the name of all output files with “run1…”.
Now we need to set the model of sequence evolution for GARLI to use. ModelTest has previously been run on this data set (you can peruse the output in the Modeltest
directory). The best-fitting model under the AIC criterion is SYM+I+G (6 rates of substitution, equal base frequencies, an estimated proportion of invariant sites, gamma-distributed rate heterogeneity). This is close to the default GTR+I+G model set in the configuration file, so we only need to change the equilibrium base frequency settings.
- Set statefrequencies = equal
- Save the file.
- Start GARLI by typing
Garli
on the command line in thebasicNucleotide
directory. (Garli
should be on your PATH, so the executable itself does not need to be in the current directory.) GARLI will always look for agarli.conf
file in the directory that you launch it from, regardless of where the executable is actually located.
You will see a bunch of text scroll by, informing you about the data set and the run that is starting. Most of this information is not very important, but if the program stops, be sure to read it for any error messages. The output will also contain information about the progress of the initial optimization phase, and as the program runs it will continue to log information to the screen. This output contains the current generation number, current best lnL score, optimization precision and the generation at which the last change in topology occurred. All of the screen output is also written to a log file that in this case will be named run1.screen.log
, so you can come back and look at it later.
exercise 2: monitor an ongoing run
- Look in the
basicNucleotide
directory, and note the files that have been created by the run. - Open
run1.log00.log
in a text editor. This logs the current best lnL, runtime, and optimization precision over the course of the run. It is useful for plotting the lnL over time. Next, we will look at the file that logs all of the information that is output to the screen. - Open
run1.screen.log
in a text editor. This file contains an exact copy of the screen output of the program. It can be useful when you go back later and want to know what you did. In particular, check the “Model Report” near the start to ensure that the program is using the correct model. - Now let’s look at the current best topology. This is contained in a file called
run1.best.current.tre
. This file is updated every saveevery generations, so it is always easy to see the current best tree during a search. (Do not use this as a stopping criterion and kill the run when you like the tree, though!) - Open
run1.best.current.tre
in FigTree by typingfigtree
at the command line, and examine the tree.
exercise 3: inspect the final results of a run
Hopefully at least one of the search replicates has finished by now. Examine how the topology and branch lengths changed over the entire run. The run1.rep1.treelog00.tre
file contains each successively better scoring topology encountered during the first search replicate. Note that this file can be interesting to look at, but in general will not be very useful to you. The default is for this file to not be created at all.
- Open
run1.rep1.treelog00.tre
in FigTree. - Click through all of the trees. Note how the tree slowly changes over the course of the run. We can also get other information from the treelog file.
- Open the
run1.rep1.treelog00.tre
file in a text editor. You will see a normal Nexus trees block. Each tree line includes comments in square brackets containing the lnL of that individual, the type of topological mutation that created it (mut = 1 is NNI, 2 and 4 are SPR, 8 and 16 are local SPR) and the model parameters of that individual. For example:
tree gen1= [&U] [-10286.10914 mut=8][ M1 r 1 4 1 1 4 e 0.25 0.25 0.25 0.25 a 0.922 p 0.394 ](The “M1” in the model string indicates that this is the first model.) If you scroll through and look at the mutation types, you will probably notice that a mix of all three topological mutation types were creating better trees early on, but the very local NNI mutations dominate at the end of the run. As a comment in each tree specification you will see the model parameters that were associated with each tree during the run. They are specified with a simple code:
r = relative rate matrix
e = equilibrium frequencies
a = alpha shape parameter of gamma rate heterogeneity distribution
p = proportion of invariable sites
The information that you really want from the program are the best trees found in each search replicate and the globally best across all replicates. After each individual replicate finishes, the best trees from all of the replicates completed thus far are written to the .best.all.tre
file. When all replicates have finished, the best tree across all replicates is written to the .best.tre
file.
The config files used here are set up to collapse internal branches that have an MLE length of zero. This may result in final trees that have polytomies. This is generally the behavior that one would want. Note that the likelihoods of the trees will be identical, regardless of whether or not the branches are collapsed.
When the two search replicates have completed, we can more closely examine the results.
First, take a look at the end of the .screen.log
file. You will see a report of the scores of the final tree from each search replicate, an indication of whether they are the same topology, and a table comparing the parameter values estimated on each final tree.
There are two possibilities:
- The search replicates found the same best tree. You should see essentially identical lnL values and parameter estimates. The
screen.log
file should indicate that the trees are identical. - The search replicates found two different trees. This is almost certainly because one or both were trapped in local topological optima. You will notice that the lnL values are somewhat different. The parameter estimates will be similar but not identical. The search settings may influence whether searches are entrapped in local optima, but generally the default settings are appropriate.
We can also evaluate and compare the results of our two search replicates with PAUP*. Being able to open the results of one program in another for further analysis is a good skill to have.
In the event you do not have PAUP* available, here is the PAUP* output for the steps below:
PAUP* output
First, open run1.best.all.tre
and remove the following line from the file: clear;
In the basicNucleotide
directory, execute the murphy29.rag1rag2.nex
file in PAUP*.
(At the command line, type paup murphy29.rag1rag2.nex
.)
Now execute the run1.best.all.tre
file in PAUP*.
(In PAUP*, type execute run1.best.all.tre
.)
You should use the command “execute”, NOT “gettrees” on this tree file. This will load the best trees found for each replicate and the parameter values estimated on the best tree into PAUP* because GARLI writes a PAUP block to the file in addition to the trees block.
In PAUP*, type lscore
.
The PAUP block contained in the tree file tells PAUP* to score the tree without actually doing any optimization, using the exact parameter values and branch lengths estimated by GARLI. The lnL score you get for the better of the two replicates should be very near the score output by GARLI. You will also see the score of the best tree from the other replicate, but it may not exactly match the score output by GARLI because the model parameter values used were those estimated on the best tree.
To verify that GARLI did a good job of estimating the branch lengths and parameter values, we can have PAUP* reoptimize them:
In PAUP*, type lset userbr=no rmat=estimate base=equal rates=gamma shape=estimate pinv=estimate
.
This tells PAUP* to optimize the branch lengths and all parameters of the SYM+I+G model.
In PAUP*, type lscore
. (This may take a few minutes.)
This tells PAUP* to use GARLI’s tree, but to optimize everything else. After a bit you will see the optimized lnL score and parameter values for each tree as estimated by PAUP*, which will be almost identical to what you saw when PAUP* used GARLI’s parameter values, which indicates that GARLI did a good job optimizing. If there is a difference in the scores between the programs, those reported by PAUP* are the true maximized lnL of the trees that you should take note of.
Since we’ve already loaded the GARLI trees into PAUP*, we can also use PAUP* to compare the two trees and see if they differ (and by how much).
In PAUP*, type treedist
.
This will show what is termed the “Symmetric Tree Distance” between the trees. It is a measure of how similar they are, and is the number of branches that appear in only one of the trees. If the trees are identical, the distance will be zero. The maximal distance between two fully resolved trees is 2*(# sequences – 3).
If the trees are different, we can calculate a consensus tree in PAUP* and see exactly where they agree. Note that in general you should choose the best-scoring tree as your ML estimate instead of a consensus.
In PAUP*, type contree /strict=yes majrule=yes LE50=yes
.
This will show a strict consensus and a majority rule consensus. The strict consensus completely collapses branches that conflict, but since GARLI already collapsed zero length branches, it is hard to tell where the conflict is. The majority rule consensus will show 50% support for branches that were only found in one tree, but it is not possible to show them all in a single tree.
A final note: obviously you can visually inspect and compare your final trees in FigTree or another tree viewer, but that is not a quantitative comparison.
exercise 4: constrain a search
If you looked carefully at any of the trees you’ve inferred (and know something about mammals), you may have noticed that the relationships are somewhat strange (and definitely wrong) in places. One relationship that this small data set apparently resolves correctly is the sister relationship of Whale and Hippo. This relationship (often termed the “Whippo” hypothesis) was once controversial, but is now fairly well accepted. If we are particularly interested in this relationship, we might want to know how strongly the data support it. One way of doing this would be to simply look at the bootstrap support for it, but we might be interested in a more rigorous hypothesis test such as a simulation-based parametric bootstrap test, a Shimodaira-Hasegawa test, or an Approximately Unbiased (AU) test. We won’t go into those here, but in (Goldman et al., 2000) there is a detailed discussion of many of the available tests for comparing topological hypotheses.
The first step in applying one of the topological hypothesis tests is to find the best topology that does NOT contain the Whippo relationship. This is done by using a constrained topology search. In this case we want a negative (also called converse) constraint that causes GARLI to search through tree space while avoiding any tree that places Whale and Hippo as sisters.
GARLI allows constraints to be specified in a number of ways. We will do it by specifying the bipartition to be constrained using the “dot-star” format. This is a simple way of specifying a particular taxonomic grouping that uses a string of periods and asterisks, with one character for each taxon. All of the taxa represented with periods are on one side of the branch being specified, and all taxa represented with asterisks are on the other. For example, to constrain a grouping of taxa 1 and 2 in a data set of 8 taxa, the string would be **…… (this is equivalent to ..******). In this case, the taxa we want to constrain are numbers 20 and 21, out of 29 total taxa.
- In the
constrainedNucleotide
directory, use a text editor to create a new (empty) text file namedwhippoNegative.con
. - On the first line, type in the string specifying the (Whale, Hippo) grouping (i.e., 19 periods, 2 asterisks, 8 periods). Note that GARLI will also take constraint trees, which is how PAUP* reads constraints. Constraints can either be positive (MUST be in the inferred tree) or converse (also called negative; CANNOT be in the inferred tree). The constraint type is specified to GARLI by putting a + or – at the start of the constraint string in the constraint file.
- Add a – to the beginning of the ….*** string that you just created, so that it looks like this:
-……………….**…….. - Save the file. Now we need to tell GARLI to use the constraint. The
garli.conf
file in this directory has already been set up to be similar to the one we used during the unconstrained search earlier, so we only need to make minimal changes. - In the
constrainedNuclotide
directory, edit thegarli.conf
file and set constraintfile = whippoNegative.con. - Set ofprefix = constrainedRun1.
- Save the config file.
- Start GARLI.
Constrained searches can make searching through treespace more difficult (you can think of it as walls being erected in treespace that make getting from tree X to tree Y more difficult), so you may see that the two constrained search replicates result in very different lnL scores. When the run finishes, note the difference in lnL between the best tree that you found earlier and the best constrained tree. This difference is a measure of how strongly the data support the presence of the (Whale, Hippo) group relative to its absence. Unfortunately, we can’t simply do a likelihood ratio test here to test for the significance of the difference, because we have no expectation for how this test statistic should be distributed under the null hypothesis. That is what the parametric bootstrap, SH, or AU test would tell us, but is well beyond the scope of this activity.
Note the lnL difference between the best overall and best constrained trees.
exercise 5: bootstrap analyses
It won’t be practical to run bootstrap analyses during an in-class exercise, but you will find the output files from a GARLI bootstrap analysis of the Murphy29 data set in the bootstrapNucleotide
directory. When performing a bootstrap analysis, GARLI will generate a number of bootstrap-resampled data sets and perform a full tree search on each. The single best tree resulting from each replicate is written to a file named <ofprefix>.boot.tre
.
GARLI does not currently calculate the bootstrap proportions or the bootstrap consensus tree from the set of trees found over the bootstrap replicates. An external program must be used for that, and good options are PAUP*, the CONSENSE program from the PHYLIP package (trees will need to be converted to Newick format), and a nice program called Sumtrees from the DendroPy package (it requires a Python installation).
Since we are already familiar with it, we will post-process our bootstrap results using PAUP*.
In the event you do not have PAUP* available, here is the PAUP* output for the section below: PAUP* output
- In the
bootstrapNucleotide
directory, execute themurphy29.rag1rag2.nex
data in PAUP* (typepaup murphy29.rag1rag2.nex
). - In PAUP*, load the bootstrap trees with the command
gettrees file=nucboot.boot.tre
. - In PAUP*, make the majority rule consensus tree with the command
contree /strict=no majrule=yes LE50=yes grpfreq=no
.
This will display the majority-rule bootstrap consensus tree, including branches appearing in less than 50% of the trees (LE50). You will notice that some parts of the tree are very poorly supported, while others have high support. It is somewhat comforting that the parts of the tree that we know are resolved incorrectly receive low support. This is precisely why phylogenetic estimates MUST be evaluated in light of some measure of confidence, be it bootstrap values or posterior probabilities.
exercise 6: use checkpointing
One useful feature of GARLI is its ability to write “checkpoint” files during a run. If the run is stopped before finishing, either intentionally (e.g., need to reboot system, someone else needs to use system) or unintentionally (e.g., system crash, careless coworker closes the program), it may then be restarted from the checkpoint to resume where it left off. Note that a run restarted this way will give EXACTLY the same result as if it had never been terminated.
- In the checkpoint directory, edit the configuration file and set writecheckpoints = 1.
- Start GARLI.
- After the run progresses for a few seconds, stop it by pressing control-C, closing the window that it is running in, or killing it some other way.
- Edit the configuration file again and set restart = 1.
- Start GARLI again.
Note that it restarts from about where it left off. A run can be stopped and restarted with checkpointing as many times as desired. If you stop and restart it again, no further changes would need to be made to the config file because restart is already set to 1.
exercise 7: use a partitioned model
Partitioned models are those that divide alignment columns into discrete subsets a priori, and then apply independent substitution submodels to each. There are a nearly infinite number of ways that an alignment could be partitioned and have submodels assigned, so not surprisingly, configuration of these analyses is more complex.
Note that although some models such as gamma rate heterogeneity allow variation in some aspects of the substitution process across sites, a model in which sites are assigned to categories a priori is more statistically powerful IF the categories represent “real” groupings that show similar evolutionary tendencies.
Running a partitioned analysis requires several steps:
- Decide how you want to divide up the data. By gene and/or by codon position are common choices.
- Decide on specific substitution submodels that will be applied to each subset of the data.
- Specify the divisions of the data (subsets) using a charpartition command in a Nexus sets block in the same file as the alignment.
- Configure the proper substitution submodels for each data subset.
- Run GARLI.
Note that detailed instructions and examples are available on the following GARLI wiki page: Using partitioned models
On to the actual exercise…
- In the
basicPartition
directory, openmurphy29.rag1rag2.charpart.nex
in a text editor. Scroll down to the bottom of the file, where a Nexus sets block with a bunch of comments appears. Notice how the charset commands are used to assign names to groups of alignment columns. Notice the charpartition command, which is what tells GARLI how to make the subsets that it will use in the analysis. - Decide how you will divide up the data for your partitioned analysis. For this exercise it is up to you. There are a few sample charpartitions that appear in the datafile. If you want to use one of those, remove the bracket comments around it. If you are feeling bold, make up some other partitioning scheme and specify it with a charpartition. Save the file.
- Now we tell GARLI how to assign submodels to the subsets that you chose. The following is a table of the models chosen by the program Modeltest for each subset of the data. Look up the model for each of the subsets in the partitioning scheme that you chose. Don’t worry if you don’t know what they mean.
sites rag1 rag2 rag1+rag2 all GTR+I+G K80+I+G SYM+I+G 1st pos GTR+G SYM+G GTR+I+G 2nd pos K81uf+I+G TrN+G GTR+I+G 3rd pos TVM+G K81uf+G TVM+G 1st+2nd GTR+I+G TrN+I+G TVM+I+G - In the
basicPartitioned
directory, open thegarli.conf
file. Everything besides the models should already be set up. Scroll down a bit until you see several sections headed like this: [model1], [model2], etc. This is where you will enter the model settings for each subset, in normal GARLI model format, in the same order as the subsets were specified in the charpartition. The headings [model1], etc. MUST appear before each model, and MUST begin with model 1. For example, if you created three subsets, you’ll need three models listed here. Open thegarli_model_specs.txt
file. This file will make it much easier to figure out the proper model configuration entries to put into thegarli.conf
file. - In the
garli_model_specs.txt
file, find the models that appeared for your chosen subsets in the table above. For example, if I was looking to assign a model to rag2 2nd positions, the model from the table would be “TrN+G”. Find the line that reads “#TrN+G” and copy the 6 lines below it. Now paste those into the garli.conf file, right below a bracketed [model#] line with the proper model number. - Start GARLI.
- Peruse the output in the
.screen.log
file, particularly looking at the parameter estimates and likelihood scores. Note the “Subset rate multiplier” parameters, which assign different mean rates to the various subsets. Note that the likelihood scores of the partitioning scheme that you chose could be compared to the likelihoods of other schemes with the AIC criterion. Details on how to do that appear on the partitioning page of the GARLI wiki: Using partitioned models
exercise 8: use a custom amino acid model
In addition to nucleotide-based analyses, GARLI can analyze these protein-coding genes at the amino acid level. We might prefer this to nucleotide analyses under certain conditions, particularly when our sequences are quite dissimilar and silent substitutions at the third codon position are saturated. Unlike nucleotide models, the amino acid models also implicitly take into account how well particular amino acids can substitute for one another in proteins. However, by ignoring silent substitutions, amino acid models do throw out information that may be informative at some level, and that has the potential to worsen phylogenetic estimates.
One handy feature of GARLI is that it can do the translation from nucleotide to amino acid sequences internally. So, you can use the same alignment file as input for nucleotide, amino acid and codon analyses and only make changes in the config file. Note that as of version 2.0, GARLI allows three genetic codes: the standard code, the vertebrate mitochondrial code and the invertebrate mitochondrial code.
As with nucleotide analyses, the first step is to choose a model. The program Prottest is similar to Modeltest, and does a good job of this. One unfortunate fact is that Prottest does require an amino acid alignment, so a nucleotide alignment will have to be translated first.
Note that GARLI does not yet implement many of the models considered by Prottest, although specification of some of these models are possible with the GARLI web service. In this case we already ran Prottest on this data set, and the model it chose is what is usually termed the Jones or JTT model (Jones, Taylor and Thornton, 1992), with gamma rate heterogeneity. You can look at the Prottest output in the Prottest
directory if you like.
However, to make things more interesting, we will be using an altogether different model, one that GARLI does not implement internally. This is the “LG” model (from Le and Gasquel, 2008), and will give us an example of how model parameter values can be passed to GARLI, rather than having it estimate them. The procedure is the same regardless of whether the model is at the nucleotide, amino acid or codon level.
Set up the config file:
- In the
customAA
directory, open thegarli.conf
file. - Change the datatype from dna to codon-aminoacid (codon-aminoacid tells GARLI to do the codon to amino acid translation internally. datatype = aminoacid should be used for actual amino acid alignments).
- Change both the ratematrix and statefrequencies settings to fixed. This tells GARLI that the values of those parameters will be provided to it from a file and should not be estimated.
- Set invariantsites = none.
- Change ofprefix to AARun1.
- Set streefname = LGmodel.mod. This file is already in the
customAA
directory, and contains all of the parameter values of the LG model in a format that GARLI can use. It is rather scary looking, so don’t worry about the details. - Save the file.
- Start GARLI.
You will notice that the amino acid analysis runs significantly more slowly than the nucleotide one, so the config file is set to only do a single search replicate. Once the run finishes (approx. 10-15 minutes on a slow machine) you can use FigTree to visually compare the trees from the nucleotide and amino acid analyses, or use PAUP* for more quantitative comparisons. You will notice that the two types of analyses give very different results. In this case the nucleotide results are much more reasonable. You will also notice that the amino acid data do not support the Whippo relationship.
exercise 9: use a codon model
Codon-based analyses are very slow. We won’t do a complete search here, but will set up a configuration file for one and start it to get a feel for the speed. We’ll tell GARLI to use a model more or less equivalent to the Goldman-Yang 1994 model.
- In the
basicCodon
directory, edit the configuration file. Some of it is already set up for this data set. We only need to set the details of the codon model. - Set datatype = codon
- Set ratematrix = 2rate
- Set statefrequencies = f3x4
- Set ratehetmodel = none
- Set numratecats = 1
- Set invariantsites = none
- Save the file, start GARLI.
You’ll quickly notice that the codon-based analysis is very slow, and you can kill it by typing ctrl-C.
exercise 10: use gap (insertion-deletion) models
GARLI implements two models appropriate for incorporating gap information into analyses of fixed alignments. These are the “DIMM” model (Dollo Indel Mixture Model) and a variant of the Mkv or “morphology” model. Both decompose an aligned matrix into two: one that contains normal sequence data, and the other that encodes only the gaps as 0/1 characters. The two matrices are then analyzed under a partitioned model. The DNA portion is treated normally, while the 0/1 matrix is treated differently under the two models:
- The DIMM model assumes a very strict definition of homology of gap characters, and does not allow multiple insertions within a single alignment column. This also implies a “dollo” assumption, meaning that after a base is deleted another cannot be inserted in the same column. This also means that it infers rooted trees. The DIMM model extracts very strong signal from the gap information, but also ends up being VERY strongly affected by alignment error.
- The Mkv gap variant works on columns of 0/1 characters, but assumes that columns of entirely 0 (fully gap columns) will not be seen. This model DOES allow multiple transitions back and forth between 0 and 1 (gap and base). It does not extract as much information as DIMM, but is also not as sensitive to alignment error.
Preparing the data:
To use the DIMM or Mkv gap models, we first need to create a “gap coded” matrix that is just a direct copy of the DNA matrix with gaps as “0” and bases as “1”. This is done with an external program called “gapcode”. Scripts are provided to format Nexus or FASTA data sets for you, and to start the analyses. For the demo the data set we’ll use is called forTutorial.nex
.
- Format data for gap analysis: From the command line in the
gapModels
orgapModels-Win
directory, type./prepareNexusData.sh forTutorial.nex
orprepareNexusData.bat forTutorial.nex
- This will create some alignments in the
preparedData
directory. (These files are actually already there in the demo package, in case you have difficulty running gapcode.) - You can also use these same scripts to prepare your own Nexus or FASTA alignment.
Running GARLI gap models:
Now we run the DIMM and Mkv gap analyses by using other scripts:
- type
./runGarli.dna+gapModels.sh
orrunGarli.dna+gapModels.bat
. - This will do several GARLI searches on the same data. First a quick run will be done to generate starting trees for the other analyses. Then DNA only searches will be run, as well as the DIMM and Mkv gap model searches.
- The runs will create output that you can look at in the
garliOutput
directory. Looking at theXXX.screen.log
files will tell you about the details of the run and parameter estimates. TheXXX.best.tre
files contain the best trees found in each search. - Note that the DIMM model indicates its inferred root by adding a dummy outgroup taxon named ROOT in the tree files. On OS X and Linux the prepareData scripts will create alignment files containing this taxon in the
preparedData
directory. This hasn’t yet been automated on Windows. - If you use the above prepareData scripts on your own data set, these runGarli scripts should also work properly to automatically analyze your data.
exercise 11: use the GARLI web service
We have developed a GARLI web service that is backended by a powerful grid computing system. You will use it for this exercise.
- Go to the Create Job page.
- At the top, enter an email address. This is how you will receive notifications about the status of your job.
- Enter a job name. This is completely arbitrary, but if you would like to help us track down problems, use your last name (e.g., Zwickl) as the job name. Recall that when you ran GARLI from the command line in exercise 1, you instructed GARLI to perform two independent search replicates (searchreps = 2). Now you will be performing an adaptive search, which begins with 10 search replicates. Each search replicate will run on a different processing node in our grid system. (You can learn more about the adaptive search, as well as other GARLI web service features, in our Systematic Biology publication.)
- We will be using the same data file that we used in exercise 1, but we need to upload it to the web server.
- Click on the Sequence data file field, and browse to the
basicNucleotide
directory. - Select the
murphy29.rag1rag2.nex
data file. Now we need to set the model of sequence evolution for GARLI to use. We’ve already determined in exercise 1 that the best-fitting model under the AIC criterion is SYM+I+G, so we only need to change the equilibrium base frequency settings. - Under Model Parameters > Non-partitioned Analysis > Nucleotide, change State frequencies to “equal”.
- Fill in the CAPTCHA at the bottom of the page.
- Click the Create Job button. You should receive an email notification confirming that the job was submitted. The amount of time it takes to compute the job and get the results back will depend on a number of factors, so you should probably do something else while you’re waiting (like finish reading about the GARLI web service). When the job is complete, you should get another email notification containing a link to a page where you can download your results.
- On the Job Status page, enter your email address and the Job ID from the email you received.
- Click the Get Job Status button. If your job is complete, a list of files should appear. The
best_tree.tre
file contains the tree with the highest likelihood score among the ten search replicates. You may also peruse the analysis summary, some of the graphs that are generated, and the visualization of the best tree if that is available.
Some questions to think about and discuss:
- In the past, how have you evaluated the thoroughness of your search?
- What are the advantages of the adaptive search?
- Did you find the post-processing statistics and graphics helpful?
The web service is available for you to use any time. You will gain some additional features, such as the ability to manage your jobs more easily, if you register for an account on molecularevolution.org via the home page. The GARLI web service includes support for partitioned analyses (including a “guided mode” to ease specification of relatively simple analyses), a variety of amino acid models, a “clone job” feature, as well as sophisticated analysis post-processing including visualization of the best tree or bootstrap consensus tree. If you have time, you could try performing either the partitioned analysis in exercise 7 or the LG amino acid model analysis in exercise 8 using the web service.
exercise 12: further exercises
If you have your own data set of interest, now would be a good time to give it a try in GARLI. You can use the config file in the basicNucleotide
directory as a template, or just use the GARLI web service.
You might also try doing a constrained amino acid search that forces Whale and Hippo to be sister, since they are not sister in the ML amino acid tree. You can use the same constraint file as before; simply change the “-” at the start of the constraint string to a “+” to denote that it is a positive constraint. Comparing the likelihood of the constrained and unconstrained amino acid trees will give a measure of the support of the amino acid data against the Whippo hypothesis.