table of contents
- expected learning outcome
- jModelTest exercise
expected learning outcome
The objective of this activity is to help you become familiar with evaluation of models of nucleotide sequence evolution using the software jModelTest (Posada 2008), an extended version of the popular program ModelTest. While ModelTest could only be used in combination with PAUP*, jModelTest includes PHYML, ReadSeq, and Consense (of the PHYLIP package) and can be used as a standalone application. More importantly, jModelTest allows optimization of base trees for every individual model, model selection according to a new decision-theoretic criterion, and model averaged phylogenies.
jModelTest exercise
- Download percidae.fasta: cytochrome b.
- Go to the directory where you saved the jModelTest application and double-click on jModelTest.jar to open it. If you are using Ubuntu Linux, you may have to open a terminal, navigate to the folder this file is located in, and type the command java -jar jModelTest to open the program.
- Go File > Load DNA alignment and open the data set file percidae.fasta.
- Click Analysis > Compute likelihood scores to start the analysis.
- A dialog box will appear that allows you to specify a number of likelihood settings, including the number of substitution schemes to be tested. The other settings specify whether unequal base frequencies are to be tested (+F), and whether a proportion of invariable sites (+I) and rate variation among sites with a number of rate categories in a discretized gamma distribution (+G) should be included. You are also asked to pick one of four options to infer the base tree used for likelihood calculations: Fixed BIONJ-JC, Fixed user topology, BIONJ, and ML optimized. Keep the default settings for number of substitution schemes, base frequencies and rate variation, but pick BIONJ to calculate the base tree. Calculations might take a few minutes depending on the speed of your computer.(Fixed BIONJ-JC is the option that was implemented in ModelTest. If you would like to use hierarchical Likelihood Ratio Tests (hLRT) to identify the best-fitting model, you would have to pick one of the fixed tree topologies. However, according to Posada and Buckley (2004), Akaike Information Criterion and Bayesian approaches are preferable to hLRT because, among other reasons, hLRTs depend on a chosen significance level and require models to be nested. Focusing on the alternative model selection criteria, we are free to use the more advanced settings, BIONJ, or ML optimized. In both cases, tree topologies are optimized for every given model. Maximum Likelihood is arguably better than BIONJ (which is an improved version of the Neighbor Joining algorithm), but is computationally more intensive, which is why we will use the BIONJ algorithm for this activity.)
- Click on Analysis. You’ll see that the AIC, BIC, and DT calculations are now available, while the hLRT calculations are still grayed out. This is because we chose to have optimized tree topologies.
- Now click on Results > Show results table. A window will appear showing you the calculated likelihoods for every single model (again, note, that the trees may differ between models). Also shown are the partition schemes, the number of parameters included in every model, calculated base frequencies, as well as transition and transversion rates. Note that the tables for AIC, AICc, BIC and DT results are still grayed out. You can sort the table by likelihood by clicking on the column header (-lnL). Which model has the highest likelihood? Is it the model with the largest number of parameters? How is it possible that models with more parameters can have lower likelihoods than models with less parameters?
- We are now going to choose the best-fitting models according to Akaike Information Criterion (AIC) (Akaike 1974), Akaike Information Criterion corrected for small sample sizes (AICc) (Hurvich and Tsai 1989), Bayesian Information Criterion (BIC) (Schwarz 1978), and a decision-theoretic performance-based approach (DT) (Minin et al. 2003). Go back to Analysis and click Do AIC calculations …. A new window will appear with the Akaike Information Criterion settings. For the moment, we will not check the Use AICc correction checkbox, as we would first like to use the uncorrected AIC. Make sure, however, that the next two checkboxes (Calculate parameter importances and Do model averaging) are checked. Leave the confidence interval at 100%. Click Do AIC calculations. Calculations should finish almost instantly.If you intend to run PAUP* with the model chosen by AIC, check Write PAUP* block. A PAUP* block that contains the PAUP* commands for the selected model will then be written to the output. This PAUP* block can be added to your PAUP* input file.
- Click on Results > Show results table again. The AIC table is no longer grayed out. Click AIC. Find the model with the lowest Akaike Information Criterion by clicking on the column header AIC. The weight column indicates to what extend this model was preferred over competing models.
- Close the results table window and go back to Analysis > Do AIC calculations …. This time, click the checkbox for Use AICc correction to invoke use of the AIC corrected for small sample sizes. In this case we need to specify a sample size, but we use the preset sample size of 1140. Click Do AICc calculations.(Note that for larger sample sizes, the AICc converges towards the AIC, and therefore should always be used regardless of sample size [Burnham and Anderson 2004].)
- Repeat the same steps for the Bayesian Information Criterion (BIC) and decision-theoretic performance-based approach (DT) calculations. Make sure that Calculate parameter importances and Do model averaging are checked.
- Once all calculations are finished, go back to Results > Show results table and examine which models have been chosen by the different criteria.(In case of disagreement between the selected models, you will have to choose which model to use for phylogenetic analyses. Here are some suggestions. As stated above, AICc should always be preferred over AIC. DT is a newly implemented criterion, the weights of which “are very gross and should be used with caution.” [Posada 2008; jModelTest manual]. This leaves AICc and BIC as the most reliable criteria. Which one to prefer of these two is probably a matter of taste. You are on the safe side if you run phylogenetic analyses with both the model chosen by AICc, and the one chosen by BIC.)
- jModelTest also allows for model averaged phylogenies, which are consensus trees (strict or majority rule) of all the optimized base trees; however, it currently relies on Phylip for this feature, and may not work if the taxon names are > 10 characters. (Link to forum discussion). You may test this for yourself. Click Analysis > Model-averaged phylogeny. A window will open with the Phylogenetic averaging settings. Leave the default settings and click Run. The output will list some settings, the models, and Model averaged phylogeny = , after which presumably the phylogeny should be shown. This feature may be very helpful in future versions of jModelTest.