Table of contents
- Expected learning outcomes
- Getting started
- Exercise 1: basic run in MrBayes
- Exercise 2: assessing output quality in Tracer
- Exercise 3: partitioning data in MrBayes
- Exercise 4: using stepping-stone analysis to assess the improvement of partitioning
Expected learning outcomes
The objective of this activity is to become familiar with MrBayes, a program for running Bayesian Phylogenetic analysis. We shall cover the input and analysis of a dataset, the summarisation and assessment of an MCMC run, the partitioning of a multi-gene dataset and whether such partitioning improves the resulting tree.
Getting started
To load MrBayes you type ‘mb’ on the command line. You will need to download:
- primates.nex for exercise 1
- 16S.nex for exercise 2
- rRNA.nex for exercises 3
- rRNA parts output files for exercise 4.
These are arranged in the MrBayesTutorial zip file.
Exercise 1: basic run in MrBayes
Reading in data
The MrBayes program is started by typing the command
mb
on the command line.
Input files to and commands within MrBayes are similar to those for PAUP*. A nexus data file is read in using the command
execute <filename>
Read in the primates.nex file by using the command
execute primates.nex
If done correctly a summary of the file contents are displayed.
Q. What kind of data are the characters in this file?
Q. How long is the alignment?
Setting the evolutionary model
Once the data is read in we must choose the correct evolutionary model. Here we shall use a GTR+I+gamma model of evolution. This is done using the lset command. First we want to find out what the current parameter settings are.
You can find out about any command and its settings by typing
help <command>
This gives a list of all the parameters and their meanings for the given command. It also lists the current settings of the parameters in a table at the end.
To get such information for lset type
help lset
Q. What is the current number of substitution types (nst)?
Q. What is the current rate estimation parameter setting?
To set the model to GTR+I+gamma we need to change the nst to 6 and the rates to invgamma. This is done by typing
lset nst=6 rates=invgamma
You can use the help command to observe the changes to the parameters.
Other parameters such as the number of gamma categories (ngammacat) or the use of a Gibbs sampling approach can be changed here also but we will leave them as the default for this tutorial.
Setting the priors
Priors for the model can be set using the prset command. Here you can set options such as use of molecular clock or enforcing specific nucleotide frequencies. We shall leave the priors as default, which is often appropriate for most analyses.
Checking the evolutionary model
The model can now be checked using the showmodel command. Type
showmodel
Here we can see we have 7 parameters. The individual parameter settings can be observed using the command showparams.
Q. What is the prior placed upon the nucleotide frequencies?
Q. What shape do we have for our gamma distribution of site rates?
Q. What is the prior on the branch lengths?
We now have our data ready and our evolutionary model set up. We must now set-up our MCMC parameters and run the analysis.
Setting up the MCMC run
The MCMC parameters can be viewed using
help mcmc
We want to run the analysis with only 10,000 generations and 1 heated chain. MrBayes counts the number of heated chains as n-1 if n >1. E.g. if the parameter nchains is 4 then we have 1 cold and 3 heated chains. Also note that by default 2 separate runs are undertaken so that posterior probability sampling can be compared (next tutorial).
To change MCMC parameters without starting the analysis we use the mcmcp command.
Q. Set the MCMC parameters to 10000 generations and 1 heated chain.
NOTE: In actual analysis you often want more than 10k generations but for time we shall use only 10k. Large datasets will also likely require more than 1 heated chain.
Running the analysis
Start the analysis by typing
mcmc
The analysis will now begin. In the chain results section each line contains the following information:
Column 1: generation number
Columns 2 and 3: the 2 chains in run 1
Columns 4 and 5: the 2 chains in run 2
Column 6: the time it took to complete this number of generations.
A star separates the two runs. Thus if 4 chains were used, columns 2-5 would be run 1 and 6-9 would be run two. The numbers in columns 2-5 are the log likelihoods of the given chain. The square brackets indicate which chain is currently the cold chain.
When asked if you wish to continue the analysis after 10k generations, say no. This is because the standard deviation of split frequencies is so low we can stop here.
Q. In the help mcmc output, can you see how you would set a defined standard deviation (convergence diagnostic) to stop the analysis at?
Summarising the parameter output
We shall now summarise the model parameters and use the within-MrBayes tools for checking for convergence.
Model parameters for each sampled generation are written to the .p output files. We can summarise these files using the sump command. We wish to discard 25% of the samples as burn-in so we use the following command:
sump burnin=250
This burn-in within MrBayes is the number of samples to discard, not the number of generations (as is sometimes asked for in other programs).
Q. Can you determine another way to get a 25% burin without using the burnin flag?
The first thing to look at is the plot of generation versus the log probability. This should look like white noise (i.e. no obvious trends up or down) if we have achieved good sampling of the posterior probability distribution.
Q. Do you see any trends? Are we happy we have run enough generations?
Under the plot there is a table of all the sampled values. Two measures of convergence are listed here. The ESS (estimated sample size), which should be at least 100 (preferably above 200), and the PSRF (Potential Scale Reduction Factor), which should be close to 1 if we have a good sample. If not, a longer run is needed.
Q. Do we need to rerun the analysis with a longer generation time?
Summarising the topology output
Next we wish to summarise the trees and branch lengths to get our resulting phylogenetic tree. This is done using the sumt command. We shall run it, again with 25 % burn-in, with the command
sumt burnin=250
This will show you the support for each partition along with the posterior probabilities of each branch (first tree) and the branch lengths (second tree). It also tells you how many trees account for 99% of the posterior probabilities.
Q. How many different topologies are contained within the 99% credibility set for this data?
This also creates the consensus tree resulting form the analysis.
As the analysis is now finished you can quit MrBayes by typing
quit
Viewing the consensus tree and bipartition posterior probabilities
Once the MrBayes run is finished and we are happy with the summaries we can view the consensus tree. FigTree is a program that is very good for viewing and outputting trees for publication.
Open FigTree by typing
figtree
on the command line (after you have quit MrBayes or in a separate terminal window). Within FigTree open the file by using ‘File’>’open’ and navigate to where the ‘primates.nex.con.tree’ file was created. The phylogram will appear in the main window.
The main task to do is to display the bipartition posterior probabilities on each node. Check the box beside ‘Node Labels’ on the left-hand side. You will see digits appear beside each node. These are by default the node ages. We wish to show the bipartition posterior probabilities so click the arrow beside the checkbox you just checked. Now the options for the node label display can be seen. Select ‘prob’ from the display drop-down menu. Now each node displays the bipartition posterior probabilities as estimated by MrBayes. These trees can then be exported to PDF for publication if needed.
Exercise 2: assessing output quality in Tracer
Creating a short MCMC run
Using the commands learned in exercise 1, run an analysis of the 16S.nex file with the following options:
A GTR+gamma evolutionary model
A 10,000 generation MCMC run with 2 heated chains
The filename for the MCMC run should be 16S_10K. Find out how to set this using the help mcmc command.
Q. Can you tell how often the chain is sampled?
Q. How many samples do we finish with? (hint (chain length/sample frequency)+1)
Analysing the output of an MCMC run in Tracer
Tracer is a program that allows for analysis of trace files from Bayesian MCMC runs. It allows for analysis of both individual runs and combined runs with measurements such as ESS estimates, density curves and trace plots (also called caterpillar plots).
Open up tracer and input the two .p files created from your 16S MCMC run. This done by choosing ‘File’>’Import Trace File’, setting the format to ‘All files’ and clicking on a .p file (or shift-click on multiple files). Now we can assess whether we have run the analysis for long enough to reach convergence.
The first measurement of sampling is the ESS, as was output from sump in MrBayes. The ESS (Effective Sample Size) is the number of effectively independent draws from the posterior. If there is a low interval between samples during the run there will be a high amount of autocorrelation, resulting in a low ESS no matter how long the chain is run. Thus the ESS gives you an idea of the balance between the length of the chain and how frequent the posterior was sampled during the run. Generally the ESS should be above 100 to be sure that sufficient sampling occurred. If it is below 100 it will be highlighted in red. Ideally it should be above 200, thus values between 100 and 200 are shown in gold.
Click ‘Combined’ in the top left box to see outputs from both trace files. Under ‘Traces’ the parameters and their associated ESS is displayed.
Q. Do you think the runs were long enough to effectively sample each distribution?
Next we can look at the raw trace plot to see if we have run the chain long enough for convergence. Click on the ‘Trace’ button in the top righthand corner. A good indicator of whether we have a sufficient amount of sampling is that the trace line should not swing wildly between values. The line should hover around a value with small variances on either side (making it look like a hairy caterpillar).
Q. Do you think the raw trace plot shows sufficient sampling?
Q. If a longer run is needed, how much longer should be run for? (hint, you need to get the lowest ESS value above 200).
Q. Instead of creating longer chains, how else could we increase the ESS?
Analysis of a sufficiently sampled MrBayes run
Close down the trace analysis and load in the .p files of the 16S_1M run that is supplied for you. This analysis was created the same way as you did above but with a 1 million generation MCMC run. This takes much longer and thus was pre-computed for you.
Q. How many samples were taken from the combined runs? (hint: same sample frequency)
Q. Do you think sufficient sampling has been undertaken now?
If it has been run sufficiently, we can begin to do some extra analysis on the files. Click the ‘Marginal Density’ button. Here we can look at and compare the posterior probability density of parameters. Lets look at the rate of substitution between nucleotides. Click on r(A< – >C) under ‘Traces’. The density should have a bell shaped curve if sampled sufficiently. Now shift-click the remaining rate traces. This should overlay all six on one plot. Click ‘legend’ at the bottom right to know which trace is which.
Q. Is there the same density curve for each rate of substitution?
Q. Which substitution appears to occur more often?
Q. Was a GTR model suitable for this analysis?
Repeat with the frequencies of each nucleotide (pi).
Q. Which nucleotide has the highest frequency?
How to run MrBayes on empty
To check your distributions under the prior only set up MrBayes as you would for a normal run but use the data=no flag in the MCMC command
Exercise 3: partitioning data in MrBayes
Partitioning a dataset in MrBayes
MrBayes can analyse more than one type of characters. There are four types accepted:
- Nucleotide sequences
- Amino acid sequences
- Morphological data
- Restriction data
The input nexus file may contain one or more of these types. The data may then need to be partitioned based on each type or, in the case used here, into separate sections of the same data type (e.g. different genes). This allows for different evolutionary models and priors to be assigned to each partition in order to better contribute to the final analysis.
We shall use a dataset of 5S, 16S and 23S rRNA genes to demonstrate data partitioning. Open rRNA.nex and see how each gene is sectioned. Comments are within square brackets for convenience. Begin MrBayes and load in the rRNA.nex file as before. Partitioning of data is first done by using the charset command such as:
charset <name> = <range>
For example, the 5S gene is characters 1 to 108 so we create this character set with the command
charset 5S = 1-108
Q. How would you repeat this for the 16S and 23S genes? Do so now.
Now we shall instruct MrBayes to create a partitioning of the data using these character sets:
partition partition_by_gene = 3: 5S, 16S, 23S
This creates a partition named ‘partition_by_gene’ with 3 sections and gives the character set for each section. We set the partition by using:
set partition=partition_by_gene
Setting evolutionary models and priors per partition
Now that the data is partitioned we can set an evolutionary model for each character set. The lset command is used here with the ‘applyto’ parameter. For example, to apply an nst of 2 to the 5S set (the first partition) we use:
lset applyto=(1) nst=2;
We can set parameters to one partition, multiple, or all. We shall set the 16S and 23S partitions to be nst=6 and then a gamma distribution to all.
lset applyto=(2,3) nst=6
lset applyto=(all) rates=gamma
The same can be done for priors using the prset command
Q. How would you set the rate prior for all partitions to be variable? Do so now.
We may also wish to unlink certain priors between partitions so that they may vary independently. This is done with the unlink command. For example we can unlink the substitution rates (revmat) and character state frequencies (statefreq) by:
unlink revmat=(all) statefreq=(all)
Q. How would you unlink the gamma shape parameter? (hint use help unlink). Do so now.
Running a partitioned MCMC analysis
Running the MCMC analysis of a partitioned dataset is the same as we did in exercise 1.
Q. Run the MCMC with a 10,000 generation time and an output filename of rRNA_part_10K
Q. Is this run long enough? Check with either Tracer or sump commands.
Exercise 4: using stepping-stone analysis to assess the improvement of partitioning
Partitioning the data may lead to better marginal likelihood scores of your runs though also may not always be necessary, adding additional parameters without improving the likelihood score. A simple way to test whether partitioning is needed to improve the fit of the data is to run the stepping-stone method for accurate estimation of model likelihoods for Bayesian model choice using Bayes factor. This method can also be used to test any two scenarios. E.g. can test whether a strict or relaxed molecular clock is required.
To use the stepping stone method we use the ss command. This has many similarities to the mcmc command. You can set a number of generations and the diagonal frequency as with MCMC.
As this analysis can take some time the diagnositics have been precomputed for you. Open the rRNA_part.nex file in a text editor. This file is a MrBayes batch file which runs multiple commands without the need for typing them individually. Comments are within the square brackets. Here you can see all the commands used for the analysis. This batch file reads in the rRNA.nex datafile and runs MCMC and ss on the data with and without partitioning as was created in exercise 3. The rRNA_part_log.txt file contains all the output from this run, including the marginal likelihoods from the stepping-stone analysis.
Q. Does partitioning improve the marginal likelihood score? (hint search for ‘stepping-stone’ in the log file)