Table of contents
- Expected learning outcomes
- Getting started
- Exercise 1: Produce an ML phylogeny of 16s
- Exercise 2: Produce an ML phylogeny of rag1
- Exercise 3: Compare results
Expected learning outcomes
This exercise is supposed to teach you only the most important functions of RAxML. You will learn how to bootstrap Maximum Likelihood (ML) trees in order to assess node support, and how to partition alignment files to allow different substitution models for different regions of the alignment.
We are going to work with the 16s and rag1 alignments that we built during the Multiple Sequence Alignment exercise, and we will produce bootstrapped ML phylogenies for both alignments. Please make sure that you are logged in to the Workshop’s Amazon Machine Image (AMI) using a console window of either the Terminal application (if you’re on Linux or Mac OSX), or PuTTY (if you’re on Windows).
Using this console window, navigate to the directory for the RAxML activity:
Make sure that RAxML is installed:
(if you see ‘/usr/local/bin/raxml’, things are looking good)
Next, have a look at the impressive help text of RAxML:
Scroll back up to the beginning of the RAxML help text. Close to the top, you’ll see that raxml could be started as easily as
raxmlHPC -s sequenceFileName -n outputFileName -m substitutionModel
where ‘sequenceFileName’ and ‘outputFileName’ would have to be replaced with the actual sequence and output file names, and a substitution model would have to be chosen to replace ‘substitutionModel’ (note that for our installation, we start the program with ‘raxml’, not ‘raxmlHPC’, only because we named it that way). As RAxML is designed for large datasets, the choice of substitution models for nucleotide sequences is limited to rather parameter-rich models of the GTR family, such as GTR+Gamma, GTR+Gamma+I, GTRCAT, and GTRCAT+I. Of these, the latter two are similar to GTRGAMMA, but are computationally a lot faster (in case you want to read up on it later, Paul Lewis has posted a few slides to explain the benefits of the GTRCAT model here). The author of RAxML, Alexis Stamatakis, strongly and convincingly argues against the use of the proportion of invariable sites (the +I) in the manual of RAxML 7.04 (on page 20), so we’ll focus only on the GTRCAT model in this exercise (that said, if you browse through the RAxML help text, you may notice the newly implemented ASC_GTRCAT and ASC_GTRGAMMA models which correct for ascertainment bias in SNP data, and could prove very useful, but will not be covered here).
The RAxML help text shows that a large number of algorithms are implemented in RAxML, which you can choose with the the ‘-f’ option. If ‘-f’ is not specified, the ‘new rapid hill-climbing’ algorithm will be used. This is fine if you don’t need bootstrap support values, however we would like to have these, so we choose ‘-f a’ to run a search for the ML tree and a rapid bootstrap analysis in one run.
Exercise 1: Produce an ML phylogeny of 16s
So, let’s try to run a ML search with bootstaps for the 16s sequence data, using the alignment that was filtered with BMGE, the file called 16s_filtered.phy.
We’ll start with as little command line options as possible, and learn along the way which other options we need. Only for the reason that I find this easier than remembering all necessary options before the run. As explained above, we’ll use the GTRCAT model, and the ‘-f a’ option to run a ML search and bootstraps in the same run. And we choose ’16s_filtered.out’ as part of all result file names. So let’s try
raxml -s 16s_filtered.phy -n 16s_filtered.out -m GTRCAT -f a
As you can see, RAxML does not seem to be happy with our minimalistic choice of options, and asks for a random number seed for the ML search, which should be specified with the ‘-x’ option. Random number seeds are used to initiate the heuristic search, and you will always obtain exactly identical results if you use the same random number seed repeatedly. So if you want to run replicate analyses of the same dataset, make sure to use different number seeds each time, or the resulting identity of the resulting trees will be purely artificial, not supportive of the tree being the true ML estimate. Anyway, let’s try again:
raxml -s 16s_filtered.phy -n 16s_filtered.out -m GTRCAT -f a -x 123
(Pro tip: If you’re already tired of choosing different random number seeds yourself, you could type ‘$RANDOM’ instead of ‘123’, this will pick one number at random between 0 and 99999)
RAxML is still not happy, and would also like to know the number of bootstrap replicates, which are to be specified with the option ‘-#’ or ‘-N’ (those two are synonymous, and we’ll use ‘-N’ here). As the RAxML help text explains, one can either specify a fixed number of replicates, such as ‘-N 100’, or one could let RAxML decide itself how many bootstrap replicates should be run, by using one of the automatic ‘bootstopping criteria’ autoMR, autoMRE, autoMRE_IGN, or autoFC. According to this paper of the RAxML authors, autoMRE performs best and is fast enough for up to a few thousand taxa, so we type
raxml -s 16s_filtered.phy -n 16s_filtered.out -m GTRCAT -f a -x 123 -N autoMRE
It seems that we still need to specify yet another option, which is the random number seed for the parsimony inference (the parsimony inference is used to generate a starting tree). It is specified with ‘-p’:
raxml -s 16s_filtered.phy -n 16s_filtered.out -m GTRCAT -f a -x 123 -N autoMRE -p 456
RAxML should now be running. Have a look at the ‘Important Warnings’ that RAxML reports. Do you think one should change the alignment based on these warnings? What is the proportion of gaps and undetermined characters in the alignment? How long does RAxML take per bootstrap replicate? Feel free to cancel the run after a few bootstrap replicates, as the complete analysis might take too long for this exercise (result files for this run are already prepared in ~/wme_jan2015/activities/RAxML/res/, we’ll have a look at these later). To cancel the RAxML run, type ctrl-c (hold down the control key and press ‘c’).
Exercise 2: Produce an ML phylogeny of rag1
We will now run RAxML for the rag1 alignment. With protein-coding sequences, we might expect different rates of substitutions for the first, the second, and the third codon position, and to accomodate these, we will split the alignment into three partitions. Fortunately, we do not need to rearrange the alignment sites so that all first codon positions are next to each other, followed by all second positions, and then all third positions (albeit, if we had to, we could use the codonpos Nexus file that we had created with AliView during the Multiple Sequence Alignment exercise and turn it into phylip format). Instead, RAxML, just like other phylogenetic tools, allows the specification of discontinuous partitions, which facilitates partitioning according to codon position.
To produce a separate file with the partitioning information, use
Now type the following text to define partitions according to codon position:
DNA, codon1 = 1-1458\3
DNA, codon2 = 2-1458\3
DNA, codon3 = 3-1458\3
As is probably self-explanatory, ‘DNA’ tells RAxML that these partitions refer to DNA data (rather than amino acid sequences or morphological characters), ‘codon1’, ‘codon2’, and ‘codon3’ are names for the individual partitions (you’re free to choose these as you like), and ‘2-1458\3’ for example specifies that each third site, counting from position 2 (thus sites 2, 5, 8,…) should be considered part of this partition.
Save the file with the ctrl-o key combination. Confirm that you want to save it with name ‘partitions.txt’ by pressing enter, then exit with the ctrl-x key combination.
Run RAxML for the rag1 alignment, and specify the name of the file with the partition information with the option ‘-q’:
raxml -s rag1_aln.phy -n rag1_aln.out -m GTRCAT -f a -x 345 -N autoMRE -p 678 -q partitions.txt
This time, no warnings are generated, as the alignment does not contain identical sequences or completely undetermined columns. Have a look at the reported proportion of gaps and completely undetermined characters and note that RAxML recognized that we specified three partitions. How long does RAxML need now for each bootstrap? Again, cancel the run with the ctrl-c key combination.
Exercise 3: Compare results
Results for the two RAxML runs can be found in the ‘res’ subdirectory of the RAxML activity directory. Change directory and have a look at the files in this directory:
Have a look at the info files of the two different RAxML runs called ‘RAxML_info.16s_filtered.out’ and ‘RAxML_info.rag1_aln.out’. These basically contain the same information that RAxML reports to the screen during the run.
How long did RAxML take to run for the two input files? What is the tree length and the likelihood of the two trees?
The maximum likelihood trees with bootstrap support values can be found in files RAxML_bipartitions.16s_filtered.out and RAxML_bipartitions.rag1_aln.out. In order to be able to open them in a tree viewer, replace the extension ‘.out’ of the three filenames with ‘.tre’:
mv RAxML_bipartitions.16s_filtered.out RAxML_bipartitions.16s_filtered.tre
mv RAxML_bipartitions.rag1_aln.out RAxML_bipartitions.rag1_aln.tre
Before downloading the tree files to your own computers, there’s one more step to be done while logged in to the AMI. We have prepared a python script with which you can quickly determine the mean bootstrap support value in the two trees. To use it, type
python3 ~/wme_jan2015/scripts/get_mean_bootstrap_support.py RAxML_bipartitions.16s_filtered.tre
python3 ~/wme_jan2015/scripts/get_mean_bootstrap_support.py RAxML_bipartitions.rag1_aln.tre
Which of the two trees has the better mean bootstrap support?
Use scp or Cyberduck to download the tree files to your own computer. If using scp, the command would look similar to this (you’ll have to replace the public DNS with your actual one):
scp [email protected]:~/wme_jan2015/activities/RAxML/res/RAxML_bipartitions.*.tre .
(note the period sign at the end of this command)
If you do not yet have a tree viewer installed on your own computer, visit http://tree.bio.ed.ac.uk/software/figtree/ to download and install the FigTree executable for your operating system. If you should have problems installing this program, you could alternativly use X2Go to open FigTree on the AMI (it is already installed there).
- Open the trees in FigTree. When prompted that the nodes/branches of the tree are labelled, simply click ‘OK’. This prompt indicates that bootstrap support values are stored as part of the tree. To display these support values, click the checkbox ‘Node labels’ on the left-hand side of the FigTree window. Then click the triangle to the left of this checkbox, which opens a menu. Click on the element to the right of ‘Display’, which currently says ‘Node ages’. Choose ‘labels’ instead of ‘Node ages’, this will display the bootstrap support values. To change the way the tree is ordered, click ‘Increasing Node Order’ in the ‘Tree’ menu in the menu bar. Compare the overall look of the two trees.