Multiple Sequence Alignment

Lecture notes
Expected learning outcomes
Getting started
Exercise 1: Alignment and visualization with MAFFT and AliView
Exercise 2: Automated alignment filtering with BMGE
Exercise 3: Alignment and visualization of protein-coding sequences
Exercise 4: Codon-based manual alignment curation
Exercise 5: Adding further sequences to the alignment

Lecture notes

Multiple Sequence Alignment Introduction

Expected learning outcomes

The objective of this activity is to become familiar with multiple sequence alignment options and the visualization and editing of alignments, both manually and in an automated fashion, and with both non-coding and coding sequences.

Getting started

Multiple sequence alignment is an essential part of all phylogenetics workflows. We’re going to use sets of orthologuous sequences for two molecular markers, 16S and RAG1, for the same 294 taxa of teleost fishes with up to 250 million years of divergence. Despite their similar evolutionary history in this group, the two markers have very different characteristics. The mitochondrial 16S gene codes for a RNA subunit of the mitochondrial ribosome and contains many regions with high substitution rates, while the nuclear RAG1 gene encodes the Recombination-activating protein 1, with an overall well-conserved amino acid sequence. We will use MAFFT, one of the fastest and best-performing tools to align the two sets of sequences, and visualize the resulting alignments with a relatively new program called AliView. Alignments will be edited manually and automatically with the software BMGE, which determines the most reliable alignment positions based on the proportion of missing data and their entropy score.

Before we start, please make sure you have recent versions of MAFFT, AliView, and BMGE installed. Of these, AliView and BMGE should be sitting in your Applications directory. AliView can simply be started by double-clicking, but BMGE can only be started from the command line (we’ll get to that later) – for now, just make sure that the applications are there. To test whether a new version of MAFFT is installed and running fine, open a Terminal window (assuming that you’re using a UNIX-based OS such as Linux or Mac) and run

mafft --version

If you see something like ‘v7.205 (2014/10/20)’, you’re good to go. However, remember the MAFFT installation path which you can find out with

which mafft

You’ll need this installation path later in the exercise to run MAFFT from within AliView. If any of the applictions are missing, please visit these links and download the latest versions:

MAFFT: http://mafft.cbrc.jp/alignment/software/
AliView: http://www.ormbunkar.se/aliview/
BMGE: ftp://ftp.pasteur.fr/pub/GenSoft/projects/BMGE/
Download the tar-zipped directory containing nucleotide sequences in fasta format, for the two molecular markers 16S and RAG1: MSA.tgz
Open a Terminal window and navigate to the directory, in which you saved the file MSA.tgz.
Uncompress file MSA.tgz using the following command:

tar -xzf MSA.tgz

If the above does not work (happened to me on Linux), you could try this instead:

gunzip -c MSA.tgz | tar zxvf –

Exercise 1: Alignment and visualization with MAFFT and AliView

In this exercise, we’ll use the 16s sequence set to demonstrate alignment with MAFFT and alignment visualization and editing with AliView.

Using the Terminal window, change directory to the newly created directory MSA:

cd MSA
Using ls, you’ll see three subdirectories called 16s, rag1, and rag1_add. Change into the subdirectory called 16s:

cd 16s
The directory contains 294 fasta files. Have a look at the first of these:

less Acanthapomotis.fasta

(quit this view by typing q)

Like this file, all 294 files in this subdirectory contain only a single sequence in fasta format, and both the fasta header and the file name indicate the species identifier (also note that all species identifiers have the same number of letters, none of them contain special characters etc.). This is just an example of how I organize my data, but in any case, I strongly recommend the consistent use of species identifiers in phylogenetic analyses, since many phylogenetic applications and file formats will not work with actual latin or common species names.
Concatenate all single-entry fasta files to one unaligned fasta file containing all 294 sequences of the 16s gene:

cat *.fasta > 16s.fasta
This is how the combined fasta file looks like:

less 16s.fasta

(you can scroll through the file by pressing the space bar, and again, quit by typing q)
We’re about to run MAFFT. While we’re only going to need some of the most basic functions of MAFFT, you may want to have a look at all the available options:

mafft --help
Now use MAFFT to align all 16S sequences and save these to a new file (this should take about 15 sec):

mafft 16s.fasta > 16s_aln.fasta

The file 16s_aln.fasta contains all aligned sequences in fasta format.

As you see, running MAFFT is fast and easy. The default settings of MAFFT usually work well, but we’ll explore other options for alignment penalties below.
Have a first look at the aligned fasta file:

less 16s_aln.fasta

(once more, use q to quit)
Start AliView, and open file 16s_aln.fasta using the ‘Open File’ in the ‘File’ menu (or Command-O).
To get an overview of the alignment, zoom in and out using the + and – keys on your keyboard (you may also have to hold the shift key).
Without closing the alignment window, switch back to the Terminal. We’ll produce a second alignment of the same sequences, this time with an increased gap opening penalty, using MAFFT’s ‘--op’ option. Effectively, this reduces the score of alignments with a large number of gaps, and will therefore tend to result in less gappy alignments. To save this alignment with a different name than the first, run

mafft --op 2 16s.fasta > 16s_op2_aln.fasta
Also open this alignment in AliView, again using ‘Open File’ in the ‘File’ menu or the Command-O key combination.
Zoom in and out as before, and compare 16s_op2_aln.fasta with 16s_aln.fasta. Toggle between the two open alignment windows, using the Alt-Tab key combination on Linux or the Command-` key combination on a Mac. Compare the total alignment lengths shown in the status bar at the bottom right (2309 bp in 16s_aln.fasta; 2223 bp in 16s_op2_aln.fasta).
Once more, switch back to the Terminal without closing any alignment windows, and produce a third alignment of the same sequences, this time with an increased gap extension penalty using MAFFT’s ‘--ep’ option:

mafft --ep 1 16s.fasta > 16s_ep1_aln.fasta
Open file 16s_ep1_aln.fasta in AliView as before.
In each of the three open alignment windows: Zoom in so that you can see the last 500 bp of the alignment. Note that a single sequence bridges two large gaps around bp 2040 and bp 2210 in 16s_aln.fasta, while gaps are placed differently in 16s_op2_aln.fasta and far more smaller gaps are found in the same region in 16s_ep1_aln.fasta.
Close 16s_op2_aln.fasta and 16s_ep1_aln.fasta. We will from now on focus on the alignment produced with default parameters, 16s_aln.fasta.
In the AliView window of 16s_aln.fasta, place the cursor on the sequence that bridges the first of the two large gaps near the end of the alignment (around bp 2040) and zoom in (using the + key) until you can see the labels of the individual bases. You’ll see that the taxon responsible for these gaps is called ‘Balistecaprisc’. It appears that the sequence alignment for this taxon is correct up to bp 2023, but that the sequence is not homologous to other taxa between bp 2023 and the end of the alignment.
Use the cursor to select position 2024 of the ‘Balistecaprisc’ sequence. Use ‘Expand selection Right’ in the ‘Selection’ menu to select bp 2024-2354 of the ‘Balistecaprisc’ sequence.
Remove this part of the ‘Balistecaprisc’ sequence using ‘Clear selected bases’ in the ‘Edit’ menu.
After removing this part of the ‘Balistecaprisc’ sequence, the two large gaps near the end are not bridged by any sequence anymore. Remove these gaps entirely using ‘Delete gap-only columns’ in the ‘Edit’ menu.
Save the alignment and note how the alignment length shown at the bottom right of the alignment window is now 2244 bp.
Have a look at the region around bp 1760, which appears to be poorly aligned. Use the cursor to click in the ruler area (above the alignment) and select the columns 1728-1800. We chose these boundaries as these boundary sites appear to be reliably aligned, in contrast to the alignment block between these boundaries.
In the ‘Align’ menu, click ‘Change default Aligner program > for realigning all (or selected blocks)’.
Click the third radio button to select ‘Mafft-globalpair’ as the default algorithm for realignment. Make sure that the specified MAFFT installation path is correct, and confirm with ‘OK’.
Click ‘Realign selected block’ in the ‘Align’ menu.
Does the alignment look more reliable now? Once more, remove gap-only columns, and save the alignment file.

Exercise 2: Automated alignment filtering with BMGE

Next, we’ll use BMGE to automatically remove sites with questionable homology from the alignment.

Go back to the Terminal window, without closing the AliView window. Type

java -jar /Applications/BMGE-1.1/BMGE.jar -i 16s_aln.fasta -t DNA -of 16s_filtered.fasta -oh 16s_filtered.html

(if the file BMGE.jar is not in /Applications/BMGE-1.1/ on your machine, you’ll have to change this for the actual program file path)

With the above command, BMGE writes a filtered alignment in fasta format in file 16s_filtered.fasta, and visualizes the filtered alignment in HTML format in file 16s_filtered.html.
Open file 16s_filtered.html in a browser. It’s relatively large (8.6 MB) but should load fine in Firefox, Safari, or Chrome.
Scroll through the alignment and note the black alignment blocks.
At the very top of the alignment, you’ll see two measures plotted for each site in light grey and black. The gap proportion is shown with light gray equal signs and ranges from 0 to 1. Black colons indicate what the authors of BMGE call a ‘smoothed entropy-like score’. Basically, this is a measure of the nucleotide diversity at this site (please see the BMGE publication for more information). You’ll note that the black alignment blocks coincide with regions of low gap proportion and low entropy, which are be the most suitable alignment positions for phylogenetic inference.

Our selection of alignment blocks is based on default settings of BMGE for the entropy score cut-off (option -h), the gap rate cut-off (-g), and the minimum block size (-b). By default, BMGE selects sites with an entropy score below 0.5 (-h 0.5) and a gap proportion below 0.2 (-g 0.2), and only if these form a block of at least 5 sites with these properties (-b 5).
Repeat the BMGE block selection with custom settings for entropy score cut-off, gap rate cut-off, and minimum block size, and note how this changes the overall number of selected sites and the distribution of selected blocks in the alignment. For example, increase the allowed proportion of gaps using -g 0.3:

java -jar /Applications/BMGE-1.1/BMGE.jar -i 16s_aln.fasta -t DNA -g 0.3 -of 16s_filtered.fasta -oh 16s_filtered_g03.html
The standard output of BMGE to the Terminal tells you how many sites (characters) remain selected. Note the difference between the last two runs.
With file 16s_filtered.html still open in your browser, scroll to between bp 1150 and bp 1240. You’ll see several low-entropy regions that are not marked in black because a large proportion of sequences contain gaps for these sites. Also open 16s_filtered_g03.html in your browser, and compare it with file 16s_filtered.html. Note how sites 1229 to 1238 are selected in 16s_filtered_g03.html, but not in 16s_filtered.html.
Also try out different combinations of -g 0.5, -h 0.2, and -b 20.
Let’s try to execute BMGE from within AliView. In AliView, the unfiltered file 16s_aln.fasta should still be open. Click on ‘Edit external command’ in the ‘External commands’ menu. You’ll see that two external commands (FastTree + FigTree and Textedit) are already defined. If you would like to use these (but this is not part of this exercise), you’ld have to select the checkboxes to the left of the ‘Display name’ fields, and make sure that the programs to be executed are correctly installed. For now, we’ll define a third command to run BMGE. Select the third checkbox and write ‘BMGE’ in the third ‘Display name’ field. In the command window to the right of this field, write the following:

java -jar /Applications/BMGE-1.1/BMGE.jar -i CURRENT_ALIGNMENT_FASTA -t DNA -g 0.5 -of TEMP_OUT_FILE
ALIVIEW_OPEN TEMP_OUT_FILE

(make sure that the path to BMGE.jar is the correct one and that there’s a line break after the first ‘TEMP_OUT_FILE’)

Confirm with ‘OK’, then apply this command to the current alignment by clicking ‘BMGE’ in the ‘External commands’ menu. A new alignment window will open with the filtered alignment that should now be about 974 positions in length. Save this alignment as 16s_filtered.fasta using ‘Save as Fasta’ in the ‘File’ menu (the dialog window probably suggests to save to a temporary file directory, if so, change this to save to the 16s directory). Confirm that you want to overwrite the existing file of the same name.
Also save the alignment in nexus and philip format, using ‘Save as Nexus’ and ‘Save as Phylip (full names & padded)’ from the ‘File’ menu. Quit AliView.

Exercise 3: Alignment and visualization of protein-coding sequences

In the next two exercises, we’ll align and visualize protein-coding sequences of the RAG1 gene, and learn to use translation patterns to improve the alignment.

In the Terminal window, change the current directory to the directory containing the RAG1 sequences in fasta format:

cd ../rag1
As before, combine all single-sequence fasta files to a file containing all 294 sequences:

cat *.fasta > rag1.fasta
Check that the combined fasta file looks as expected:

less rag1.fasta

(I’m sure you know how to quit this view by now)
Again use MAFFT and default settings to align all sequences and save the alignment in fasta format as rag1_aln.fasta:

mafft rag1.fasta > rag1_aln.fasta
Also make sure that the aligned fasta file looks alright:

less rag1_aln.fasta
Restart AliView and open file rag1_aln.fasta using ‘Open File’ in the ‘File’ menu.
Zoom out and scroll to bp 3000-4000.
Click the fourth icon from the left in the tool bar to translate nucleotide sequence to amino acids (or the Command-T key combination on a Mac).

Exercise 4: Codon-based manual alignment curation

We’ll learn how to identify the correct reading frame for amino acid translation, and how this information can help to improve the alignment.

Click the sigma sign icon in the centre of the tool bar to count stop codons. How many stop codons does AliView count?
Change the reading frame using the box to the right of the sigma sign icon and also count stop codons for the second and third reading frame. Note how the overall colour pattern of the alignment changes when you select different reading frames. Does one reading frame appear more conserved than the other two? Which reading frame has the lowest number of stop codons? Use this reading frame from now.
Notice how the main part of the alignment has four gap regions: bp 2902-2903, bp 3189-3218, bp 3779-3787, bp 3799-3804, and bp 4175-4182. The first and the last of these gap regions contain a number of gap columns that can not be divided by 3 and thus lead to a shift of the reading frame in most taxa. With the currently selected reading frame (reading frame 1), stop codons appear only to the left of the first gap region and to the right of the last gap region, e.g. on columns 2893-2895 and columns 4192-4194.
Find the sequence that bridges the first and the last of the above gap regions. Obviously, this sequence is misaligned between bp 2897 and 2961, and between bp 4135 and 4282. Select and remove these fragments with ‘Clear selected bases’ in the ‘Edit’ menu.
Remove the resulting gap-only sites with ‘Delete gap-only columns’ in the ‘Edit’ menu.
The reading frame has now changed for most of the alignment. To correct for this, select the entire alignment using ‘Select all’ in the ‘Selection’ menu, and reset the coding pattern for all sites using the icon near the center that shows a number ‘1’ over a color bar (alternatively, click ‘Set selection as coding (selection starting with codon position=1)’ in the ‘Selection’ menu). AliView might need a few seconds to execute this command. Once again, find the reading frame with the lowest number of stop codons.
Multiple sequences extend into the gap region at bp 3187-3216. This appears to be a flexible region of the translated protein, and extensions into this gap appear to have evolved multiple times independently. We assume that these extensions are not homologous to each other. Since gap boundary positions can be misaligned relative to the sequence that bridges the gap, we conservatively remove the broken codon to the left and right of this gap. Select (for all taxa) sites 3185-3217 and remove them using ‘Delete selected’ in the ‘Edit’ menu.
Similarly, remove the other gaps by deleting bp 3743-3754, and bp 3752-3757 (in this order).
Trim off the beginning and the end of the alignment, but so that the remaining alignment starts with a first codon position and ends with a third codon position (only for the reason that it facilitates partitioning in phylogenetic analyses). To do so, select site 2776 by clicking in the ruler area above the alignment, and extend the selection to the beginning of the alignment using ‘Expand selection Left’ in the ‘Selection’ menu. Use ‘Delete Selected’ in the ‘Edit’ menu.
Remove the tail of the alignment, from bp 1462 to the end, again by selecting site 1462, using ‘Expand selection Right’ in the ‘Selection’ menu, and ‘Delete selected’ in the ‘Edit’ menu
Once more, count the number of stop codons using the sigma sign icon in the centre of the toolbar. Save the alignment file, but leave the AliView window open.

Exercise 5: Adding further sequences to the alignment

AliView allows the addition of sequences to an existing alignment, which can be handy when some sequence data becomes available at later stages.

In order to add sequences to the current alignment, first click ‘Change default Aligner program > when pasting new sequences (profile-alignment)’ in the ‘Align’ menu of AliView. Choose the second radio button to select MAFFT with option ‘-add’ as the aligner program when adding sequences. Make sure the MAFFT installation path is correctly specified.
You’ll find sequences for four additional taxa in directory rag1_add. Use the Terminal window to change to this directory:

cd ../rag1_add

and have a look at the files in it:

ls -l
Again, combine the four additional sequence files in fasta format to a single fasta file:

cat *.fasta > rag1_add.fasta
Open rag1_add.fasta in a text editor, such as Gedit or Emacs on Linux, TextEdit on the Mac, or Notepad on Windows. Select all text and copy to the clipboard.
Go back to the AliView alignment window and use ‘Add and align sequences from clipboard (fasta)’ in the ‘Align’ menu.
If the sequence addition has changed the reading frame, again select the entire alignment with ‘Select all’ in the ‘Selection’ menu, and then click the icon with a number ‘1’ near the center of the tool bar. Make sure the correct reading frame is selected.
The addition of these four sequences has again opened two gap regions. Remove sites 409-411 and sites 967-969, and also remove the tail of the new sequences starting at bp 1462.
The four new sequences were added at the bottom of the alignment, and taxon names are not sorted alphabetically anymore. Sort them using ‘Sort sequences by name’ in the ‘View’ menu.
Save the edited alignment in fasta format and phylip format, using ‘Save as Fasta’ and ‘Save as Phylip (full names & padded)’ from the ‘File’ menu, and confirm that you want to overwrite rag1_aln.fasta.
Also save the alignment as a nexus file sorted according to codon position, using ‘Save as codonpos Nexus (codonpos as charsets – excluded removed)’ in the ‘File’ menu. We will continue to work with these files in a later exercise.

Table of contents