Multiple Sequence Alignment (AMI version)

Lecture notes
Expected learning outcomes
Getting started
Exercise 1: Alignment and visualization with MAFFT and AliView
Exercise 2: Automated alignment filtering with BMGE
Exercise 3: Alignment and visualization of protein-coding sequences
Exercise 4: Codon-based manual alignment curation
Exercise 5: Adding further sequences to the alignment

Lecture notes

Multiple Sequence Alignment Introduction

Expected learning outcomes

The objective of this activity is to become familiar with multiple sequence alignment options and the visualization and editing of alignments, both manually and in an automated fashion, and with both non-coding and coding sequences.

Getting started

Multiple sequence alignment is an essential part of all phylogenetics workflows. We’re going to use sets of orthologuous sequences for two molecular markers, 16S and RAG1, for the same 294 taxa of teleost fishes with up to 250 million years of divergence. Despite their similar evolutionary history in this group, the two markers have very different characteristics. The mitochondrial 16S gene codes for a RNA subunit of the mitochondrial ribosome and contains many regions with high substitution rates, while the nuclear RAG1 gene encodes the Recombination-activating protein 1, with an overall well-conserved amino acid sequence.

We will use MAFFT and MUSCLE, two of the fastest and best-performing tools to align the two sets of sequences, and visualize the resulting alignments with a relatively new program called AliView. Alignments will be edited manually and automatically with the software BMGE, which determines the most reliable alignment positions based on the proportion of missing data and their entropy score.

In this AMI version of this exercise, you will install and run the alignment viewer AliView locally on your machine, but MAFFT and BMGE will be executed remotely on the Amazon Machine Image (AMI). The reason for this is that the graphical user interface (GUI) of AliView is likely to work much better when installed locally, while command line tools like MAFFT and BMGE can easily be operated remotely. The alignment tool MUSCLE comes included in AliView, and can be triggered from the AliView GUI. We will only use it for alignment refinement as MUSCLE is not quite as fast as MAFFT (one could also install MAFFT so that it can be triggered from within AliView, however, this will not be needed for this exercise). Please make sure that you are familiar with connecting to, and transferring files to and from the AMIs provided by the Workshop, using tools such as scp or Cyberduck.

Before we start, please visit http://www.ormbunkar.se/aliview/ to download and install the most recent stable version of the software AliView, unless the software is present on your computer already. Next, you should make sure that you can access the Workshop AMI with ssh (if you’re on Linux or OSX) or PuTTY (if you’re on Windows). For the rest of this exercise, ‘console window’ will refer to the tool that you use to log in to the AIM on the command line, meaning either the Terminal or PuTTY, depending on your operating system. Use a console window to log in to the AIM and make sure that MAFFT and BGME are installed there and running fine. In the console window, run

mafft --version
If you see something like ‘v7.215 (2014/12/17)’, you’re good to go. Also check out the MAFFT installation path with

which mafft
Navigate to the directory containing the exercise material:
cd ~/wme_jan2015/activities/MSA/
Have a look at the directory’s content:
ls -l

Exercise 1: Alignment and visualization with MAFFT and AliView

In this exercise, we’ll use the 16s sequence set to demonstrate alignment with MAFFT and alignment visualization and editing with AliView.

Change directory to the newly created directory MSA:
cd MSA
Using ls, you’ll see three subdirectories called 16s, rag1, and rag1_add. Change into the subdirectory called 16s:
cd 16s
The directory contains 294 fasta files. Have a look at the first of these:
less Acanthapomotis.fasta

(quit this view by typing q)

If you’re not familiar with fasta format yet, this is how they look like. Each record consists of an id and a sequence, of which the id is always on a single line that starts with a ‘>’ symbol, followed by one or more lines containing the sequence. Like this file, all 294 files in this subdirectory contain only a single record in fasta format, and both the record id and the file name indicate the species identifier (also note that all species identifiers have the same number of letters, none of them contain special characters etc.). This is just an example of how I organize my data, but in any case, I strongly recommend the consistent use of species identifiers in phylogenetic analyses, since many phylogenetic applications and file formats will not work with actual latin or common species names.
Concatenate all single-entry fasta files to one unaligned fasta file containing all 294 sequences of the 16s gene:
cat *.fasta > 16s.fasta
This is how the combined fasta file looks like:
less 16s.fasta

(you can scroll through the file by pressing the space bar, and again, quit by typing q)
We’re about to run MAFFT. While we’re only going to need some of the most basic functions of MAFFT, you may want to have a look at all the available options:
mafft --help
Now use MAFFT to align all 16S sequences and save these to a new file (this should take about 15 sec):
mafft 16s.fasta > 16s_aln.fasta

The file 16s_aln.fasta contains all aligned sequences in fasta format.

As you see, running MAFFT is fast and easy. The default settings of MAFFT usually work well, but we’ll explore other options for alignment penalties below.
Have a first look at the aligned fasta file:
less 16s_aln.fasta

(once more, use q to quit)
Before we turn to AliView to view and edit the aligned fasta file, we’ll produce a second alignment of the same sequences, this time with an increased gap opening penalty, using MAFFT’s ‘--op‘ option. Effectively, this reduces the score of alignments with a large number of gaps, and will therefore tend to result in less gappy alignments. To save this alignment with a different name than the first, run
mafft --op 2 16s.fasta > 16s_op2_aln.fasta
Finally, we produce a third alignment of the same sequences, this time with an increased gap extension penalty using MAFFT’s ‘--ep‘ option:
mafft --ep 1 16s.fasta > 16s_ep1_aln.fasta
Download the three alignment files 16s_aln.fasta, 16s_op2_aln.fasta, and 16s_ep1_aln.fasta using scp or Cyberduck and store them somewhere on your own computer. If using scp, the command should look roughly look this:
scp [email protected]:~/wme_jan2015/activities/MSA/16s/16s_aln.fasta .

Note that if you use scp you’ll have to type this not into the console window with which you’ve logged in to the AMI, but into a second console window. And you’ll have to replace the public DNS of the AMI in the above command (the ec2-54-146-138-136 part) with your actual one.
Start AliView locally on your computer by double-clicking the application icon, and open file 16s_aln.fasta using the ‘Open File’ in the ‘File’ menu (or Command-O).
To get an overview of the alignment, zoom in and out using the + and – keys on your keyboard (you may also have to hold the shift key).
Without closing the first AliView window, also open 16s_op2_aln.fasta in AliView, again using ‘Open File’ in the ‘File’ menu.
Zoom in and out as before, and compare 16s_op2_aln.fasta with 16s_aln.fasta. Toggle between the two open alignment windows, using the Alt-Tab key combination on Linux or the Command-` key combination on a Mac. Compare the total alignment lengths shown in the status bar at the bottom right (2309 bp in 16s_aln.fasta; 2223 bp in 16s_op2_aln.fasta).
Again without closing the first two AliView windows, also open file 16s_ep1_aln.fasta in AliView as before.
In each of the three open alignment windows: Zoom in so that you can see the last 500 bp of the alignment. Note that a single sequence bridges two large gaps around bp 2040 and bp 2210 in 16s_aln.fasta, while gaps are placed differently in 16s_op2_aln.fasta and far more smaller gaps are found in the same region in 16s_ep1_aln.fasta.
Close 16s_op2_aln.fasta and 16s_ep1_aln.fasta. We will from now on focus on the alignment produced with default parameters, 16s_aln.fasta.
In the AliView window of 16s_aln.fasta, place the cursor on the sequence that bridges the first of the two large gaps near the end of the alignment (around bp 2040) and zoom in (using the + key) until you can see the labels of the individual bases. You’ll see that the taxon responsible for these gaps is called ‘Balistecaprisc’. It appears that the sequence alignment for this taxon is correct up to bp 2023, but that the sequence is not homologous to other taxa between bp 2023 and the end of the alignment.
Use the cursor to select position 2024 of the ‘Balistecaprisc’ sequence. Use ‘Expand selection Right’ in the ‘Selection’ menu to select bp 2024-2354 of the ‘Balistecaprisc’ sequence.
Remove this part of the ‘Balistecaprisc’ sequence using ‘Clear selected bases’ in the ‘Edit’ menu.
After removing this part of the ‘Balistecaprisc’ sequence, the two large gaps near the end are not bridged by any sequence anymore. Remove these gaps entirely using ‘Delete gap-only columns’ in the ‘Edit’ menu.
Save the alignment and note how the alignment length shown at the bottom right of the alignment window is now 2244 bp.
Have a look at the region around bp 1760, which appears to be poorly aligned. Use the cursor to click in the ruler area (above the alignment) and select the columns 1728-1800. We chose these boundaries as these boundary sites appear to be reliably aligned, in contrast to the alignment block between these boundaries.
Click ‘Realign selected block’ in the ‘Align’ menu. This will trigger MUSCLE to realign only the selected sites 1728-1800.
Does the alignment look more reliable now? You can use ‘Undo’ and ‘Redo’ to compare the alignment before and after realigning (it could be that you have to press these buttons twice due to a small bug). After realignment, remove newly created gap-only columns, and save the alignment file locally on your machine with ‘Save’ in the ‘File’ menu.
Use scp or Cyberduck to transfer the modified alignment file in fasta format back to the AMI, save it there in the 16s subdirectory of the MSA exercise directory. If using scp, the command should look more or less like this (again, this would need to be typed into the second console window, and the public DNS would have to be changed):
scp 16s_aln.fasta [email protected]:~/wme_jan2015/activities/MSA/16s/

Exercise 2: Automated alignment filtering with BMGE

Next, we’ll use BMGE to automatically remove sites with questionable homology from the alignment.

Go back to the console window that you used to log in to the AMI.
Make sure you’re in the right place by typing pwd (which stands for ‘print working directory’) and you should see ‘/home/ubuntu/wme_jan2015/activities/MSA/16s’ as an answer. If not, move to that directory. Then type

java -jar /home/ubuntu/wme_jan2015/software/BMGE-1.12/BMGE.jar -i 16s_aln.fasta -t DNA -of 16s_filtered.fasta -oh 16s_filtered.html

With the above command, BMGE writes a filtered alignment in fasta format in file 16s_filtered.fasta, and visualizes the filtered alignment in HTML format in file 16s_filtered.html. With BMGE, alignment filtering is based on the gap rate cut-off (-g), a minimum block size (-b), and by what the authors of BMGE call a ‘smoothed entropy-like score’ (option -h). Basically, this is a measure of the nucleotide diversity at this site (please see the BMGE publication for more information). By default, BMGE selects sites with an entropy score below 0.5 (-h 0.5) and a gap proportion below 0.2 (-g 0.2), and only if these form a block of at least 5 sites with these properties (-b 5). Before downloading the filtered alignment in fasta and html format to your own computer, let’s run BMGE again with custom settings for entropy score cut-off, gap rate cut-off, and minimum block size, and note how this changes the overall number of selected sites and the distribution of selected blocks in the alignment. For example, increase the allowed proportion of gaps using -g 0.3:

java -jar /home/ubuntu/wme_jan2015/software/BMGE-1.12/BMGE.jar -i 16s_aln.fasta -t DNA -g 0.3 -of 16s_filtered_g03.fasta -oh 16s_filtered_g03.html
The standard output of BMGE to the console window tells you how many sites (characters) remain selected. Note the difference between the last two runs.
Download the filtered fasta and html files from the AMI to your computer, with scp or Cyberduck. If using scp, the command would be something like
scp [email protected]:~/wme_jan2015/activities/MSA/16s/16s_filtered* .

(and again, the same advice applies to this and all below scp commands: type it into a second console window and replace the public DNS)
On your own computer, open file 16s_filtered.html in a browser. It’s relatively large (8.6 MB) but should load fine in Firefox, Safari, or Chrome.
Scroll through the alignment and note the black alignment blocks.
At the very top of the alignment, you’ll see two measures plotted for each site in light grey and black. The gap proportion is shown with light gray equal signs and ranges from 0 to 1 and black colons indicate the entropy score. You’ll note that the black alignment blocks coincide with regions of low gap proportion and low entropy, which are be the most suitable alignment positions for phylogenetic inference.
With file 16s_filtered.html still open in your browser, scroll to between bp 1150 and bp 1240. You’ll see several low-entropy regions that are not marked in black because a large proportion of sequences contain gaps for these sites. Also open 16s_filtered_g03.html in your browser, and compare it with file 16s_filtered.html. Note how sites 1229 to 1238 are selected in 16s_filtered_g03.html, but not in 16s_filtered.html.
This step is just to explain that BMGE could also be run from within AliView, if it was installed locally on your machine (which it probably isn’t). In AliView, click on ‘Edit external command’ in the ‘External commands’ menu. You’ll see that two external commands (FastTree + FigTree and Textedit) are already defined. If you would like to use these (but this is not part of this exercise), you’ld have to select the checkboxes to the left of the ‘Display name’ fields, and make sure that the programs to be executed are correctly installed. Similarly, you could define a third command to run BMGE. To do so, you would select the third checkbox and write ‘BMGE’ in the third ‘Display name’ field. In the command window to the right of this field, you would have to write the following:
java -jar /Applications/BMGE-1.12/BMGE.jar -i CURRENT_ALIGNMENT_FASTA -t DNA -g 0.5 -of TEMP_OUT_FILE
ALIVIEW_OPEN TEMP_OUT_FILE

(there’s a line break after TEMP_OUT_FILE)

Obviously, for this to work you would have to install BMGE in /Applications/BMGE-1.12/, as specified in the above command. You could then apply this command to the current alignment by simply clicking ‘BMGE’ in the ‘External commands’ menu.

Exercise 3: Alignment and visualization of protein-coding sequences

In the next two exercises, we’ll align and visualize protein-coding sequences of the RAG1 gene, and learn to use translation patterns to improve the alignment.

In the console window with which you logged in to the AMI, change the current directory to the directory containing the RAG1 sequences in fasta format:
cd ../rag1
As before, combine all single-sequence fasta files to a file containing all 294 sequences:
cat *.fasta > rag1.fasta
Check that the combined fasta file looks as expected:
less rag1.fasta

(I’m sure you know how to quit this view by now)
Again use MAFFT and default settings to align all sequences and save the alignment in fasta format as rag1_aln.fasta:
mafft rag1.fasta > rag1_aln.fasta
Also make sure that the aligned fasta file looks alright:
less rag1_aln.fasta
Download rag1_aln.fasta to your own computer using scp or Cyberduck. With scp, the command would be similar to this:
scp [email protected]:~/wme_jan2015/activities/MSA/rag1/rag1_aln.fasta .
Open file rag1_aln.fasta in AliView (on your own computer) using ‘Open File’ in the ‘File’ menu.
Zoom out and scroll to bp 3000-4000.
Click the fourth icon from the left in the tool bar to translate nucleotide sequence to amino acids (or the Command-T key combination on a Mac).

Exercise 4: Codon-based manual alignment curation

We’ll learn how to identify the correct reading frame for amino acid translation, and how this information can help to improve the alignment.

Click the sigma sign icon in the centre of the tool bar to count stop codons. How many stop codons does AliView count?
Change the reading frame using the box to the right of the sigma sign icon and also count stop codons for the second and third reading frame. Note how the overall colour pattern of the alignment changes when you select different reading frames. Does one reading frame appear more conserved than the other two? Which reading frame has the lowest number of stop codons? Use this reading frame from now on.
Notice how the main part of the alignment has five gap regions: bp 2902-2903, bp 3189-3218, bp 3779-3787, bp 3799-3804, and bp 4175-4182. The first and the last of these gap regions contain a number of gap columns that can not be divided by 3 and thus lead to a shift of the reading frame in most taxa. With the currently selected reading frame (reading frame 1), stop codons appear only to the left of the first gap region and to the right of the last gap region, e.g. on columns 2893-2895 and columns 4192-4194.
Find the sequence that bridges the first and the last of the above gap regions. Obviously, this sequence is misaligned between bp 2897 and 2961, and between bp 4148 and 4282. Select and remove these fragments with ‘Clear selected bases’ in the ‘Edit’ menu.
Remove the resulting gap-only sites with ‘Delete gap-only columns’ in the ‘Edit’ menu.
The reading frame has now changed for most of the alignment. To correct for this, select the entire alignment using ‘Select all’ in the ‘Selection’ menu, and reset the coding pattern for all sites using the icon near the center that shows a number ‘1’ over a color bar (alternatively, click ‘Set selection as coding (selection starting with codon position=1)’ in the ‘Selection’ menu). AliView might need a few seconds to execute this command. Once again, find the reading frame with the lowest number of stop codons.
Multiple sequences extend into the gap region at bp 3187-3216. This appears to be a flexible region of the translated protein, and extensions into this gap appear to have evolved multiple times independently. We assume that these extensions are not homologous to each other. Since gap boundary positions can be misaligned relative to the sequence that bridges the gap, we conservatively remove the broken codon to the left and right of this gap. Select (for all taxa) sites 3185-3217 and remove them using ‘Delete selected’ in the ‘Edit’ menu.
Similarly, remove the other gaps by deleting bp 3743-3754, and bp 3752-3757 (in this order).
Trim off the beginning and the end of the alignment, but so that the remaining alignment starts with a first codon position and ends with a third codon position (only for the reason that it facilitates partitioning in phylogenetic analyses). To do so, select site 2776 by clicking in the ruler area above the alignment, and extend the selection to the beginning of the alignment using ‘Expand selection Left’ in the ‘Selection’ menu. Use ‘Delete Selected’ in the ‘Edit’ menu.
Remove the tail of the alignment, from bp 1462 to the end, again by selecting site 1462, using ‘Expand selection Right’ in the ‘Selection’ menu, and ‘Delete selected’ in the ‘Edit’ menu
Once more, count the number of stop codons using the sigma sign icon in the centre of the toolbar. Save the alignment file, but leave the AliView window open.

Exercise 5: Adding further sequences to the alignment

AliView allows the addition of sequences to an existing alignment, which can be handy when some sequence data becomes available at later stages.

You’ll find sequences for four additional taxa in subdirectory rag1_add of the directory containing the activity material on the AMI. Use the console window with which you’ve logged in to the AMI to change to this directory:
cd ../rag1_add

and have a look at the files in it:

ls -l
Again, combine the four additional sequence files in fasta format to a single fasta file:
cat *.fasta > rag1_add.fasta
Download file rag1_add.fasta to your own computer, using scp or Cyberduck. With scp, the command would look more or less like this:
scp [email protected]:~/wme_jan2015/activities/MSA/rag1_add/rag1_add.fasta .
Open rag1_add.fasta in a text editor, such as Gedit or Emacs on Linux, TextEdit on the Mac, or Notepad on Windows. Select all text and copy to the clipboard.
Go back to the AliView alignment window and use ‘Add and align sequences from clipboard (fasta)’ in the ‘Align’ menu. This will again use MUSCLE, which comes along with AliView, to align the additional sequences.
If the sequence addition has changed the reading frame, again select the entire alignment with ‘Select all’ in the ‘Selection’ menu, and then click the icon with a number ‘1’ near the center of the tool bar. Make sure the correct reading frame is selected.
The addition of these four sequences has again opened two gap regions. Remove sites 409-411 and sites 967-972, and also remove the tail of the new sequences starting at bp 1459.
The four new sequences were added at the bottom of the alignment, and taxon names are not sorted alphabetically anymore. Sort them using ‘Sort sequences by name’ in the ‘View’ menu.
Save the edited alignment in fasta format and phylip format, using ‘Save as Fasta’ and ‘Save as Phylip (full names & padded)’ from the ‘File’ menu, and confirm that you want to overwrite rag1_aln.fasta. In case you’re not familiar with the phylip format yet, you might want to take this opportunity to have a look at it in a text editor. As you will see, phylip format is particularly simple, consisting of the number of taxa and characters on the very first line, followed by one line per record, on which the id and the sequence are separated by whitespace characters.
Also save the alignment as a nexus file sorted according to codon position, using ‘Save as codonpos Nexus (codonpos as charsets – excluded removed)’ in the ‘File’ menu. This feature can be useful when partitioning based on codon position is used in a subsequent phylogenetic analysis. Again, if you’re not familiar with nexus format yet, open the nexus file in a text editor to see how this format is different from fasta and phylip format.