Multiple sequence alignment trimming is a common step in phylogenomic workflows and aims to reduce errors/uninformative sites in alignments. Reducing total alignment size can save downstream computation time, but excessive trimming can be a source of error.

Here, we will be trimming multiple sequence alignments using multiple strategies and examining the resulting alignment lengths.

Note, steps within objectives that have fill-in-the-blank prompts are indicated as such using blue color font. Please fill in these prompts. Additionally, if the software you are trying to use isn’t in your path, it is likely in ~/software.


Protocol

This objective is divided in five parts:

The dataset for this practical will be amino acid FASTA files.


1. Download and examine the dataset

  1. With your Guacamole terminal download the data set, which is available here: https://github.com/JLSteenwyk/2024_phylogenomics_workshop/raw/main/trimming/alignment_trimming_data.tar.gz.

Hint:

$ wget http://...
  1. Confirm that you have downloaded the tar zipped directory alignment_trimming_data.tar.gz
  2. Unzip the directory

Hint:

# the command is 'tar' ("tape archive")

solution:

$ tar -zxvf alignment_trimming_data.tar.gz
  1. Change directory into the newly unzipped directory
  2. Examine the contents of the directory using the ls command
  3. How many files are in alignment_trimming_data?


2. Trim multiple sequence alignments using ClipKIT

ClipKIT is a tool for multiple sequence alignment trimming that aims to keep phylogenetically informative sites and discard others. Publication and online documentation.

Steps:

  1. Examine the ClipKIT help message by executing the following: clipkit -h
  2. Briefly explain what smart-gap, kpic, and kpi trimming modes refer to.

Solution:

Different trimming modes.

smart-gap: dynamic determination of gaps threshold
kpic: keeps parismony informative and constant sites
kpi: keep only parsimony informative sites
  1. Write separate for loops that trim using the smart-gap and kpic, and kpi approaches. Challenge: write a nested for loop that trims using both methods.

Hint:

The general structure of a for loop is:
for i in $(<list of files>); do <action> ; done

Separate for loop solution:

$ for i in *mafft; do clipkit $i -m smart-gap -o $i.clipkit_smart-gap; done
$ for i in *mafft; do clipkit $i -m kpic -o $i.clipkit_kpic; done
$ for i in *mafft; do clipkit $i -m kpi -o $i.clipkit_kpi; done

Nested for loop solution:

$ for i in *mafft; do for j in $(echo "smart-gap kpic kpi"); do clipkit $i -m $j -o $i.clipkit_${j} ; done ; done


3. Trim multiple sequence alignments using BMGE

BMGE uses entropy to identify highly divergent sites in multiple sequence alignments and removes them. Publication and documentation.

Steps:

  1. Examine the help message by executing the following: BMGE -?
  2. Which parameter specifies the entropy threshold?

Separate for loop solution:

-h <thr_max> : real number corresponding to the maximum entropy threshold (ranges from 0 to 1; default: 0.5)
  1. Write separate for loops that trim using the an entropy threshold of 0.7 and 0.5. Challenge: write a nested for loop that trims using both methods.

Separate for loop solution:

$ for i in *mafft; do BMGE -i $i -t AA -h 0.7 -o $i.BMGE0.7 ; done
$ for i in *mafft; do BMGE -i $i -t AA -h 0.5 -o $i.BMGE0.5 ; done

Nested for loop solution:

$ for i in *mafft; do for j in $(echo "0.5 0.7"); do BMGE -i $i -t AA -h $j -o $i.BMGE${j} ; done ; done


4. Trim multiple sequence alignments using trimAl

TrimAl removes highly divergent sites using a conservation threshold. Publication and documentation.

  1. Examine the trimAl help message by executing the following: trimal -h
  2. How are the three trimming modes specified?

Solution:

-gappyout, -strict , & -strictplus
  1. Write separate for loops that trim using the gappyout and strictplus approaches. Challenge: write a nested for loop that trims using both methods

Separate for loop solution:

$ for i in *mafft; do trimal -in $i -out $i.trimal_gappyout -gappyout ; done
$ for i in *mafft; do trimal -in $i -out $i.trimal_strictplus -strictplus ; done

Nested for loop solution:

$ for i in *mafft; do for j in $(echo "gappyout strictplus"); do trimal -in $i -out $i.trimal_$j -${j} ; done ; done


5. Examine the resulting alignment lengths using PhyKIT

PhyKIT is a command-line bioinformatics tool to facilitate processing and analysis of multiple sequence alignments and phylogenetic trees. Publication and documentation.

  1. Navigate the PhyKIT online documentation.
  2. On the left navigation panel, click “Usage,” which will bring up four options: “General usage,” “Alignment-based functions,” “Tree-based functions,” Alignment- and tree-based functions.”
  3. Since we are calculating lengths of alignments, click “Alignment-based functions” and then “Alignment length”
  4. Write a for loop to calculate the alignment length of each trimmed alignment using PhyKIT. Note, PhyKIT can be called two ways:
  • phykit or the alias
  • pk_. The primary is a long-hand version and the latter is a short hand version that allows you to “tab out” all PhyKIT commands. Throughout, we will use the latter.

Solution:

$ for i in OG00*; do echo -e -n "$i\t"; pk_aln_len $i ; done
  1. For each ortholog, which trimming strategy removed the most sites and which removed the fewest?

  2. Let’s examine other ways these alignments may differ. We will use PhyKIT to calculate diverse metrics that summarize the information content of these alignments. Specifically, we will calculate the following:

Execute these commands and while the analysis is running, read about each metric in the PhyKIT documentation.

for i in $(ls OG00*); do
  echo -e -n "$i\talignment_length_no_gaps\t" ; pk_aln_len_no_gaps $i | awk '{print $1}';
  echo -e -n "$i\tpairwise_identity\t" ; pk_pairwise_id $i  | grep "mean" | awk '{print $NF}';
  echo -e -n "$i\tparsimony_informative_sites\t" ; pk_parsimony_informative_sites $i | awk '{print $1}';
done |tee alignment_information_content.txt
  1. Examine the resulting output. For ortholog OG0010342, which trimming strategy had the longest alignment length with no gaps?

Solution:

ClipKIT with the smart-gap parameter and trimAl with the gappyout parameter


6. Finished early? Try this challenge :D

  1. Run this thought experiment: How would you visualize the differences between each trimming strategy?

solution:

1. Conduct dimensional reduction analysis using principal component analysis (or PCA).
See this example from the ClipKIT manuscript https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3001007.
In this figure, methods that are far away from others produced drastically different results.
For example, in the left panel, BMGE with an entropy threshold of 0.3 performed very different from the other trimming approaches.

2. Plot boxplots (or violin plots, histograms, etc.) with one panel for each metric.
  1. Finished really early? Try making a plot and show a TA!