Multiple sequence alignment trimming is a common step in phylogenomic workflows and aims to reduce errors/uninformative sites in alignments. Reducing total alignment size can save downstream computation time, but excessive trimming can be a source of error.
Here, we will be trimming multiple sequence alignments using multiple strategies and examining the resulting alignment lengths.
Note, steps within objectives that have fill-in-the-blank prompts are
indicated as such using blue color
font. Please fill in these prompts. Additionally, if the software you
are trying to use isn’t in your path, it is likely in
~/software
.
This objective is divided in five parts:
The dataset for this practical will be amino acid FASTA files.
Hint:
$ wget http://...
alignment_trimming_data.tar.gz
Hint:
# the command is 'tar' ("tape archive")
solution:
$ tar -zxvf alignment_trimming_data.tar.gz
ls
commandClipKIT is a tool for multiple sequence alignment trimming that aims to keep phylogenetically informative sites and discard others. Publication and online documentation.
Steps:
clipkit -h
Solution:
Different trimming modes.
smart-gap: dynamic determination of gaps threshold
kpic: keeps parismony informative and constant sites
kpi: keep only parsimony informative sites
Hint:
The general structure of a for loop is:
for i in $(<list of files>); do <action> ; done
Separate for loop solution:
$ for i in *mafft; do clipkit $i -m smart-gap -o $i.clipkit_smart-gap; done
$ for i in *mafft; do clipkit $i -m kpic -o $i.clipkit_kpic; done
$ for i in *mafft; do clipkit $i -m kpi -o $i.clipkit_kpi; done
Nested for loop solution:
$ for i in *mafft; do for j in $(echo "smart-gap kpic kpi"); do clipkit $i -m $j -o $i.clipkit_${j} ; done ; done
BMGE uses entropy to identify highly divergent sites in multiple sequence alignments and removes them. Publication and documentation.
Steps:
BMGE -?
Separate for loop solution:
-h <thr_max> : real number corresponding to the maximum entropy threshold (ranges from 0 to 1; default: 0.5)
Separate for loop solution:
$ for i in *mafft; do BMGE -i $i -t AA -h 0.7 -o $i.BMGE0.7 ; done
$ for i in *mafft; do BMGE -i $i -t AA -h 0.5 -o $i.BMGE0.5 ; done
Nested for loop solution:
$ for i in *mafft; do for j in $(echo "0.5 0.7"); do BMGE -i $i -t AA -h $j -o $i.BMGE${j} ; done ; done
TrimAl removes highly divergent sites using a conservation threshold. Publication and documentation.
trimal -h
Solution:
-gappyout, -strict , & -strictplus
Separate for loop solution:
$ for i in *mafft; do trimal -in $i -out $i.trimal_gappyout -gappyout ; done
$ for i in *mafft; do trimal -in $i -out $i.trimal_strictplus -strictplus ; done
Nested for loop solution:
$ for i in *mafft; do for j in $(echo "gappyout strictplus"); do trimal -in $i -out $i.trimal_$j -${j} ; done ; done
PhyKIT is a command-line bioinformatics tool to facilitate processing and analysis of multiple sequence alignments and phylogenetic trees. Publication and documentation.
Solution:
$ for i in OG00*; do echo -e -n "$i\t"; pk_aln_len $i ; done
For each ortholog, which trimming strategy removed the most sites and which removed the fewest?
Let’s examine other ways these alignments may differ. We will use PhyKIT to calculate diverse metrics that summarize the information content of these alignments. Specifically, we will calculate the following:
Execute these commands and while the analysis is running, read about each metric in the PhyKIT documentation.
for i in $(ls OG00*); do
echo -e -n "$i\talignment_length_no_gaps\t" ; pk_aln_len_no_gaps $i | awk '{print $1}';
echo -e -n "$i\tpairwise_identity\t" ; pk_pairwise_id $i | grep "mean" | awk '{print $NF}';
echo -e -n "$i\tparsimony_informative_sites\t" ; pk_parsimony_informative_sites $i | awk '{print $1}';
done |tee alignment_information_content.txt
Solution:
ClipKIT with the smart-gap parameter and trimAl with the gappyout parameter
solution:
1. Conduct dimensional reduction analysis using principal component analysis (or PCA).
See this example from the ClipKIT manuscript https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3001007.
In this figure, methods that are far away from others produced drastically different results.
For example, in the left panel, BMGE with an entropy threshold of 0.3 performed very different from the other trimming approaches.
2. Plot boxplots (or violin plots, histograms, etc.) with one panel for each metric.