A concatenated supermatrix is used as input to infer a putative species phylogeny using the concatenation method. In this workshop practical, we will perform the common steps of creating a concatenated supermatrix, a partition file, and determining the appropriate models of evolution for partitions.
Note, steps within objectives that have fill-in-the-blank prompts are
indicated as such using blue color
font. Please fill in these prompts. Additionally, if the software you
are trying to use isn’t in your path, it is likely in
~/software
.
This objective is divided in four parts:
The dataset for this practical will be amino acid FASTA files.
Hint:
$ wget http://...
trimming_data
Hint:
# the command is 'tar' ("tape archive")
Solution:
$ tar -zxvf partitioning_and_concatenation_data.tar.gz
Inferring a putative species tree using the concatenation method requires constructing a supermatrix of all your genes. In this step, we will familiarize ourselves with the dataset and create a supermatrix using PhyKIT.
PhyKIT is a command-line bioinformatics tool to process and analyze multiple sequence alignments and phylogenetic trees. Publication and documentation.
Steps: 1. Change directory to partitioning_and_concatenation_data [using the cd command] 2. Examine the contents of the directory [using the ls command] 3. How many FASTA files are there?
Hint:
# they are aligned and trimmed using mafft and trimal and have
# “.fa.mafft.trimal” as suffixes
Solution:
There are five files.
Hint:
# use “grep” and “wc -l”
Solution:
There are 20 taxa.
Solution:
phykit -h
For every PhyKIT function, there are aliases. For example, the
treeness function can be executed using: phykit treeness
or
pk_treeness
. The purposes of aliases is to allow “tabbing”
through commands, saving some keystrokes.
Solution:
ls EOG092D* > alignments.txt
pk_create_concat -a alignments.txt -p concat
And to use the file, we must first change permissions:
chmod 777 partition2nexus.sh
bash partition2nexus.sh concat.partition > concat.partition.nex
Different genes may have different models of sequence evolution that best describes their evolution. As an example using models for nucleotide sequences, the simplest model is JC69 (Jukes and Cantor, 1969) where assumptions of equal base frequencies and mutation rates is assumed for A, T, C, and G while the GTR (generalized time-reversible ) model is the most general and may require up to six substitution rate parameters and four equilibrium base frequency parameters. As you may imagine, the evolution of some sequences may be best described by GTR as opposed to JC69 (or vice versa).
There are numerous methods to determine the best fit models for sequences. In fact, many phylogenetic software (e.g., RAxML and IQ-tree) have this built into existing frameworks, which is why AUTO was specified in the model section for ‘auto selection’. However, one may want to determine the best-fit model separately for various reasons (e.g., to split up the task into smaller chunks).
Here, we will use IQTREE to determine the best-fit model for our data as well as how many partitions are best.
Steps: 1. Determine the number of partitions that can describe the data using a scheme similar to jModelTest/ProtTest within IQTREE using the following command:
iqtree2 -s concat.fa -spp concat.partition.nex -m TESTMERGEONLY -pre concat_partition_jProt
Hint:
# answers to these questions can be found in concat_partition_testing.iqtree
# file
Solution:
Find the portion of the file that is as follows:
ID Name Type Seq Site Unique Infor Invar Const
1 part1+part3+part4 AA 20 1345 527 418 832 832
2 part2 AA 20 573 301 253 286 286
3 part5 AA 20 287 256 207 46 46
This communicates that there are three final partitions and that part1+part3+part4 were combined into a single partition.
The best fit models are detailed on the following line in the results file
Best-fit model according to BIC: LG+I+G4:part1+part3+part4,LG+I+G4:part2,LG+G4:part5
iqtree2 -s concat.fa -spp concat.partition.nex -m MF+MERGE -pre concat_partition_MFP
Note, this may take longer. Be patient and check on how your neighbor is doing. This takes about 5 min.
Hint:
# answers to these questions can be found in .iqtree results file
Solution:
Find the portion of the file that is as follows:
ID Name Type Seq Site Unique Infor Invar Const
1 part1+part3+part4 AA 20 1345 527 418 832 832
2 part2 AA 20 573 301 253 286 286
3 part5 AA 20 287 256 207 46 46
This communicates that there are three final partitions and that part1+part3+part4 were combined into a single partition.
The best fit models are detailed on the following line in the results file
Best-fit model according to BIC: LG+R3:part1+part3+part4,LG+R4:part2,LG+G4:part5
These results differ from TESTMERGEONLY because partition 1 and 2 had better fitting free rate models. As described in IQTREE documentation,
the FreeRate model generalizes the +G model by relaxing the assumption of Gamma-distributed rates.
The number describes how many rate categories there are. So, R3 has three rate categories.
MERGE
from the model testing argument. For the sake of
faster computation, test standard substitution models in the
concatenated matrix.iqtree2 -s concat.fa -m TESTONLY -pre concat_partition_single_model
Solution
# What was the best fitting substitution model?
Best-fit model according to BIC: LG+I+G4
# Would the model that was selected be different if AIC was used instead of BIC to evaluate model fit?
Check the .log file, where you will find the following:
Akaike Information Criterion: LG+F+I+G4
Corrected Akaike Information Criterion: LG+F+I+G4
Bayesian Information Criterion: LG+I+G4
Best-fit model: LG+I+G4 chosen according to BIC
AIC and AICc will have the extra parameter +F, which stands for empirical base frequencies.
LG+F+G
. This
site-homogeneous model applies one reversible substitution matrix and
the same character frequencies for all sites in a data matrix.
Researchers may want to incorporate greater biological realism by
allowing nucleotide or amino acid equilibrium frequencies to differ
across sites. Models that incorporate this complexity into a maximum
likelihood framework are termed profile mixture models. In doing so,
profile mixture models help combat saturation by multiple substitutions,
a well known source of phylogenomic error.Test model fit across the site homogenous model LG
and
LG
with varying amino acid equilibrium frequencies across
sites using the following command, which takes about 10 minutes:
iqtree2 -s concat.fa -m TESTONLY -mset LG+G4,LG+C20,LG+C60 -pre concat_partition_site_heterogeneous
Solution:
20 and 60 refer to the number of components in the mixture; in other words, complexity in site-heterogeneity. The more complex, the higher the value.
Thus, C60 is more complex than C20. Available models range from C10 to C60 with a step of 10.
During phylogenomic subsampling, subsamples of loci in a full phylogenomic data matrix are used to re-infer organismal histories. This can help identify unstable bipartitions in phylogenetic trees (more to come about phylogenomic incongruence later). Subsampling is typically done by selecting, for example, half of the loci in a full data matrix with the desirable feature associated with phylogenetic signal.
Here, we will conduct phylogenomic subsampling using the following metrics of information content: * alignment length * evolutionary rate * relative composition variability * long branch score * saturation * treeness
Because some metrics rely on information from single-locus phylogenies (e.g., long branch score), we first need to infer phylogenetic trees for each ortholog.
for
loop to infer single-locus
phylogenetic trees.for i in $(ls EOG092D*); do iqtree2 -s $i -pre $i -m LG+F+G4 ; done
for i in $(ls EOG092D*treefile | sed 's/.treefile//g'); do
echo -e -n "$i\talignment_length\t" ; pk_aln_len $i ;
echo -e -n "$i\tevo_rate\t" ; pk_evo_rate $i.treefile ;
echo -e -n "$i\trcv\t" ; pk_rcv $i ;
echo -e -n "$i\tlb_score\t" ; pk_lbs $i.treefile | grep "median" | awk '{print $NF}';
echo -e -n "$i\tsaturation\t" ; pk_saturation -a $i -t $i.treefile ;
echo -e -n "$i\ttreeness\t" ; pk_treeness $i.treefile ;
done |tee gene_wise_information_content.txt
What do the -e and -n arguments of echo do?
Note, PhyKIT implements other methods that can guide phylogenomic subsampling strategies and summarizing the information content of phylogenomic data matrices. See here for a tutorial in the PhyKIT documentation https://jlsteenwyk.com/PhyKIT/tutorials/index.html#summarizing-information-content.
Solution:
Alignment length: EOG092D1MLK.fa.mafft.trimal; EOG092D1QZM.fa.mafft.trimal; EOG092D1YWG.fa.mafft.trimal
Evolutionary rate: EOG092D1MLK.fa.mafft.trimal; EOG092D1YWG.fa.mafft.trimal; EOG092D23QW.fa.mafft.trimal
Relative composition variability: EOG092D1MLK.fa.mafft.trimal; EOG092D1YWG.fa.mafft.trimal; EOG092D23QW.fa.mafft.trimal
Long branch score: EOG092D1QZM.fa.mafft.trimal; EOG092D23QW.fa.mafft.trimal; EOG092D2PES.fa.mafft.trimal
Saturation: EOG092D1MLK.fa.mafft.trimal; EOG092D1QZM.fa.mafft.trimal; EOG092D23QW.fa.mafft.trimal
Treeness: EOG092D1YWG.fa.mafft.trimal; EOG092D23QW.fa.mafft.trimal; EOG092D2PES.fa.mafft.trimal
Solution:
Evolutionary rate & relative composition variability
evolutionary rate
and
relative composition variability
. To do so, execute the
following commands:for i in $(echo "alignment_length evo_rate lb_score saturation treeness"); do
echo $i ;
if [ "$i" = "evo_rate" ] || [ "$i" = "lb_score" ]; then # if lower value is better
grep "$i" gene_wise_information_content.txt | sort -k3,3 -rn | tail -n 3 | awk '{print $1}' > aln_list_${i}.txt ;
else
grep "$i" gene_wise_information_content.txt | sort -k3,3 -rn | head -n 3 | awk '{print $1}' > aln_list_${i}.txt ;
fi
pk_create_concat -a aln_list_${i}.txt -p concat_${i} ;
done
LG+F+G4
, and the -fast
parameter, which reduces the expansiveness of tree search.for i in $(ls concat*fa); do
iqtree2 -s $i -pre $i -m LG+F+G4 -fast
done
# Note, this may slightly differ depending on stochastic variation between runs.
ls concat_*treefile | grep -v "partition" | while read i ; do pk_rf_dist concat.fa.treefile $i ; done
BONUS: visualize these phylogenies using lessons learned in previous labs to identify what bipartitions differ.
Solution:
This may suggest numerous things including:
* five genes are insufficient to infer species trees (this is often, almost always, true)
* there is a lot of noise among single genes (this is often true too)
* evolutionary relationships may be difficult to infer for many other reasons including insufficient taxon sampling, a history of raditions, etc.
This practice of phylogenomic subsampling underscores how incongruence can be rampant among small sets of genes. Stay tuned for the 2nd week of the workshop where will do a deep dive on more methods to detect and ameliorate incongruence during phylogenomic inference!