A concatenated supermatrix is used as input to infer a putative species phylogeny using the concatenation method. In this workshop practical, we will perform the common steps of creating a concatenated supermatrix, a partition file, and determining the appropriate models of evolution for partitions.

Note, steps within objectives that have fill-in-the-blank prompts are indicated as such using blue color font. Please fill in these prompts. Additionally, if the software you are trying to use isn’t in your path, it is likely in ~/software.


Protocol

This objective is divided in four parts:

The dataset for this practical will be amino acid FASTA files.



1. Download and examine the dataset

  1. With your Guacamole terminal download the data set using the following URL: https://github.com/JLSteenwyk/2024_phylogenomics_workshop/raw/main/partitioning_and_concatenation/partitioning_and_concatenation_data.tar.gz.

Hint:

$ wget http://...
  1. Confirm that you have downloaded the tar zipped directory trimming_data
  2. Unzip the directory

Hint:

# the command is 'tar' ("tape archive")

Solution:

$ tar -zxvf partitioning_and_concatenation_data.tar.gz
  1. Change directory into the newly unzipped directory [using the cd command]
  2. Examine the contents of the directory [using the ls command]
  3. How many directories are in FILES_concat_and_model_testing_practical?



2. Concatenate the amino acid FASTA files with PhyKIT

Inferring a putative species tree using the concatenation method requires constructing a supermatrix of all your genes. In this step, we will familiarize ourselves with the dataset and create a supermatrix using PhyKIT.

PhyKIT is a command-line bioinformatics tool to process and analyze multiple sequence alignments and phylogenetic trees. Publication and documentation.


Steps: 1. Change directory to partitioning_and_concatenation_data [using the cd command] 2. Examine the contents of the directory [using the ls command] 3. How many FASTA files are there?

Hint:

# they are aligned and trimmed using mafft and trimal and have
# “.fa.mafft.trimal” as suffixes

Solution:

There are five files.
  1. All FASTA files have the same set of taxa. How many taxa are in each FASTA file?

Hint:

# use “grep” and “wc -l”

Solution:

There are 20 taxa.
  1. PhyKIT will be used for constructing the concatenated matrix. PhyKIT comes with a diverse suite of utilities. Examine all of the PhyKIT functions by examining the help message.

Solution:

phykit -h

For every PhyKIT function, there are aliases. For example, the treeness function can be executed using: phykit treeness or pk_treeness. The purposes of aliases is to allow “tabbing” through commands, saving some keystrokes.

  1. Now, use PhyKIT to create a concatenated supermatrix. First create a file with all the alignment names, then use that as a parameter to run PhyKIT. Give the output the prefix “concat”.

Solution:

ls EOG092D* > alignments.txt
pk_create_concat -a alignments.txt -p concat
  1. Three output files were created.
  • What does each file contain? Hint: PhyKIT documentation provides this information.
  • What is the length of the concatenated alignment?
  • The partition file is in RAxML-style format. What information does this format provide?
  1. Certain programs require nexus formatted partition files. To prepare you for creating nexus files, modify concat.partition to be in nexus format with the following command using an “in-house script”, which can be downloaded from GitHub using the following link: https://raw.githubusercontent.com/JLSteenwyk/2024_phylogenomics_workshop/main/partitioning_and_concatenation/partition2nexus.sh

And to use the file, we must first change permissions:

chmod 777 partition2nexus.sh
bash partition2nexus.sh concat.partition > concat.partition.nex
  1. Examine the contents of the resulting file



3. Determine models of sequence evolution using IQ-TREE2

Different genes may have different models of sequence evolution that best describes their evolution. As an example using models for nucleotide sequences, the simplest model is JC69 (Jukes and Cantor, 1969) where assumptions of equal base frequencies and mutation rates is assumed for A, T, C, and G while the GTR (generalized time-reversible ) model is the most general and may require up to six substitution rate parameters and four equilibrium base frequency parameters. As you may imagine, the evolution of some sequences may be best described by GTR as opposed to JC69 (or vice versa).

There are numerous methods to determine the best fit models for sequences. In fact, many phylogenetic software (e.g., RAxML and IQ-tree) have this built into existing frameworks, which is why AUTO was specified in the model section for ‘auto selection’. However, one may want to determine the best-fit model separately for various reasons (e.g., to split up the task into smaller chunks).

Here, we will use IQTREE to determine the best-fit model for our data as well as how many partitions are best.


Steps: 1. Determine the number of partitions that can describe the data using a scheme similar to jModelTest/ProtTest within IQTREE using the following command:

iqtree2 -s concat.fa -spp concat.partition.nex -m TESTMERGEONLY -pre concat_partition_jProt
  1. Examine the resulting outputs to determine the following:
  • How many partitions were made?
  • What partitions were combined?
  • What models best fit each partition?
  • Which metric is used to evaluate model fit? BIC, AIC, or AICc
    • Note, BIC, AIC, and AICc are Bayesian methods to evaluate model fit among a finite set of models.

Hint:

# answers to these questions can be found in concat_partition_testing.iqtree
# file

Solution:

Find the portion of the file that is as follows:

  ID  Name               Type   Seq Site    Unique  Infor   Invar   Const
   1  part1+part3+part4  AA 20  1345    527 418 832 832
   2  part2              AA 20  573 301 253 286 286
   3  part5              AA 20  287 256 207 46  46

This communicates that there are three final partitions and that part1+part3+part4 were combined into a single partition.

The best fit models are detailed on the following line in the results file

Best-fit model according to BIC: LG+I+G4:part1+part3+part4,LG+I+G4:part2,LG+G4:part5
  1. Determine the number of partitions that can describe the data using the ModelFinderPlus scheme within IQTREE using the following command:
iqtree2 -s concat.fa -spp concat.partition.nex -m MF+MERGE -pre concat_partition_MFP

Note, this may take longer. Be patient and check on how your neighbor is doing. This takes about 5 min.

  1. Examine the resulting outputs to determine the following:
  • How many partitions were made?
  • What partitions were combined?
  • What models best fit each partition?
  • How do these results using TESTMERGEONLY and MF+MERGE differ?

Hint:

# answers to these questions can be found in .iqtree results file

Solution:

Find the portion of the file that is as follows:

  ID  Name               Type   Seq Site    Unique  Infor   Invar   Const
   1  part1+part3+part4  AA 20  1345    527 418 832 832
   2  part2              AA 20  573 301 253 286 286
   3  part5              AA 20  287 256 207 46  46

This communicates that there are three final partitions and that part1+part3+part4 were combined into a single partition.

The best fit models are detailed on the following line in the results file

Best-fit model according to BIC: LG+R3:part1+part3+part4,LG+R4:part2,LG+G4:part5


These results differ from TESTMERGEONLY because partition 1 and 2 had better fitting free rate models. As described in IQTREE documentation,
the FreeRate model generalizes the +G model by relaxing the assumption of Gamma-distributed rates.

The number describes how many rate categories there are. So, R3 has three rate categories.
  1. Alternatively, a single best fitting substitution model can be determined by excluding the partition file and removing the MERGE from the model testing argument. For the sake of faster computation, test standard substitution models in the concatenated matrix.
iqtree2 -s concat.fa -m TESTONLY -pre concat_partition_single_model
  1. Examine the resulting outputs and address the following:
  • What was the best fitting substitution model?
  • Would the model that was selected be different if AIC was used instead of BIC to evaluate model fit?
  • Refer to IQ-TREE documentation to better understand what exactly is different. IQTREE Substitution-Models Documentation

Solution

# What was the best fitting substitution model?
Best-fit model according to BIC: LG+I+G4

# Would the model that was selected be different if AIC was used instead of BIC to evaluate model fit?
Check the .log file, where you will find the following:
Akaike Information Criterion:           LG+F+I+G4
Corrected Akaike Information Criterion: LG+F+I+G4
Bayesian Information Criterion:         LG+I+G4
Best-fit model: LG+I+G4 chosen according to BIC

AIC and AICc will have the extra parameter +F, which stands for empirical base frequencies.
  1. Rather than testing all possible substutiton models, researchers may want to test specific subsets. For example, suppose the best fitting substitution model for a given gene is LG+F+G. This site-homogeneous model applies one reversible substitution matrix and the same character frequencies for all sites in a data matrix. Researchers may want to incorporate greater biological realism by allowing nucleotide or amino acid equilibrium frequencies to differ across sites. Models that incorporate this complexity into a maximum likelihood framework are termed profile mixture models. In doing so, profile mixture models help combat saturation by multiple substitutions, a well known source of phylogenomic error.

Test model fit across the site homogenous model LG and LG with varying amino acid equilibrium frequencies across sites using the following command, which takes about 10 minutes:

iqtree2 -s concat.fa -m TESTONLY -mset LG+G4,LG+C20,LG+C60 -pre concat_partition_site_heterogeneous
  1. Examine the resulting outputs and address the following:

Solution:

20 and 60 refer to the number of components in the mixture; in other words, complexity in site-heterogeneity. The more complex, the higher the value.
Thus, C60 is more complex than C20. Available models range from C10 to C60 with a step of 10.


4. Phylogenomic subsampling based on information content of phylogenetic trees and alignments

During phylogenomic subsampling, subsamples of loci in a full phylogenomic data matrix are used to re-infer organismal histories. This can help identify unstable bipartitions in phylogenetic trees (more to come about phylogenomic incongruence later). Subsampling is typically done by selecting, for example, half of the loci in a full data matrix with the desirable feature associated with phylogenetic signal.

Here, we will conduct phylogenomic subsampling using the following metrics of information content: * alignment length * evolutionary rate * relative composition variability * long branch score * saturation * treeness

Because some metrics rely on information from single-locus phylogenies (e.g., long branch score), we first need to infer phylogenetic trees for each ortholog.

  1. To do so, we will use some short-cuts to speed up computation, such as using standard substitution model testing, not ModelFinder Plus. Execute the following for loop to infer single-locus phylogenetic trees.
for i in $(ls EOG092D*); do iqtree2 -s $i -pre $i -m LG+F+G4 ; done
  1. Calculate each metric of information content for individual orthologs using PhyKIT. To do so, execute the following command; note, while this command is running, move onto the the next two steps.
for i in $(ls EOG092D*treefile | sed 's/.treefile//g'); do
  echo -e -n "$i\talignment_length\t" ; pk_aln_len $i ;
  echo -e -n "$i\tevo_rate\t" ; pk_evo_rate $i.treefile ;
  echo -e -n "$i\trcv\t" ; pk_rcv $i ;
  echo -e -n "$i\tlb_score\t" ; pk_lbs $i.treefile | grep "median" | awk '{print $NF}';
  echo -e -n "$i\tsaturation\t" ; pk_saturation -a $i -t $i.treefile ;
  echo -e -n "$i\ttreeness\t" ; pk_treeness $i.treefile ;
done |tee gene_wise_information_content.txt

What do the -e and -n arguments of echo do?

  1. Refer to PhyKIT documentation and describe each of the differ metrics. What does each one refer to? For each metric are relatively higher or lower values better? Note, assume that you want genes with lower evolutionary rates; this is often preferred for datasets of anciently diverged organisms because genes with lower evolutionary rates are less likely to succumb to saturation by multiple substitutions.

Note, PhyKIT implements other methods that can guide phylogenomic subsampling strategies and summarizing the information content of phylogenomic data matrices. See here for a tutorial in the PhyKIT documentation https://jlsteenwyk.com/PhyKIT/tutorials/index.html#summarizing-information-content.

  1. Suppose we want to subsamping using 60% of the total data matrix (3 / 5 genes). For each metric, which three genes should be concatenated together? Note, assume that you want genes with lower evolutionary rates.

Solution:

Alignment length: EOG092D1MLK.fa.mafft.trimal; EOG092D1QZM.fa.mafft.trimal; EOG092D1YWG.fa.mafft.trimal
Evolutionary rate: EOG092D1MLK.fa.mafft.trimal; EOG092D1YWG.fa.mafft.trimal; EOG092D23QW.fa.mafft.trimal
Relative composition variability: EOG092D1MLK.fa.mafft.trimal; EOG092D1YWG.fa.mafft.trimal; EOG092D23QW.fa.mafft.trimal
Long branch score: EOG092D1QZM.fa.mafft.trimal; EOG092D23QW.fa.mafft.trimal; EOG092D2PES.fa.mafft.trimal
Saturation: EOG092D1MLK.fa.mafft.trimal; EOG092D1QZM.fa.mafft.trimal; EOG092D23QW.fa.mafft.trimal
Treeness: EOG092D1YWG.fa.mafft.trimal; EOG092D23QW.fa.mafft.trimal; EOG092D2PES.fa.mafft.trimal
  1. Which metrics had the same best scoring sets of genes?

Solution:

Evolutionary rate & relative composition variability
  1. Create concatenated matrices using the best scoring three genes, according to each metric. Create only one matrix for subsampling based on evolutionary rate and relative composition variability. To do so, execute the following commands:
for i in $(echo "alignment_length evo_rate lb_score saturation treeness"); do
  echo $i ;

  if [ "$i" = "evo_rate" ] || [ "$i" = "lb_score" ]; then # if lower value is better
    grep "$i" gene_wise_information_content.txt | sort -k3,3 -rn | tail -n 3 | awk '{print $1}' > aln_list_${i}.txt ; 
  else
    grep "$i" gene_wise_information_content.txt | sort -k3,3 -rn | head -n 3 | awk '{print $1}' > aln_list_${i}.txt ;
  fi

  pk_create_concat -a aln_list_${i}.txt -p concat_${i} ; 
done
  1. Infer quick and dirty phylogenies from each concatenated matrix using IQTREE. To speed up computation, we will use a predefined substitution matrix, LG+F+G4, and the -fast parameter, which reduces the expansiveness of tree search.
for i in $(ls concat*fa); do
  iqtree2 -s $i -pre $i -m LG+F+G4 -fast
done
  1. Compare the phylogenies inferred using subsampled data matrices to the phylogenetic tree inferred using the full data matrix. Which subsampling strategies resulted in a different phylogenetic tree?
# Note, this may slightly differ depending on stochastic variation between runs.
ls concat_*treefile | grep -v "partition" | while read i ; do pk_rf_dist concat.fa.treefile $i ; done

BONUS: visualize these phylogenies using lessons learned in previous labs to identify what bipartitions differ.

  1. In many instances, each subsampling strategy resulted in a different phylogenetic tree. What does this suggest?

Solution:

This may suggest numerous things including:
* five genes are insufficient to infer species trees (this is often, almost always, true)
* there is a lot of noise among single genes (this is often true too)
* evolutionary relationships may be difficult to infer for many other reasons including insufficient taxon sampling, a history of raditions, etc.

This practice of phylogenomic subsampling underscores how incongruence can be rampant among small sets of genes. Stay tuned for the 2nd week of the workshop where will do a deep dive on more methods to detect and ameliorate incongruence during phylogenomic inference!