Internode certainty and related measures (Salichos and Rokas 2013; Kobert et al. 2016) have proven to be powerful methods to examine bipartition support. This has proven to be especially true when bootstrap values become unreliable (e.g., when an input alignment is very long).

In this practical, we will learn how to calculate internode certainty and related measures to examine bipartition support and support for alternative topologies. We will be examining the relationships between filamentous fungi known to be medically and technologically significant. More specifically, we will be examining the relationships among Aspergillus, Penicillium, Monascus, Xeromyces, Penicilliopsis, and outgroup taxa (Uncinocarpus reesii and Coccidioides posadasii) using phylogenies from Steenwyk et al. (2019) mBio.

Note, steps within objectives that have fill-in-the-blank prompts are indicated as such using blue color font. Please fill in these prompts. Additionally, if the software you are trying to use isn’t in your path, it is likely in ~/software.


Protocol

This objective is divided in seven parts:

As noted above, the dataset for this practical comes from Steenwyk et al. (2019) mBio.


1. Download and examine the dataset

  1. With your Guacamole terminal download the data set, which is available here: https://github.com/JLSteenwyk/2024_phylogenomics_workshop/raw/main/incongruence/confidence_measures.tar.gz

Hint:

$ wget http://...
  1. Confirm that you have downloaded the tar zipped directory confidence_measures.tar.gz
  2. Unzip the directory

Hint:

# the command is 'tar' ("tape archive")

Solution:

$ tar -zxvf confidence_measures.tar.gz
  1. Change directory into the newly unzipped directory
  2. Examine the contents of the directory using the ls command
  3. How many files are in confidence_measures?


2. How to collapse branches by a support threshold

As described in the lecture, it can be helpful to collapse poorly support branches before, for example, species tree infernece using summary tree methods. Collapsing poorly supported branches can help account for gene tree uncertainty because the summary tree methods assume gene trees are otherwise accurate. To do so, we will use the PhyKIT function collapse_branches.

Steps: 1. Change directory into FILES_collapsing_across_thresholds. 2. Write a nested for loop and collapse the branches of each phylogenetic tree with thresholds of 80, 70, and 60 using the collapse_branches function.

Solution

for j in $(echo "60 70 80"); do for i in $(ls *treefile) ; do echo $i $j ; pk_collapse_branches $i -s $j -o $i.${j}collapsed ; done ; done
  1. Count the number of internal branches in each file and each threshold (including no collapsing) using a nested for loop. Make the output be nicely.

Hint

# The first question is how do we count internal branches in a phylogenetic tree? 
# I know an indirect way we can do this using the `internal_branch_stats` function of PhyKIT.

# PhyKIT functions that print out summary statistics (mean, medians, etc.) have a verbose option, -v/--verbose.
# When using the verbose option, raw data is outputted instead. Thus, we can obtain internal branch length information
# using the `internal_branch_stats` function but use the verbose option (-v/--verbose) and get the length of each internal branch.
# Thereafter, we can count how many internal branches there are and calculate the difference to see how many were collapsed.

Solution

# copy and paste this one-liner
echo -e "file\tthreshold\tBranch cnt-uncoll\tBranch cnt-coll\tNum. collapsed branches" ; for i in *treefile ; do cnt=$(pk_internal_branch_stats $i -v | wc -l) ; for j in $(echo "60 70 80"); do coll_cnt=$(pk_internal_branch_stats $i.${j}collapsed -v | wc -l) ; num_coll=$(expr $cnt - $coll_cnt) ; echo -e "$i\t$j\t$cnt\t$coll_cnt\t$num_coll" ; done ; done

# one-liner written out as more readable with comments so you can better understand
echo -e "file\tthreshold\tBranch cnt-uncoll\tBranch cnt-coll\tNum. collapsed branches" ; 
for i in *treefile ; do
  cnt=$(pk_internal_branch_stats $i -v | wc -l) # count internal branches in the uncollapsed case
  for j in $(echo "60 70 80"); do
    coll_cnt=$(pk_internal_branch_stats $i.${j}collapsed -v | wc -l) # count internal branches in the collapsed case
    num_coll=$(expr $cnt - $coll_cnt) # substract the collapsed number of internal branches from the uncollapsed number of internal branches 
    echo -e "$i\t$j\t$cnt\t$coll_cnt\t$num_coll"
  done ;
done

file    threshold   Branch cnt-uncoll   Branch cnt-coll Num. collapsed branches
OG0002050.fa.mafft.clipkit.treefile 60  21  20  1
OG0002050.fa.mafft.clipkit.treefile 70  21  20  1
OG0002050.fa.mafft.clipkit.treefile 80  21  17  4
OG0002051.fa.mafft.clipkit.treefile 60  21  19  2
OG0002051.fa.mafft.clipkit.treefile 70  21  18  3
OG0002051.fa.mafft.clipkit.treefile 80  21  14  7
OG0002053.fa.mafft.clipkit.treefile 60  21  20  1
OG0002053.fa.mafft.clipkit.treefile 70  21  20  1
OG0002053.fa.mafft.clipkit.treefile 80  21  19  2

where

  • column 1: Tree being examined
  • column 2: Collapsing threshold
  • column 3: Number of internal branches in uncollapsed tree
  • column 4: Number of internal branches in collapsed tree
  • column 5: The number of collapsed branches
  1. Which tree had the most branches collapsed and under what collapsing threshold? Solution
OG0002051.fa.mafft.clipkit.treefile had 7 branches that were collapsed
  1. Examine the tree in OG0002051.fa.mafft.clipkit.treefile.70collapsed. Are shallow or deep branches collapsed? Why is your answer more likely to be collapsed?

Hint

# This command allows you to quickly view a phylogenetic tree in the terminal by printing a tree in ASCII characters.
pk_print_tree OG0002051.fa.mafft.clipkit.treefile.70collapsed -r

# the -r argument removes branch length information making it easier to see which branches got collapsed

Solution

Often, deeper branches in a phylogeny will have lower support values because more ancient divergences are more difficult to accurately infer. Thus, deeper, not shallower, branches are more likely to be collapsed.


4. Identify bipartitions with low internode certainty values

To examine internode certainty values among bipartitions in our putative species tree, we will first execute a script that replaces bootstrap values with internode support values and forgo examination of internode certainty all values. We will then view the tree in FigTree or similar software. Note, if you’d like to use a webservice like phylo.io, you can cat the tree file and copy and paste the resulting text into the Tree data box on the phylo.io website.

Steps:

  1. To report internode certainty values as support values, we will use the file FILES_scripts_and_matrices/report_ic_as_internode_support.py. To understand how to use the script, execute the following command:
python FILES_scripts_and_matrices/report_ic_as_internode_support.py -h
  1. Based on the help message printed from the previous command, execute the script on the following tree RAxML_IC_Score_BranchLabels.Asp_Pen_phylo.ic to report internode certainty values as support values.

Solution

python FILES_scripts_and_matrices/report_ic_as_internode_support.py -t RAxML_IC_Score_BranchLabels.Asp_Pen_phylo.ic
  1. Open up the resulting file, RAxML_IC_Score_BranchLabels.Asp_Pen_phylo.ic.ic.tree, using FigTree or your preferred tree viewer.

  2. In Steenwyk et al. (2019) mBio, we further scrutinized bipartitions with internode certainty values <0.1. Our reasoning is that internodes with this value or lower have ~5 evaluation trees that support the reference topology and ~2 or more phylogenies that support an alternative topology - we consider this substantial incongruence.

  3. Identify the internode with an internode certainty below 0.1 and specify the value and which internode that is in the phylogeny below. Note, you will have midpoint root the phylogeny first, or place the root at ancestor of U. reesii and C. posadasii.

Solution:

The branch at the ancestor of M. ruber and P. camemberti has a low IC value.

5. Determine gene support frequency and concordance factors for the reference and alternative topology at the contentious internode

  • To determine gene support frequency, we will use RAxML. Note, you have previously calculate gene support frequencies with PhyKIT when you used the pk_ptt function.
  • To determine concordance factors, we will use IQ-Tree.

Steps:

  1. Examine the verboseSplits output from RAxML, which is file RAxML_verboseSplits.Asp_Pen_phylo.ic

  2. The bipartition for the contentious internode of question is second to last instance of the string ‘partition:’, which appears as the following: partition:

---** ***** * xx/yy/zz
---- **** * xx/yy/zz
--*** ----- - xx/yy/zz

where xx/yy/zz represents the following:

  • xx: the number of genes that support a given bipartition,
  • yy: the gene support frequency of the given bipartition, and • zz: the internode certainty value for the given internode.

We will focus on the first two topologies which consider whether the genus Penicilliopsis is more closely related to outgroup taxa or if Xeromyces and Monascus is. In other words, which genus/genera Penicilliopsis or Xeromyces and Monascus is sister to all other members of this fungal family?

What is the number of genes and gene support frequency that support the two topologies?

Solution:

First topology:
375/29.137529

Second topology:
340/26.418026
  1. Now, calculate gene and site concordance factors in the following single command.
iqtree2 -t Asp_Pen_phylo.subset.new --gcf 1287.trees --scf 100 -p FILES_AA_fastas_subset/ --prefix gcf1297_scf100_ref

where:

  • -t specifies the putative species tree,
  • --gcf represents the 1287.trees observed single gene trees,
  • --scf 100 represents that 100 quartets will be randomly for each internal branch for computing site concordance factors,
  • -p specifies the directory that has the single gene fasta files, and
  • --prefix specifies the prefix of the output files.
  1. There are four resulting files:
  • a log file,
  • a cf.tree file with gene concordance factors,
  • a cf.branch file with internal branch identifiers, and
  • a cf.stat file with tab-separated values of gene and site concordance as well as gene and site discordance factors for each internode.

For now, we will focus on gene and site concordance factors.

  1. Reexamine the contentious bipartition in the gcf1297_scf100_ref.cf.tree file, which can be viewed in FigTree or similar software. What is the gene concordance and site concordance factor for the contentious bipartition?

Solution:

0.155/32.1
  1. Conduct the same analysis using the alternative topology as an input by executing the following command:
iqtree2 -t Asp_Pen_phylo.subset.alt.new --gcf 1287.trees --scf 100 -p FILES_AA_fastas_subset/ --prefix gcf1297_scf100_alt
  1. Examine the contentious bipartition where Penicilliopsis zonata splits from Aspergillus and Penicillium in file gcf1297_scf100_alt.cf.tree. What is the gene concordance factor and site concordance factor for this bipartition?

Solution:

0/36.2

6. Calculate gene-wise and site-wise phylogenetic signal

To determine gene-wise and site-wise phylogenetic signal, we will follow protocol first described in Shen et al. (2017), Nature Ecology and Evolution and calculate gene-wise and site-wise log-likelihood scores. We will use IQ-Tree to facilitate the analysis.

Steps:

  1. Calculating gene-wise and site-wise log-likelihood scores takes too much time – so do not execute the following commands. The following commands are provided because I wanted you to have access and an understanding of how the results files you will be using were created.
# Command 1
iqtree2 -s FILES_scripts_and_matrices/concat.fa -seed 78913467814 -st AA -pre likelihood_ref -spp FILES_scripts_and_matrices/partition.file -te Asp_Pen_phylo.subset.new -wpl -wsl

# Command 2
iqtree2 -s concatenation.fasta -seed 13469781343 -st AA -pre likelihood_alt -spp partition.file -te FILES_GLS_files/concatenation.alt.topology -wpl -wsl
  1. In the previous two commands, what do the following specify? Hint, it will be useful to use their documentation – see http://www.iqtree.org/doc/Command-Reference.
  • -spp
  • -te
  • -wpl
  • -wsl
  1. Create summary files with gene-wise and site-wise log likelihood scores using a custom script provided for you. To do so, first change directories into FILES_gene_and_site_lh_values with the following command:
cd FILES_gene_and_site_lh_values

Next examine how to use the script that will create summary files for you using the following command:

bash ../FILES_scripts_and_matrices/create_GLS_summary.sh -h

Based on the help message, create gene-wise and site-wise log likelihood score summary files using the following command:

bash ../FILES_scripts_and_matrices/create_GLS_summary.sh ../FILES_scripts_and_matrices/partition.file likelihood_ref.partlh likelihood_alt.partlh likelihood_ref.sitelh likelihood_alt.sitelh
  1. We will now examine the contents of gene-wise_logli_scores.txt and site-wise_logli_scores.txt. To do so, we will plot the results from each file in RStudio using ggplot2. First, establish an RStudio server connection in a new window and load ggplot2 using the following command:
library(ggplot2)

If you are not in the working directory with your results files, use setwd('path') to change to the appropriate directory. Make sure you are in the appropriate directory with the getwd() command.

  1. Next, read in the data tables using the following command:
gene <- read.table("gene-wise_logli_scores.txt", sep = "\t")

where

  • read.table() is an R function to read in a table or text file and
  • sep = “ specifies that the table is tab (or”) delimited.
  1. Make a plot of the gene-wise log likelihood scores using the following command:
ggplot(gene, aes(V1, V4, color = V5)) + geom_bar(stat = "identity") + ggtitle("GLS (gene-wise)") +
    xlab("Genes") + ylab("deltaGLS") + theme_classic() + theme(axis.text.x = element_blank(),
    axis.ticks.x = element_blank()) + scale_color_manual(values = c("#56B4E9", "#999999",
    "#E69F00"))
  1. There is one gene with overwhelming phylogenetic signal. To determine what gene this is, sort the data stored as gene according to the column of ΔGLS values using the following command and examine the fourth column, which is ΔGLS:
head(gene[order(-gene$V4), ])
  1. What gene has a high ΔGLS in favor of the reference topology?

Solution

EOG091N0K4M has a ΔGLS of 113.180
  1. To determine if removing this gene will switch the topology at the given internode, calculate the sum of log likelihood values after removing the gene-wise log likelihood score for this gene. To do so, execute the following command:
sum(gene[(gene$V4 != 113.18), ]$V4)

where

  • sum() will calculate the sum of a set of numbers and
  • gene[(gene$V4!=113.180),]$V4 will print the set of numbers excluding the highly opinionated gene.
  1. Does the topology change if you remove this gene?

Solution

sum w/ 400.72
sum w/out 287.54

No, because the sum of phylogenetic signal is still (+).
  1. Repeat v and vi for site-wise_logli_scores.txt by executing the following commands:
# read in site-wise log likelihood scores file
site <- read.table("site-wise_logli_scores.txt", sep = "\t")
# create plot
ggplot(site, aes(V1, V4, color = V5)) + geom_bar(stat = "identity") + ggtitle("SLS (site-wise)") +
    xlab("Sites") + ylab("deltaSLS") + theme_classic() + theme(axis.text.x = element_blank(),
    axis.ticks.x = element_blank()) + scale_color_manual(values = c("#56B4E9", "#999999",
    "#E69F00"))

Note, this plot may take 5-10 minutes to generate. If you don’t want to wait, see FILES_results/FILES_gene-wise_site-wise_log_likelihood/site_wise.png for a premade PNG file of the same plot that would be generated.

  1. Is there one site with overwhelming signal?

Solution

No

7. Bonus

  1. Using only starting data of gene trees, use internode certainty (and/or related measures) to rank the following datasets (which are found in FILES_bonus_datasets) from most consistent among single-gene trees to least. The four datasets are:
  • Eutheria, a lineage of mammals,
  • budding yeasts,
  • filamentous fungi, and
  • mammals

Each dataset has both SC-OGs and SNAP-OGs. For this bonus question, use both SC-OGs and SNAP-OGs.


Light hints for overall workflow

1. You will have to infer a summary tree (aka, species tree estimate) first. How do you infer a species tree from a collection of single gene trees?

2. Use the PhyKIT function `bipartition_support_stats` to get summary statistics support statistics in a phylogeny. Support statistics have to be reported in the position bootstrap statistics are reported.


Detailed hints for overall workflow

1. To do so, first infer a summary tree using ASTRAL and then use the input trees as evaluation trees to calculate internode certainty and related measures in RAxML.
astral.jar -i all_trees.tre -o output_tree

2. Calculate internode certainty
raxmlHPC -f i -t output_tree -z all_trees.tre -m GTRCAT -n output_tree.ic -C

3. Use `report_ic_as_internode_support.py` to reformat IC values as bootstrap values

4. Use the PhyKIT function `bipartition_support_stats` to get summary statistics of internode certainty values across the tree
# https://jlsteenwyk.com/PhyKIT/usage/index.html#bipartition-support-statistics
phykit bipartition_support_stats input.tree

5. Keep track of your summary statistic of choice across each dataset and rank the datasets by which has the most to least certainty.


Solution

Gotcha!!!!

No solution will be provided here. Grab a TA, discuss with them and show them your results.
  1. For each dataset, calculate summary statistics of internode certainty (or related measures) between SC-OGs and SNAP-OGs. How is certainity similar or different between SC-OGs and SNAP-OGs?

Solution

Gotcha again!!!!

No solution will be provided here. Grab a TA, discuss with them and show them your results.