Internode certainty and related measures (Salichos and Rokas 2013; Kobert et al. 2016) have proven to be powerful methods to examine bipartition support. This has proven to be especially true when bootstrap values become unreliable (e.g., when an input alignment is very long).
In this practical, we will learn how to calculate internode certainty and related measures to examine bipartition support and support for alternative topologies. We will be examining the relationships between filamentous fungi known to be medically and technologically significant. More specifically, we will be examining the relationships among Aspergillus, Penicillium, Monascus, Xeromyces, Penicilliopsis, and outgroup taxa (Uncinocarpus reesii and Coccidioides posadasii) using phylogenies from Steenwyk et al. (2019) mBio.
Note, steps within objectives that have fill-in-the-blank prompts are
indicated as such using blue color
font. Please fill in these prompts. Additionally, if the software you
are trying to use isn’t in your path, it is likely in
~/software
.
This objective is divided in seven parts:
As noted above, the dataset for this practical comes from Steenwyk et al. (2019) mBio.
Hint:
$ wget http://...
confidence_measures.tar.gz
Hint:
# the command is 'tar' ("tape archive")
Solution:
$ tar -zxvf confidence_measures.tar.gz
ls
commandAs described in the lecture, it can be helpful to collapse poorly support branches before, for example, species tree infernece using summary tree methods. Collapsing poorly supported branches can help account for gene tree uncertainty because the summary tree methods assume gene trees are otherwise accurate. To do so, we will use the PhyKIT function collapse_branches.
Steps: 1. Change directory into
FILES_collapsing_across_thresholds
. 2. Write a nested for
loop and collapse the branches of each phylogenetic tree with thresholds
of 80, 70, and 60 using the collapse_branches
function.
Solution
for j in $(echo "60 70 80"); do for i in $(ls *treefile) ; do echo $i $j ; pk_collapse_branches $i -s $j -o $i.${j}collapsed ; done ; done
for
loop. Make the output be nicely.Hint
# The first question is how do we count internal branches in a phylogenetic tree?
# I know an indirect way we can do this using the `internal_branch_stats` function of PhyKIT.
# PhyKIT functions that print out summary statistics (mean, medians, etc.) have a verbose option, -v/--verbose.
# When using the verbose option, raw data is outputted instead. Thus, we can obtain internal branch length information
# using the `internal_branch_stats` function but use the verbose option (-v/--verbose) and get the length of each internal branch.
# Thereafter, we can count how many internal branches there are and calculate the difference to see how many were collapsed.
Solution
# copy and paste this one-liner
echo -e "file\tthreshold\tBranch cnt-uncoll\tBranch cnt-coll\tNum. collapsed branches" ; for i in *treefile ; do cnt=$(pk_internal_branch_stats $i -v | wc -l) ; for j in $(echo "60 70 80"); do coll_cnt=$(pk_internal_branch_stats $i.${j}collapsed -v | wc -l) ; num_coll=$(expr $cnt - $coll_cnt) ; echo -e "$i\t$j\t$cnt\t$coll_cnt\t$num_coll" ; done ; done
# one-liner written out as more readable with comments so you can better understand
echo -e "file\tthreshold\tBranch cnt-uncoll\tBranch cnt-coll\tNum. collapsed branches" ;
for i in *treefile ; do
cnt=$(pk_internal_branch_stats $i -v | wc -l) # count internal branches in the uncollapsed case
for j in $(echo "60 70 80"); do
coll_cnt=$(pk_internal_branch_stats $i.${j}collapsed -v | wc -l) # count internal branches in the collapsed case
num_coll=$(expr $cnt - $coll_cnt) # substract the collapsed number of internal branches from the uncollapsed number of internal branches
echo -e "$i\t$j\t$cnt\t$coll_cnt\t$num_coll"
done ;
done
file threshold Branch cnt-uncoll Branch cnt-coll Num. collapsed branches
OG0002050.fa.mafft.clipkit.treefile 60 21 20 1
OG0002050.fa.mafft.clipkit.treefile 70 21 20 1
OG0002050.fa.mafft.clipkit.treefile 80 21 17 4
OG0002051.fa.mafft.clipkit.treefile 60 21 19 2
OG0002051.fa.mafft.clipkit.treefile 70 21 18 3
OG0002051.fa.mafft.clipkit.treefile 80 21 14 7
OG0002053.fa.mafft.clipkit.treefile 60 21 20 1
OG0002053.fa.mafft.clipkit.treefile 70 21 20 1
OG0002053.fa.mafft.clipkit.treefile 80 21 19 2
where
OG0002051.fa.mafft.clipkit.treefile had 7 branches that were collapsed
OG0002051.fa.mafft.clipkit.treefile.70collapsed
. Are shallow or deep branches collapsed? Why is your
answer more likely to be collapsed?Hint
# This command allows you to quickly view a phylogenetic tree in the terminal by printing a tree in ASCII characters.
pk_print_tree OG0002051.fa.mafft.clipkit.treefile.70collapsed -r
# the -r argument removes branch length information making it easier to see which branches got collapsed
Solution
Often, deeper branches in a phylogeny will have lower support values because more ancient divergences are more difficult to accurately infer. Thus, deeper, not shallower, branches are more likely to be collapsed.
To examine internode certainty values among bipartitions in our
putative species tree, we will first execute a script that replaces
bootstrap values with internode support values and forgo examination of
internode certainty all values. We will then view the tree in FigTree or
similar software. Note, if you’d like to use a webservice like phylo.io, you can cat
the tree
file and copy and paste the resulting text into the
Tree data
box on the phylo.io website.
Steps:
FILES_scripts_and_matrices/report_ic_as_internode_support.py
.
To understand how to use the script, execute the following command:python FILES_scripts_and_matrices/report_ic_as_internode_support.py -h
RAxML_IC_Score_BranchLabels.Asp_Pen_phylo.ic
to report
internode certainty values as support values.Solution
python FILES_scripts_and_matrices/report_ic_as_internode_support.py -t RAxML_IC_Score_BranchLabels.Asp_Pen_phylo.ic
Open up the resulting file,
RAxML_IC_Score_BranchLabels.Asp_Pen_phylo.ic.ic.tree
, using
FigTree or your preferred tree viewer.
In Steenwyk et al. (2019) mBio, we further scrutinized bipartitions with internode certainty values <0.1. Our reasoning is that internodes with this value or lower have ~5 evaluation trees that support the reference topology and ~2 or more phylogenies that support an alternative topology - we consider this substantial incongruence.
Identify the internode with an internode certainty below 0.1 and specify the value and which internode that is in the phylogeny below. Note, you will have midpoint root the phylogeny first, or place the root at ancestor of U. reesii and C. posadasii.
Solution:
The branch at the ancestor of M. ruber and P. camemberti has a low IC value.
pk_ptt
function.Steps:
Examine the verboseSplits output from RAxML, which is file
RAxML_verboseSplits.Asp_Pen_phylo.ic
The bipartition for the contentious internode of question is second to last instance of the string ‘partition:’, which appears as the following: partition:
---** ***** * xx/yy/zz
---- **** * xx/yy/zz
--***
----- - xx/yy/zz
where xx/yy/zz
represents the following:
We will focus on the first two topologies which consider whether the genus Penicilliopsis is more closely related to outgroup taxa or if Xeromyces and Monascus is. In other words, which genus/genera Penicilliopsis or Xeromyces and Monascus is sister to all other members of this fungal family?
What is the number of genes and gene support frequency that support the two topologies?
Solution:
First topology:
375/29.137529
Second topology:
340/26.418026
iqtree2 -t Asp_Pen_phylo.subset.new --gcf 1287.trees --scf 100 -p FILES_AA_fastas_subset/ --prefix gcf1297_scf100_ref
where:
For now, we will focus on gene and site concordance factors.
gcf1297_scf100_ref.cf.tree
file, which can be viewed in
FigTree or similar software. What is the gene
concordance and site concordance factor for the contentious
bipartition?Solution:
0.155/32.1
iqtree2 -t Asp_Pen_phylo.subset.alt.new --gcf 1287.trees --scf 100 -p FILES_AA_fastas_subset/ --prefix gcf1297_scf100_alt
gcf1297_scf100_alt.cf.tree
. What is the gene
concordance factor and site concordance factor for this
bipartition?Solution:
0/36.2
To determine gene-wise and site-wise phylogenetic signal, we will follow protocol first described in Shen et al. (2017), Nature Ecology and Evolution and calculate gene-wise and site-wise log-likelihood scores. We will use IQ-Tree to facilitate the analysis.
Steps:
# Command 1
iqtree2 -s FILES_scripts_and_matrices/concat.fa -seed 78913467814 -st AA -pre likelihood_ref -spp FILES_scripts_and_matrices/partition.file -te Asp_Pen_phylo.subset.new -wpl -wsl
# Command 2
iqtree2 -s concatenation.fasta -seed 13469781343 -st AA -pre likelihood_alt -spp partition.file -te FILES_GLS_files/concatenation.alt.topology -wpl -wsl
cd FILES_gene_and_site_lh_values
Next examine how to use the script that will create summary files for you using the following command:
bash ../FILES_scripts_and_matrices/create_GLS_summary.sh -h
Based on the help message, create gene-wise and site-wise log likelihood score summary files using the following command:
bash ../FILES_scripts_and_matrices/create_GLS_summary.sh ../FILES_scripts_and_matrices/partition.file likelihood_ref.partlh likelihood_alt.partlh likelihood_ref.sitelh likelihood_alt.sitelh
library(ggplot2)
If you are not in the working directory with your results files, use
setwd('path')
to change to the appropriate directory. Make
sure you are in the appropriate directory with the getwd()
command.
gene <- read.table("gene-wise_logli_scores.txt", sep = "\t")
where
ggplot(gene, aes(V1, V4, color = V5)) + geom_bar(stat = "identity") + ggtitle("GLS (gene-wise)") +
xlab("Genes") + ylab("deltaGLS") + theme_classic() + theme(axis.text.x = element_blank(),
axis.ticks.x = element_blank()) + scale_color_manual(values = c("#56B4E9", "#999999",
"#E69F00"))
head(gene[order(-gene$V4), ])
Solution
EOG091N0K4M has a ΔGLS of 113.180
sum(gene[(gene$V4 != 113.18), ]$V4)
where
sum()
will calculate the sum of a set of numbers
andgene[(gene$V4!=113.180),]$V4
will print the set of
numbers excluding the highly opinionated gene.Solution
sum w/ 400.72
sum w/out 287.54
No, because the sum of phylogenetic signal is still (+).
# read in site-wise log likelihood scores file
site <- read.table("site-wise_logli_scores.txt", sep = "\t")
# create plot
ggplot(site, aes(V1, V4, color = V5)) + geom_bar(stat = "identity") + ggtitle("SLS (site-wise)") +
xlab("Sites") + ylab("deltaSLS") + theme_classic() + theme(axis.text.x = element_blank(),
axis.ticks.x = element_blank()) + scale_color_manual(values = c("#56B4E9", "#999999",
"#E69F00"))
Note, this plot may take 5-10 minutes to generate. If you don’t want
to wait, see
FILES_results/FILES_gene-wise_site-wise_log_likelihood/site_wise.png
for a premade PNG file of the same plot that would be generated.
Solution
No
Each dataset has both SC-OGs and SNAP-OGs. For this bonus question, use both SC-OGs and SNAP-OGs.
Light hints for overall workflow
1. You will have to infer a summary tree (aka, species tree estimate) first. How do you infer a species tree from a collection of single gene trees?
2. Use the PhyKIT function `bipartition_support_stats` to get summary statistics support statistics in a phylogeny. Support statistics have to be reported in the position bootstrap statistics are reported.
Detailed hints for overall workflow
1. To do so, first infer a summary tree using ASTRAL and then use the input trees as evaluation trees to calculate internode certainty and related measures in RAxML.
astral.jar -i all_trees.tre -o output_tree
2. Calculate internode certainty
raxmlHPC -f i -t output_tree -z all_trees.tre -m GTRCAT -n output_tree.ic -C
3. Use `report_ic_as_internode_support.py` to reformat IC values as bootstrap values
4. Use the PhyKIT function `bipartition_support_stats` to get summary statistics of internode certainty values across the tree
# https://jlsteenwyk.com/PhyKIT/usage/index.html#bipartition-support-statistics
phykit bipartition_support_stats input.tree
5. Keep track of your summary statistic of choice across each dataset and rank the datasets by which has the most to least certainty.
Solution
Gotcha!!!!
No solution will be provided here. Grab a TA, discuss with them and show them your results.
Solution
Gotcha again!!!!
No solution will be provided here. Grab a TA, discuss with them and show them your results.