Introgression detection

Tree-based Introgression Detection

Michael Matschiner, 18 January 2025

Slides

Background

Comparisons of the frequencies of tree topologies, inferred from sequence alignments from across the genome, can serve as a useful complement to SNP-based analyses for the detection of past introgression events. Importantly, introgression tests based on such sets of phylogenies could be robust to conditions that may produce misleading results in the ABBA-BABA test. In particular, the D-statistic assumes identical substitution rates for all species and the absence of homoplasies so that derived sites can only be shared if they had the same origin; the possibility of multiple independent substitutions at the same site is ignored. These conditions assumed by the ABBA-BABA test are likely to hold in sets of recently diverged species, but may be problematic when more divergent species are compared. Phylogenetic approaches based on sequence alignments can therefore serve to verify or reject patterns of introgression determined with SNP-based methods.

Learning goals

Know how to extract a large set of sequence alignment blocks from a whole-genome alignment, and how to identify those most suitable for phylogenetic analysis.
Find out how to produce a phylogenies for the filtered set of alignment blocks.
Use the set of phylogenies to estimate a species tree and assess support for introgression.

How to run this activity

For this activity, you will need to connect to the Amazon instance with a terminal window on your machine, using SSH (ssh [email protected]), or you can use the terminal that comes with Guacamole (ec2-XX-XXX-XX-XXX.compute-1.amazonaws.com:8080/guacamole/). Then navigate to the activity folder: cd ~/workshop_materials/27_tree_based_introgression_detection

Requirements

PAUP*: The software PAUP* is a general-utility program for phylogenetic inference. A graphical user interface has been developed for Windows (and also for MacOS, but that is now no longer functional). For this activity, the command-line version installed on the Amazon instances is recommended and can (in principle) be executed with the command ~/software/paup.
IQ-TREE: IQ-TREE (Minh et al., 2020) is a modern tool for rapid phylogenetic inference under maximum likelihood. We are here using IQ-TREE v.2, a significant update from IQ-TREE v.1. The program is installed on your Amazon instance and can be executed with the command iqtree2.
ASTRAL:The program ASTRAL (Zhang et al., 2018) allows efficient and accurate estimation of species trees from gene trees. The program is installed on your Amazon instance and can be executed with the command java -jar ~/software/Astral/astral.5.7.8.jar.
PhyloNet: The PhyloNet program (Than et al. 2008) implements a range of methods for the inference of species trees and networks in a maximum-likelihood, Bayesian, or parsimony framework. It is installed on your Amazon instance and can be executed with the command java -jar ~/software/PhyloNet/PhyloNet.jar.
FigTree : The program FigTree is a very intuitive and useful tool for the visualization and (to a limited extent) manipulation of phylogenies encoded in Newick format. This program can easily be installed on your machine, using executables for Mac OS X, Linux, and Windows provided on https://github.com/rambaut/figtree/releases (I recommend to use v.1.4.4 if you want to install FigTree on your machine). Alternatively, you can use the FigTree installation on the Amazon instances through Guacamole.

Notes

All text in red on gray background indicates commands or file names that you can type or copy to the terminal. Text in black on gray background refers to file names, paths, or terminal outputs.
Questions or tasks are indicated with Q. If you get stuck, check the answer-box:

Show me the answer!

Not now already. Only if you’re really stuck! First, try to find the answer yourself!

Outline
Dataset
Familiarizing yourself with the whole-genome alignment
Generating a set of gene trees
Estimating the species tree from gene trees with ASTRAL
Detecting introgression based on tree-topology asymmetry
- Calculating D_tree
- Plotting D_tree
Model-based inferrence of species networks
- Estimating species networks with PhyloNet
- Visualizing species networks with IcyTree

Outline

In this activity I am going to demonstrate how phylogenies from across the genome – “gene trees” – can be used to infer past introgression events. The alignment blocks used for phylogenetic inference will be extracted from a chromosome-scale alignment, and they will be filtered according to their proportion of missing data and their frequency of recombination breakpoints to identify the most suitable alignment blocks for phylogenetic analysis. This phylogenetic analysis will then be conducted for each selected alignment block based on maximum likelihood with IQ-TREE. The resulting set of block phylogenies will first be used to infer a species tree with ASTRAL. Then, the same set of phylogenies will be used to determine asymmetry among alternative phylogenetic topologies for each species trio, and as in the application of D-statistics, this asymmetry will allow conclusions about past introgression events. Finally, the set of phylogenies will be analyzed with PhyloNet to assess support for alternative models of diversitifcation with and without introgression.

Dataset

The dataset used in this activity is part of a whole-genome alignment, prepared with the program Progressive Cactus (Armstrong et al., 2020). More details about the preparation of this whole-genome alignment, including repeat masking of the input assemblies, can be found here. This whole-genome alignment contains sequences for five species of the cichlid genus Neolamprologus of Lake Tanganyika: Neolamprologus brichardi (“neobri”), Neolamprologus gracilis (“neogra”), Neolamprologus marunguensis (“neomar”), Neolamprologus olivaceous (“neooli”), and Neolamprologus pulcher (“neopul”). Additionally, the whole-genome alignment contains sequences for Nile tilapia (Oreochromis niloticus; “orenil”), which will be used as an outgroup. Most assemblies used for the whole-genome alignment can only be considered draft assemblies, as they are heavily fragmented and based exclusively on short-read Illumina reads. Nevertheless, the whole-genome alignment contains sufficient blocks with lengths over 1,000 bp, that can be useful for phylogenetic analyses. To save computing time, this activity does not use the complete whole-genome alignment, but only alignment blocks mapping to chromosome 5 of the Nile tilapia (Oreochromis niloticus; “orenil”) genome (Conte et al., 2019).

If you would like to run the activity on your machine you can download the whole-genome alignment file cichlids_chr5.maf here. On the Amazon instance, the same file can be found in ~/workshop_materials/27_tree_based_introgression_detection/data.

Familiarizing yourself with the whole-genome alignment

The whole-genome alignment (even though this alignment contains only data for a single chromosome, I will continue to refer to it as “whole-genome alignment”) cichlids_chr5.maf was originally produced in the reference-free, binary HAL format, but has already been converted to human-readable (but reference-based) MAF format with the tool hal2maf.

Have a look at the file cichlids_chr5.maf, e.g. with the command less. Try to figure out the structure of this file, if necessary with the help of the format definition.

Show me the answer!

To see the file content, this command is suitable:
less -S ~/workshop_materials/27_tree_based_introgression_detection/data/cichlids_chr5.maf

This command should show the following output:

##maf version=1 scoring=N/A
# hal ((neomar:0.0045736,neogra:0.004315)Anc1:0.00021445,((neopul:0.0033453,neooli:0.0032482)Anc3:0.00094829,neobri:0.0049009)Anc2:0.00032849,orenil:0.046583)Anc0;

a
s       orenil.NC_031970.2      0       51272   +       39714817        gtgtattggttttagttacctttagcgtatgtgctagttagcgatagctatggcagcacagttatttgattgttttgggcttgttcctgctttgtttgttccccctgcaaa

a
s       orenil.NC_031970.2      51272   1659    +       39714817        atctcatcaactaatctgtcaccagtgttgggcaagttactttgaaaaagtaattcaattatagttactagttcttcaaaaaagtaactgagttagtaactgagttacaat
    			s       neobri.NW_006272085.1   2638449 205     -       2741720 ATCTCATCAACTAATCTGTCAC-------------------------------------------------------------------------------------------------
...

What we see there are first a tree in Newick format on the second line of the file, that was used to build the alignment. Below that, we see, separated by empty lines, alignment blocks initiated with the letter “a” and consisting of sequences marked with the letter s.

Think about the type of alignments that we would need for phylogenetic analyses, and whether we can see those in file cichlids_chr5.maf.
Show me the answer!

The first two alignment blocks do not contain many sequences; the first contains only one singe sequence and the second contains two. What we want for phylogenetic analyses are alignment blocks with one sequences for each of the six included cichlid species, and these should also have a certain minimum length. When you scroll down to see more of the file content, you’ll eventually find alignment blocks that fulfill these criteria.

Generating a set of gene trees

In this part of the activity, we will extract suitable alignment blocks for phylogenetic inference from the whole-genome alignment, and use these to generate gene trees. In other cases where a whole-genome alignment is not available, sets of gene trees could alternatively be based on alignments of orthologous markers, or on chromosome-scale alignments produced by mapping reads to a reference assembly, followed by variant calling.

Extracting alignment blocks from a whole-genome alignment

To extract blocks from the whole-genome alignment cichlids_chr5.maf, we’ll need to use a custom script, written in Python. This script can be found online here.

Ideally, alignments used for phylogenetic analysis should have as little missing data as possible, be as informative as possible, and show no signs of within-alignment recombination. Thus, we are going to extract alignments from the whole-genome alignment while filtering for information content (completeness and number of polymorphic sites), but to reduce the probability of within-alignment recombination, we will also quantify signals of recombination per alignment and remove those alignments for which these signals are strongest.

As the length of all alignments, we use 1,000 bp, assuming that this length is a good compromise between increasing probability of recombination with longer alignments and decreasing phylogenetic signal with shorter alignments. For a more thorough analysis, however, it might be worth testing a range of different alignment lengths.

Make sure that you are in the right directory for this activity: cd ~/workshop_materials/27_tree_based_introgression_detection
Download the script to extract alignment blocks from a whole-genome alignment in MAF format to your working directory on the Amazon instance: wget http://evoinformatics.group/make_alignments_from_maf.py
Familiarize yourself with this script by calling its help text: python make_alignments_from_maf.py -h
Based on the information gained from the help text, execute the script make_alignments_from_maf.py to extract up to a maximum of 10,000 alignments with a length of 1,000 bp in Nexus format from the whole-genome alignment cichlids_chr5.maf, using only those with sequences for all six species of the whole-genome alignment, a completeness of at least 95%, and at least 10 polymorphic sites. Save the resulting alignment blocks in a new directory named short_alignments, with the prefix “cichlids”.
Show me the code!
- Maximum of 10,000 alignments: -n 10000
- Length of 1,000 bp: -l 1000 -k 1000
- Nexus format: -f nexus
- Sequences for all six species: -x 6
- Completeness of at least 95%: -c 0.95
- At least 10 polymorphic sites: -m 10
- Output directory short_alignments: -p short_alignments
- Prefix “cichlids”: cichlids (as second argument, after the MAF input file name as first argument)
The full command is therefore python make_alignments_from_maf.py data/cichlids_chr5.maf cichlids -n 10000 -l 1000 -k 1000 -f nexus -x 6 -c 0.95 -m 10 -p short_alignments
Have a look at the screen output of the script. Q How many alignments did it write? And how many polymorphic sites do they have?

Show me the answers!
The output should indicate, in its last line, that 5,657 alignments were written, and the lines above should indicate that most alignments have between 20 and 60 polymorphic sites.
Have a look at a few of these alignment files with the less command to get a feeling for their completeness, and to see the specifics of the Nexus format.
(Optionally) To also get a sense of the variation present in the alignments, you could copy a few of them to your local computer and visualize them there with an alignment viewer such as the excellent program AliView (Larsson, 2014). Q Which of the species seems to be the most divergent?

Show me the answer!
You’ll probably find that most of the variation distinguishes the outgroup sequence “orenil” from the five other sequences, and that there may therefore be relatively little information for the inference of relationships among the five ingroup species of the genus Neolamprologus.

Alignment block filtering

As a measure of within-alignment recombination, we are going to quantify the number of so-called hemiplasies in each alignment, and we will use this measure as a criterion for filtering. Simply put, a hemiplasy is a site that supports a grouping that is in conflict with a grouping supported by another site of the same alignment. Possible explanations for hemiplasy include sequencing errors but within-alignment recombination in combination with incomplete lineage sorting can also produce them. And because within-alignment recombination can lead to unreliable phylogenetic inference, this means that alignments with none or few hemiplasies are more suitable for phylogenetic inference than other alignments with many hemiplasies (Maynard Smith and Smith, 1998).

Fortunately, the number of hemiplasies can easily be approximated as the difference between the number of variable sites and the so-called parsimony score of an alignment, which is the lowest number of substitutions on a phylogeny that is required to produce the alignment (the result is only an approximation because some hemiplasies may increase the parsimony-score by more than 1 if three or more substitutions are required to explain them). To calculate the number of variable sites, a custom Ruby script called get_number_of_variable_sites.rb can be used.

Download the script get_number_of_variable_sites.rb from the GitHub repository of the study by Ronco et al. (2021) to your current directory on Saga:
wget https://raw.githubusercontent.com/cichlidx/ronco_et_al/master/lt_phylogenomics/src/get_number_of_variable_sites.rb
Before moving on to calculate parsimony scores, first test this script with the first Nexus file in directory short_alignments to make sure that it’s working:
nex=`ls short_alignments/*.nex | head -n 1` ruby get_number_of_variable_sites.rb ${nex}
This should write just a single number to the screen.

The calculation of parsimony scores requires the software PAUP*, developed by Workshop on Molecular Evolution legend Dave Swofford. Conveniently, we can use another script from the GitHub repository of Ronco et al. (2021) to run PAUP* and calculate the parsimony score for an input file in Nexus format.

Download the script get_parsimony_score.sh from the GitHub repository of Ronco et al. (2021):
wget https://raw.githubusercontent.com/cichlidx/ronco_et_al/master/lt_phylogenomics/src/get_parsimony_score.sh
This script performs several steps: It writes a command file for PAUP* which specifies the task that PAUP* should perform as well as the input file, then it runs PAUP* with this command file, it reads the output of PAUP* to extract the parsimony score, reports that score, and finally it removes the temporary files that were written by PAUP* to clean up the directory.
Before running the script get_parsimony_score.sh, we have to make a single change in the script: We need to delete lines 3 and 4, which have the following content:
# Load the paup module. module load paup/4.0a163
We can delete these two lines with the following command:
sed -i '3,4d' get_parsimony_score.sh
Then check that the script works by running it with the first Nexus file in directory short_alignments, using the following commands:
nex=`ls short_alignments/*.nex | head -n 1` bash get_parsimony_score.sh ${nex}

What am I supposed to see?
The parsimony score for the first alignment should be 49.
Compare the parsimony score, calculated in the last step for the first alignment, with the number of variable sites, which you also already calculated for the first alignment. Q Are they identical? And if not why not?

Show me the answers!
The parsimony score is likely larger than the number of variable sites. As explained above, this is the result of hemiplasies in the alignment.
Now, use the two different scripts (get_number_of_variable_sites.rb and get_parsimony_score.sh) jointly to quantify the number of hemiplasies for all alignments (as the difference between the parsimony score and the number of variable sites), and to record this information in a new table that will be named short_alignment_stats.txt. For this to work, you will need to write a new BASH script that calls both scripts for each alignment, stores the resulting numbers, subtracts them from each other, and writes the difference into the new file short_alignment_stats.txt.
OMG, how is this supposed to work?
Try the following script:
```
echo -e "id\tn_variable_sites\tn_hemiplasies" > short_alignment_stats.txt
for nex in short_alignments/*.nex
do
	id=`basename ${nex%.nex}`
	n_variable_sites=`ruby get_number_of_variable_sites.rb ${nex}`
	parsimony_score=`bash get_parsimony_score.sh ${nex}`
	n_hemiplasies=$((${parsimony_score}-${n_variable_sites}))
	echo -e "${id}\t${n_variable_sites}\t${n_hemiplasies}" >> short_alignment_stats.txt
done
```
Note, however, that executing this script for all alignments would likely take several hours. There would be ways to parallelize the script, but these might be too distracting from what we really want to learn here. Thus, if you already executed the script, please cancel it with Ctrl-C and remove the file that it started to write with rm short_alignment_stats.txt. Then, continue by downloading the prepared version of file short_alignment_stats.txt: wget http://evoinformatics.group/short_alignment_stats.txt
Have a look at the content of file short_alignment_stats.txt, using less. You’ll see that the file contains a table with three columns, of which the first lists the alignment ID, the second the number of variable sites in the alignment, and the third the number of hemiplasies in the alignment.

It is likely that the numbers of variable sites and hemiplasies correlate with each other, because without variable sites, an alignment could not have any hemiplasies. However, the most suitable alignments for phylogenetic inference are those with a large number of variable sites (thus, much phylogenetic information) and a small number of hemiplasies. We are therefore going to filter the alignments based on these two criteria. But before, we will need to decide on thresholds for the two criteria. To allow us to determine suitable thresholds on the minimum number of variables sites and the maximum number of hemiplasies in alignments, we will plot the two measures against each other.

Write a new R script named plot_n_variable_and_n_hemiplasies.r with the following content:

table <- read.table("short_alignment_stats.txt", header=T)
pdf("short_alignment_stats.pdf", height=7, width=7)
plot(table$n_variable_sites, table$n_hemiplasies, xlim=c(0,100), xlab="Variable sites", ylab="Hemiplasies", pch=16, col = rgb(red=0, green=0, blue=0, alpha=0.2))
dev.off()

Execute the R script plot_n_variable_and_n_hemiplasies.r: Rscript plot_n_variable_and_n_hemiplasies.r
This should produce a file called short_alignment_stats.pdf
Copy file short_alignment_stats.pdf from the Amazon instance to your local computer with scp, and open it in a PDF viewer, or have a look at it through Guacamole.

I'm really too lazy for this.
OK, then have a look at the plot here:

Q Which minimum number of variable sites and maximum number of hemiplasies would you consider a reasonable choice for selection of alignments for phylogenetic analyses?

Yes, good question.
We want the number of variable sites to be relatively large and that of hemiplasies to be close to zero, so that phylogenetic analyses are well-informed by the variable sites, but not biased due to hemiplasies. The plot indicates that we might still have a sufficient number of alignments even if we apply strict filters with a minimum number of 50 variable sites and not even a single hemiplasy.
Try to write a BASH script named filter_alignments.sh that takes file short_alignment_stats.txt as input, copying all alignments that match our chosen filters to a new directory named short_alignments_filtered.
Seriously?
It's not actually so hard. The following script will do this, applying a minimum number of 50 variable sites and a maximum number of 0 hemiplasies (excluding all alignments with hemiplasies):
```
mkdir -p short_alignments_filtered
cat short_alignment_stats.txt | tail -n +2 | awk '$2 >= 50' | awk '$3 == 0' | cut -f 1 > tmp.txt
while read line
do
	echo "Selecting file short_alignments/${line}.nex"
	cp short_alignments/${line}.nex short_alignments_filtered
done < tmp.txt
rm tmp.txt
```
If you haven't yet done so, execute the script filter_alignments.sh: bash filter_alignments.sh
Q How many alignments have been moved to directory short_alignments_filtered?

Show me the answer!
364 alignments should have passed the filtering and should therefore have been copied to directory short_alignments_filtered. We can count these with ls short_alignments_filtered/*.nex | wc -l

Inferring gene trees from alignment blocks using IQ-TREE

In this part of the tutorial, we are going to infer phylogenies for each filtered alignment; the set of inferred phylogenies will subsequently be used to determine asymmetries in the topologies of species trios which can serve as an indicator for introgression. We are going to use IQ-TREE for maximum-likelihood phylogenetic inference. We will need to specify the ID of the outgroup "orenil" so that the resulting trees are rooted.

Familiarize yourself with the options available in IQ-TREE. To see a short summary of all options, just execute IQ-TREE without any arguments: iqtree2

The screen output of the above command tells you how to see all the options available to IQ-TREE; use the corresponding command and briefly scroll over the resulting screen output to get an impression of what IQ-TREE can do.

Run IQ-TREE with the first filtered alignment as input to get a sense of how long it will run, and how to interpret its output.

Show me the code!

These two commands would be suitable to get the name of the first filtered alignment, and to run IQ-TREE with it as input:

nex=`ls short_alignments_filtered/*.nex | head -n 1`
iqtree2 -s ${nex} -o orenil

As mentioned above, the outgroup needs to be specified, as this is here done with -o orenil.
By not specifying a substitution model we allow IQ-TREE to automatically select the best-fitting model itself.

Find the output files written by IQ-TREE for this first phylogenetic analysis. These are in the same directory and have the same name as the input alignment, but with additional file extensions.

Show me the code!

You can find these for example with ls -l short_alignments_filtered/*.nex.*

The file that's most important for us, because it contains the inferred phylogenet tree, is the one with the file ending in .treefile

Have a look at the content of the file ending in .treefile using less. You should see that this file consists of a single line, encoding the phylogeny in Newick format.

If you're unfamiliar with the Newick tree format, now would be a good time to look up the Wikipedia page explaining the format.
Visualize the inferred phylogeny. There are a few options to do so: You could copy the file over to your own machine, and open it in FigTree there or you could use Guacamole and FigTree on the Amazon instance to open it there. You could also just copy the content of the file (the single line with the Newick string) and paste it into an empty new FigTree window on your own machine. And if you don't have a FigTree installation on your own machine, you could open the IcyTree website and visualize the tree there online by clicking "File" in the top left menu, then "Enter tree directly...", and then pasting the Newick string into the window that opens.

I'm again too lazy for this.
Sorry, you may need to do this yourself; as the trees that you inferred might differ from those that I got. IQ-TREE is not guaranteed to find the same maximum-likelihood tree each time that it is run on a certain alignment.
To facilitate the next step, remove the output of the first IQ-TREE run from directory short_alignments_filtered: rm short_alignments_filtered/*.nex.* (Make sure to type the period after "nex" in this command, or the alignements themselves will be deleted)
To run IQ-TREE for all filtered alignments, write and run a BASH script named run_iqtree_filtered.sh.
How exactly?
Write the following code to the new file run_iqtree_filtered.sh:
```
for nex in short_alignments_filtered/*.nex
 do
 	iqtree2 -s ${nex} -o orenil
done
```
Then, execute this file with bash run_iqtree_filtered.sh
Combine all inferred phylogenies into a single file called short_alignments_filtered.trees.

Hmmm?
cat short_alignments_filtered/*.treefile > short_alignments_filtered.trees
Make sure that the above command worked and you have a file named short_alignments_filtered.trees with tree information in it, and if so, you could clean up files no longer needed with rm short_alignments_filtered/*.nex.*
Like before with the single phylogeny, open file short_alignments_filtered.trees in FigTree or IcyTree, either on your own machine after copying the file or its content there, or on the Amazon instance through Guacamole. If using FigTree, the arrows at the top of the window allow you to browse through the different phylogenies. Q After seeing a few of the phylogenies, what is your impression of these trees?

OK, but what do you think?
Most of the alignments contain polytomies and short branches separating the five Neolamprologus species, indicating that most of the variable sites separate the outgroup from the ingroup, but that there is rather little phylogenetic information about the relationships of the ingroup species. Thus, individual gene trees contain very little information about their relationship. Nevertheless, in their combination, they should be informative about the species relationships and past introgression events.

Estimating the species tree from gene trees with ASTRAL

ASTRAL infers the species tree by searching for the topology that agrees with the largest number of species quartets included in the set of gene trees. In addition, branch lengths are estimated in coalescent units based on the number of quartets that are in conflict with a branch of the species tree. Unlike concatenation, the species-tree approach of ASTRAL has been shown to be statistically consistent under the multi-species coalescent model (Mirarab et al., 2014), meaning that its species-tree estimate is guaranteed to converge to the correct species tree with increasing size of the dataset.

Even though we won't need any non-default settings, you may want to familiarize yourself with the available ASTRAL options by calling the program with java -jar ~/software/Astral/astral.5.7.8.jar --help
Run ASTRAL with the file containing all inferred phylogenies, short_alignments_filtered.trees, as input, and specifying species.tre as output file name.

Show me the command!
java -jar ~/software/Astral/astral.5.7.8.jar -i short_alignments_filtered.trees -o species.tre

The output file named species.tre now contains just a single species tree, annotated with posterior probabilities as node support.
Visualize again the file species.tre with FigTree or IcyTree to see the inferred species tree.

Detecting introgression based on tree-topology asymmetry

Analogous to the logic behind the D-statistic, asymmetry in the topologies of species trios can be used to identify past introgression. In the absence of introgression, we would expect that gene trees of three species P1, P2, and P3 most frequently support the topology of the true species tree (e.g. a clade formed by P1 and P2 if these are in fact sister species), and that the two alternative topologies (e.g. a clade formed by P1 and P3 or a clade formed by P2 and P3) should occur equally frequently. A significant difference in the frequencies of the two alternative topologies is therefore not compatible with a null hypothesis of no introgression, meaning that such a significant difference would support the occurrence of past introgression. An important difference to the D-statistic is that the analysis of tree asymmetry can be more robust to rate variation among species (Koppetsch et al., 2024).

Calculating D_tree

To analyze the frequencies of alternative trio topologies in the set of trees, we can use another script from the GitHub repository of Ronco et al. (2021), the Ruby script analyze_tree_asymmetry.rb.

Download the script analyze_tree_asymmetry.rb to your current directory on your Amazon instance.

Again with wget?
Yes, like this: wget https://raw.githubusercontent.com/cichlidx/ronco_et_al/master/age_determination/introgression_detection/src/analyze_tree_asymmetry.rb

This script should work well with any file that contains a set of trees in Newick format. It determines the set of species in these trees and then tests for each possible species trio how often two of the three species form a pair relative to the third species. These frequencies are then used to place the three species in positions P1, P2, and P3 so that

N_(P1,P2) > N_(P2,P3) > N_(P1,P3)

where N_(P1,P2) is the frequency at which P1 and P2 form a pair, N_(P2,P3) is the frequency of pairs formed by P2 and P3, and N_(P1,P3) is the frequency of pairs formed by P1 and P3. An equivalent to the D-statistic, that we call D_tree is then calculated as

D_tree = (N_(P2,P3) - N_(P1,P3)) / (N_(P2,P3) + N_(P1,P3))

and a p-value for the null hypothesis of no asymmetry between N_(P1,P3) and N_(P2,P3) is calculated based on a binomial test.

To quantify tree asymmetry, run the script analyze_tree_asymmetry.rb with the tree file short_alignments_filtered.trees, and specify that the output should be written to a new file named short_alignments_filtered_topology_frequencies.txt, using the following command:
ruby analyze_tree_asymmetry.rb short_alignments_filtered.trees short_alignments_filtered_topology_frequencies.txt
Have a look at the file short_alignments_filtered_topology_frequencies.txt produced by script analyze_tree_asymmetry.rb, for example using less.

How do I read these results?
In file short_alignments_filtered_topology_frequencies.txt, each row includes results for one species trio, with the frequencies of the three alternative topologies in columns 4 to 6, followed by the equivalent to the D-statistic, D_tree, in the seventh column and the p-value in the last. For the species trios that include Oreochromis niloticus ("orenil"), the frequencies N_(P1,P3) and N_(P2,P3) are all 0, which is not surprising because we had set this species to be the outgroup. The reason why in these trios only a proportion of the trees support N_(P1,P2) is that IQ-TREE places the outgroup with another species in a polytomy at the root, and the script analyze_tree_asymmetry.rb ignores trios connected by polytomies. But since the frequencies of topologies in species trios that include the outgroup are by definition never asymmetric, we can ignore these comparisons anyway. More conclusive results are given for species trios within the ingroup, for example for the trio of "neooli", "neobri", and "neogra", where around 150 trees pair "neooli" and "neobri" and around 120 trees pair "neobri" and "neogra", but only around 100 trees pair "neobri" and "neogra" (note that these numbers can vary quite a bit when relatively few trees are used - as is the case here). If the D_tree value is positive and the p-value is below some significance threshold (e.g. 0.05), this asymmetry can not be explained by incomplete lineage sorting alone.

Q Which species trio has the highest D_tree value in this table?

Show me the answer!
The trio with the highest D_tree is probably the one with "neooli", "neopul", and "neobri", in which around 200 trees pair "neooli" and "neopul" and around 120 trees pair "neopul" and "neobri" but only around 70 trees pair "neooli" and "neobri". The D_tree value for this trio should be above 0.2 and the p-value should be below 0.001.

Plotting D_tree

A good way of plotting the asymmetry among alternative trio topologies, and the significance of this asymmetry, is in the form of a heatmap. This heatmap plot can be generated with another Ruby script GitHub repository of Ronco et al. (2021), plot_tree_asymmetry.rb.

Download the Ruby script plot_tree_asymmetry.rb from the GitHub repository of Ronco et al. (2021).

I know, I need wget, but what is the path?
That would be https://raw.githubusercontent.com/cichlidx/ronco_et_al/master/age_determination/introgression_detection/src/plot_tree_asymmetry.rb, so use wget https://raw.githubusercontent.com/cichlidx/ronco_et_al/master/age_determination/introgression_detection/src/plot_tree_asymmetry.rb

The script plot_tree_asymmetry.rb requires the following three command-line arguments: the name of the file produced by script analyze_tree_asymmetry.rb (which is short_alignments_filtered_topology_frequencies.txt), the name of a file that lists the order in which the species should be listed along the heatmap axes, and the name of the output file, which will be in scalable vector graphic (SVG) format. Thus, we still need an input that specifyies the order in which species should be plotted.

Write a new file named short_alignments_species_order.txt in which you list species (their short names, e.g. "orenil") in the order in which they appear in the species generated with ASTRAL, when this tree is opened in FigTree.
I need an example.
This file could for example have the following content:
```
orenil
neomar
neogra
neobri
neopul
neooli
```
Execute script plot_tree_asymmetry.rb with files short_alignments_filtered_topology_frequencies.txt and short_alignments_species_order.txt as input, and name the output file short_alignments_filtered_topology_frequencies.svg: ruby plot_tree_asymmetry.rb short_alignments_filtered_topology_frequencies.txt short_alignments_species_order.txt short_alignments_filtered_topology_frequencies.svg
Open the output file short_alignments_filtered_topology_frequencies.svg with a program capable of reading files in SVG format, either through Guacamole or on your own machine after copying the file there. Perhaps the easiest way to do this is with a browser such as Firefox.
Q How do you interpret this plot? Do we have any evidence for introgression between species, and if so, how strong is it?

I don't know!
In principle, this question is better answered with the table in file short_alignments_filtered_topology_frequencies.txt that we already had a look at above. But the plot gives a good overview, again showing that the strongest signals support introgression between "neopul" and "neobri", followed by a weaker signal for introgression between "neogra" and "neobri".

Model-based inferrence of species networks

If there is still time, you are welcome to continue with this optional part of the activity.

Estimating species networks with PhyloNet

We are going to run PhyloNet to test alternative models of diversification with and without hybridization, and we will compare these models on the basis of their estimated likelihoods. The tested models will represent species trees in which reticulation edges connect branches that exchanged genes through hybridization. With such reticulation edges, the "species trees" in fact are no longer trees, but "species networks". PhyloNet implements various methods, of which we will use InferNetwork_ML, a method for the inference of species networks from sets of input trees under maximum likelihood.

First, to apply PhyloNet to the set of trees, these must be converted to Nexus format. If you're unfamiliar with this format, you may look up the Wikipedia page for it. The Nexus file must contain the keyword "#NEXUS" on the first line, followed by two information blocks - one for the trees and one with run commands for PhyloNet. More information about this format can be found on the PhyloNet website for the InferNetwork_ML method.

Try to write a file in the Nexus format shown on the above website, using the set of trees inferred with IQ-TREE. Note that each tree must have an ID, and that these IDs must be unique. Name this new file phylonet.nex.

Help please.

One way to write this new file is to use a BASH script that includes the following commands:

# Set the name of the nexus file.
nex=phylonet.nex

# Write a nexus file with all trees.
echo -e "#NEXUS\n\nBEGIN TREES;\n" > ${nex}
count=1
while read line
do
	echo -n "Tree gt${count} = " >> ${nex}
	echo ${line} >> ${nex}
	count=$(( ${count} + 1 ))
done < short_alignments_filtered.trees
echo -e "\nEND;\n\n" >> ${nex}

# Add a phylonet block to the nexus file.
echo -e "BEGIN PHYLONET;\n" >> ${nex}
echo -e "InferNetwork_ML (all) 1 -bl -pl 2;" >> ${nex}
echo -e "\nEND;" >> ${nex}

Have a look at the content of file phylonet.nex, for example with less, to make sure that it matches the format on the PhyloNet website. On the line with the InferNetwork_ML command at the end of the file, you'll see that instead of the list of gene trees that should be provided, the first parameter that is passed to the command is "(all)". This is a special value that is understood by the command as a shorthand for all gene tree IDs. The second parameter that is passed to the command is the number "1", meaning that we allow the inference of up to 1 reticulation edge in this analysis. The -bl option is invoked, specifying that we want to use the branch lengths of the gene trees in the inference. And finally, -numProcessors 2 specifies that two processors should be used to speed up the analysis.
Next, run the PhyloNet analysis with file phylonet.nex as input: java -jar ~/software/PhyloNet/PhyloNet.jar phylonet.nex
This analysis should finish in less than 2 minutes.
You should see that the Phylonet screen output lists results after each of 5 runs. This is the default number of searches for the maximum-likelihood species network that PhyloNet performs. In each case, the log-likelihood is reported on the same line as the network, such as in this example:
```
Results after run #1
-103.92929554204092: ((orenil:0.0)I1#H1:0.1521092737805993::0.0014873743766175402,((neomar:0.0,neogra:0.0)I4:0.0,(neobri:0.0,(neooli:0.0,(neopul:0.0,I1#H1:0.0::0.9985126256233825)I6:0.0)I5:0.0)I3:0.0)I2:0.1521092737805993)I0;
```
The log-likelihood of the above species network, given the set of input trees, is therefore -103.92929554204092. Each run started from a different randomly selected point in the likelihood surface. Unless the search for the maximum likelihood got stuck on local optima of the likelihood surface, all five log-likelihoods and networks should be identical (if this is not the case, you might want to repeat the PhyloNet analysis with a higher number of runs, which you can specify with the -x option).
Following the results from the five runs, PhyloNet reports the network with the best log-likelihood (which again should be identical to at least one of the first five) on the line after "Inferred Network #1":
```
Inferred Network #1:
((orenil:0.008164571975)#H1:0.14608423975387927::0.0014873743766175402,(#H1:0.024493715925::0.9985126256233825,((neomar:0.0,neogra:0.0):0.0,(neobri:0.0,(neooli:0.0,neopul:0.0):0.0):0.0):0.0326582879):0.12159052382887928);
```
If PhyloNet in fact found support for a reticulation edge (we allowed maximally one), the network should contain two occurrences of the keyword "#H1", indicating the connection points of the reticulation edge. The specification is then not strictly in Newick format anymore, but instead in "extended Newick" format, which is described in Cardona et al. (2008). The format is extremely difficult to read, and FigTree is unable to read it. To visualize the network, other programs must be used. One tool capable of reading and visualizing extended Newick format is Dendroscope (Huson et al. 2007), but we'll here use an easier-to-use alternative, the browser-based tree viewer IcyTree (Vaughan 2017).

Visualizing species networks with IcyTree

To see how species networks in extended Newick format are visualized with IcyTree, open the IcyTree website, click "File" in the top left menu and then "Enter tree directly...". Paste the following string in extended Newick format into the window that opened: "(((A,#H1),B),(C)#H1);".
As you may guess, this text encodes a species network in which "A" and "B" are sister species, and a hybridization branch connects "A" with the outgroup "C", and this should be shown as such by IcyTree.
You could browse the style settings of IcyTree to learn a bit about how to optimize the visualization. For example, you could change the width of the branches by clicking "Edge width" > "Increase" in IcyTree's "Style" menu. You may need to click on it multiple times to see a difference. You could achieve the same by hitting the "+" key on your keyboard multiple times.
You could also to increase the font size for the tip labels, by clicking multiple times on "Font size" > "Increase" in the "Style" menu.
Copy the string in extended Newick format from the PhyloNet screen output (the line after "Inferred Network #1:") and visualize the species network in IcyTree in the same way as the above example network. Note that IcyTree may not be able to visualize networks with zero-length branches correctly.
Save an image of the species network by clicking "Export tree as..." > "PNG image" in IcyTree's "File" menu, and post that image on the Discord channel "#27jan_tree_based_introgression".
Repeat the PhyloNet analysis, this time ignoring branch length information. To do so, remove the -bl option from the PhyloNet command in file phylonet.nex and restart the analysis. Once this analysis is done, again visualize the inferred network with IcyTree. This analysis may take longer than the previous one (around 30 minutes) in which branch length information was used. Again, post any resulting networks on the Discord channel. Q Does the species network inferred without branch lengths differ from the previously inferred one?
If time is left, also repeat the PhyloNet analysis without allowing any reticulation (replace InferNetwork_ML (all) 1; with InferNetwork_ML (all) 0; in file phylonet.nex), and with allowing not just one but two reticulation edges (use InferNetwork_ML (all) 2;). Q How do these different affect the inferred species network? Is the one reticulation edge inferred in the first analysis (where only one edge was allowed) one of the two reticulation edges inferred in the last analysis? Is the topology of the network (actually not a network, but a tree) inferred without a reticulation edge identical to that of the species tree inferred with ASTRAL?
Compare the log-likelihoods for the networks with zero, one, and two reticulation edges if PhyloNet should have found support for reticulation edges. Assuming that the networks with more reticulation edges are identical to those with less reticulation edges except for the added reticulation edge, we can use a likelihood-ratio test to assess whether or not the additional reticulation edge significantly improved the likelihood. The table below may be helpful for this comparison.

Difference between log-likelihoods p-value

< 1.92073 > 0.05

> 1.92073 < 0.05

> 3.31745 < 0.01

> 5.41375 < 0.001

Difference between log-likelihoods	p-value
< 1.92073	> 0.05
> 1.92073	< 0.05
> 3.31745	< 0.01
> 5.41375	< 0.001

Tree-based Introgression Detection

Slides

Background

Learning goals

How to run this activity

Requirements

Notes

Show me the answer!

Table of contents

Outline

Dataset

Familiarizing yourself with the whole-genome alignment

Show me the answer!

Show me the answer!

Generating a set of gene trees

Extracting alignment blocks from a whole-genome alignment

Show me the code!

Show me the answers!

Show me the answer!

Alignment block filtering

What am I supposed to see?

Show me the answers!

OMG, how is this supposed to work?

I'm really too lazy for this.

Yes, good question.

Seriously?

Show me the answer!

Inferring gene trees from alignment blocks using IQ-TREE

Show me the code!

Show me the code!

I'm again too lazy for this.

How exactly?

Hmmm?

OK, but what do you think?

Estimating the species tree from gene trees with ASTRAL

Show me the command!

Detecting introgression based on tree-topology asymmetry

Calculating Dtree

Again with wget?

How do I read these results?

Show me the answer!

Plotting Dtree

I know, I need wget, but what is the path?

I need an example.

I don't know!

Model-based inferrence of species networks

Estimating species networks with PhyloNet

Help please.

Visualizing species networks with IcyTree

Calculating D_tree

Plotting D_tree