Tutorial: Topology Weighting

Activity by Simon Martin ([email protected]), 1 February 2018


Requirements

For Part 1:

  • Python 2.7
  • Numpy 1.10+
  • R 3.0+
  • ete 3 (Python module)
  • msms

For Part 2:


Introduction

Topology weighting is a means to quantify relationships between taxa that are not necessarily monophyletic. It provides a summary of a complex genealogy by considering simpler “taxon topologies” and quantifying the proportion of sub-trees that match each taxon topology. The method we use to compute the weightings is called Twisst: Topology weighting by iterative sampling of sub-trees.

In this practical we will use a simulation to explore how topology weightings provide a summary of the genealogical history. We will then try to infer topology weights across our simulated chromosome using neighbour joining trees inferred for narrow windows.

Workflow Overview

In the first part of the practical we will simulate a fairly complex genealogical history including recombination gene flow and selection. We will then compute topology weightings for all genealogies across this genomic region. We will plot the weightings in R to see what these tell us about the relationships among the simulated populations. In the second part of the practical, we will create a simulated sequence alignment and use it as if it was genomic data we had generated. We will infer genealogies for sequence windows across the region
Finally, we will compute weightings for the inferred trees and compare these with the true weightings.

Preparation

In the command line, navigate to the folder: ~/workshop_materials/13_topology_weighting. Then, download the scripts required for this tutorial from github

git clone https://github.com/simonhmartin/genomics_general
git clone https://github.com/simonhmartin/twisst

Part 1. Analysis of simulated genealogies

Coalescent simulation

Our simulation includes five populations each represented by 8 haploid samples. The populations split in the order (((A,B),(C,D)),E). In other words, A and B are sister taxa, C and D are sister taxa, and E is an outgroup to the clade of A, B, C and D. We will add gene flow between C and B. Finally, we will add a beneficial allele to poulation C, which should sweep through population B (i.e. adaptive introgression). The simulated region will be 500 Kb in length, with a population recombination rate of 0.01. The selected locus will be at the centre of the region.

We use msms to perform a coalescent simulation with selection.

The command below will simulate the scenario described above. Note that to modify the evolutionary history, you can change the population joins (-ej flags, times are given in coalescent units of 4N generations), migration (-m flag), selection strength (-SAA and -SAa) and selection initiation parameters (-SI).

java -Xmx1500M -jar /home/wpsg/software/msms/lib/msms.jar 40 1 -T -I 5 8 8 8 8 8 -ej 0.5 2 1 -ej 0.5 4 3 -ej 1 3 1 -ej 1.5 5 1 -m 2 3 0.1 -r 5000 500000  -SAA 1000 -SAa 1000 -Sp 0.5 -SI 0.1 5 0 0 1 0 0 -N 100000 | grep ";" > sim/msms_5of8_l500k_r5k_sweep.txt

The final part of the command extracts only the tree objects from the output, as these are all we need.

We can view the simulated trees using less -S sim/msms_5of8_l500k_r5k_sweep.txt (type q at any time to exit). Each line represents the genealogy for a non-recombining ‘block’ of sequence, with the genealogies separated by recombination events. Most blocks are only a few base pairs long (lengths given at the start of each line). You might have noticed that adjacent genealogies are usually very similar, and sometimes identical. This is because they are usually separated by only a single recombination event, making them highly non-independent.
We can check the number of blocks by counting the lines in the file: wc -l sim/msms_5of8_l500k_r5k_sweep.txt.

Next we extract the start and end positions of each genealogy, which will be used for plotting later. We also make a file of trees with the block sizes removed.

python genomics_general/phylo/parse_ms_trees.py --msOutput sim/msms_5of8_l500k_r5k_sweep.txt --outTrees sim/msms_5of8_l500k_r5k_sweep.trees.gz --outData sim/msms_5of8_l500k_r5k_sweep.data.tsv.gz

The outputs of this command are both zipped with gzip to save space. You can view the files using zcat sim/msms_5of8_l500k_r5k_sweep.data.tsv.gz | less -S (type q to exit).


Twisst

We run Twisst to compute the weightings for each topology. The core script is twisst.py, but we use the script run_twisst_parallel.py to run the script with multi-threading.

The only information Twisst requires is which group each sample belongs to. This may be determined by species, phenotype or geography. In our case we know the groupings because we simulated the data. Samples 1:8 will be group ‘A’, 8:16 group ‘B’ etc. Instead of specifying the groups on the command line, we use a pre-made groups file (Note that if you change the number of samples per group, you will need a different groups file.

python twisst/run_twisst_parallel.py --threads 2 -t sim/msms_5of8_l500k_r5k_sweep.trees.gz --method complete -g A -g B -g C -g D -g E --groupsFile sim/sim.pop.txt --outputTopos sim/ABCDE.topos | gzip > sim/msms_5of8_l500k_r5k_sweep.weights.tsv.gz

The command above tells twisst to use the complete method. This will compute the exact weightings by considering all possible sub-trees. The main output sim/msms_5of8_l500k_r5k_sweep.weights.tsv.gz has a line for each input tree with fifteen columns, one for each of the possible topologies describing the relationship between the five groups. Each value gives the number of subtrees that match the given topology. Because complete weighting was used, the values in each line should sum to the total number of possible subtrees (8 5 = 32768).

The other output file is sim/ABCDE.topos, which contains the 15 possible subtree topologies for a single sample from A, B, C, D and E.

Plotting

(NOTE: All commands in this section will be performed in R.)

Open Rstudio in the browser (http://YOUR_IP.compute-1.amazonaws.com:8787/) and start a new R script.

First we will import the APE library and use it to plot the 15 topologies.

#Change the working directory to '13_topology_weighting'
setwd("~/workshop_materials/13_topology_weighting")

#Heres a set of 15 colourful colours for the 15 topologies these are from https://en.wikipedia.org/wiki/Help:Distinguishable_colors
cols = c(
"#0075DC", #Blue
"#2BCE48", #Green
"#FFA405", #Orpiment
"#5EF1F2", #Sky
"#FF5005", #Zinnia
"#005C31", #Forest
"#00998F", #Turquoise
"#FF0010", #Red
"#9DCC00", #Lime
"#003380", #Navy
"#F0A3FF", #Amethyst
"#740AFF", #Violet
"#426600", #Quagmire
"#C20088", #Mallow
"#94FFB5") #Jade

#Import ape
library(ape)

#read topologies file
topos = read.tree(file="sim/ABCDE.topos")
par(mfrow = c(3,5), mar = c(1,1,2,1), xpd=NA)

for (n in 1:length(topos)){
    plot.phylo(topos[[n]], type = "clad", edge.color=cols[n], edge.width=5, label.offset=.4, cex=1)
    mtext(side=3,text=paste0("topo",n))
  }

Now we want to plot the weightings for each or these topologies. First we define the files containing the weights for each genealogy, and the start and end positions for each block along the chromosome.

#weights file with a column for each topology
weights_file = 'sim/msms_5of8_l500k_r5k_sweep.weights.tsv.gz'

#coordinates file for each window
window_data_file = 'sim/msms_5of8_l500k_r5k_sweep.data.tsv.gz'

Read in the weights file and then normalise the weights so that each row sums to one. We do this by dividing by the sum total of each row.

weights = read.table(weights_file, header = T)

#normalise rows so weights sum to 1
weights = weights / apply(weights, 1, sum)

Read in block (i.e. window) coordinates, and add columns giving the window midpoint and number of sites per window, which will be useful for plotting.

window_data = read.table(window_data_file, header = T)
window_data$mid = (window_data$start + window_data$end) / 2
window_data$sites <- window_data$end - window_data$start + 1

Some functions to help with plotting are provided in a separate script plot_twisst.R. We first load that script.

source("code/plot_twisst.R")

We start by plotting the average weightings for each of the 15 topologies.

barplot(apply(weights, 2, mean), col = cols)

The predominant colour is that corresponding to Topology 3, which matches the populations branching pattern in the simulation. The fact that other topologies are also represented indicates that lineage sorting does not always follow the population branching pattern. In particular, we expect to see some representation of Topology 12 in which B groups with C, and is nested within the C, D clade. This topology is expected to be more common than under null expectations due to the gene flow from population C to B. Note that Topology 13, which also groups B and C, is not as common. Topology weighting is therefore informative about the direction of gene flow. Gene flow that occurs strictly from C to B will only tend to increase the abundance of Topology 12. Can you see why?

Now we will plot the raw weightings for a short section of the simulated region: the first 10 Kb

par(mfrow =c(1,1), mar = c(3,3,1,1), xpd = FALSE)
plot.weights(weights_dataframe=weights, positions=cbind(window_data$start,window_data$end),
  line_cols=NA, lwd= 0, fill_cols=cols, xlim =c(0,10000),stacked=TRUE)

You will see coloumns of colour of varying width. Each column corresponds to a single block with a single genealogy. Some blocks are all one colour. That indicates that all subtrees in that block have the same topology, indicating a consistent and completely sorted genealogy. Other columns have two or more colours stacked upon one another, indicating that the genealogy has a more complex evolutionary history, with more than one topology represented among the subtrees.

If you repeat the above plot using an expanded scale (e.g. xlim = c(0,100000)), you will notice that you begin to lose resolution, with the narrower blocks becoming invisible (this depends on the resolution of your monitor, and you can often get much better resolution if you export to an image file).

One way around the resolution problem is to use a smoothing function. A function is provided in plot_twisst that will smooth all weightings using locally weighted smoothing.

weights_smooth = smooth.weights(window_positions=window_data$mid, weights_dataframe = weights,
  span = 0.05, window_sites=window_data$sites)

Now we can plot the smoothed weights for the wntire 500 Kb region without any resolution problem.

plot.weights(weights_dataframe=weights_smooth, positions=window_data$mid,
  line_cols=cols, fill_cols=cols, xlim =c(0,500000),stacked=TRUE)

This reveals the clear shift in topology weightings that occurs in the vicinity of the selected locus.

Try on your own

How do the weightings change if we make changes to the simulation, such as decreasing the split times, or increasing the rate of migration? (You might need to look at the msms manual to figure out what to change.)


Part 2. Infering weightings from sequence data

Above we have used the ‘true’ genealogies as they were simulated. In most situations, all we have is sequence data, and its evolutionary history has to be inferred. To demonstrate how this is done for topology weighting, we will use the same simulated genealogies as above to simulate sequence data, and then see whether we can recover the same result using topology weighting based on inferred genealogies.

Simulating sequences

To get sequence data, we will use seq-gen. Seq-gen can take the complete set of genealogies and simulate a single ‘recombinant’ sequence alignment for our 40 samples. It requires that we provide the number of ‘partitions’, which corresponds to the number of distinct genealogies for the region. It also requires a model of molecular evolution (we will use Hasegawa, Kishino and Yano 1985) and a scaling factor to convert the branch lengths from coalescent units (4N generations) to mutational distances. We will use 1%.

(NOTE: We’re back on the command line now.)

partitions=$(wc -l sim/msms_5of8_l500k_r5k_sweep.txt)

seq-gen -mHKY -l 500000 -s 0.01 -p $partitions < sim/msms_5of8_l500k_r5k_sweep.txt | gzip > sim/msms_5of8_l500k_r5k_sweep.seqgen.phy.gz

The output is a sequence alignment in Phylip sequential format. This carries no information about where the recombination breakpoints are, except for the information caried in the sequences themselves. We will use a simple approach of dividing the sequence into narrow windows and inferring the genealogy for each window.

We will use some Python scripts to do this. First we need to convert the sequences into a more usable format, in which each row is a site and individual genotypes are given in columns.

(NOTE: Conveniently, the simulation provides haploid sequences. Typical genomic data from diploid samples should ideally be phased so that individual haplotypes can be distinguished for tree inference.)

python genomics_general/seqToGeno.py --seqFile sim/msms_5of8_l500k_r5k_sweep.seqgen.phy.gz \
 --format phylip --genoFile sim/msms_5of8_l500k_r5k_sweep.seqgen.geno.gz

This file contains both invariant and variant sites (SNPs). We only need the latter to make trees, so we will filter the file using another python script.

python genomics_general/filterGenotypes.py -i sim/msms_5of8_l500k_r5k_sweep.seqgen.geno.gz --minAlleles 2 \
 -o sim/msms_5of8_l500k_r5k_sweep.seqgen.SNP.geno.gz  --threads 2

Infering trees for windows and computing weights

We now have a file of SNPs distributed across our 500 Kb genomic region. We want to infer trees in windows, and we want to select a window size that is narrow enough to capture the variation in genealogical history across the chromosome, but contains sufficient SNPs to provide the necessary power for tree inference. We will use 50 SNP windows in this example.

The next script reads the SNP file in windows and then infers a tree for each window using Phyml. Phyml is capable of maximum likelihood inference, but here we will not use optimisation, so the trees output will be Neighbour-Joining trees, inferred using the BIONJ algorithm.

(NOTE: a copy of the script genomics.py needs to be in the same directory for this script to run.)

cp genomics_general/genomics.py genomics_general/phylo/
python genomics_general/phylo/phyml_sliding_windows.py -g sim/msms_5of8_l500k_r5k_sweep.seqgen.SNP.geno.gz \
 --windType sites -w 50 --prefix sim/msms_5of8_l500k_r5k_sweep.seqgen.SNP.w50.phyml_bionj --model HKY85 --optimise n --threads 2

This generates two output files: sim/msms_5of8_l500k_r5k_sweep.seqgen.SNP.w50.phyml_bionj.trees.gz contains the trees for each window. sim/msms_5of8_l500k_r5k_sweep.seqgen.SNP.w50.phyml_bionj.data.tsv contains the coordinates for each window.

Finally, we can compute the weightings for each of these inferred window trees.

python twisst/run_twisst_parallel.py -t sim/msms_5of8_l500k_r5k_sweep.seqgen.SNP.w50.phyml_bionj.trees.gz \
 -g A -g B -g C -g D -g E --groupsFile sim/sim.pop.txt --method complete --threads 2 |
gzip > sim/msms_5of8_l500k_r5k_sweep.seqgen.SNP.w50.phyml_bionj.weights.tsv.gz

Plotting inferred weights

As we did above, we can now plot the weights for these inferred trees across the chromosome. This can be done in the same R script as before.

(NOTE: We’re in R again.)

As before we read in the weights and window data files, and normalise the weights. Here the objects have ‘NJ50’ in their name, indicating that they are the values from 50 SNP neighbour-joining trees.

weights_NJ50_file = "sim/msms_5of8_l500k_r5k_sweep.seqgen.SNP.w50.phyml_bionj.weights.tsv.gz"
window_data_NJ50_file = "sim/msms_5of8_l500k_r5k_sweep.seqgen.SNP.w50.phyml_bionj.data.tsv"
weights_NJ50 = read.table(weights_NJ50_file, header = T)
#normalise rows so weights sum to 1
weights_NJ50 = weights_NJ50 / apply(weights_NJ50, 1, sum)
window_data_NJ50 = read.table(window_data_NJ50_file, header = T)

Because we’re now working with real data, there may be some lines of missing data that need to be excluded. For example, tree building failed due to a lack of informative sites.

#define good rows as those withot any NA values
good_rows = which(is.na(apply(weights_NJ50,1,sum)) == F)
#retain only good rows in the weights and window data objects
weights_NJ50 <- weights_NJ50[good_rows,]
window_data_NJ50 = window_data_NJ50[good_rows,]

We then smooth the weightings as before, and then plot our smoothed inferred weightings, and below them the “true” weightings from the original simulation.

weights_NJ50_smooth = smooth.weights(window_positions=window_data_NJ50$mid, weights_dataframe = weights_NJ50,
span = 0.05, window_sites=window_data_NJ50$sites)
par(mfrow = c(2,1), mar = c(3,3,1,1))
plot.weights(weights_dataframe=weights_NJ50_smooth, positions=window_data_NJ50$mid,
  line_cols=NA, fill_cols=cols, xlim =c(0,500000),stacked=TRUE)
plot.weights(weights_dataframe=weights_smooth, positions=window_data$mid,
  line_cols=NA, fill_cols=cols, xlim =c(0,500000),stacked=TRUE)

In your own time:

What effect does changing the window size have on inference? Can you explain this behaviour in terms of the tradeoff between power and resolution?
Can you identify introgression between any other populations?