Tutorial: ABBA-BABA statistics
Simon Martin ([email protected])
ABBA BABA statistics (also called D statistics) provide a simple and powerful test for a deviation from a strict bifurcating evolutionary history. They are therefore frequently used to test for introgression using genome-scale SNP data.
In this practical we use genomic data from several populations of Heliconius butterflies to explore the utility of ABBA BABA statistics.
The first part of the practical will demonstrate typical computation of ABBA BABA statistics using genome-wide SNP data. Starting with genotype data from multiple individuals, we first infer allele frequencies at each SNP. We then compute the D statistic and use a block jackknife method to test for a significant deviation from the null expectation of D =0. Finally we estimate f the ‘admixture proportion’.
We will study multiple races from three species:
Heliconius melpomene , Heliconius timareta and Heliconius cydno. These species have partially overlapping ranges and they are thought to hybridise where they occur in sympatry. Our sample set includes two pairs of sympatric races of H. melpomene and H. cydno from Panama and the western slopes of the Andes in Colombia. There are also two pairs of sympatric races of H. melpomene and H. timareta from the eastern slopes of the Andes in Colombia and Peru. We hypothesize that hybridisation between species in sympatry will lead to sharing of genetic variation between H. cydno and the sympatric races of H. melpomene from the west, and between H. timareta and the corresponding sympatric races of H. melpomene from the east. This should include sharing of wing patterning alleles between H. melpomene and H. timareta , which are co-mimetics. There is also another race of H. melpomene from French Guiana that is allopatric from both H. timareta and H. cydno , which should have not experienced recent genetic exchange with either species and therefore serves as a control. Finally, there are two samples from an outgroup species Heliconius numata silvana .
In the command line, navigate to the folder:
Download python scripts required for this tutorial from github:
git clone https://github.com/simonhmartin/genomics_general
Part 1: Genome wide test for introgression
In its simplest formulation the ABBA BABA test relies on counts of sites in the genome that match the ABBA and BABA genotype patterns. That is, given three ingroup populations and an outgroup with the relationship (((P1,P2),P3),O), and given a single (haploid) genome sequence representing each population (i.e., H1, H2 and H3), ABBA sites are those at which H2 and H3 share a derived allele (‘B’), while H1 has the ancestral state (‘A’), as defined by the outgroup sample. Likewise, BABA represents sites at which H1 and H3 share the derived state.
Ignoring recurrant mutation, the two SNP patterns can only come about if some parts of the genome have genealogies that do not follow the ‘species tree’, but instead group H2 with H3 or H1 with H3. If the populations split fairly recently, such ‘discordant’ genealogies are expected to occur in some parts of the genome due to variation in lineage sorting. In the absence of any deviation from a strict bifurcating topology, we expect roughly equal proportions of the genome to show the two discordant genealogies (((H2,H3),H1),O) and (((H1,H3),H2),O). By counting ABBA and BABA SNPs across the genome (or a large proportion of it), we are therefore approximating the proportion of the genome represented by the two discordant genealogies, and we can test for any deviation from the 1:1 expectation. A deviation could come about as a result of gene flow between populations 3 and 2 for example, although it could also indicate other phenomena that break our assumptions, such as ancestral population structure, or variable substitution rates.
To quantify the deviation from the expected ratio, we calculate D , which is the difference in the sum of ABBA and BABA patterns across the genome, divided by their sum:
D = [sum(ABBA) – sum(BABA)] / [sum(ABBA) + sum(BABA)]
If we have multiple samples from each population, then counting ABBA and BABA sites is less straghtforward. One option is to consider only sites at which all samples from the same population share the same allele, but that will discard a large amount of useful data. A preferable option is to use the allele frequencies at each site to quantify the extent to which the genealogy is skewed toward the ABBA or BABA pattern. This is effectively equivalent to counting ABBA and BABA SNPs using all possible sets of four haploid samples at each site. ABBA and BABA are therefore no longer binary states, but rather numbers between 0 and 1 that represent the frequency of allele combinations matching each genealogy. They are computed based on the frequency of the derived allele ( p ) and ancestral allele (1- p ) in each population as follows:
ABBA = (1- p1 ) x p2 x p3 x (1- pO )
BABA = p1 x (1- p2 ) x p3 x (1- pO )
Genome wide allele frequencies
To compute these values from population genomic data, we need allele frequencies. We will compute these from the Heliconius genotype data provided using a python script. The input file has already been filtered to contain only bi-allelic sites. The frequencies script requires that we define populations. These are defined in the file
python genomics_general/freq.py -g data/bar92.DP8MP4BIMAC2HET75dist1K.geno.gz -p mpg -p ros -p vul -p mal -p ama -p chi -p zel -p flo -p txn -p slv --popsFile data/bar92.pop.txt --target derived -o data/bar92.DP8MP4BIMAC2HET75dist1K.derFreq.tsv.gz
--target derived we obtain the frequency of the derived allele in each population at each site. This is based on using the final population specified ( H. numata silvana , or ‘slv ’) as the outgroup. Sites at which this population is not fixed for the ancestral state are discarded.
Genome wide ABBA BABA analysis
(NOTE: here we’re working in Rstudio)
Start a new R script (File > New File > R Script).
First set your working directory to
~/workshop_materials/12_abba_baba using the command
Now we define functions for computing the ABBA and BABA proportions at each site. These will take as input the frequency of the derived allele in populations P1, P2 and P3 (i.e. p1, p2 and p3 ). (The frequency of the ancestral allele in the outgroup will be 1 at all sites, by definition, so this can be ignored).
abba = function(p1, p2, p3) (1 - p1) * p2 * p3 baba = function(p1, p2, p3) p1 * (1 - p2) * p3
These functions will return a single value if
p3 are single values, but they will return a vector of values for each site if
p3 are each vectors.
We then define a function for D. We give it a single input dataframe, which must contain columns named ABBA and BABA. This will be useful for the block jackknifing procedure below.
D.stat = function(dataframe) (sum(dataframe$ABBA) - sum(dataframe$BABA)) / (sum(dataframe$ABBA) + sum(dataframe$BABA))
Read in our frequencies.
freq_table = read.table("data/bar92.DP8MP4BIMAC2HET75dist1K.derFreq.tsv.gz", header=T, as.is=T) nrow(freq_table)
Now, to compute D, we need to define populations P1, P2 and P3. We will start with an obvious and previously published test case: we will ask whether there is evidence of introgression between H. melpomene rosina ( ros ) and H. cydno chioneus ( chi ). These will be P2 and P3 respectively. P1 will be our allopatric population H. melpomene melpomene from French Guiana ( mpg ).
We set these populations and then compute D by extracting the derived allele frequencies for all SNPs for the three populations.
P1 = "mpg" P2 = "ros" P3 = "chi" ABBA = abba(freq_table[,P1], freq_table[,P2], freq_table[,P3]) BABA = baba(freq_table[,P1], freq_table[,P2], freq_table[,P3]) ABBA_BABA_df = as.data.frame(cbind(ABBA,BABA)) D = D.stat(ABBA_BABA_df)
Indeed there is a nearly two-fold excess of ABBA over BABA , giving a strongly positive D statistic, indicative of excess shared ancestry between P2 and P3. However, we currently don’t know whether this result is statistically robust. In particular, we don’t know whether the excess of ABBA is evenly distributed across the genome. If it results from odd ancestry at just one locus, we would have less confidence that there has been significant introgression.
To test for a consistent genome-wide signal we use a block-jackknife procedure. In this case, bootstrap resampling would not be appropriate, because of non-independence among sites.
The Jackknife procedure allows us to compute the variance of D despite non-independence among sites. We achieve this by computing the standard deviation for so-called ‘pseudovalues’ of the mean genome-wide D , where each pseudovalue is computed by excluding a defined block of the genome taking the difference between the mean genome-wide D and D computed when the block is computed. The block size needs to exceed the distance at which autocorrelation occurs. In our case, we will use a block size of 1 Mb. In fact we know that linkage disequilibrium decays to background levels at a distance of well below 1 Mb, so this is a conservative block size.
To define the start and end positions of each block, we need to know the lengths of each chromosome. These are provided in the file
"data/Hmel2_chrom_lengths.txt" . We load this file and then make a vector of chromosome lengths with the chromosome names (“chr1”, chr2” etc. as the names of this vector).
chrom_table = read.table("data/Hmel2_chrom_lengths.txt") chrom_lengths = chrom_table[,2] names(chrom_lengths) = chrom_table[,1]
Some code to help with the block jackknifing is provided in a separate script, which we can import now.
We then use a predefined function to define the coordinates of each 1Mb block on each chromosome.
blocks = get_genome_blocks(block_size=1e6, chrom_lengths=chrom_lengths) n_blocks = nrow(blocks)
This gives a table with a row for each block, giving its start and end position and the chromosome it is on.
For each jackknife iteration, we need to exclude all the sites within the given block. We make a list of sites that we will retain in each iteration: those sites that do not fall within each block.
indices = get_genome_jackknife_indices(chromosome=freq_table$scaffold, position=freq_table$position, block_info=blocks)
Now we can run the block jackknifing procedure. We provide the the D statistic function we created earlier, which it will use each iteration. We also provide the input ABBA BABA dataframe, and the block indices.
D_sd = get_jackknife_sd(FUN=D.stat, input_dataframe=as.data.frame(cbind(ABBA,BABA)), jackknife_indices=indices)
This provides an unbiased estimate of the standard deviation of D . From this we can estimate the standard error, Z score and p-value for the test of whether D deviates significantly from zero.
D_err <- D_sd/sqrt(n_blocks) D_Z <- D / D_err D_p <- 2*pnorm(-abs(D_Z))
In this case the deviation is hugely significant.
In your own time
What do we find if we change the identity of P1, P2 and P3? We have data for nine populations, so there are many questions that we can ask. For example, we can test for recent admixture by using vulcanus (vul) as P1. It is more closely related to rosina, and the test is only sensitive to introgression that occurred subsequent to the P1-P2 split.
We can also try entirely different combinations of taxa. Is there similalrly extensive introgression between H. timareta (flo and txn) and the eastern melpomene populations like malleti (mal) and amaryllis (ama)?
Estimating the admixture proportion
The D statistic provides a powerful test for introgression, but it does not quantify the proportion of the genome that has been shared. A related method has been developed to estimate f , admixture proportion.
The idea behind this approach is that we compare the observed excess of ABBA over BABA sites, to that which would be expected under complete admixture. To approximate the expectation under complete admixture we re-count ABBA and BABA but substituting a second population of the P3 species in the place of P2. If you lack a second population, you can simply split your P3 samples into two. In this case, we have two populations to represent each species, so if we’re using H. cydno chioneus ( chi ) as P3a, we can use H. cydno zelinde ( zel ) as P3b).
P1 = "mpg" P2 = "ros" P3a = "chi" P3b = "zel"
We then compute ABBA and BABA using P1, P2 and P3a, as above, as well as using P1, P3b and P3a.
ABBA_1_2_3a = abba(freq_table[,P1], freq_table[,P2], freq_table[,P3a]) BABA_1_2_3a = baba(freq_table[,P1], freq_table[,P2], freq_table[,P3a]) ABBA_1_3b_3a = abba(freq_table[,P1], freq_table[,P3b], freq_table[,P3a]) BABA_1_3b_3a = baba(freq_table[,P1], freq_table[,P3b], freq_table[,P3a])
Finally, we can estimate f as the observed difference between ABBA and BABA to that we see when we substitute P3b for P2.
f = (sum(ABBA_1_2_3a) - sum(BABA_1_2_3a))/ (sum(ABBA_1_3b_3a) - sum(BABA_1_3b_3a))
This reveals that nearly 30% of the genome has been shared between H. melpomene and H. cydno in sympatry. The admixture proportion can be interpreted as the average proportion of foreign ancestry in any given haploid genome. It can also be interpreted as the expected frequency of foreign alleles at any given site in the genome in a population. However, the second interpretation is more problematic, because we expect certain parts of the genome to be more or less prone to admixture. We will consider this variation in part 2.
Part 2. ABBA BABA statistics in sliding windows
As described above, these statistics are designed to be used at the whole-genome scale. It may seem meaningless or even dangerous to try to apply these at a narrow genomic scale. However, they have in fact proved useful for visually exploring the signatures of introgression, and even quantifying admixture on a small scale.
The D statistic is not well suited for comparing admixture levels across the genome, because it’s absolute value depends on factors such as the effective population size, which can vary across the genome. The f estimator above is better, because it by definition reflects the admixture proportion, but it is highly sensitive to stochastic error at the small scale. A modified f estimator was therefore developed for this purpose that is more robust to the error introduced by using small numbers of SNPs. While the conventional f estimator assumes that P3 is the donor population and P2 the recipient, f d infers the donor on a site-by-site basis.
Here we are interested in introgression of wing patterning alleles around the gene optix on Heliconius chromosome 18. We use H. melpomene amaryllis from Peru as P3, and it’s sympatric co-mimic H. timareta thelxinoe as P2. We use H. timareta florencia , which has a different wing pattern, as P1.
Note that here H. timareta florencia is not allopatric with respect to H. melpomene , so we do not expect this analysis to reveal the total extent of admixture between the species. Instead, it should specifically highlight parts of the genome at which P2 and P3 share variation that is not shared by P1. This is therefore ideal for identifying wing patterning alleles, as this should be one of only a few genomic regions at which the two H. timareta races are strongly distinct.
Sliding window analysis
We wil use a python script to analyse sequence data for a single chromosome and compute the ABBA BABA statistics for overlapping windows. We use a window size of 50 kb, sliding in increments of 20 kb. The input file contains SNP data for 92 samples from 10 populations for chromosome 18. We also include a lower cutoff for the minimum number of SNPs required per 50 Kb window (1000).
(NOTE this is on the command line)
python genomics_general/ABBABABAwindows.py -g data/bar92.DP8HET75M40MP1SNP.chr18.geno.gz -f phased -o data/bar92.DP8HET75M40MP1SNP.chr18.ABBABABA_flo_txn_ama_slv.w50m1s10.csv.gz -P1 flo -P2 txn -P3 ama -O slv --popsFile data/bar92.pop.txt -w 50000 -m 1000 -s 20000 --minData 0.5 --T 2
Plotting window statistics
(NOTE this is back in Rstudio)
Load the window stats file.
stats_table = read.csv("data/bar92.DP8HET75M40MP1SNP.chr18.ABBABABA_flo_txn_ama_slv.w50m1s10.csv.gz", as.is=T)
f d is meaningless when D is negative, as it is designed to quantify the excess of ABBA over BABA only when an excess exists. We therefore convert all f d values to 0 at sites where D is negative.
stats_table$fd = ifelse(stats_table$fd < 0, 0, stats_table$fd)
However, another modified statistic, f dM , has been developed to simultaneously quantify admixture bewteen P3 and P2 and between P3 and P1. The script also computes f dM .
We can begin by plotting the raw ABBA and BABA values across the chromosome. We add shading around optix .
par(mfrow = c(3,1), mar = c(4,4,1,1)) plot(0,cex=0,xlim=c(0,17e6),ylim=c(0,300),ylab="Count",xlab="Position") #add shading around optix rect(1000000,0,1250000,300, col = "gray80", border=NA) lines(stats_table$mid, stats_table$ABBA, type = "l", col = "red") lines(stats_table$mid, stats_table$BABA, type = "l", col = "blue") legend("top", legend = c("ABBA","BABA"),col = c("red","blue"), lty=1, bty = "n", ncol=2)
This indicates a huge excess of ABBA in the vicinity of optix .
We can then add plots of f d and f dM .
plot(0,cex=0,xlim=c(0,17e6),ylim=c(0,1),ylab="fd",xlab="Position") rect(1000000,0,1250000,1, col = "gray80", border=NA) lines(stats_table$mid, stats_table$fd, type = "l", col = "black") plot(0,cex=0,xlim=c(0,17e6),ylim=c(-.5,.5),ylab="fdM",xlab="Position") rect(1000000,-.5,1250000,.5, col = "gray80", border=NA) lines(stats_table$mid, stats_table$fdM, type = "l", col = "black")
This reveals that there are in fact multiple other parts of the chromosome that show evidence of considerable admixture. This shows the danger of only considering the ABBA and BABA difference. This will depend on various factors, and a fairly small difference can still be consistent with considerable introgression.
In your own time
What happens when we use different sample sets, such as an allopatric population for P1?