Introduction

Population Genetics is centred around understanding genetic variation.

  • Where does it come from?
  • Why does it persist?
  • How is it distributed aross the genome or across the species’ range?

Towards this end, population geneticists are interested in the fundamental processes of mutation, migration, recombination, selection and drift.

In this lab session, we’ll be touching on most of these and covering a range of common foundation analyses to understand the evolutionary history of a SNP dataset. The goals are for you to get a feel of a basic population genetics workflow through a whistle-stop tour of methods. We could spend an entire day on each step of the analysis, but given we only have 3 hours we’ll be learning a little about a few things as opposed to a deep-dive on one thing.

It’s expected that there are a range of abilities in the cohort, and consequently some may progress quicker than others. At the end of each section there are additional exercises that are optional for anyone who is working quickly.

It’s also important to learn how to troubleshoot errors and problems on your own! Remember that for command-line programs we can usually type --help, or in R type ?my_function, to see documentation.


The Dataset

We’ll be working with a simulated dataset that includes a number of conditions and events that we will uncover over the course of the session. The workshop is designed around the idea of building up an understanding of a SNP dataset that we may have limited a-priori knowledge of, which can be common for example when working with sequencing data from natural populations. Starting with exploratory summary analyses, we will generate and test hypotheses to explore evolutionary questions.

Let’s pretend our data comes from a study into the a hypothetical, yet MIRACULOUS, conservation success story of Scottish Wild Cats spreading south into England from their Highlands refugia. In this study, samples were collected from a number of individuals from the ancestral Highlands population (p0) and at four further locations moving south through the Scottish borders and into England, at Kielder Forest (p1), Leeds (p2), Norfolk (p3), and London (p4).

Scottish Wild Cats are back!

The only information we will start with is a simple tree summary of the 5 populations, which we will predict from the known North to South colonisation history:

plot(ape::read.tree(text = '(Highlands_p0,(Kielder_p1,(Leeds_p2,(Norfolk_p3,London_p4))));'))


Prepare the working directory

Create a new directory at the following path and move into it: ~/workshop_materials/population_genomics/workdir

# Change to our workdir
mkdir ~/workshop_materials/population_genomics/workdir
cd ~/workshop_materials/population_genomics/workdir

Also, define the following variables in your command line. Variables help keep scripts tidy, and should we want to change the inputs, it is much simpler to change one line of code that defines the variable than every line of code where that variable is used!

VCF=popgen_demo_krumlov23_maf05.vcf.gz

DATASET=popgen_demo_krumlov23

We will also be using R for some analyses and to visualise results, so open up R and load the following packages:

  • vcfR
  • ggplot2
  • data.table
  • adegenet
  • GenotypePlot

Set our working directory (using setwd()) to our working directory as above.

# Load our libraries
library(vcfR)
library(ggplot2)
library(data.table)
library(adegenet)
library(GenotypePlot)

# Set our working directory
setwd("~/workshop_materials/population_genomics/workdir")

Overview of VCF format

Get a general sense of the dataset we’ll be working with:

  • How many individuals are there?
  • How many chromosomes and what are their sizes?
  • How many SNPs?

We’ll start by filtering our VCF based on a minor allele frequency filter of 5%. (You can do this with vcftools or bcftools)

bcftools view --min-af 0.05:minor ../popgen_demo_krumlov23.vcf.gz -Oz -o popgen_demo_krumlov23_maf05.vcf.gz

Step 1 - Population Structure

Visualising population structure can be a good place to start when interpreting a fresh SNP dataset. Population structure provides us with a sense of how individuals are related, whether those relationships form clusters that we might consider populations, and whether there are any unexpected outliers or obvious trends in our data that might be suggestive of sequencing or sampling issues. For instance, if our data has been sequenced in batches or with an unbalanced design, sequencing artifacts may cause individuals to cluster according to when or how they were sequenced (batch effects). Males and females may also be clear if SNPs include those coming from sex-linked regions.


1a) Principal Component Analysis - PCA

PCA is a powerful tool for quickly inferring and visualising population structure. It is an ordination technique that seeks to produce a set of uncorrelated axes (principal components, e.g. PC1, PC2 etc) that maximise genetic variation among all included individuals. In short, this is done by ‘rotating’ the data in multivariate space to find the single axis that explains the most variation (PC1), and repeatedly introducing orthogonal axes that contain increasingly smaller proportions of total variation until all variation is explained.

Filter SNPs for linkage

To perform PCA, the first step is to filter our SNP dataset for linkage. Linkage disequilibrium or ‘LD’ describes the non-random associations or correlation between SNPs. LD is particularly strong among SNPs that are physically close together because they are likely to be inherited in common haplotype blocks. LD can also be caused by other things, such as selection, and can occur among SNPs that are not physically close together. When considering population structure, we want to remove as much LD as possible because linked variants carry redundant information about inheritance. We therefore want each SNP to represent an independent variable in our analysis (it’s also computationally more efficient to work with fewer SNPs if we can!).

There are many approaches available to filter our data for linkage. We could do this in R using a package like SNPRelate (and you’re welcome to give this a go). One of the most commonly used approaches is Plink’s --indep-pairwise, which removes SNPs that are in linkage based on a correlation cut-off (usually 0.2) within windows of a user-defined size. For our window size, we’ll do non-overlapping windows of 50 SNPs, but feel free to explore how changing this parameter influences the number of SNPs that are left following pruning. This software runs on the command-line.

Plink will provide us a file of unlinked SNPs in a file suffixed *.prune.in, which we can use to pull SNPs out of our VCF for other analyses.

Hints: You’ll also need to use the --allow-extra-chr option.

# Linkage prune with plink
# We'll prune in non-overlapping windows of 50 SNPs
plink --vcf $VCF --allow-extra-chr \
--indep-pairwise 50 50 0.2 \
--out $DATASET

Explore results in R

Now let’s explore the results! Return to R and read in the PCA outputs, which will be suffixed .eigenval and .eigenvec.

# Read in the eigenvector scores
plink_scores = read.table('popgen_demo_krumlov23.eigenvec')

# Rename the eigenvecs cols
colnames(plink_scores) = c('ind','ind2',paste0('PC',1:(ncol(plink_scores) - 2)))

# Read in the eigenvalues
plink_eigenval = read.table('popgen_demo_krumlov23.eigenval')

Read your inputs into two data.frames called plink_scores and plink_eigenval, that look like this:

head(plink_scores)
##    ind ind2       PC1          PC2        PC3         PC4          PC5
## 1 p0.0 p0.0 -0.122078  0.012935200 0.00320871 -0.00245191 -0.000822215
## 2 p0.1 p0.1 -0.125386  0.011319500 0.00728584 -0.00351268  0.003910970
## 3 p0.2 p0.2 -0.126704  0.007575760 0.00503239  0.00133178  0.001004640
## 4 p0.3 p0.3 -0.125776  0.005454740 0.00618741  0.00470080  0.000722623
## 5 p0.4 p0.4 -0.125732  0.014062200 0.00147493  0.00100268 -0.003479150
## 6 p0.5 p0.5 -0.123756 -0.000981481 0.00597981  0.00843322  0.002228040
##           PC6          PC7         PC8         PC9         PC10         PC11
## 1  0.00281681 -0.003645220  0.00834636 -0.00479214  0.004214160  9.96145e-05
## 2  0.00893109  0.000240292  0.00916124 -0.00692168 -0.000162668 -7.71035e-03
## 3 -0.01115630 -0.007243930 -0.00222996  0.00576070 -0.004564770  2.10456e-03
## 4 -0.00511106 -0.003291240 -0.00463956  0.00567235  0.001458280  3.50304e-03
## 5  0.00287493  0.001770390  0.00348909 -0.01090230 -0.002004750 -5.69118e-03
## 6 -0.01823700 -0.006327550 -0.01425580  0.02202560 -0.000862466  6.02781e-03
##          PC12        PC13        PC14        PC15        PC16        PC17
## 1 -0.00895890  0.01018750  0.02175990  0.00890439 -0.00743525  0.01685170
## 2 -0.00452610 -0.00194529  0.01643850  0.01370070 -0.01946810  0.00366498
## 3  0.00596315 -0.01390820 -0.01420920 -0.00781733  0.00768012 -0.00359343
## 4  0.00989684 -0.01301740 -0.02102160 -0.00780664  0.01567650 -0.00980370
## 5 -0.00415022  0.01257410  0.00746004  0.00368681 -0.00689611  0.01690440
## 6  0.01204670 -0.03076720 -0.03431040 -0.01517650  0.03400560 -0.02972850
##          PC18        PC19         PC20
## 1 -0.01956960 -0.00557496 -0.000799148
## 2  0.00800027  0.00526460  0.007826790
## 3 -0.00352997 -0.00887329  0.001173550
## 4  0.00530733  0.00455785  0.001065890
## 5  0.01108530  0.00668424  0.008758650
## 6 -0.00448433 -0.00214756  0.000397279
head(plink_eigenval)
##         V1
## 1 48.15550
## 2 22.78070
## 3 14.02390
## 4 10.69700
## 5  4.20565
## 6  3.81928

A good place to start is to look at what proportion of genetic variation is explained by each principal component. This information is provided by the eigenvalues. The eigenvalues represent scaling parameters that describe the relative importance of each axis.

The more genetic variation that is explained by our major axes tells us how strong population structure is; populations that are separate on a PC axis that explains 1% of variation are less distinct than populations that separate on a PC axis that explains 20% of genetic variation.

In order to get the proportion of variation explained by each, we need to divide each eigenvalue by the sum of all eigenvalues.

# Examine the eigenvals
plink_eigenval
# Convert these to % variance explained
plink_eigenval$var_explained = 100 * plink_eigenval$V1 / sum(plink_eigenval$V1)
# Check the variance explained
plink_eigenval

We’ll now plot the PCA results using the plink_scores data. The scores represent the transformed value of each individual in each axis. So by visualising these, we can see where individuals are plotted relative to other individuals. Individuals that are closer together are more closely-related.

You can do this using any plotting software you’re familiar, but the example will be in ggplot2. We’re usually interested in visualising the PC1 scores on the x-axis and PC2 sores on the y-axis, and it helps to include the % variance explained on each axis as part of the label. The PC scores are found in the eigenvecs file, and describe the transformed value of each individual on each axis. Produce a scatterplot that shows each individual’s position in PCA space, with axis labels in the format ‘PC1 (xx %)’

Hint: You can use the paste0() or paste() function to produce an axis label that pulls the variance explained value directly from your eigenvals table to produce an axis label!

Hopefully, your figure looks something like this…

# Plot the PCA results
# PC1 and PC2
ggplot(plink_scores,aes(PC1,PC2)) + 
  geom_point() +
  xlab(paste0('PC1 (',round(plink_eigenval$var_explained[1],2),'%)')) +
  ylab(paste0('PC2 (',round(plink_eigenval$var_explained[2],2),'%)'))

Visualising the results on PC1 and PC2 provides us with a sense of population structure. The distance between points is analagous to how related individuals are, so ‘populations’ appear as clusters of individuals. F1 hybrids between populations appear equidistant between clusters. PC1 and PC2 however only represent some of the total variation, so it can help to explore additional axes. For example, PC1 and PC2 give the appearance of 4 clusters, but if we plot PC2 and PC3, we can see 5 clusters…

# PC2 and PC3
ggplot(plink_scores,aes(PC2,PC3)) + 
  geom_point() +
  xlab(paste0('PC2 (',round(plink_eigenval$var_explained[2],2),'%)')) +
  ylab(paste0('PC3 (',round(plink_eigenval$var_explained[3],2),'%)'))

In fact, to get a view of population structure that incorporates all of our PC axes, we can simply calculate the distance among all individuals using the dist() function. This calculates the euclidean distance among all individuals, which is analogous to how closely related they are.

Don’t worry about doing this (unless you’d really like to!), but if we plot the distance in PC space between 50 random individuals we can see clearly that distance is lower among indivduals from the same population, and our ancestral Highlands (p0) population is more distinct than the others.

# Extract only the scores
only_scores = plink_scores[,3:ncol(plink_scores)]
# Multiple each column by its eigenvalue
for(i in 1:nrow(plink_eigenval)){
  only_scores[,i] = only_scores[,i] * plink_eigenval$V1[i]
}

# Plot a random sample of 50 individuals
rownames(only_scores) = plink_scores$ind
pca_dists = dist(only_scores[sample(1:nrow(only_scores),50),])
pheatmap::pheatmap(as.matrix(pca_dists))

Depending on time…

The sense of population structure that PCA provides is highly context-dependent. It depends on which individuals we include, how the data was originally filtered, and how many SNPs we included. Explore how your PCA results change if you:

  • Remove all individuals that have p0 in their name.
  • Include only 100 SNPs. How few SNPs can you use to re-create the pattern of population structure above?
  • Filter the original dataset for only singletons, rather than a minor allele frequency of 5%

Try and write a function called plot_plink_PCA() that lets you plot the results of a plink PCA analysis by providing it with the path to the eigenvecs and eigenvals files.


1b) Discriminant Analysis of Principal Components - DAPC

Whilst PCA is powerful and intuitive, it doesn’t specifically perform any tests of whether there are populations and how many might be statistically relevant. DAPC is an extension of PCA that includes a step to determine how many populations best describe the data. Having identified a likely number of populations (k populations), we can then re-analyse the data in a way that seeks to maximise genetic variation among our k clusters, as opposed to among all individuals as in PCA.

To do this, we’ll need to produce a VCF that only includes our linkage-pruned SNPs from the previous section. We can do this using bcftools and the view and --include options.

# We'll also make a new VCF of the linkage-pruned SNPs
bcftools view --include ID==@$DATASET.prune.in $VCF -Oz -o ${DATASET}_maf05_LDpruned.vcf.gz

We can then read this pruned VCF into R using the read.vcfR() function, and then convert to genind format using the vcfR2genind() function

# Read in the LD-pruned VCF
LDpruned_vcf = read.vcfR("popgen_demo_krumlov23_maf05_LDpruned.vcf.gz")

# Convert our VCF to genind format
genind_input = vcfR2genind(LDpruned_vcf)

The first step for our DAPC analysis is to identify the most likely number of clusters. To do this, we use the find.clusters() function. There are two options to bear in mind here, the max.n.clust and n.pca options. These options restrict the exploration of parameters to limits of our choice. Set each of these at 20, but feel free to explore how these affect the analysis.

# Identify the most likely k-value for number of clusters
# When we run this, we see steep declines in BIC up to k = 5, so we'll pick that for our cluster N
grp <- find.clusters(genind_input,
                     max.n.clust=20,
                     n.pca = 20,)

## Choose the number of clusters (>=2):

This function shows us how model fit changes as we fit it with different numbers of clusters. We can see a sharp decline in BIC (improving model fit) up to a k of 5, and only moderate improvements at k of 6+. This implies that 5 clusters is the best fit for our data. Set k = 5.

grp <- find.clusters(genind_input,
                     n.clust=5,
                     n.pca = 20)

As well as helping us choose the most likely number of clusters in our data, this function also assigns our individuals into groups. Assuming you’ve saved your clusters to a variable called grp, we can look at these by running grp$grp.

head(grp$grp)
## p0.0 p0.1 p0.2 p0.3 p0.4 p0.5 
##    4    4    4    4    4    4 
## Levels: 1 2 3 4 5

These groups are given an arbitrary number as an ID, so let’s just quickly change this (you can just run this bit of code, don’t worry about it!)

# Fetch the grp number for each actual pop
new_grp_names = lapply(paste0('p',0:4),function(p){
  as.character(grp$grp[grep(p,names(grp$grp))][1])
})
names(new_grp_names) = paste0('p',0:4)

# Recode factor levels by name
levels(grp$grp) <- new_grp_names

We will now attach these cluster IDs to our plink PCA scores from earlier, and re-plot the PCA results with the points now coloured according to their assigned groups. If using ggplot(), we can also now add stat_ellipse() to plot a 95% confidence ellipse around each cluster.

Hint: You can do this easily by making a data.frame with 2 columns called ‘ind’ and ‘pop’, by taking names(grp$grp) and grp$grp. You can then use the merge() function to attach these to your PC scores data.frame

# We can check these clusters against our PCA results by merging them...
ind_groups = data.frame(ind = names(grp$grp),
                        pop = grp$grp)
plink_scores_ids = merge(plink_scores,ind_groups)

# Plot together, now colouring by group assignment...
ggplot(plink_scores_ids,aes(PC1,PC2,colour = pop)) +
  geom_point() + 
  stat_ellipse()

Now that we’ve got our groups, we can perform DAPC to maximise variation among them. To do this, we’ll run the dapc() function, providing it with our genind input, our group assignments, the n.pca option from before, and we’ll set n.da = 4 so we can produce 4 discriminant axes to separate our 5 groups. Plot the output of this using the scatter() function.

# We can then use these groups for DAPC
# We'll retain all discriminant functions seeing as we want to describe diffs between 5 clusters
dapc_res <- dapc(genind_input, 
                 grp$grp,
                 n.pca = 20,
                 n.da = 4)
# Display results in a scatter plot
scatter(dapc_res)

Probability of group assignment

A useful feature of DAPC is that we can check how confident the analysis is of our group assignments. This information is stored in the dapc_res$posterior slot. This matrix describes the probability that each individual is assigned to each group. Use this matrix to produce a data.frame that describes, for each individual, the cluster it has been assigned to and the probability.

Hint: To work through the matrix, you can either use a forloop or the apply() function to find the max probability

# A useful feature of dapc is we get a membership probability
head(dapc_res$posterior)
##      p0 p1 p2 p3 p4
## p0.0  1  0  0  0  0
## p0.1  1  0  0  0  0
## p0.2  1  0  0  0  0
## p0.3  1  0  0  0  0
## p0.4  1  0  0  0  0
## p0.5  1  0  0  0  0
# And we can use this to pull out the group membership and the probability
head(data.frame(ind = rownames(dapc_res$posterior),
                cluster = apply(dapc_res$posterior,1,which.max),
                prob = apply(dapc_res$posterior,1,max)))
##       ind cluster prob
## p0.0 p0.0       1    1
## p0.1 p0.1       1    1
## p0.2 p0.2       1    1
## p0.3 p0.3       1    1
## p0.4 p0.4       1    1
## p0.5 p0.5       1    1

As a final step for this analysis, we can produce an admixture plot that visualises the probability of group assignment for each individual as grouped bars. The easiest way to do this is with the compoplot() function, but if you’re feeling confident with ggplot() you could also do this using the raw data from the matrix above.

# We can also plot these populations in a structure like plot as well
compoplot(dapc_res, posi="bottomright",
          txt.leg=paste("Cluster", 1:5), lab="",
          ncol=1, xlab="individuals", col=funky(5))

Depending on time…

  • DAPC includes a number of steps to validate results. This includes using the alpha score and cross-validation. How to do these is described in an in-depth DAPC tutorial by the developer (link at the bottom of the document, section 4). Give this a go and optimise the number of retained PC’s.

  • Explore how choosing different values of k influences the probability of group assignment. It is common to show admixture plots like the one above for multiple k values of 2 to k. These plots are informative in seeing hierarchical population structure.

  • Re-run the DAPC analysis with 100 random SNPs. DAPC, like any statistical process, is sensitive to statistical power. With fewer SNPs, we have less power to detect structure.


Step 2 - Genetic/Nucleotide Diversity (Pi)

Nucleotide diversity describes the proportion of nucleotides that are expected to differ between two given sequences taken from a population. A greater estimate of nucleotide diversity implies there is more genetic variation in a given population or in a given region of the genome.

This is another simple analysis that can quickly tell us a lot about the data we’re working with. Nucleotide diversity varies extensively across the genome; regions of high diversity may reflect regions under balancing selection or mutation hotspots, whilst regions exhibiting low diversity may be under strong selection or in regions of low recombination and high background selection.

Calculate Pi on the command-line

There are many ways to calculate genetic diversity, here we’ll use vcftools and the --window-pi option. We’ll run this individually for each population, seeing as we are interested in within-population nucleotide diversity. To do this, we will need to produce a popmap file, which is simply a text file for each population that lists the individuals in that population.

Use for-loops to first produce a popmap text file for each individual, and then use this popmap and the --keep option of vcftools to calculate pi for each population in 20kb windows. Run this over the original maf-filtered VCF, rather than the linkage-pruned VCF!

##### CALCULATING PI ######
# What window size shall we look at?
WINDOW_SIZE=20000

# We want to do this on a per-population basis, so we'll start by making popmaps
for i in {0..4}
do
    # Use grep to pull individual IDs from the vcf into a file
    bcftools query -l $VCF | grep "p${i}" > p${i}.popmap
done

# We want to do this using a for-loop as we'll be running it within each of our 5 populations
for i in {0..4}
do
    # Nice little message to tell us what we're doing
    echo ">>> Starting pi calculation for pop${i}"

    # Make a variable to store pop
    POP=p${i}

    # Calculate nucleotide diversity with vcftools
    vcftools --gzvcf $VCF \
    --keep ${POP}.popmap \
    --window-pi $WINDOW_SIZE \
    --out ${POP}

done

Visualise Nucleotide Diversity in R

Having calculated pi, read each of the outputs back into R and plot pi along the genome for each of the populations. Align your plots vertically so that you can see any trends across populations. Consider:

  • Where is nucleotide diversity high or low?
  • What is the genome-wide mean? Does this vary across populations?
  • Are there any regions specific to individual populations that show high/low diversity? What might cause these?

Hint: If you format your pi results with a column that has the population ID, you can use facet_wrap() in ggplot() to align your plots vertically

# Identify all the pi results and read them into a list
pi_files = list.files(pattern = 'windowed.pi')
pi_res = lapply(pi_files,read.table,header = T)

# We can loop through our list and add an identifer to each
for(i in 1:length(pi_res)){
  pi_res[[i]]$pop = paste0('p',i - 1)
}

# Collapse these into a list
all_pi_res = rbindlist(pi_res)

# And we can now plot together with facet_wrap
ggplot(all_pi_res,aes(x = BIN_START,y = PI)) +
  geom_line() +
  facet_wrap(~pop,ncol = 1) +
  labs(x = 'Chromosome Position (bp)',
       y = expression(Nucleotide~Diversity~(pi)))

We can also take all of our windowed estimates of nucleotide diversity and compare the average nucleotide diversity among populations, either by calculating the mean or plotting. Try both now…

# Mean nuc div per population
all_pi_res[,.(mean(PI)),by = pop]
##    pop           V1
## 1:  p0 0.0004821958
## 2:  p1 0.0007959663
## 3:  p2 0.0007576140
## 4:  p3 0.0008544700
## 5:  p4 0.0007201156
# Visualise differences in nucleotide diversity
ggplot(all_pi_res,aes(x = pop,y = PI)) +
  geom_boxplot() +
  labs(x = 'Population',y = expression(Windowed~pi))

Our results for nucleotide diversity indicate that the ancestral Highlands (p0) population has lower nucleotide diversity than the other populations.

  • In light of our results above on population structure, what is the most likely explanation for this?

Note: To calculate an unbiased estimate of genetic diversity, we need to know not only where genetic variants are observed, but also where they are absent due to issues with read-mapping. This issue is discussed in-depth in Korunes and Samuk 2021 - pixy: Unbiased estimation of nucleotide diversity and divergence in the presence of missing data. This is not an issue here as our data is simulated, however should be kept in mind for real sequencing data.


Step 3 - Identifying Selective Sweeps

Two of our populations exhibit reduced diversity around 5 Mb (p2 and p4), which may be indicative of selection acting on these regions. Selective sweeps are a common mode by which adaptation can occur in the genome. In this mode, an adaptive variant appears in a population (either through mutation or introgression) and rapidly increases in frequency. Because of linkage, the haplotype carrying the adaptive variant also increases in frequency, subsequently removing local haplotype diversity around the region harbouring the adaptive variant. This process creates a signature of reduced local diversity that we can test for.

Below shows a simplified schematic of how sweeps reduce local diverisity (ignoring opportunities for recombination). As the haplotype carrying the adaptive mutation increases in frequency, it replaces other haplotypes carrying diversity. In this example, we lose 4 SNPs from the population as the adaptive haplotype fixes.

Hard seletive sweep example

The SweepFinder2 method

Sweepfinder2 scans the genome for sweep-like regions by first modelling how background selection and recombination rate affect local genetic diversity across the genome. It then uses this estimate (called a B-value) to evaluate whether at a given region of the genome the neutral allele frequency spectrum deviates from expectations using a Conditional Likelihood Ratio (CLR) test.

See the short article at DeGeorgio et al. 2016 - SWEEPFINDER2: increased sensitivity, robustness and flexibility for more detail.

A note on the VCF we are using…

In our previous analyses we used a VCF filtered for a minor allele frequency of 5%. This is ok for some analyses, but for analyses that are focussed on the site-frequency spectrum (SFS) it will be missing important information on low frequency variants because these have been removed.

We do maf-filtering to remove low frequency variants that are more likely to be erroneous. So best-practise for analyses like this is to find a balance between data quality and retaining low-frequency sites. For instance, we might remove only SNPs that have an allele count of 1 (singletons).

The script to do this singleton-filtering is below, but in the interest of reducing running time (and I already know it gives the same result) we will just go ahead and do our sweep analysis using our maf 5%-filtered VCF because it has fewer SNPs and runs quicker.

# Filter all SNPs and remove singletons
bcftools view --min-ac 2 ../popgen_demo_krumlov23.vcf.gz -Oz -o popgen_demo_krumlov23_mac2.vcf.gz

Run SweepFinder2

To run Sweepfinder2, we will first need to produce files in the requested input format. The input file is a four column text-file that has:

  • Column 1: position = Position of SNP
  • Column 2: x = Allele counts for that SNP in the population
  • Column 3: n = The total number of alleles in the population (2N)
  • Column 4: folded = A column of ‘1’s that tell Sweepfinder2 that we are giving it a ’folded’ site frequency estimate (we don’t know ancestral/derived states).

Hint: You can use vcftools with the --counts2 option to get allele counts!

For e.g.

##    position  x   n folded
## 1:       63 32 100      1
## 2:      573  0 100      1
## 3:     1198 37 100      1
## 4:     1600 24 100      1
## 5:     1834  0 100      1
## 6:     2199  0 100      1

After making the formatted input file, we can run SweepFinder2 by including the -sg option to define window size (let’s stick with 20kb…)

Put all of this together in a command-line forloop. In each loop you should:

  • Assign the relevant population map
  • Filter the $VCF_SFS for only individuals in the popmap
  • Calculate the allele counts for all SNPs
  • Convert our counts to SweepFinder2’s input format
  • Run SweepFinder2
# We want to do this using a for-loop as we'll be running it within each of our 5 populations
for i in {0..4}
do

    # Make a variable to store pop
    POP=p${i}

    # Get counts from the VCF files and only keep individuals from this pop
    vcftools --gzvcf $VCF \
    --keep ${POP}.popmap \
    --counts2 \
    --out ${POP}
    
    # Now we need to edit the format of the outputted file to SweepFinder standard
    tail -n+2 ${POP}.frq.count | awk -v OFS="\t" '{print $2,$6,$4,"1"}' > temp
    # Add header
    echo -e 'position\tx\tn\tfolded' | cat - temp > ${POP}_SF.in

    # Keep tidy
    rm -f ${POP}*frq* $POP.log

    # Run sweepfinder in windows
    SweepFinder2 -sg $WINDOW_SIZE ${POP}_SF.in ${POP}_SF.out

done

This takes around 10 minutes to run, so feel free to grab a coffee or take a break!

Plot SF results

Having run these, read in the results as above and plot the outputs. If you look especially carefully you should just about make out two peaks of elevated likelihood suggestive of two selective sweeps, one in p2 (Leeds) and one in p4 (Buckingham Palace).

# Read in and plot sweepfinder results ------------------------------------
sf_files = list.files(pattern = 'SF.out')
sf_res = lapply(sf_files,read.table,header = T)

# We can loop through our list and add an identifer to each
for(i in 1:length(sf_res)){
  sf_res[[i]]$pop = paste0('p',i - 1)
}

# Collapse these into a list
all_sf_res = rbindlist(sf_res)

# And we can now plot together with facet_wrap
ggplot(all_sf_res,aes(x = location,y = LR)) +
  geom_line() +
  facet_wrap(~pop,ncol = 1) +
  labs(x = 'Chromosome Position (bp)',y = 'CLR')

Note: Detecting selective sweeps is easiest when identifying ‘hard’ sweeps, in which a mutation has arisen at very low frequency and increased in frequency until it has become fixed. Selective sweeps in which a mutation has increased in frequency from an already appreciable frequency (a ‘soft’ sweep), or has yet to fix within a population (an ‘incomplete’ sweep) are more difficult to detect, yet may be more common in natural populations.

Depending on time…

  • Selective sweeps are one way that selection manipulates genetic diversity within the genome, but selection manifests its way in many forms and there are many ways to detect it. Another common approach is to examine genetic differentiation between populations using metrics such as FST. Use vcftools and the --weir-fst-pop option to calculate FST between pairs of populations in sliding windows. Compare these results to our Sweepfinder2 results.

Step 4 - Adaptive Introgression and D-statistics

Having observed that two of our populations show evidence of selection having occurred at the same region, a question we might next be interested in asking is how this has occurred.

Recall that p2 and p4 were sampled in Leeds and London, which are both much more urban environments than the others. Perhaps our wild cats are undergoing local adaptation to urban environments, and the evolution of a gene within our sweep region facilitates this.

But by what mechanism have both populations undergone evolution at the same genomic region? One possibility is that novel mutations have arisen at the same or similar locus in each population and each has experienced an independent selective sweep. In this sense, a gene around 5Mb has been modified multiple times, perhaps in different ways, but in a way that has convergently increased local fitness in p2 and p4.

An alternative is that an adaptive variant arose in one population and was introduced to the other following introgression between them. Alternatively, the adaptive variant may have arisen at some point in the metapopulation history prior to the population split, and was subsequently inherited in each before undergoing selection and fixation. In this sense, the gene was modified only once but was transferred or inherited into both populations and selection acted multiple times on it independently.

These mechanisms offer up testable hypotheses in terms of what the underlying tree structure should like at the region where the sweeps were detected. If mutations are independent, we can predict that the tree structure at the sweep site should reflect the tree given in at the start of the worksheet, since the haplotypes carrying each adaptive mutation will be unique to each population. Alternatively, if adaptive introgression is the mechanism we can predict the structure of the tree at the selected site will place both populations together, even though they are not sisters. This is because the haplotype carrying the adaptive variant will have come from whichever population the mutation arose in, and subsequently there is a single clade for the adaptive haplotype.

Let’s now explore adaptive introgression among the populations to test this.

ABBA-BABA and D-Statistics

\[ D = \frac{ABBA - BABA}{ABBA + BABA} \]

ABBA-BABA (No, not that ABBA)

ABBA-BABA is a straightforward concept originally designed to detect introgression between humans and neanderthals. It assumes that given a tree structure of an outgroup and a focal trio, that there should be an approximately equal number of alleles that show the ‘ABBA’ pattern relative to the ‘BABA’ pattern. In this case, D ~ 0. Excesses of either ABBA or BABA sites suggests that introgression might have taken place within the assumed tree, either between P1 and P3 (excess BABA) or P2 and P3 (excess ABBA). (P1 and P2 are assumed to be sister-groups).

D scores thus range from -1 (excess BABA) to +1 (excess ABBA), and can be tested for significance using a jackknife procedure in which part of the genome is excluded and pseudo-D scores are re-calculated.

Running Dsuite

Dsuite provides a fast and powerful set of functions to calculate D-statistics and other introgression statistics (such as fbranch statistics).

To run Dsuite, we will provide it with our assumed tree from the start of the document, which you can save to a text file. We’ll assume that the ancestral Highlands population represents the outgroup (p0). This tree is important because it tells Dsuite our a-priori understanding of the relationship among populations.

The Newick tree for our populations looks like this (Outgroup,(p1,(p2,(p3,p4)))). Save this to a text file called popgen_demo.nwk.

# Give Dsuite a phylogenetic tree to work with based on stepping-stone model
echo "(Outgroup,(p1,(p2,(p3,p4))))" > popgen_demo.nwk

We’ll also need to make a slightly different popmap, where all of our previous popmaps are combined together with an additional column that gives the population identifier. In this new column, replace p0 with ‘Outgroup’.

##    p0.0 Outgroup
## 1: p0.1 Outgroup
## 2: p0.2 Outgroup
## 3: p0.3 Outgroup
## 4: p0.4 Outgroup
## 5: p0.5 Outgroup
## 6: p0.6 Outgroup
# Make a full popmap that includes all inds in col1 and their pop in col2
for i in {0..4}
do
    # We'll cat them together and add a column
    # Add a sed command to attach the second column
    cat p${i}.popmap | sed "s/$/\tp${i}/" >> $DATASET.popmap
done

# Make p0 the outgroup
sed -ie 's/\tp0/\tOutgroup/g' $DATASET.popmap

We can now run Dsuite, giving it our Newick tree, popmap and VCF (use our maf of 5% VCF). This will calculate D-statistics among all possible trios of p1 - p4.

# Run Dsuite...
Dsuite Dtrios -t popgen_demo.nwk $VCF $DATASET.popmap 

Dsuite should run quickly, and gives us 3 output files that are suffixed:

  • _tree.txt = ABBA-BABA, taking P1 and P2 as sisters in our tree.
  • _Dmin.txt = ABBA-BABA, taking P1 and P2 as the populations that minimise D
  • _BBAA.txt = ABBA-BABA, taking P1 and P2 as the populations that maximise BBAA sites.

Explore each of these outputs and see how they differ. Each output includes the D-statistic, it’s Z-score and p-value, and the numbers of sites with different patterns for each trio.

# Look at these results
# Dmin
cat popgen_demo_krumlov23_Dmin.txt
# BBAA
cat popgen_demo_krumlov23_BBAA.txt
# Tree
cat popgen_demo_krumlov23_tree.txt

We can see that BBAA and Dmin outputs both assume that our Leeds (p2) and London (p4) populations are sisters, i.e. P1 and P2. However, we ‘know’ that this is not the case. The tree.txt file maintains our tree order, and suggests there is strong evidence for introgression between London and Leeds. Indeed, if introgression is strong enough it may overwhelm the original tree, which is why we see discordance in these outputs. Supplying the tree therefore allows us to test specific hypotheses regarding introgression.

D-statistics across the genome

These results tell us that it is highly likely that introgression has occurred between London and Leeds. But this does not confirm that our selective sweep region is the cause of this.

To look into this, we can run Dsuite again using the Dinvestigate option. We just need to tell it which trio we’d like to investigate by giving it a text file in the format P1\tP2\tP3. In this case, P1 = p3, P2 = p4 and P3 = p2. This tells Dsuite we are interested in introgression specifically between p4 and p2.

# Make our test trios file
echo -e "p3\tp4\tp2" > test_trio.txt

Dinvestigate calculates D-statistics in sliding windows of N SNPs, allowing us to see where ABBA-BABA imbalance is particularly high. Specify a window size of 100 SNPs, and slides of 50 SNPs.

# Run Dinvestigate
Dsuite Dinvestigate $VCF \
    $DATASET.popmap \
    test_trio.txt \
    -w 100,50 \
    -n $DATASET

This should run quickly. Finally, read the output back into R (p3_p4_p2_localFstats_popgen_demo_krumlov23_100_50.txt) and plot the results. These show us very clearly that the region around our sweep signatures is driving the introgression results. This implies that the same haplotype is responsible for the sweep in both populations.

# Read in Dinvestigate and analyse ----------------------------------------
introgression_winds = fread("p3_p4_p2_localFstats_popgen_demo_krumlov23_100_50.txt")
introgression_winds$midpoint = rowMeans(introgression_winds[,.(windowStart,windowEnd)])

# Plot
ggplot(introgression_winds,aes(midpoint,y = f_dM)) +
  geom_point() +
  geom_smooth() +
  labs(y = 'fdM',x = 'Chromosome Position (bp)')

Confirm the result by visualisation

There is no substitute for visualising raw data and really seeing the statistical signature we are picking up. We can do this by isolating the region around 5 Mb and plotting it for all populations using a package called GenotypePlot.

To do this, we just need to:

  • Read in our Dsuite popmap into R.
  • Read in our full maf-filtered VCF using vcfR
  • Make a logical vector that says which SNPs are between 4.9 and 5.1Mb.
  • Use this filtering vector to subset our vcf by rows.

We can make genotype plots with and without clustering the individuals (see ?genotype_plot). These show very clearly that at 5 Mb, populations p2 and p4 share the same adaptive haplotype.

Hint: Use the vcf_object option with genotype_plot

# We'll need our popmaps
popmap = read.table('popgen_demo_krumlov23.popmap')

# And we'll need the full VCF
full_vcf = read.vcfR('popgen_demo_krumlov23_maf05.vcf.gz')

# Use general R principles to filter the full VCF for the region of interest
# SNP data is stored in the fix slot
# head(full_vcf@fix)
# Make SNPs to keep
to_keep = which(as.integer(full_vcf@fix[,2]) > 4.9e6 &
                  as.integer(full_vcf@fix[,2]) < 5.1e6)

# Let's plot 100kb either side of 5Mb
geno_plot = genotype_plot(vcf_object = full_vcf[to_keep,],
                          popmap = popmap,
                          snp_label_size = 10000)
combine_genotype_plot(geno_plot)

And add some clustering by setting cluster = T

# Do the same again but run some clustering
cluster_plot = genotype_plot(vcf_object = full_vcf[to_keep,],
                            popmap = popmap,
                            snp_label_size = 10000,
                            cluster = T)
combine_genotype_plot(cluster_plot)


What have we learnt?

So starting with a set of SNPs of which we had a limited a-piori knowledge of, we have demonstrated:

  • Evidence of population structure:
    • PCA shows clustering behaviour indicative of populations. The distance between individuals sampled within clusters is greater than between. This tells us the likelihood of breeding within the population is greater than breeding between.
    • DAPC tells us that there are most likely 5 clusters in our data, and we can assign individuals into these clusters with high confidence.
  • The genomic landscape of genetic variation:
    • We have quantified nucleotide diversity across the genome within each population. This reveals population-specific trends and trends across all populations that reflect intra-genomic processes such as recombination rate variation.
    • Reduced genetic variation in the Highlands p0 is expected given its greater isolation from the other populations. With lower rates of migration, the effective population size is reduced as fewer individuals can contribute novel variation.
  • Evidence of selection:
    • Our selective sweep analysis revealed evidence of selection acting at the 5 Mb locus in two populations. This was characterised by extreme local reductions in genetic diversity that exceeded the intra-genomic expectations.
  • Evidence of introgression:
    • We quantified evidence of introgression between all English populations (p1 - p4) through D-statistics; with strong support for introgression between Leeds (p2) and London (p4) populations.
    • This signature is particularly strong around 5 Mb, indicating that our observed sweeps are driven by the same haplotype and represent adaptive introgression.
    • We confirmed this by visualising the clustering of haplotypes.