Population Genetics is centred around understanding genetic variation.
Towards this end, population geneticists are interested in the fundamental processes of mutation, migration, recombination, selection and drift.
In this lab session, we’ll be touching on most of these and covering a range of common foundation analyses to understand the evolutionary history of a SNP dataset. The goals are for you to get a feel of a basic population genetics workflow through a whistle-stop tour of methods. We could spend an entire day on each step of the analysis, but given we only have 3 hours we’ll be learning a little about a few things as opposed to a deep-dive on one thing.
It’s expected that there are a range of abilities in the cohort, and consequently some may progress quicker than others. At the end of each section there are additional exercises that are optional for anyone who is working quickly.
It’s also important to learn how to troubleshoot errors and problems
on your own! Remember that for command-line programs we can usually type
--help
, or in R type ?my_function
, to see
documentation.
We’ll be working with a simulated dataset that includes a number of conditions and events that we will uncover over the course of the session. The workshop is designed around the idea of building up an understanding of a SNP dataset that we may have limited a-priori knowledge of, which can be common for example when working with sequencing data from natural populations. Starting with exploratory summary analyses, we will generate and test hypotheses to explore evolutionary questions.
Let’s pretend our data comes from a study into the a hypothetical, yet MIRACULOUS, conservation success story of Scottish Wild Cats spreading south into England from their Highlands refugia. In this study, samples were collected from a number of individuals from the ancestral Highlands population (p0) and at four further locations moving south through the Scottish borders and into England, at Kielder Forest (p1), Leeds (p2), Norfolk (p3), and London (p4).
Scottish Wild Cats are back!
The only information we will start with is a simple tree summary of the 5 populations, which we will predict from the known North to South colonisation history:
plot(ape::read.tree(text = '(Highlands_p0,(Kielder_p1,(Leeds_p2,(Norfolk_p3,London_p4))));'))
Create a new directory at the following path and move into it:
~/workshop_materials/population_genomics/workdir
# Change to our workdir
mkdir ~/workshop_materials/population_genomics/workdir
cd ~/workshop_materials/population_genomics/workdir
Also, define the following variables in your command line. Variables help keep scripts tidy, and should we want to change the inputs, it is much simpler to change one line of code that defines the variable than every line of code where that variable is used!
VCF=popgen_demo_krumlov23_maf05.vcf.gz
DATASET=popgen_demo_krumlov23
We will also be using R for some analyses and to visualise results, so open up R and load the following packages:
vcfR
ggplot2
data.table
adegenet
GenotypePlot
Set our working directory (using setwd()
) to our working
directory as above.
# Load our libraries
library(vcfR)
library(ggplot2)
library(data.table)
library(adegenet)
library(GenotypePlot)
# Set our working directory
setwd("~/workshop_materials/population_genomics/workdir")
Get a general sense of the dataset we’ll be working with:
We’ll start by filtering our VCF based on a minor allele frequency filter of 5%. (You can do this with vcftools or bcftools)
bcftools view --min-af 0.05:minor ../popgen_demo_krumlov23.vcf.gz -Oz -o popgen_demo_krumlov23_maf05.vcf.gz
Visualising population structure can be a good place to start when interpreting a fresh SNP dataset. Population structure provides us with a sense of how individuals are related, whether those relationships form clusters that we might consider populations, and whether there are any unexpected outliers or obvious trends in our data that might be suggestive of sequencing or sampling issues. For instance, if our data has been sequenced in batches or with an unbalanced design, sequencing artifacts may cause individuals to cluster according to when or how they were sequenced (batch effects). Males and females may also be clear if SNPs include those coming from sex-linked regions.
PCA is a powerful tool for quickly inferring and visualising population structure. It is an ordination technique that seeks to produce a set of uncorrelated axes (principal components, e.g. PC1, PC2 etc) that maximise genetic variation among all included individuals. In short, this is done by ‘rotating’ the data in multivariate space to find the single axis that explains the most variation (PC1), and repeatedly introducing orthogonal axes that contain increasingly smaller proportions of total variation until all variation is explained.
To perform PCA, the first step is to filter our SNP dataset for linkage. Linkage disequilibrium or ‘LD’ describes the non-random associations or correlation between SNPs. LD is particularly strong among SNPs that are physically close together because they are likely to be inherited in common haplotype blocks. LD can also be caused by other things, such as selection, and can occur among SNPs that are not physically close together. When considering population structure, we want to remove as much LD as possible because linked variants carry redundant information about inheritance. We therefore want each SNP to represent an independent variable in our analysis (it’s also computationally more efficient to work with fewer SNPs if we can!).
There are many approaches available to filter our data for linkage.
We could do this in R using a package like SNPRelate (and you’re welcome
to give this a go). One of the most commonly used approaches is Plink’s
--indep-pairwise
, which removes SNPs that are in linkage
based on a correlation cut-off (usually 0.2) within windows of a
user-defined size. For our window size, we’ll do non-overlapping windows
of 50 SNPs, but feel free to explore how changing this parameter
influences the number of SNPs that are left following pruning. This
software runs on the command-line.
Plink will provide us a file of unlinked SNPs in a file suffixed
*.prune.in
, which we can use to pull SNPs out of our VCF
for other analyses.
Hints: You’ll also need to use the --allow-extra-chr
option.
# Linkage prune with plink
# We'll prune in non-overlapping windows of 50 SNPs
plink --vcf $VCF --allow-extra-chr \
--indep-pairwise 50 50 0.2 \
--out $DATASET
Having produced a set of SNPs that are pruned for linkage, we can
perform PCA on the unlinked SNPs. Sticking with Plink, we can do this
using the --pca
option.
Hints: We will need the --extract
option to pull our
unlinked SNPs, and we will again need to use the
--allow-extra-chr
option.
# Perform PCA using our linkage-pruned SNPs
plink --vcf $VCF \
--allow-extra-chr \
--extract $DATASET.prune.in \
--pca \
--out $DATASET
Now let’s explore the results! Return to R and read in the PCA
outputs, which will be suffixed .eigenval
and
.eigenvec
.
# Read in the eigenvector scores
plink_scores = read.table('popgen_demo_krumlov23.eigenvec')
# Rename the eigenvecs cols
colnames(plink_scores) = c('ind','ind2',paste0('PC',1:(ncol(plink_scores) - 2)))
# Read in the eigenvalues
plink_eigenval = read.table('popgen_demo_krumlov23.eigenval')
Read your inputs into two data.frames called
plink_scores
and plink_eigenval
, that look
like this:
head(plink_scores)
## ind ind2 PC1 PC2 PC3 PC4 PC5
## 1 p0.0 p0.0 -0.122078 0.012935200 0.00320871 -0.00245191 -0.000822215
## 2 p0.1 p0.1 -0.125386 0.011319500 0.00728584 -0.00351268 0.003910970
## 3 p0.2 p0.2 -0.126704 0.007575760 0.00503239 0.00133178 0.001004640
## 4 p0.3 p0.3 -0.125776 0.005454740 0.00618741 0.00470080 0.000722623
## 5 p0.4 p0.4 -0.125732 0.014062200 0.00147493 0.00100268 -0.003479150
## 6 p0.5 p0.5 -0.123756 -0.000981481 0.00597981 0.00843322 0.002228040
## PC6 PC7 PC8 PC9 PC10 PC11
## 1 0.00281681 -0.003645220 0.00834636 -0.00479214 0.004214160 9.96145e-05
## 2 0.00893109 0.000240292 0.00916124 -0.00692168 -0.000162668 -7.71035e-03
## 3 -0.01115630 -0.007243930 -0.00222996 0.00576070 -0.004564770 2.10456e-03
## 4 -0.00511106 -0.003291240 -0.00463956 0.00567235 0.001458280 3.50304e-03
## 5 0.00287493 0.001770390 0.00348909 -0.01090230 -0.002004750 -5.69118e-03
## 6 -0.01823700 -0.006327550 -0.01425580 0.02202560 -0.000862466 6.02781e-03
## PC12 PC13 PC14 PC15 PC16 PC17
## 1 -0.00895890 0.01018750 0.02175990 0.00890439 -0.00743525 0.01685170
## 2 -0.00452610 -0.00194529 0.01643850 0.01370070 -0.01946810 0.00366498
## 3 0.00596315 -0.01390820 -0.01420920 -0.00781733 0.00768012 -0.00359343
## 4 0.00989684 -0.01301740 -0.02102160 -0.00780664 0.01567650 -0.00980370
## 5 -0.00415022 0.01257410 0.00746004 0.00368681 -0.00689611 0.01690440
## 6 0.01204670 -0.03076720 -0.03431040 -0.01517650 0.03400560 -0.02972850
## PC18 PC19 PC20
## 1 -0.01956960 -0.00557496 -0.000799148
## 2 0.00800027 0.00526460 0.007826790
## 3 -0.00352997 -0.00887329 0.001173550
## 4 0.00530733 0.00455785 0.001065890
## 5 0.01108530 0.00668424 0.008758650
## 6 -0.00448433 -0.00214756 0.000397279
head(plink_eigenval)
## V1
## 1 48.15550
## 2 22.78070
## 3 14.02390
## 4 10.69700
## 5 4.20565
## 6 3.81928
A good place to start is to look at what proportion of genetic variation is explained by each principal component. This information is provided by the eigenvalues. The eigenvalues represent scaling parameters that describe the relative importance of each axis.
The more genetic variation that is explained by our major axes tells us how strong population structure is; populations that are separate on a PC axis that explains 1% of variation are less distinct than populations that separate on a PC axis that explains 20% of genetic variation.
In order to get the proportion of variation explained by each, we need to divide each eigenvalue by the sum of all eigenvalues.
# Examine the eigenvals
plink_eigenval
# Convert these to % variance explained
plink_eigenval$var_explained = 100 * plink_eigenval$V1 / sum(plink_eigenval$V1)
# Check the variance explained
plink_eigenval
We’ll now plot the PCA results using the plink_scores
data. The scores represent the transformed value of each individual in
each axis. So by visualising these, we can see where individuals are
plotted relative to other individuals. Individuals that are closer
together are more closely-related.
You can do this using any plotting software you’re familiar, but the
example will be in ggplot2
. We’re usually interested in
visualising the PC1 scores on the x-axis and PC2 sores on the y-axis,
and it helps to include the % variance explained on each axis as part of
the label. The PC scores are found in the eigenvecs file, and describe
the transformed value of each individual on each axis. Produce a
scatterplot that shows each individual’s position in PCA space, with
axis labels in the format ‘PC1 (xx %)’
Hint: You can use the paste0()
or
paste()
function to produce an axis label that pulls the
variance explained value directly from your eigenvals table to produce
an axis label!
Hopefully, your figure looks something like this…
# Plot the PCA results
# PC1 and PC2
ggplot(plink_scores,aes(PC1,PC2)) +
geom_point() +
xlab(paste0('PC1 (',round(plink_eigenval$var_explained[1],2),'%)')) +
ylab(paste0('PC2 (',round(plink_eigenval$var_explained[2],2),'%)'))
Visualising the results on PC1 and PC2 provides us with a sense of population structure. The distance between points is analagous to how related individuals are, so ‘populations’ appear as clusters of individuals. F1 hybrids between populations appear equidistant between clusters. PC1 and PC2 however only represent some of the total variation, so it can help to explore additional axes. For example, PC1 and PC2 give the appearance of 4 clusters, but if we plot PC2 and PC3, we can see 5 clusters…
# PC2 and PC3
ggplot(plink_scores,aes(PC2,PC3)) +
geom_point() +
xlab(paste0('PC2 (',round(plink_eigenval$var_explained[2],2),'%)')) +
ylab(paste0('PC3 (',round(plink_eigenval$var_explained[3],2),'%)'))
In fact, to get a view of population structure that incorporates all
of our PC axes, we can simply calculate the distance among all
individuals using the dist()
function. This calculates the
euclidean distance among all individuals, which is analogous to how
closely related they are.
Don’t worry about doing this (unless you’d really like to!), but if
we plot the distance in PC space between 50 random individuals we can
see clearly that distance is lower among indivduals from the same
population, and our ancestral Highlands (p0
) population is
more distinct than the others.
# Extract only the scores
only_scores = plink_scores[,3:ncol(plink_scores)]
# Multiple each column by its eigenvalue
for(i in 1:nrow(plink_eigenval)){
only_scores[,i] = only_scores[,i] * plink_eigenval$V1[i]
}
# Plot a random sample of 50 individuals
rownames(only_scores) = plink_scores$ind
pca_dists = dist(only_scores[sample(1:nrow(only_scores),50),])
pheatmap::pheatmap(as.matrix(pca_dists))
The sense of population structure that PCA provides is highly context-dependent. It depends on which individuals we include, how the data was originally filtered, and how many SNPs we included. Explore how your PCA results change if you:
p0
in their name.Try and write a function called plot_plink_PCA()
that
lets you plot the results of a plink PCA analysis by providing it with
the path to the eigenvecs
and eigenvals
files.
Whilst PCA is powerful and intuitive, it doesn’t specifically perform
any tests of whether there are populations and how many might be
statistically relevant. DAPC is an extension of PCA that includes a step
to determine how many populations best describe the data. Having
identified a likely number of populations (k populations
),
we can then re-analyse the data in a way that seeks to maximise genetic
variation among our k
clusters, as opposed to among all
individuals as in PCA.
To do this, we’ll need to produce a VCF that only includes our
linkage-pruned SNPs from the previous section. We can do this using
bcftools
and the view
and
--include
options.
# We'll also make a new VCF of the linkage-pruned SNPs
bcftools view --include ID==@$DATASET.prune.in $VCF -Oz -o ${DATASET}_maf05_LDpruned.vcf.gz
We can then read this pruned VCF into R using the
read.vcfR()
function, and then convert to genind format
using the vcfR2genind()
function
# Read in the LD-pruned VCF
LDpruned_vcf = read.vcfR("popgen_demo_krumlov23_maf05_LDpruned.vcf.gz")
# Convert our VCF to genind format
genind_input = vcfR2genind(LDpruned_vcf)
The first step for our DAPC analysis is to identify the most likely
number of clusters. To do this, we use the find.clusters()
function. There are two options to bear in mind here, the
max.n.clust
and n.pca
options. These options
restrict the exploration of parameters to limits of our choice. Set each
of these at 20, but feel free to explore how these affect the
analysis.
# Identify the most likely k-value for number of clusters
# When we run this, we see steep declines in BIC up to k = 5, so we'll pick that for our cluster N
grp <- find.clusters(genind_input,
max.n.clust=20,
n.pca = 20,)
## Choose the number of clusters (>=2):
This function shows us how model fit changes as we fit it with
different numbers of clusters. We can see a sharp decline in BIC
(improving model fit) up to a k
of 5, and only moderate
improvements at k
of 6+. This implies that 5 clusters is
the best fit for our data. Set k = 5
.
grp <- find.clusters(genind_input,
n.clust=5,
n.pca = 20)
As well as helping us choose the most likely number of clusters in
our data, this function also assigns our individuals into groups.
Assuming you’ve saved your clusters to a variable called
grp
, we can look at these by running
grp$grp
.
head(grp$grp)
## p0.0 p0.1 p0.2 p0.3 p0.4 p0.5
## 4 4 4 4 4 4
## Levels: 1 2 3 4 5
These groups are given an arbitrary number as an ID, so let’s just quickly change this (you can just run this bit of code, don’t worry about it!)
# Fetch the grp number for each actual pop
new_grp_names = lapply(paste0('p',0:4),function(p){
as.character(grp$grp[grep(p,names(grp$grp))][1])
})
names(new_grp_names) = paste0('p',0:4)
# Recode factor levels by name
levels(grp$grp) <- new_grp_names
We will now attach these cluster IDs to our plink PCA scores from
earlier, and re-plot the PCA results with the points now coloured
according to their assigned groups. If using ggplot()
, we
can also now add stat_ellipse()
to plot a 95% confidence
ellipse around each cluster.
Hint: You can do this easily by making a data.frame with 2 columns
called ‘ind’ and ‘pop’, by taking names(grp$grp)
and
grp$grp
. You can then use the merge()
function
to attach these to your PC scores data.frame
# We can check these clusters against our PCA results by merging them...
ind_groups = data.frame(ind = names(grp$grp),
pop = grp$grp)
plink_scores_ids = merge(plink_scores,ind_groups)
# Plot together, now colouring by group assignment...
ggplot(plink_scores_ids,aes(PC1,PC2,colour = pop)) +
geom_point() +
stat_ellipse()
Now that we’ve got our groups, we can perform DAPC to maximise
variation among them. To do this, we’ll run the dapc()
function, providing it with our genind input, our group assignments, the
n.pca
option from before, and we’ll set
n.da = 4
so we can produce 4 discriminant axes to separate
our 5 groups. Plot the output of this using the scatter()
function.
# We can then use these groups for DAPC
# We'll retain all discriminant functions seeing as we want to describe diffs between 5 clusters
dapc_res <- dapc(genind_input,
grp$grp,
n.pca = 20,
n.da = 4)
# Display results in a scatter plot
scatter(dapc_res)
A useful feature of DAPC is that we can check how confident the
analysis is of our group assignments. This information is stored in the
dapc_res$posterior
slot. This matrix describes the
probability that each individual is assigned to each group. Use this
matrix to produce a data.frame that describes, for each individual, the
cluster it has been assigned to and the probability.
Hint: To work through the matrix, you can either use a forloop or the apply() function to find the max probability
# A useful feature of dapc is we get a membership probability
head(dapc_res$posterior)
## p0 p1 p2 p3 p4
## p0.0 1 0 0 0 0
## p0.1 1 0 0 0 0
## p0.2 1 0 0 0 0
## p0.3 1 0 0 0 0
## p0.4 1 0 0 0 0
## p0.5 1 0 0 0 0
# And we can use this to pull out the group membership and the probability
head(data.frame(ind = rownames(dapc_res$posterior),
cluster = apply(dapc_res$posterior,1,which.max),
prob = apply(dapc_res$posterior,1,max)))
## ind cluster prob
## p0.0 p0.0 1 1
## p0.1 p0.1 1 1
## p0.2 p0.2 1 1
## p0.3 p0.3 1 1
## p0.4 p0.4 1 1
## p0.5 p0.5 1 1
As a final step for this analysis, we can produce an admixture plot
that visualises the probability of group assignment for each individual
as grouped bars. The easiest way to do this is with the
compoplot()
function, but if you’re feeling confident with
ggplot()
you could also do this using the raw data from the
matrix above.
# We can also plot these populations in a structure like plot as well
compoplot(dapc_res, posi="bottomright",
txt.leg=paste("Cluster", 1:5), lab="",
ncol=1, xlab="individuals", col=funky(5))
DAPC includes a number of steps to validate results. This includes using the alpha score and cross-validation. How to do these is described in an in-depth DAPC tutorial by the developer (link at the bottom of the document, section 4). Give this a go and optimise the number of retained PC’s.
Explore how choosing different values of k
influences the probability of group assignment. It is common to show
admixture plots like the one above for multiple k
values of
2 to k
. These plots are informative in seeing hierarchical
population structure.
Re-run the DAPC analysis with 100 random SNPs. DAPC, like any statistical process, is sensitive to statistical power. With fewer SNPs, we have less power to detect structure.
Nucleotide diversity describes the proportion of nucleotides that are expected to differ between two given sequences taken from a population. A greater estimate of nucleotide diversity implies there is more genetic variation in a given population or in a given region of the genome.
This is another simple analysis that can quickly tell us a lot about the data we’re working with. Nucleotide diversity varies extensively across the genome; regions of high diversity may reflect regions under balancing selection or mutation hotspots, whilst regions exhibiting low diversity may be under strong selection or in regions of low recombination and high background selection.
There are many ways to calculate genetic diversity, here we’ll use
vcftools
and the --window-pi
option. We’ll run
this individually for each population, seeing as we are interested in
within-population nucleotide diversity. To do this, we will need to
produce a popmap file, which is simply a text file for each
population that lists the individuals in that population.
Use for-loops to first produce a popmap text file for each
individual, and then use this popmap and the
--keep
option of vcftools
to calculate pi for
each population in 20kb windows. Run this over the original maf-filtered
VCF, rather than the linkage-pruned VCF!
##### CALCULATING PI ######
# What window size shall we look at?
WINDOW_SIZE=20000
# We want to do this on a per-population basis, so we'll start by making popmaps
for i in {0..4}
do
# Use grep to pull individual IDs from the vcf into a file
bcftools query -l $VCF | grep "p${i}" > p${i}.popmap
done
# We want to do this using a for-loop as we'll be running it within each of our 5 populations
for i in {0..4}
do
# Nice little message to tell us what we're doing
echo ">>> Starting pi calculation for pop${i}"
# Make a variable to store pop
POP=p${i}
# Calculate nucleotide diversity with vcftools
vcftools --gzvcf $VCF \
--keep ${POP}.popmap \
--window-pi $WINDOW_SIZE \
--out ${POP}
done
Having calculated pi, read each of the outputs back into R and plot pi along the genome for each of the populations. Align your plots vertically so that you can see any trends across populations. Consider:
Hint: If you format your pi results with a column that has the population ID, you can use facet_wrap() in ggplot() to align your plots vertically
# Identify all the pi results and read them into a list
pi_files = list.files(pattern = 'windowed.pi')
pi_res = lapply(pi_files,read.table,header = T)
# We can loop through our list and add an identifer to each
for(i in 1:length(pi_res)){
pi_res[[i]]$pop = paste0('p',i - 1)
}
# Collapse these into a list
all_pi_res = rbindlist(pi_res)
# And we can now plot together with facet_wrap
ggplot(all_pi_res,aes(x = BIN_START,y = PI)) +
geom_line() +
facet_wrap(~pop,ncol = 1) +
labs(x = 'Chromosome Position (bp)',
y = expression(Nucleotide~Diversity~(pi)))
We can also take all of our windowed estimates of nucleotide diversity and compare the average nucleotide diversity among populations, either by calculating the mean or plotting. Try both now…
# Mean nuc div per population
all_pi_res[,.(mean(PI)),by = pop]
## pop V1
## 1: p0 0.0004821958
## 2: p1 0.0007959663
## 3: p2 0.0007576140
## 4: p3 0.0008544700
## 5: p4 0.0007201156
# Visualise differences in nucleotide diversity
ggplot(all_pi_res,aes(x = pop,y = PI)) +
geom_boxplot() +
labs(x = 'Population',y = expression(Windowed~pi))
Our results for nucleotide diversity indicate that the ancestral
Highlands (p0
) population has lower nucleotide diversity
than the other populations.
Note: To calculate an unbiased estimate of genetic diversity, we
need to know not only where genetic variants are observed, but also
where they are absent due to issues with read-mapping. This issue is
discussed in-depth in
Korunes and Samuk 2021 - pixy: Unbiased estimation of nucleotide diversity and divergence in the presence of missing data
.
This is not an issue here as our data is simulated, however should be
kept in mind for real sequencing data.
Two of our populations exhibit reduced diversity around 5 Mb
(p2
and p4
), which may be indicative of
selection acting on these regions. Selective sweeps are a common mode by
which adaptation can occur in the genome. In this mode, an adaptive
variant appears in a population (either through mutation or
introgression) and rapidly increases in frequency. Because of linkage,
the haplotype carrying the adaptive variant also increases in frequency,
subsequently removing local haplotype diversity around the region
harbouring the adaptive variant. This process creates a signature of
reduced local diversity that we can test for.
Below shows a simplified schematic of how sweeps reduce local diverisity (ignoring opportunities for recombination). As the haplotype carrying the adaptive mutation increases in frequency, it replaces other haplotypes carrying diversity. In this example, we lose 4 SNPs from the population as the adaptive haplotype fixes.
Hard seletive sweep example
Sweepfinder2
scans the genome for sweep-like regions by
first modelling how background selection and recombination rate affect
local genetic diversity across the genome. It then uses this estimate
(called a B-value) to evaluate whether at a given region of the genome
the neutral allele frequency spectrum deviates from expectations using a
Conditional Likelihood Ratio (CLR) test.
See the short article at DeGeorgio et al. 2016 - SWEEPFINDER2: increased sensitivity, robustness and flexibility for more detail.
In our previous analyses we used a VCF filtered for a minor allele frequency of 5%. This is ok for some analyses, but for analyses that are focussed on the site-frequency spectrum (SFS) it will be missing important information on low frequency variants because these have been removed.
We do maf-filtering to remove low frequency variants that are more likely to be erroneous. So best-practise for analyses like this is to find a balance between data quality and retaining low-frequency sites. For instance, we might remove only SNPs that have an allele count of 1 (singletons).
The script to do this singleton-filtering is below, but in the interest of reducing running time (and I already know it gives the same result) we will just go ahead and do our sweep analysis using our maf 5%-filtered VCF because it has fewer SNPs and runs quicker.
# Filter all SNPs and remove singletons
bcftools view --min-ac 2 ../popgen_demo_krumlov23.vcf.gz -Oz -o popgen_demo_krumlov23_mac2.vcf.gz
To run Sweepfinder2
, we will first need to produce files
in the requested input format. The input file is a four column text-file
that has:
position
= Position of SNPx
= Allele counts for that SNP in the
populationn
= The total number of alleles in the
population (2N)folded
= A column of ‘1’s that tell
Sweepfinder2
that we are giving it a ’folded’ site
frequency estimate (we don’t know ancestral/derived states).Hint: You can use vcftools
with the
--counts2
option to get allele counts!
For e.g.
## position x n folded
## 1: 63 32 100 1
## 2: 573 0 100 1
## 3: 1198 37 100 1
## 4: 1600 24 100 1
## 5: 1834 0 100 1
## 6: 2199 0 100 1
After making the formatted input file, we can run
SweepFinder2
by including the -sg
option to
define window size (let’s stick with 20kb…)
Put all of this together in a command-line forloop. In each loop you should:
$VCF_SFS
for only individuals in the
popmap# We want to do this using a for-loop as we'll be running it within each of our 5 populations
for i in {0..4}
do
# Make a variable to store pop
POP=p${i}
# Get counts from the VCF files and only keep individuals from this pop
vcftools --gzvcf $VCF \
--keep ${POP}.popmap \
--counts2 \
--out ${POP}
# Now we need to edit the format of the outputted file to SweepFinder standard
tail -n+2 ${POP}.frq.count | awk -v OFS="\t" '{print $2,$6,$4,"1"}' > temp
# Add header
echo -e 'position\tx\tn\tfolded' | cat - temp > ${POP}_SF.in
# Keep tidy
rm -f ${POP}*frq* $POP.log
# Run sweepfinder in windows
SweepFinder2 -sg $WINDOW_SIZE ${POP}_SF.in ${POP}_SF.out
done
This takes around 10 minutes to run, so feel free to grab a coffee or take a break!
Having run these, read in the results as above and plot the outputs. If you look especially carefully you should just about make out two peaks of elevated likelihood suggestive of two selective sweeps, one in p2 (Leeds) and one in p4 (Buckingham Palace).
# Read in and plot sweepfinder results ------------------------------------
sf_files = list.files(pattern = 'SF.out')
sf_res = lapply(sf_files,read.table,header = T)
# We can loop through our list and add an identifer to each
for(i in 1:length(sf_res)){
sf_res[[i]]$pop = paste0('p',i - 1)
}
# Collapse these into a list
all_sf_res = rbindlist(sf_res)
# And we can now plot together with facet_wrap
ggplot(all_sf_res,aes(x = location,y = LR)) +
geom_line() +
facet_wrap(~pop,ncol = 1) +
labs(x = 'Chromosome Position (bp)',y = 'CLR')
Note: Detecting selective sweeps is easiest when identifying ‘hard’ sweeps, in which a mutation has arisen at very low frequency and increased in frequency until it has become fixed. Selective sweeps in which a mutation has increased in frequency from an already appreciable frequency (a ‘soft’ sweep), or has yet to fix within a population (an ‘incomplete’ sweep) are more difficult to detect, yet may be more common in natural populations.
vcftools
and the
--weir-fst-pop
option to calculate FST between
pairs of populations in sliding windows. Compare these results to our
Sweepfinder2
results.Having observed that two of our populations show evidence of selection having occurred at the same region, a question we might next be interested in asking is how this has occurred.
Recall that p2
and p4
were sampled in Leeds
and London, which are both much more urban environments than the others.
Perhaps our wild cats are undergoing local adaptation to urban
environments, and the evolution of a gene within our sweep region
facilitates this.
But by what mechanism have both populations undergone evolution at
the same genomic region? One possibility is that novel mutations have
arisen at the same or similar locus in each population and each has
experienced an independent selective sweep. In this sense, a gene around
5Mb has been modified multiple times, perhaps in different ways, but in
a way that has convergently increased local fitness in p2
and p4.
An alternative is that an adaptive variant arose in one population and was introduced to the other following introgression between them. Alternatively, the adaptive variant may have arisen at some point in the metapopulation history prior to the population split, and was subsequently inherited in each before undergoing selection and fixation. In this sense, the gene was modified only once but was transferred or inherited into both populations and selection acted multiple times on it independently.
These mechanisms offer up testable hypotheses in terms of what the underlying tree structure should like at the region where the sweeps were detected. If mutations are independent, we can predict that the tree structure at the sweep site should reflect the tree given in at the start of the worksheet, since the haplotypes carrying each adaptive mutation will be unique to each population. Alternatively, if adaptive introgression is the mechanism we can predict the structure of the tree at the selected site will place both populations together, even though they are not sisters. This is because the haplotype carrying the adaptive variant will have come from whichever population the mutation arose in, and subsequently there is a single clade for the adaptive haplotype.
Let’s now explore adaptive introgression among the populations to test this.
\[ D = \frac{ABBA - BABA}{ABBA + BABA} \]
ABBA-BABA (No, not that ABBA)
ABBA-BABA is a straightforward concept originally designed to detect introgression between humans and neanderthals. It assumes that given a tree structure of an outgroup and a focal trio, that there should be an approximately equal number of alleles that show the ‘ABBA’ pattern relative to the ‘BABA’ pattern. In this case, D ~ 0. Excesses of either ABBA or BABA sites suggests that introgression might have taken place within the assumed tree, either between P1 and P3 (excess BABA) or P2 and P3 (excess ABBA). (P1 and P2 are assumed to be sister-groups).
D scores thus range from -1 (excess BABA) to +1 (excess ABBA), and can be tested for significance using a jackknife procedure in which part of the genome is excluded and pseudo-D scores are re-calculated.
Dsuite
provides a fast and powerful set of functions to
calculate D-statistics and other introgression statistics (such as
fbranch statistics).
To run Dsuite
, we will provide it with our assumed tree
from the start of the document, which you can save to a text file. We’ll
assume that the ancestral Highlands population represents the outgroup
(p0). This tree is important because it tells Dsuite our a-priori
understanding of the relationship among populations.
The Newick tree for our populations looks like this
(Outgroup,(p1,(p2,(p3,p4))))
. Save this to a text file
called popgen_demo.nwk
.
# Give Dsuite a phylogenetic tree to work with based on stepping-stone model
echo "(Outgroup,(p1,(p2,(p3,p4))))" > popgen_demo.nwk
We’ll also need to make a slightly different popmap, where all of our previous popmaps are combined together with an additional column that gives the population identifier. In this new column, replace p0 with ‘Outgroup’.
## p0.0 Outgroup
## 1: p0.1 Outgroup
## 2: p0.2 Outgroup
## 3: p0.3 Outgroup
## 4: p0.4 Outgroup
## 5: p0.5 Outgroup
## 6: p0.6 Outgroup
# Make a full popmap that includes all inds in col1 and their pop in col2
for i in {0..4}
do
# We'll cat them together and add a column
# Add a sed command to attach the second column
cat p${i}.popmap | sed "s/$/\tp${i}/" >> $DATASET.popmap
done
# Make p0 the outgroup
sed -ie 's/\tp0/\tOutgroup/g' $DATASET.popmap
We can now run Dsuite, giving it our Newick tree, popmap and VCF (use our maf of 5% VCF). This will calculate D-statistics among all possible trios of p1 - p4.
# Run Dsuite...
Dsuite Dtrios -t popgen_demo.nwk $VCF $DATASET.popmap
Dsuite should run quickly, and gives us 3 output files that are suffixed:
_tree.txt
= ABBA-BABA, taking P1 and P2 as sisters in
our tree._Dmin.txt
= ABBA-BABA, taking P1 and P2 as the
populations that minimise D_BBAA.txt
= ABBA-BABA, taking P1 and P2 as the
populations that maximise BBAA sites.Explore each of these outputs and see how they differ. Each output includes the D-statistic, it’s Z-score and p-value, and the numbers of sites with different patterns for each trio.
# Look at these results
# Dmin
cat popgen_demo_krumlov23_Dmin.txt
# BBAA
cat popgen_demo_krumlov23_BBAA.txt
# Tree
cat popgen_demo_krumlov23_tree.txt
We can see that BBAA
and Dmin
outputs both
assume that our Leeds (p2) and London (p4) populations are sisters,
i.e. P1 and P2. However, we ‘know’ that this is not the case. The
tree.txt
file maintains our tree order, and suggests there
is strong evidence for introgression between London and Leeds. Indeed,
if introgression is strong enough it may overwhelm the original tree,
which is why we see discordance in these outputs. Supplying the tree
therefore allows us to test specific hypotheses regarding
introgression.
These results tell us that it is highly likely that introgression has occurred between London and Leeds. But this does not confirm that our selective sweep region is the cause of this.
To look into this, we can run Dsuite again using the
Dinvestigate
option. We just need to tell it which trio
we’d like to investigate by giving it a text file in the format
P1\tP2\tP3
. In this case, P1 = p3
, P2 =
p4
and P3 = p2
. This tells Dsuite we are
interested in introgression specifically between p4
and
p2
.
# Make our test trios file
echo -e "p3\tp4\tp2" > test_trio.txt
Dinvestigate calculates D-statistics in sliding windows of N SNPs, allowing us to see where ABBA-BABA imbalance is particularly high. Specify a window size of 100 SNPs, and slides of 50 SNPs.
# Run Dinvestigate
Dsuite Dinvestigate $VCF \
$DATASET.popmap \
test_trio.txt \
-w 100,50 \
-n $DATASET
This should run quickly. Finally, read the output back into R
(p3_p4_p2_localFstats_popgen_demo_krumlov23_100_50.txt
) and
plot the results. These show us very clearly that the region around our
sweep signatures is driving the introgression results. This implies that
the same haplotype is responsible for the sweep in both populations.
# Read in Dinvestigate and analyse ----------------------------------------
introgression_winds = fread("p3_p4_p2_localFstats_popgen_demo_krumlov23_100_50.txt")
introgression_winds$midpoint = rowMeans(introgression_winds[,.(windowStart,windowEnd)])
# Plot
ggplot(introgression_winds,aes(midpoint,y = f_dM)) +
geom_point() +
geom_smooth() +
labs(y = 'fdM',x = 'Chromosome Position (bp)')