Population Genetics is centred around understanding genetic variation.
Towards this end, population geneticists are interested in the fundamental processes of mutation, migration, recombination, selection and drift.
In this lab session, we’ll be touching on most of these and covering a range of common foundation analyses to understand the evolutionary history of a SNP dataset. The goals are for you to get a feel of a basic population genetics workflow through a whistle-stop tour of methods. We could spend an entire day on each step of the analysis, but given we only have 3 hours we’ll be learning a little about a few things as opposed to a deep-dive on one thing.
It’s expected that there are a range of abilities in the cohort, and consequently some may progress quicker than others. At the end of each section there are additional exercises that are optional for anyone who is working quickly.
It’s also important to learn how to troubleshoot errors and problems
on your own! Remember that for command-line programs we can usually type
--help
, or in R type ?my_function
, to see
documentation.
We’ll be working with a simulated dataset that includes a number of conditions and events that we will uncover over the course of the session. The workshop is designed around the idea of building up an understanding of a SNP dataset that we may have limited a-priori knowledge of, which can be common for example when working with sequencing data from natural populations. Starting with exploratory summary analyses, we will generate and test hypotheses to explore evolutionary questions.
Let’s pretend our data comes from a study into the a hypothetical, yet MIRACULOUS, conservation success story of Scottish Wild Cats spreading south into England from their Highlands refugia. In this study, samples were collected from a number of individuals from the ancestral Highlands population (p0) and at four further locations moving south through the Scottish borders and into England, at Kielder Forest (p1), Leeds (p2), Norfolk (p3), and London (p4).
The only information we will start with is a simple tree summary of the 5 populations, which we will predict from the known North to South colonisation history:
plot(ape::read.tree(text = '(Highlands_p0,(Kielder_p1,(Leeds_p2,(Norfolk_p3,London_p4))));'))
Create a new directory at the following path and move into it:
~/workshop_materials/population_genomics/workdir
# Change to our workdir
mkdir ~/workshop_materials/population_genomics/workdir
cd ~/workshop_materials/population_genomics/workdir
Also, define the following variables in your command line. Variables help keep scripts tidy, and should we want to change the inputs, it is much simpler to change one line of code that defines the variable than every line of code where that variable is used!
VCF=popgen_demo_krumlov23_maf05.vcf.gz
DATASET=popgen_demo_krumlov23
We will also be using R for some analyses and to visualise results, so open up R and load the following packages:
vcfR
ggplot2
data.table
adegenet
GenotypePlot
Set our working directory (using setwd()
) to our working
directory as above.
# Load our libraries
library(vcfR)
library(ggplot2)
library(data.table)
library(adegenet)
library(GenotypePlot)
# Set our working directory
setwd("~/workshop_materials/population_genomics/workdir")
Get a general sense of the dataset we’ll be working with:
We’ll start by filtering our VCF based on a minor allele frequency filter of 5%. (You can do this with vcftools or bcftools)
bcftools view --min-af 0.05:minor ../popgen_demo_krumlov23.vcf.gz -Oz -o popgen_demo_krumlov23_maf05.vcf.gz
Visualising population structure can be a good place to start when interpreting a fresh SNP dataset. Population structure provides us with a sense of how individuals are related, whether those relationships form clusters that we might consider populations, and whether there are any unexpected outliers or obvious trends in our data that might be suggestive of sequencing or sampling issues. For instance, if our data has been sequenced in batches or with an unbalanced design, sequencing artifacts may cause individuals to cluster according to when or how they were sequenced (batch effects). Males and females may also be clear if SNPs include those coming from sex-linked regions.
PCA is a powerful tool for quickly inferring and visualising population structure. It is an ordination technique that seeks to produce a set of uncorrelated axes (principal components, e.g. PC1, PC2 etc) that maximise genetic variation among all included individuals. In short, this is done by ‘rotating’ the data in multivariate space to find the single axis that explains the most variation (PC1), and repeatedly introducing orthogonal axes that contain increasingly smaller proportions of total variation until all variation is explained.
To perform PCA, the first step is to filter our SNP dataset for linkage. Linkage disequilibrium or ‘LD’ describes the non-random associations or correlation between SNPs. LD is particularly strong among SNPs that are physically close together because they are likely to be inherited in common haplotype blocks. LD can also be caused by other things, such as selection, and can occur among SNPs that are not physically close together. When considering population structure, we want to remove as much LD as possible because linked variants carry redundant information about inheritance. We therefore want each SNP to represent an independent variable in our analysis (it’s also computationally more efficient to work with fewer SNPs if we can!).
There are many approaches available to filter our data for linkage.
We could do this in R using a package like SNPRelate (and you’re welcome
to give this a go). One of the most commonly used approaches is Plink’s
--indep-pairwise
, which removes SNPs that are in linkage
based on a correlation cut-off (usually 0.2) within windows of a
user-defined size. For our window size, we’ll do non-overlapping windows
of 50 SNPs, but feel free to explore how changing this parameter
influences the number of SNPs that are left following pruning. This
software runs on the command-line.
Plink will provide us a file of unlinked SNPs in a file suffixed
*.prune.in
, which we can use to pull SNPs out of our VCF
for other analyses.
Hints: You’ll also need to use the --allow-extra-chr
option.
# Linkage prune with plink
# We'll prune in non-overlapping windows of 50 SNPs
plink --vcf $VCF --allow-extra-chr \
--indep-pairwise 50 50 0.2 \
--out $DATASET
Having produced a set of SNPs that are pruned for linkage, we can
perform PCA on the unlinked SNPs. Sticking with Plink, we can do this
using the --pca
option.
Hints: We will need the --extract
option to pull our
unlinked SNPs, and we will again need to use the
--allow-extra-chr
option.
# Perform PCA using our linkage-pruned SNPs
plink --vcf $VCF \
--allow-extra-chr \
--extract $DATASET.prune.in \
--pca \
--out $DATASET
Now let’s explore the results! Return to R and read in the PCA
outputs, which will be suffixed .eigenval
and
.eigenvec
.
# Read in the eigenvector scores
plink_scores = read.table('popgen_demo_krumlov23.eigenvec')
# Rename the eigenvecs cols
colnames(plink_scores) = c('ind','ind2',paste0('PC',1:(ncol(plink_scores) - 2)))
# Read in the eigenvalues
plink_eigenval = read.table('popgen_demo_krumlov23.eigenval')
Read your inputs into two data.frames called
plink_scores
and plink_eigenval
, that look
like this:
head(plink_scores)
## ind ind2 PC1 PC2 PC3 PC4 PC5
## 1 p0.0 p0.0 -0.122078 0.012935200 0.00320871 -0.00245191 -0.000822215
## 2 p0.1 p0.1 -0.125386 0.011319500 0.00728584 -0.00351268 0.003910970
## 3 p0.2 p0.2 -0.126704 0.007575760 0.00503239 0.00133178 0.001004640
## 4 p0.3 p0.3 -0.125776 0.005454740 0.00618741 0.00470080 0.000722623
## 5 p0.4 p0.4 -0.125732 0.014062200 0.00147493 0.00100268 -0.003479150
## 6 p0.5 p0.5 -0.123756 -0.000981481 0.00597981 0.00843322 0.002228040
## PC6 PC7 PC8 PC9 PC10 PC11
## 1 0.00281681 -0.003645220 0.00834636 -0.00479214 0.004214160 9.96145e-05
## 2 0.00893109 0.000240292 0.00916124 -0.00692168 -0.000162668 -7.71035e-03
## 3 -0.01115630 -0.007243930 -0.00222996 0.00576070 -0.004564770 2.10456e-03
## 4 -0.00511106 -0.003291240 -0.00463956 0.00567235 0.001458280 3.50304e-03
## 5 0.00287493 0.001770390 0.00348909 -0.01090230 -0.002004750 -5.69118e-03
## 6 -0.01823700 -0.006327550 -0.01425580 0.02202560 -0.000862466 6.02781e-03
## PC12 PC13 PC14 PC15 PC16 PC17
## 1 -0.00895890 0.01018750 0.02175990 0.00890439 -0.00743525 0.01685170
## 2 -0.00452610 -0.00194529 0.01643850 0.01370070 -0.01946810 0.00366498
## 3 0.00596315 -0.01390820 -0.01420920 -0.00781733 0.00768012 -0.00359343
## 4 0.00989684 -0.01301740 -0.02102160 -0.00780664 0.01567650 -0.00980370
## 5 -0.00415022 0.01257410 0.00746004 0.00368681 -0.00689611 0.01690440
## 6 0.01204670 -0.03076720 -0.03431040 -0.01517650 0.03400560 -0.02972850
## PC18 PC19 PC20
## 1 -0.01956960 -0.00557496 -0.000799148
## 2 0.00800027 0.00526460 0.007826790
## 3 -0.00352997 -0.00887329 0.001173550
## 4 0.00530733 0.00455785 0.001065890
## 5 0.01108530 0.00668424 0.008758650
## 6 -0.00448433 -0.00214756 0.000397279
head(plink_eigenval)
## V1
## 1 48.15550
## 2 22.78070
## 3 14.02390
## 4 10.69700
## 5 4.20565
## 6 3.81928
A good place to start is to look at what proportion of genetic variation is explained by each principal component. This information is provided by the eigenvalues. The eigenvalues represent scaling parameters that describe the relative importance of each axis.
The more genetic variation that is explained by our major axes tells us how strong population structure is; populations that are separate on a PC axis that explains 1% of variation are less distinct than populations that separate on a PC axis that explains 20% of genetic variation.
In order to get the proportion of variation explained by each, we need to divide each eigenvalue by the sum of all eigenvalues.
# Examine the eigenvals
plink_eigenval
# Convert these to % variance explained
plink_eigenval$var_explained = 100 * plink_eigenval$V1 / sum(plink_eigenval$V1)
# Check the variance explained
plink_eigenval
We’ll now plot the PCA results using the plink_scores
data. The scores represent the transformed value of each individual in
each axis. So by visualising these, we can see where individuals are
plotted relative to other individuals. Individuals that are closer
together are more closely-related.
You can do this using any plotting software you’re familiar, but the
example will be in ggplot2
. We’re usually interested in
visualising the PC1 scores on the x-axis and PC2 sores on the y-axis,
and it helps to include the % variance explained on each axis as part of
the label. The PC scores are found in the eigenvecs file, and describe
the transformed value of each individual on each axis. Produce a
scatterplot that shows each individual’s position in PCA space, with
axis labels in the format ‘PC1 (xx %)’
Hint: You can use the paste0()
or
paste()
function to produce an axis label that pulls the
variance explained value directly from your eigenvals table to produce
an axis label!
Hopefully, your figure looks something like this…
# Plot the PCA results
# PC1 and PC2
ggplot(plink_scores,aes(PC1,PC2)) +
geom_point() +
xlab(paste0('PC1 (',round(plink_eigenval$var_explained[1],2),'%)')) +
ylab(paste0('PC2 (',round(plink_eigenval$var_explained[2],2),'%)'))
Visualising the results on PC1 and PC2 provides us with a sense of population structure. The distance between points is analagous to how related individuals are, so ‘populations’ appear as clusters of individuals. F1 hybrids between populations appear equidistant between clusters. PC1 and PC2 however only represent some of the total variation, so it can help to explore additional axes. For example, PC1 and PC2 give the appearance of 4 clusters, but if we plot PC2 and PC3, we can see 5 clusters…
# PC2 and PC3
ggplot(plink_scores,aes(PC2,PC3)) +
geom_point() +
xlab(paste0('PC2 (',round(plink_eigenval$var_explained[2],2),'%)')) +
ylab(paste0('PC3 (',round(plink_eigenval$var_explained[3],2),'%)'))
In fact, to get a view of population structure that incorporates all
of our PC axes, we can simply calculate the distance among all
individuals using the dist()
function. This calculates the
euclidean distance among all individuals, which is analogous to how
closely related they are.
Don’t worry about doing this (unless you’d really like to!), but if
we plot the distance in PC space between 50 random individuals we can
see clearly that distance is lower among indivduals from the same
population, and our ancestral Highlands (p0
) population is
more distinct than the others.
# Extract only the scores
only_scores = plink_scores[,3:ncol(plink_scores)]
# Multiple each column by its eigenvalue
for(i in 1:nrow(plink_eigenval)){
only_scores[,i] = only_scores[,i] * plink_eigenval$V1[i]
}
# Plot a random sample of 50 individuals
rownames(only_scores) = plink_scores$ind
pca_dists = dist(only_scores[sample(1:nrow(only_scores),50),])
pheatmap::pheatmap(as.matrix(pca_dists))
The sense of population structure that PCA provides is highly context-dependent. It depends on which individuals we include, how the data was originally filtered, and how many SNPs we included. Explore how your PCA results change if you:
p0
in their name.Try and write a function called plot_plink_PCA()
that
lets you plot the results of a plink PCA analysis by providing it with
the path to the eigenvecs
and eigenvals
files.
Whilst PCA is powerful and intuitive, it doesn’t specifically perform
any tests of whether there are populations and how many might be
statistically relevant. DAPC is an extension of PCA that includes a step
to determine how many populations best describe the data. Having
identified a likely number of populations (k populations
),
we can then re-analyse the data in a way that seeks to maximise genetic
variation among our k
clusters, as opposed to among all
individuals as in PCA.
To do this, we’ll need to produce a VCF that only includes our
linkage-pruned SNPs from the previous section. We can do this using
bcftools
and the view
and
--include
options.
# We'll also make a new VCF of the linkage-pruned SNPs
bcftools view --include ID==@$DATASET.prune.in $VCF -Oz -o ${DATASET}_maf05_LDpruned.vcf.gz
We can then read this pruned VCF into R using the
read.vcfR()
function, and then convert to genind format
using the vcfR2genind()
function
# Read in the LD-pruned VCF
LDpruned_vcf = read.vcfR("popgen_demo_krumlov23_maf05_LDpruned.vcf.gz")
# Convert our VCF to genind format
genind_input = vcfR2genind(LDpruned_vcf)
The first step for our DAPC analysis is to identify the most likely
number of clusters. To do this, we use the find.clusters()
function. There are two options to bear in mind here, the
max.n.clust
and n.pca
options. These options
restrict the exploration of parameters to limits of our choice. Set each
of these at 20, but feel free to explore how these affect the
analysis.
# Identify the most likely k-value for number of clusters
# When we run this, we see steep declines in BIC up to k = 5, so we'll pick that for our cluster N
grp <- find.clusters(genind_input,
max.n.clust=20,
n.pca = 20,)
## Choose the number of clusters (>=2):
This function shows us how model fit changes as we fit it with
different numbers of clusters. We can see a sharp decline in BIC
(improving model fit) up to a k
of 5, and only moderate
improvements at k
of 6+. This implies that 5 clusters is
the best fit for our data. Set k = 5
.
grp <- find.clusters(genind_input,
n.clust=5,
n.pca = 20)
As well as helping us choose the most likely number of clusters in
our data, this function also assigns our individuals into groups.
Assuming you’ve saved your clusters to a variable called
grp
, we can look at these by running
grp$grp
.
head(grp$grp)
## p0.0 p0.1 p0.2 p0.3 p0.4 p0.5
## 4 4 4 4 4 4
## Levels: 1 2 3 4 5
These groups are given an arbitrary number as an ID, so let’s just quickly change this (you can just run this bit of code, don’t worry about it!)
# Fetch the grp number for each actual pop
new_grp_names = lapply(paste0('p',0:4),function(p){
as.character(grp$grp[grep(p,names(grp$grp))][1])
})
names(new_grp_names) = paste0('p',0:4)
# Recode factor levels by name
levels(grp$grp) <- new_grp_names
We will now attach these cluster IDs to our plink PCA scores from
earlier, and re-plot the PCA results with the points now coloured
according to their assigned groups. If using ggplot()
, we
can also now add stat_ellipse()
to plot a 95% confidence
ellipse around each cluster.
Hint: You can do this easily by making a data.frame with 2 columns
called ‘ind’ and ‘pop’, by taking names(grp$grp)
and
grp$grp
. You can then use the merge()
function
to attach these to your PC scores data.frame
# We can check these clusters against our PCA results by merging them...
ind_groups = data.frame(ind = names(grp$grp),
pop = grp$grp)
plink_scores_ids = merge(plink_scores,ind_groups)
# Plot together, now colouring by group assignment...
ggplot(plink_scores_ids,aes(PC1,PC2,colour = pop)) +
geom_point() +
stat_ellipse()
Now that we’ve got our groups, we can perform DAPC to maximise
variation among them. To do this, we’ll run the dapc()
function, providing it with our genind input, our group assignments, the
n.pca
option from before, and we’ll set
n.da = 4
so we can produce 4 discriminant axes to separate
our 5 groups. Plot the output of this using the scatter()
function.
# We can then use these groups for DAPC
# We'll retain all discriminant functions seeing as we want to describe diffs between 5 clusters
dapc_res <- dapc(genind_input,
grp$grp,
n.pca = 20,
n.da = 4)
# Display results in a scatter plot
scatter(dapc_res)
A useful feature of DAPC is that we can check how confident the
analysis is of our group assignments. This information is stored in the
dapc_res$posterior
slot. This matrix describes the
probability that each individual is assigned to each group. Use this
matrix to produce a data.frame that describes, for each individual, the
cluster it has been assigned to and the probability.
Hint: To work through the matrix, you can either use a forloop or the apply() function to find the max probability
# A useful feature of dapc is we get a membership probability
head(dapc_res$posterior)
## p0 p1 p2 p3 p4
## p0.0 1 0 0 0 0
## p0.1 1 0 0 0 0
## p0.2 1 0 0 0 0
## p0.3 1 0 0 0 0
## p0.4 1 0 0 0 0
## p0.5 1 0 0 0 0
# And we can use this to pull out the group membership and the probability
head(data.frame(ind = rownames(dapc_res$posterior),
cluster = apply(dapc_res$posterior,1,which.max),
prob = apply(dapc_res$posterior,1,max)))
## ind cluster prob
## p0.0 p0.0 1 1
## p0.1 p0.1 1 1
## p0.2 p0.2 1 1
## p0.3 p0.3 1 1
## p0.4 p0.4 1 1
## p0.5 p0.5 1 1
As a final step for this analysis, we can produce an admixture plot
that visualises the probability of group assignment for each individual
as grouped bars. The easiest way to do this is with the
compoplot()
function, but if you’re feeling confident with
ggplot()
you could also do this using the raw data from the
matrix above.
# We can also plot these populations in a structure like plot as well
compoplot(dapc_res, posi="bottomright",
txt.leg=paste("Cluster", 1:5), lab="",
ncol=1, xlab="individuals", col=funky(5))
DAPC includes a number of steps to validate results. This includes using the alpha score and cross-validation. How to do these is described in an in-depth DAPC tutorial by the developer (link at the bottom of the document, section 4). Give this a go and optimise the number of retained PC’s.
Explore how choosing different values of k
influences the probability of group assignment. It is common to show
admixture plots like the one above for multiple k
values of
2 to k
. These plots are informative in seeing hierarchical
population structure.
Re-run the DAPC analysis with 100 random SNPs. DAPC, like any statistical process, is sensitive to statistical power. With fewer SNPs, we have less power to detect structure.
Nucleotide diversity describes the proportion of nucleotides that are expected to differ between two given sequences taken from a population. A greater estimate of nucleotide diversity implies there is more genetic variation in a given population or in a given region of the genome.
This is another simple analysis that can quickly tell us a lot about the data we’re working with. Nucleotide diversity varies extensively across the genome; regions of high diversity may reflect regions under balancing selection or mutation hotspots, whilst regions exhibiting low diversity may be under strong selection or in regions of low recombination and high background selection.
There are many ways to calculate genetic diversity, here we’ll use
vcftools
and the --window-pi
option. We’ll run
this individually for each population, seeing as we are interested in
within-population nucleotide diversity. To do this, we will need to
produce a popmap file, which is simply a text file for each
population that lists the individuals in that population.
Use for-loops to first produce a popmap text file for each
individual, and then use this popmap and the
--keep
option of vcftools
to calculate pi for
each population in 20kb windows. Run this over the original maf-filtered
VCF, rather than the linkage-pruned VCF!
##### CALCULATING PI ######
# What window size shall we look at?
WINDOW_SIZE=20000
# We want to do this on a per-population basis, so we'll start by making popmaps
for i in {0..4}
do
# Use grep to pull individual IDs from the vcf into a file
bcftools query -l $VCF | grep "p${i}" > p${i}.popmap
done
# We want to do this using a for-loop as we'll be running it within each of our 5 populations
for i in {0..4}
do
# Nice little message to tell us what we're doing
echo ">>> Starting pi calculation for pop${i}"
# Make a variable to store pop
POP=p${i}
# Calculate nucleotide diversity with vcftools
vcftools --gzvcf $VCF \
--keep ${POP}.popmap \
--window-pi $WINDOW_SIZE \
--out ${POP}
done
Having calculated pi, read each of the outputs back into R and plot pi along the genome for each of the populations. Align your plots vertically so that you can see any trends across populations. Consider:
Hint: If you format your pi results with a column that has the population ID, you can use facet_wrap() in ggplot() to align your plots vertically
# Identify all the pi results and read them into a list
pi_files = list.files(pattern = 'windowed.pi')
pi_res = lapply(pi_files,read.table,header = T)
# We can loop through our list and add an identifer to each
for(i in 1:length(pi_res)){
pi_res[[i]]$pop = paste0('p',i - 1)
}
# Collapse these into a list
all_pi_res = rbindlist(pi_res)
# And we can now plot together with facet_wrap
ggplot(all_pi_res,aes(x = BIN_START,y = PI)) +
geom_line() +
facet_wrap(~pop,ncol = 1) +
labs(x = 'Chromosome Position (bp)',
y = expression(Nucleotide~Diversity~(pi)))