Population Genetics is centred around understanding genetic variation.
Towards this end, population geneticists are interested in the fundamental processes of mutation, migration, recombination, selection and drift.
In this lab session, we’ll be touching on most of these and covering a range of common foundation analyses to understand the evolutionary history of a SNP dataset. The goals are for you to get a feel of a basic population genetics workflow through a whistle-stop tour of methods. We could spend an entire day on each step of the analysis, but given we only have 3 hours we’ll be learning a little about a few things as opposed to a deep-dive on one thing.
It’s expected that there are a range of abilities in the cohort, and consequently some may progress quicker than others. At the end of each section there are additional exercises that are optional for anyone who is working quickly.
It’s also important to learn how to troubleshoot errors and problems
on your own! Remember that for command-line programs we can usually type
--help
, or in R type ?my_function
, to see
documentation.
We’ll be working with a simulated dataset that includes a number of conditions and events that we will uncover over the course of the session. The workshop is designed around the idea of building up an understanding of a SNP dataset that we may have limited a-priori knowledge of, which can be common for example when working with sequencing data from natural populations. Starting with exploratory summary analyses, we will generate and test hypotheses to explore evolutionary questions.
Let’s pretend our data comes from a study into the a hypothetical, yet MIRACULOUS, conservation success story of Scottish Wild Cats spreading south into England from their Highlands refugia. In this study, samples were collected from a number of individuals from the ancestral Highlands population (p0) and at four further locations moving south through the Scottish borders and into England, at Kielder Forest (p1), Leeds (p2), Norfolk (p3), and London (p4).
Scottish Wild Cats are back!
The only information we will start with is a simple tree summary of the 5 populations, which we will predict from the known North to South colonisation history:
plot(ape::read.tree(text = '(Highlands_p0,(Kielder_p1,(Leeds_p2,(Norfolk_p3,London_p4))));'))
Create a new directory at the following path and move into it:
~/workshop_materials/population_genomics/workdir
# Change to our workdir
mkdir ~/workshop_materials/population_genomics/workdir
cd ~/workshop_materials/population_genomics/workdir
Also, define the following variables in your command line. Variables help keep scripts tidy, and should we want to change the inputs, it is much simpler to change one line of code that defines the variable than every line of code where that variable is used!
VCF=popgen_demo_krumlov23_maf05.vcf.gz
DATASET=popgen_demo_krumlov23
We will also be using R for some analyses and to visualise results, so open up R and load the following packages:
vcfR
ggplot2
data.table
adegenet
GenotypePlot
Set our working directory (using setwd()
) to our working
directory as above.
# Load our libraries
library(vcfR)
library(ggplot2)
library(data.table)
library(adegenet)
library(GenotypePlot)
# Set our working directory
setwd("~/workshop_materials/population_genomics/workdir")
Get a general sense of the dataset we’ll be working with:
We’ll start by filtering our VCF based on a minor allele frequency filter of 5%. (You can do this with vcftools or bcftools)
bcftools view --min-af 0.05:minor ../popgen_demo_krumlov23.vcf.gz -Oz -o popgen_demo_krumlov23_maf05.vcf.gz
Visualising population structure can be a good place to start when interpreting a fresh SNP dataset. Population structure provides us with a sense of how individuals are related, whether those relationships form clusters that we might consider populations, and whether there are any unexpected outliers or obvious trends in our data that might be suggestive of sequencing or sampling issues. For instance, if our data has been sequenced in batches or with an unbalanced design, sequencing artifacts may cause individuals to cluster according to when or how they were sequenced (batch effects). Males and females may also be clear if SNPs include those coming from sex-linked regions.
PCA is a powerful tool for quickly inferring and visualising population structure. It is an ordination technique that seeks to produce a set of uncorrelated axes (principal components, e.g. PC1, PC2 etc) that maximise genetic variation among all included individuals. In short, this is done by ‘rotating’ the data in multivariate space to find the single axis that explains the most variation (PC1), and repeatedly introducing orthogonal axes that contain increasingly smaller proportions of total variation until all variation is explained.
To perform PCA, the first step is to filter our SNP dataset for linkage. Linkage disequilibrium or ‘LD’ describes the non-random associations or correlation between SNPs. LD is particularly strong among SNPs that are physically close together because they are likely to be inherited in common haplotype blocks. LD can also be caused by other things, such as selection, and can occur among SNPs that are not physically close together. When considering population structure, we want to remove as much LD as possible because linked variants carry redundant information about inheritance. We therefore want each SNP to represent an independent variable in our analysis (it’s also computationally more efficient to work with fewer SNPs if we can!).
There are many approaches available to filter our data for linkage.
We could do this in R using a package like SNPRelate (and you’re welcome
to give this a go). One of the most commonly used approaches is Plink’s
--indep-pairwise
, which removes SNPs that are in linkage
based on a correlation cut-off (usually 0.2) within windows of a
user-defined size. For our window size, we’ll do non-overlapping windows
of 50 SNPs, but feel free to explore how changing this parameter
influences the number of SNPs that are left following pruning. This
software runs on the command-line.
Plink will provide us a file of unlinked SNPs in a file suffixed
*.prune.in
, which we can use to pull SNPs out of our VCF
for other analyses.
Hints: You’ll also need to use the --allow-extra-chr
option.
# Linkage prune with plink
# We'll prune in non-overlapping windows of 50 SNPs
plink --vcf $VCF --allow-extra-chr \
--indep-pairwise 50 50 0.2 \
--out $DATASET
Having produced a set of SNPs that are pruned for linkage, we can
perform PCA on the unlinked SNPs. Sticking with Plink, we can do this
using the --pca
option.
Hints: We will need the --extract
option to pull our
unlinked SNPs, and we will again need to use the
--allow-extra-chr
option.
# Perform PCA using our linkage-pruned SNPs
plink --vcf $VCF \
--allow-extra-chr \
--extract $DATASET.prune.in \
--pca \
--out $DATASET
Now let’s explore the results! Return to R and read in the PCA
outputs, which will be suffixed .eigenval
and
.eigenvec
.
# Read in the eigenvector scores
plink_scores = read.table('popgen_demo_krumlov23.eigenvec')
# Rename the eigenvecs cols
colnames(plink_scores) = c('ind','ind2',paste0('PC',1:(ncol(plink_scores) - 2)))
# Read in the eigenvalues
plink_eigenval = read.table('popgen_demo_krumlov23.eigenval')
Read your inputs into two data.frames called
plink_scores
and plink_eigenval
, that look
like this:
head(plink_scores)
## ind ind2 PC1 PC2 PC3 PC4 PC5
## 1 p0.0 p0.0 -0.122078 0.012935200 0.00320871 -0.00245191 -0.000822215
## 2 p0.1 p0.1 -0.125386 0.011319500 0.00728584 -0.00351268 0.003910970
## 3 p0.2 p0.2 -0.126704 0.007575760 0.00503239 0.00133178 0.001004640
## 4 p0.3 p0.3 -0.125776 0.005454740 0.00618741 0.00470080 0.000722623
## 5 p0.4 p0.4 -0.125732 0.014062200 0.00147493 0.00100268 -0.003479150
## 6 p0.5 p0.5 -0.123756 -0.000981481 0.00597981 0.00843322 0.002228040
## PC6 PC7 PC8 PC9 PC10 PC11
## 1 0.00281681 -0.003645220 0.00834636 -0.00479214 0.004214160 9.96145e-05
## 2 0.00893109 0.000240292 0.00916124 -0.00692168 -0.000162668 -7.71035e-03
## 3 -0.01115630 -0.007243930 -0.00222996 0.00576070 -0.004564770 2.10456e-03
## 4 -0.00511106 -0.003291240 -0.00463956 0.00567235 0.001458280 3.50304e-03
## 5 0.00287493 0.001770390 0.00348909 -0.01090230 -0.002004750 -5.69118e-03
## 6 -0.01823700 -0.006327550 -0.01425580 0.02202560 -0.000862466 6.02781e-03
## PC12 PC13 PC14 PC15 PC16 PC17
## 1 -0.00895890 0.01018750 0.02175990 0.00890439 -0.00743525 0.01685170
## 2 -0.00452610 -0.00194529 0.01643850 0.01370070 -0.01946810 0.00366498
## 3 0.00596315 -0.01390820 -0.01420920 -0.00781733 0.00768012 -0.00359343
## 4 0.00989684 -0.01301740 -0.02102160 -0.00780664 0.01567650 -0.00980370
## 5 -0.00415022 0.01257410 0.00746004 0.00368681 -0.00689611 0.01690440
## 6 0.01204670 -0.03076720 -0.03431040 -0.01517650 0.03400560 -0.02972850
## PC18 PC19 PC20
## 1 -0.01956960 -0.00557496 -0.000799148
## 2 0.00800027 0.00526460 0.007826790
## 3 -0.00352997 -0.00887329 0.001173550
## 4 0.00530733 0.00455785 0.001065890
## 5 0.01108530 0.00668424 0.008758650
## 6 -0.00448433 -0.00214756 0.000397279
head(plink_eigenval)
## V1
## 1 48.15550
## 2 22.78070
## 3 14.02390
## 4 10.69700
## 5 4.20565
## 6 3.81928
A good place to start is to look at what proportion of genetic variation is explained by each principal component. This information is provided by the eigenvalues. The eigenvalues represent scaling parameters that describe the relative importance of each axis.
The more genetic variation that is explained by our major axes tells us how strong population structure is; populations that are separate on a PC axis that explains 1% of variation are less distinct than populations that separate on a PC axis that explains 20% of genetic variation.
In order to get the proportion of variation explained by each, we need to divide each eigenvalue by the sum of all eigenvalues.
# Examine the eigenvals
plink_eigenval
# Convert these to % variance explained
plink_eigenval$var_explained = 100 * plink_eigenval$V1 / sum(plink_eigenval$V1)
# Check the variance explained
plink_eigenval
We’ll now plot the PCA results using the plink_scores
data. The scores represent the transformed value of each individual in
each axis. So by visualising these, we can see where individuals are
plotted relative to other individuals. Individuals that are closer
together are more closely-related.
You can do this using any plotting software you’re familiar, but the
example will be in ggplot2
. We’re usually interested in
visualising the PC1 scores on the x-axis and PC2 sores on the y-axis,
and it helps to include the % variance explained on each axis as part of
the label. The PC scores are found in the eigenvecs file, and describe
the transformed value of each individual on each axis. Produce a
scatterplot that shows each individual’s position in PCA space, with
axis labels in the format ‘PC1 (xx %)’
Hint: You can use the paste0()
or
paste()
function to produce an axis label that pulls the
variance explained value directly from your eigenvals table to produce
an axis label!
Hopefully, your figure looks something like this…
# Plot the PCA results
# PC1 and PC2
ggplot(plink_scores,aes(PC1,PC2)) +
geom_point() +
xlab(paste0('PC1 (',round(plink_eigenval$var_explained[1],2),'%)')) +
ylab(paste0('PC2 (',round(plink_eigenval$var_explained[2],2),'%)'))
Visualising the results on PC1 and PC2 provides us with a sense of population structure. The distance between points is analagous to how related individuals are, so ‘populations’ appear as clusters of individuals. F1 hybrids between populations appear equidistant between clusters. PC1 and PC2 however only represent some of the total variation, so it can help to explore additional axes. For example, PC1 and PC2 give the appearance of 4 clusters, but if we plot PC2 and PC3, we can see 5 clusters…
# PC2 and PC3
ggplot(plink_scores,aes(PC2,PC3)) +
geom_point() +
xlab(paste0('PC2 (',round(plink_eigenval$var_explained[2],2),'%)')) +
ylab(paste0('PC3 (',round(plink_eigenval$var_explained[3],2),'%)'))
In fact, to get a view of population structure that incorporates all
of our PC axes, we can simply calculate the distance among all
individuals using the dist()
function. This calculates the
euclidean distance among all individuals, which is analogous to how
closely related they are.
Don’t worry about doing this (unless you’d really like to!), but if
we plot the distance in PC space between 50 random individuals we can
see clearly that distance is lower among indivduals from the same
population, and our ancestral Highlands (p0
) population is
more distinct than the others.
# Extract only the scores
only_scores = plink_scores[,3:ncol(plink_scores)]
# Multiple each column by its eigenvalue
for(i in 1:nrow(plink_eigenval)){
only_scores[,i] = only_scores[,i] * plink_eigenval$V1[i]
}
# Plot a random sample of 50 individuals
rownames(only_scores) = plink_scores$ind
pca_dists = dist(only_scores[sample(1:nrow(only_scores),50),])
pheatmap::pheatmap(as.matrix(pca_dists))