Introduction

Population Genetics is centred around understanding genetic variation.

  • Where does it come from?
  • Why does it persist?
  • How is it distributed aross the genome or across the species’ range?

Towards this end, population geneticists are interested in the fundamental processes of mutation, migration, recombination, selection and drift.

In this lab session, we’ll be touching on most of these and covering a range of common foundation analyses to understand the evolutionary history of a SNP dataset. The goals are for you to get a feel of a basic population genetics workflow through a whistle-stop tour of methods. We could spend an entire day on each step of the analysis, but given we only have 3 hours we’ll be learning a little about a few things as opposed to a deep-dive on one thing.

It’s expected that there are a range of abilities in the cohort, and consequently some may progress quicker than others. At the end of each section there are additional exercises that are optional for anyone who is working quickly.

It’s also important to learn how to troubleshoot errors and problems on your own! Remember that for command-line programs we can usually type --help, or in R type ?my_function, to see documentation.


The Dataset

We’ll be working with a simulated dataset that includes a number of conditions and events that we will uncover over the course of the session. The workshop is designed around the idea of building up an understanding of a SNP dataset that we may have limited a-priori knowledge of, which can be common for example when working with sequencing data from natural populations. Starting with exploratory summary analyses, we will generate and test hypotheses to explore evolutionary questions.

Let’s pretend our data comes from a study into the a hypothetical, yet MIRACULOUS, conservation success story of Scottish Wild Cats spreading south into England from their Highlands refugia. In this study, samples were collected from a number of individuals from the ancestral Highlands population (p0) and at four further locations moving south through the Scottish borders and into England, at Kielder Forest (p1), Leeds (p2), Norfolk (p3), and London (p4).

Scottish Wild Cats are back!

The only information we will start with is a simple tree summary of the 5 populations, which we will predict from the known North to South colonisation history:

plot(ape::read.tree(text = '(Highlands_p0,(Kielder_p1,(Leeds_p2,(Norfolk_p3,London_p4))));'))


Prepare the working directory

Create a new directory at the following path and move into it: ~/workshop_materials/population_genomics/workdir

# Change to our workdir
mkdir ~/workshop_materials/population_genomics/workdir
cd ~/workshop_materials/population_genomics/workdir

Also, define the following variables in your command line. Variables help keep scripts tidy, and should we want to change the inputs, it is much simpler to change one line of code that defines the variable than every line of code where that variable is used!

VCF=popgen_demo_krumlov23_maf05.vcf.gz

DATASET=popgen_demo_krumlov23

We will also be using R for some analyses and to visualise results, so open up R and load the following packages:

  • vcfR
  • ggplot2
  • data.table
  • adegenet
  • GenotypePlot

Set our working directory (using setwd()) to our working directory as above.

# Load our libraries
library(vcfR)
library(ggplot2)
library(data.table)
library(adegenet)
library(GenotypePlot)

# Set our working directory
setwd("~/workshop_materials/population_genomics/workdir")

Overview of VCF format

Get a general sense of the dataset we’ll be working with:

  • How many individuals are there?
  • How many chromosomes and what are their sizes?
  • How many SNPs?

We’ll start by filtering our VCF based on a minor allele frequency filter of 5%. (You can do this with vcftools or bcftools)

bcftools view --min-af 0.05:minor ../popgen_demo_krumlov23.vcf.gz -Oz -o popgen_demo_krumlov23_maf05.vcf.gz

Step 1 - Population Structure

Visualising population structure can be a good place to start when interpreting a fresh SNP dataset. Population structure provides us with a sense of how individuals are related, whether those relationships form clusters that we might consider populations, and whether there are any unexpected outliers or obvious trends in our data that might be suggestive of sequencing or sampling issues. For instance, if our data has been sequenced in batches or with an unbalanced design, sequencing artifacts may cause individuals to cluster according to when or how they were sequenced (batch effects). Males and females may also be clear if SNPs include those coming from sex-linked regions.


1a) Principal Component Analysis - PCA

PCA is a powerful tool for quickly inferring and visualising population structure. It is an ordination technique that seeks to produce a set of uncorrelated axes (principal components, e.g. PC1, PC2 etc) that maximise genetic variation among all included individuals. In short, this is done by ‘rotating’ the data in multivariate space to find the single axis that explains the most variation (PC1), and repeatedly introducing orthogonal axes that contain increasingly smaller proportions of total variation until all variation is explained.

Filter SNPs for linkage

To perform PCA, the first step is to filter our SNP dataset for linkage. Linkage disequilibrium or ‘LD’ describes the non-random associations or correlation between SNPs. LD is particularly strong among SNPs that are physically close together because they are likely to be inherited in common haplotype blocks. LD can also be caused by other things, such as selection, and can occur among SNPs that are not physically close together. When considering population structure, we want to remove as much LD as possible because linked variants carry redundant information about inheritance. We therefore want each SNP to represent an independent variable in our analysis (it’s also computationally more efficient to work with fewer SNPs if we can!).

There are many approaches available to filter our data for linkage. We could do this in R using a package like SNPRelate (and you’re welcome to give this a go). One of the most commonly used approaches is Plink’s --indep-pairwise, which removes SNPs that are in linkage based on a correlation cut-off (usually 0.2) within windows of a user-defined size. For our window size, we’ll do non-overlapping windows of 50 SNPs, but feel free to explore how changing this parameter influences the number of SNPs that are left following pruning. This software runs on the command-line.

Plink will provide us a file of unlinked SNPs in a file suffixed *.prune.in, which we can use to pull SNPs out of our VCF for other analyses.

Hints: You’ll also need to use the --allow-extra-chr option.

# Linkage prune with plink
# We'll prune in non-overlapping windows of 50 SNPs
plink --vcf $VCF --allow-extra-chr \
--indep-pairwise 50 50 0.2 \
--out $DATASET

Explore results in R

Now let’s explore the results! Return to R and read in the PCA outputs, which will be suffixed .eigenval and .eigenvec.

# Read in the eigenvector scores
plink_scores = read.table('popgen_demo_krumlov23.eigenvec')

# Rename the eigenvecs cols
colnames(plink_scores) = c('ind','ind2',paste0('PC',1:(ncol(plink_scores) - 2)))

# Read in the eigenvalues
plink_eigenval = read.table('popgen_demo_krumlov23.eigenval')

Read your inputs into two data.frames called plink_scores and plink_eigenval, that look like this:

head(plink_scores)
##    ind ind2       PC1          PC2        PC3         PC4          PC5
## 1 p0.0 p0.0 -0.122078  0.012935200 0.00320871 -0.00245191 -0.000822215
## 2 p0.1 p0.1 -0.125386  0.011319500 0.00728584 -0.00351268  0.003910970
## 3 p0.2 p0.2 -0.126704  0.007575760 0.00503239  0.00133178  0.001004640
## 4 p0.3 p0.3 -0.125776  0.005454740 0.00618741  0.00470080  0.000722623
## 5 p0.4 p0.4 -0.125732  0.014062200 0.00147493  0.00100268 -0.003479150
## 6 p0.5 p0.5 -0.123756 -0.000981481 0.00597981  0.00843322  0.002228040
##           PC6          PC7         PC8         PC9         PC10         PC11
## 1  0.00281681 -0.003645220  0.00834636 -0.00479214  0.004214160  9.96145e-05
## 2  0.00893109  0.000240292  0.00916124 -0.00692168 -0.000162668 -7.71035e-03
## 3 -0.01115630 -0.007243930 -0.00222996  0.00576070 -0.004564770  2.10456e-03
## 4 -0.00511106 -0.003291240 -0.00463956  0.00567235  0.001458280  3.50304e-03
## 5  0.00287493  0.001770390  0.00348909 -0.01090230 -0.002004750 -5.69118e-03
## 6 -0.01823700 -0.006327550 -0.01425580  0.02202560 -0.000862466  6.02781e-03
##          PC12        PC13        PC14        PC15        PC16        PC17
## 1 -0.00895890  0.01018750  0.02175990  0.00890439 -0.00743525  0.01685170
## 2 -0.00452610 -0.00194529  0.01643850  0.01370070 -0.01946810  0.00366498
## 3  0.00596315 -0.01390820 -0.01420920 -0.00781733  0.00768012 -0.00359343
## 4  0.00989684 -0.01301740 -0.02102160 -0.00780664  0.01567650 -0.00980370
## 5 -0.00415022  0.01257410  0.00746004  0.00368681 -0.00689611  0.01690440
## 6  0.01204670 -0.03076720 -0.03431040 -0.01517650  0.03400560 -0.02972850
##          PC18        PC19         PC20
## 1 -0.01956960 -0.00557496 -0.000799148
## 2  0.00800027  0.00526460  0.007826790
## 3 -0.00352997 -0.00887329  0.001173550
## 4  0.00530733  0.00455785  0.001065890
## 5  0.01108530  0.00668424  0.008758650
## 6 -0.00448433 -0.00214756  0.000397279
head(plink_eigenval)
##         V1
## 1 48.15550
## 2 22.78070
## 3 14.02390
## 4 10.69700
## 5  4.20565
## 6  3.81928

A good place to start is to look at what proportion of genetic variation is explained by each principal component. This information is provided by the eigenvalues. The eigenvalues represent scaling parameters that describe the relative importance of each axis.

The more genetic variation that is explained by our major axes tells us how strong population structure is; populations that are separate on a PC axis that explains 1% of variation are less distinct than populations that separate on a PC axis that explains 20% of genetic variation.

In order to get the proportion of variation explained by each, we need to divide each eigenvalue by the sum of all eigenvalues.

# Examine the eigenvals
plink_eigenval
# Convert these to % variance explained
plink_eigenval$var_explained = 100 * plink_eigenval$V1 / sum(plink_eigenval$V1)
# Check the variance explained
plink_eigenval

We’ll now plot the PCA results using the plink_scores data. The scores represent the transformed value of each individual in each axis. So by visualising these, we can see where individuals are plotted relative to other individuals. Individuals that are closer together are more closely-related.

You can do this using any plotting software you’re familiar, but the example will be in ggplot2. We’re usually interested in visualising the PC1 scores on the x-axis and PC2 sores on the y-axis, and it helps to include the % variance explained on each axis as part of the label. The PC scores are found in the eigenvecs file, and describe the transformed value of each individual on each axis. Produce a scatterplot that shows each individual’s position in PCA space, with axis labels in the format ‘PC1 (xx %)’

Hint: You can use the paste0() or paste() function to produce an axis label that pulls the variance explained value directly from your eigenvals table to produce an axis label!

Hopefully, your figure looks something like this…

# Plot the PCA results
# PC1 and PC2
ggplot(plink_scores,aes(PC1,PC2)) + 
  geom_point() +
  xlab(paste0('PC1 (',round(plink_eigenval$var_explained[1],2),'%)')) +
  ylab(paste0('PC2 (',round(plink_eigenval$var_explained[2],2),'%)'))

Visualising the results on PC1 and PC2 provides us with a sense of population structure. The distance between points is analagous to how related individuals are, so ‘populations’ appear as clusters of individuals. F1 hybrids between populations appear equidistant between clusters. PC1 and PC2 however only represent some of the total variation, so it can help to explore additional axes. For example, PC1 and PC2 give the appearance of 4 clusters, but if we plot PC2 and PC3, we can see 5 clusters…

# PC2 and PC3
ggplot(plink_scores,aes(PC2,PC3)) + 
  geom_point() +
  xlab(paste0('PC2 (',round(plink_eigenval$var_explained[2],2),'%)')) +
  ylab(paste0('PC3 (',round(plink_eigenval$var_explained[3],2),'%)'))

In fact, to get a view of population structure that incorporates all of our PC axes, we can simply calculate the distance among all individuals using the dist() function. This calculates the euclidean distance among all individuals, which is analogous to how closely related they are.

Don’t worry about doing this (unless you’d really like to!), but if we plot the distance in PC space between 50 random individuals we can see clearly that distance is lower among indivduals from the same population, and our ancestral Highlands (p0) population is more distinct than the others.

# Extract only the scores
only_scores = plink_scores[,3:ncol(plink_scores)]
# Multiple each column by its eigenvalue
for(i in 1:nrow(plink_eigenval)){
  only_scores[,i] = only_scores[,i] * plink_eigenval$V1[i]
}

# Plot a random sample of 50 individuals
rownames(only_scores) = plink_scores$ind
pca_dists = dist(only_scores[sample(1:nrow(only_scores),50),])
pheatmap::pheatmap(as.matrix(pca_dists))

Depending on time…

The sense of population structure that PCA provides is highly context-dependent. It depends on which individuals we include, how the data was originally filtered, and how many SNPs we included. Explore how your PCA results change if you:

  • Remove all individuals that have p0 in their name.
  • Include only 100 SNPs. How few SNPs can you use to re-create the pattern of population structure above?
  • Filter the original dataset for only singletons, rather than a minor allele frequency of 5%

Try and write a function called plot_plink_PCA() that lets you plot the results of a plink PCA analysis by providing it with the path to the eigenvecs and eigenvals files.


1b) Discriminant Analysis of Principal Components - DAPC

Whilst PCA is powerful and intuitive, it doesn’t specifically perform any tests of whether there are populations and how many might be statistically relevant. DAPC is an extension of PCA that includes a step to determine how many populations best describe the data. Having identified a likely number of populations (k populations), we can then re-analyse the data in a way that seeks to maximise genetic variation among our k clusters, as opposed to among all individuals as in PCA.

To do this, we’ll need to produce a VCF that only includes our linkage-pruned SNPs from the previous section. We can do this using bcftools and the view and --include options.

# We'll also make a new VCF of the linkage-pruned SNPs
bcftools view --include ID==@$DATASET.prune.in $VCF -Oz -o ${DATASET}_maf05_LDpruned.vcf.gz

We can then read this pruned VCF into R using the read.vcfR() function, and then convert to genind format using the vcfR2genind() function

# Read in the LD-pruned VCF
LDpruned_vcf = read.vcfR("popgen_demo_krumlov23_maf05_LDpruned.vcf.gz")

# Convert our VCF to genind format
genind_input = vcfR2genind(LDpruned_vcf)

The first step for our DAPC analysis is to identify the most likely number of clusters. To do this, we use the find.clusters() function. There are two options to bear in mind here, the max.n.clust and n.pca options. These options restrict the exploration of parameters to limits of our choice. Set each of these at 20, but feel free to explore how these affect the analysis.

# Identify the most likely k-value for number of clusters
# When we run this, we see steep declines in BIC up to k = 5, so we'll pick that for our cluster N
grp <- find.clusters(genind_input,
                     max.n.clust=20,
                     n.pca = 20,)

## Choose the number of clusters (>=2):

This function shows us how model fit changes as we fit it with different numbers of clusters. We can see a sharp decline in BIC (improving model fit) up to a k of 5, and only moderate improvements at k of 6+. This implies that 5 clusters is the best fit for our data. Set k = 5.

grp <- find.clusters(genind_input,
                     n.clust=5,
                     n.pca = 20)

As well as helping us choose the most likely number of clusters in our data, this function also assigns our individuals into groups. Assuming you’ve saved your clusters to a variable called grp, we can look at these by running grp$grp.

head(grp$grp)
## p0.0 p0.1 p0.2 p0.3 p0.4 p0.5 
##    4    4    4    4    4    4 
## Levels: 1 2 3 4 5

These groups are given an arbitrary number as an ID, so let’s just quickly change this (you can just run this bit of code, don’t worry about it!)

# Fetch the grp number for each actual pop
new_grp_names = lapply(paste0('p',0:4),function(p){
  as.character(grp$grp[grep(p,names(grp$grp))][1])
})
names(new_grp_names) = paste0('p',0:4)

# Recode factor levels by name
levels(grp$grp) <- new_grp_names

We will now attach these cluster IDs to our plink PCA scores from earlier, and re-plot the PCA results with the points now coloured according to their assigned groups. If using ggplot(), we can also now add stat_ellipse() to plot a 95% confidence ellipse around each cluster.

Hint: You can do this easily by making a data.frame with 2 columns called ‘ind’ and ‘pop’, by taking names(grp$grp) and grp$grp. You can then use the merge() function to attach these to your PC scores data.frame

# We can check these clusters against our PCA results by merging them...
ind_groups = data.frame(ind = names(grp$grp),
                        pop = grp$grp)
plink_scores_ids = merge(plink_scores,ind_groups)

# Plot together, now colouring by group assignment...
ggplot(plink_scores_ids,aes(PC1,PC2,colour = pop)) +
  geom_point() + 
  stat_ellipse()

Now that we’ve got our groups, we can perform DAPC to maximise variation among them. To do this, we’ll run the dapc() function, providing it with our genind input, our group assignments, the n.pca option from before, and we’ll set n.da = 4 so we can produce 4 discriminant axes to separate our 5 groups. Plot the output of this using the scatter() function.

# We can then use these groups for DAPC
# We'll retain all discriminant functions seeing as we want to describe diffs between 5 clusters
dapc_res <- dapc(genind_input, 
                 grp$grp,
                 n.pca = 20,
                 n.da = 4)
# Display results in a scatter plot
scatter(dapc_res)

Probability of group assignment

A useful feature of DAPC is that we can check how confident the analysis is of our group assignments. This information is stored in the dapc_res$posterior slot. This matrix describes the probability that each individual is assigned to each group. Use this matrix to produce a data.frame that describes, for each individual, the cluster it has been assigned to and the probability.

Hint: To work through the matrix, you can either use a forloop or the apply() function to find the max probability

# A useful feature of dapc is we get a membership probability
head(dapc_res$posterior)
##      p0 p1 p2 p3 p4
## p0.0  1  0  0  0  0
## p0.1  1  0  0  0  0
## p0.2  1  0  0  0  0
## p0.3  1  0  0  0  0
## p0.4  1  0  0  0  0
## p0.5  1  0  0  0  0
# And we can use this to pull out the group membership and the probability
head(data.frame(ind = rownames(dapc_res$posterior),
                cluster = apply(dapc_res$posterior,1,which.max),
                prob = apply(dapc_res$posterior,1,max)))
##       ind cluster prob
## p0.0 p0.0       1    1
## p0.1 p0.1       1    1
## p0.2 p0.2       1    1
## p0.3 p0.3       1    1
## p0.4 p0.4       1    1
## p0.5 p0.5       1    1

As a final step for this analysis, we can produce an admixture plot that visualises the probability of group assignment for each individual as grouped bars. The easiest way to do this is with the compoplot() function, but if you’re feeling confident with ggplot() you could also do this using the raw data from the matrix above.

# We can also plot these populations in a structure like plot as well
compoplot(dapc_res, posi="bottomright",
          txt.leg=paste("Cluster", 1:5), lab="",
          ncol=1, xlab="individuals", col=funky(5))

Depending on time…

  • DAPC includes a number of steps to validate results. This includes using the alpha score and cross-validation. How to do these is described in an in-depth DAPC tutorial by the developer (link at the bottom of the document, section 4). Give this a go and optimise the number of retained PC’s.

  • Explore how choosing different values of k influences the probability of group assignment. It is common to show admixture plots like the one above for multiple k values of 2 to k. These plots are informative in seeing hierarchical population structure.

  • Re-run the DAPC analysis with 100 random SNPs. DAPC, like any statistical process, is sensitive to statistical power. With fewer SNPs, we have less power to detect structure.


Step 2 - Genetic/Nucleotide Diversity (Pi)

Nucleotide diversity describes the proportion of nucleotides that are expected to differ between two given sequences taken from a population. A greater estimate of nucleotide diversity implies there is more genetic variation in a given population or in a given region of the genome.

This is another simple analysis that can quickly tell us a lot about the data we’re working with. Nucleotide diversity varies extensively across the genome; regions of high diversity may reflect regions under balancing selection or mutation hotspots, whilst regions exhibiting low diversity may be under strong selection or in regions of low recombination and high background selection.

Calculate Pi on the command-line

There are many ways to calculate genetic diversity, here we’ll use vcftools and the --window-pi option. We’ll run this individually for each population, seeing as we are interested in within-population nucleotide diversity. To do this, we will need to produce a popmap file, which is simply a text file for each population that lists the individuals in that population.

Use for-loops to first produce a popmap text file for each individual, and then use this popmap and the --keep option of vcftools to calculate pi for each population in 20kb windows. Run this over the original maf-filtered VCF, rather than the linkage-pruned VCF!

##### CALCULATING PI ######
# What window size shall we look at?
WINDOW_SIZE=20000

# We want to do this on a per-population basis, so we'll start by making popmaps
for i in {0..4}
do
    # Use grep to pull individual IDs from the vcf into a file
    bcftools query -l $VCF | grep "p${i}" > p${i}.popmap
done

# We want to do this using a for-loop as we'll be running it within each of our 5 populations
for i in {0..4}
do
    # Nice little message to tell us what we're doing
    echo ">>> Starting pi calculation for pop${i}"

    # Make a variable to store pop
    POP=p${i}

    # Calculate nucleotide diversity with vcftools
    vcftools --gzvcf $VCF \
    --keep ${POP}.popmap \
    --window-pi $WINDOW_SIZE \
    --out ${POP}

done

Visualise Nucleotide Diversity in R

Having calculated pi, read each of the outputs back into R and plot pi along the genome for each of the populations. Align your plots vertically so that you can see any trends across populations. Consider:

  • Where is nucleotide diversity high or low?
  • What is the genome-wide mean? Does this vary across populations?
  • Are there any regions specific to individual populations that show high/low diversity? What might cause these?

Hint: If you format your pi results with a column that has the population ID, you can use facet_wrap() in ggplot() to align your plots vertically

# Identify all the pi results and read them into a list
pi_files = list.files(pattern = 'windowed.pi')
pi_res = lapply(pi_files,read.table,header = T)

# We can loop through our list and add an identifer to each
for(i in 1:length(pi_res)){
  pi_res[[i]]$pop = paste0('p',i - 1)
}

# Collapse these into a list
all_pi_res = rbindlist(pi_res)

# And we can now plot together with facet_wrap
ggplot(all_pi_res,aes(x = BIN_START,y = PI)) +
  geom_line() +
  facet_wrap(~pop,ncol = 1) +
  labs(x = 'Chromosome Position (bp)',
       y = expression(Nucleotide~Diversity~(pi)))