Practical session: Inferring demographic parameters from the SFS with composite likelihood method implemented in fastsimcoal2








Objectives of this practical session


  1. What is the SFS?
  • Derived (unfolded) SFS from genotypic data
  • Multi-dimensional SFS (2D-SFS)
  1. Does the SFS contain information about demography?
  • Infer parameters under simple models with fastsimcoal2 (simulated data)
  • Input files for fastsimcoal2
  • Settings to run fastsimcoal2
  • Analyse the output of fastsimcoal2
  • Does our model describe well the data? Fit of expected SFS under the model with observed SFS
  1. Application to real dataset from human populations
  • Apply fastsimcoal2 for a model with two populations
  • Compare results with and without gene flow

REQUIREMENTS

For this practical, we will use R (and R Studio) and call fastsimcoal2 directly from R.
We will then visualize the results in R. It requires the following R packages:

  • plotrix
  • diagram
  • RColorBrewer

You can install these packages with the following commands in R (they are already installed in the AMI, you only need to load the library to use it).

install.packages("plotrix")
install.packages("diagram")
install.packages("RColorBrewer")
library("RColorBrewer")

1. What is the site frequency spectrum (SFS)?

1.1. Build the derived SFS from genotypic data

The SFS (also called Allele frequency spectrum – AFS) is a summary of genome-wide data that describes the number of single nucleotide polymorphisms (SNPs) with a given frequency in our sample.

Consider the following genotypic data matrix where each entry codes for the genotypes as 0 (homozygote for reference allele), 1 (heterozygote), 2 (homozygote for alternative allele). In this matrix each row corresponds to a site in the genome, and each column corresponds to an individual. This is similar to what we can obtain from a variant call format (VCF) file.

Let’s compute the observed SFS from the above matrix, assuming it comes from a single population.

Before starting, answer the following questions:

  • Q1: What is the x-axis of the SFS? What is the range of possible values for the x-axis?

      The x-axis of the SFS contains the absolute frequency of the derived allele in the sample. Thus, it can take values from zero (if at a site all individuals are homozygote for ancestral allele) to \(2*n\) (if all the \(n\) individuals are homozygote for the derived allele). Note that we need to multiply by 2 because individuals are diploid.

  • Q2: What is the y-axis of the SFS? What is the range of possible values for the y-axis?

      The y-axis of the SFS contains the number of sites in the genome with a given absolute frequency. Thus, it can take values between zero (if there are no sites in the genome with a given frequency) to \(S\), where \(S\) is the number of sites.

Now, we can compute the observed SFS. First, we need to compute the absolute allele frequency for each SNP.

Individuals
Derived allele frequency
ind1 ind2 ind3 ind4 ind5 Absolute frequency
snp1 0 1 0 1 0 2
snp2 0 1 0 1 0 2
snp3 0 1 1 1 2 5
snp4 0 1 1 0 0 2
snp5 0 1 0 0 2 3
snp6 1 0 0 1 1 3
snp7 1 0 0 0 0 1
snp8 0 0 1 0 0 1
snp9 0 0 0 0 1 1
snp10 0 0 0 0 1 1
snp11 0 0 1 0 0 1
snp12 0 1 2 1 0 4

Second, we need to count the number of sites with a given allele frequency.

Results.

You should obtain the following table, which can be represented as a barplot. For instance, we can see in the SFS that there are 5 sites with a frequency of 1, 2 sites with a frequency of 3, and zero sites with a frequency larger than 6.

Abs. Frequency of derived allele
0 1 2 3 4 5 6 7 8 9 10
Number of sites 0 5 3 2 1 1 0 0 0 0 0

1.2. Multi-dimensional SFS (2D-SFS)

The SFS can be extended to any number of populations. For instance, if we had sampled data from two populations, the 2D-SFS corresponds to a matrix where the entry (i,j) corresponds to the count of sites with frequency i in population 1, and frequency j in population 2.

Consider the following genotype matrix for 12 SNPs genotyped at 5 individuals from 2 populations, 2 diploid individuals from Pop1, and 3 diploid individuals from Pop2. Each column corresponds to one individual and each row to one SNP. Genotypes are coded as 0, 1, 2 as in Exercise 1.1.

Pop 1
Pop 2
P1_ind1 P1_ind2 P2_ind1 P2_ind2 P2_ind3
snp1 1 0 1 0 0
snp2 1 1 1 0 0
snp3 0 0 0 0 0
snp4 0 1 0 0 0
snp5 0 0 0 0 0
snp6 0 0 1 0 0
snp7 0 1 1 0 0
snp8 0 0 0 1 0
snp9 0 0 1 0 1
snp10 1 0 1 0 1
snp11 0 1 2 0 0
snp12 0 1 2 1 1

Exercise 1.2.1.
Use a piece of paper and a pen to compute the observed 2 dimensional SFS from the above matrices with the genotypes for each population.

The 2D SFS can be seen as a matrix. Before starting answer the following questions:

  • How many rows has the 2D SFS for this dataset? What affects the number of rows?

  • How many columns has this 2D SFS for this dataset? What affects the number of columns?

  • What is the range of possible values for each entry of the matrix?

First, you need to compute the absolute allele frequency for each SNP in each population.
Second, you need to count the number of sites with a given allele frequency in each population.

Example of a 2D-SFS

The following matrix contains the 2D-SFS of 50,000 simulated sites from two populations, with a sample size of 2 diploids for pop1 and 3 diploids for pop2. Hence, we have 7 rows and 5 columns. Each entry (i,j) contains the number of SNPs with a derived allele frequency of i in pop1, and j in pop2. The sum of the entries in this matrix is the total number of sites, which in this case is 50,000. This matrix can be visualy represented by a heatmap, as in the figure below.

Abs. Frequency of derived allele in Pop1
0 1 2 3 4
Abs. Frequency of derived allele in Pop2 0 23688 3450 897 194 27
1 4640 2000 889 338 75
2 1790 1310 929 485 175
3 699 832 763 568 256
4 240 492 627 569 393
5 87 223 417 560 623
6 20 72 228 424 1020

Exercise 1.2.2.
Based on the above matrix with the 2D SFS, answer the following questions.

  • How many sites have 1 heterozygote individual in pop1 and 1 heterozygote individual in pop2?

  • How many sites have a derived allele frequency of 4 in pop1 and a derived allele frequency of 4 in pop2?

  • Can you identify the entry corresponding to the sites where all individuals are heterozygote?

Exercise 1.2.3.
Based on the above matrix with the 2D SFS, obtain the marginal 1D SFS.

Abs. Frequency of derived allele in Pop1
Marginal SFS Pop2
0 1 2 3 4
Abs. Frequency of derived allele in Pop2 0 23688 3450 897 194 27 28256
1 4640 2000 889 338 75 7942
2 1790 1310 929 485 175 4689
3 699 832 763 568 256 3118
4 240 492 627 569 393 2321
5 87 223 417 560 623 1910
6 20 72 228 424 1020 1764
Marginal SFS Pop1
31164 8379 4750 3138 2569 50000

2. Does the SFS contain information about demographic history?

Infer parameters under simple models with fastsimcoal2 with simulated data

We will analyse the models simulated according to the following scenarios.


Data simulated according to the following scenario.

2.1 Input files: DATA

The input files with the observed SFS are in the folder: “./FscInputFiles/”:

  • 2PopDiv20Mb_jointDAFpop1_0.obs: 2D observed SFS for the model with two populations.

Exercise 2.1. Have a look at the input files using a text editor (e.g. RStudio editor, gedit, TextPad, Notepad). You can read them with R using the read.table() function.


# set the working directory
setwd("/home/popgen/workshop_materials/22_fastsimcoal")
# read the observed SFS
# 2D from two populations with different effective sizes that diverged some time ago
pop2d <- read.table("./FscInputFiles/2PopDiv20Mb_jointDAFpop1_0.obs", skip=1, stringsAsFactors = F, header=T, row.names = 1)
# 
head(pop2d)

The observed derived SFS is:

d0_0 d0_1 d0_2 d0_3 d0_4 d0_5 d0_6
d1_0 19969298 8235 2294 945 368 152 49
d1_1 4188 1615 1142 703 471 234 112
d1_2 1204 853 815 674 532 370 221
d1_3 415 429 539 556 535 471 363
d1_4 94 199 296 390 543 695 0

Have a look at the observed SFS file. Does the observed SFS include the monomorphic sites?

The entry for the monomorphic sites is the one with a derived frequency of zero in pop1 and in pop2, i.e. the entry [1,1] corresponding to the 1st row and 1st column.

Input files 2.2: DEFINITION OF MODELS

Exercise 2.2.
Have a look at the TPL and EST files for the models shown above. Use a texteditor (e.g. RStudio editor, gedit, TextPad) to have a look at these files.

Each model is defined in a TPL file. The parameters we want to infer are given a parameter name tag (e.g. NPOP, TDIV). Have a look at the TPL file on the folder “FscInputFiles”, corresponding to the model:

  • 2PopDiv20Mb.tpl

The search range for each parameter is defined in the EST file. For each parameter we specify the search range using the corresponding parameter name tag. Have a look at the EST file on the folder “FscInputFiles”, corresponding to the two model:

  • 2PopDiv20Mb.est

In sum, there are two steps:

  1. Define the model with the TPL file, specifying the parameter tags for those parameters we want to infer. NOTE: different parameters need to be given different parameter name tags, and name tags cannot be a sub-string another one (e.g. NPOP1, NPOP1_2 cannot be used because NPOP1 is a sub-string of NPOP1_2). For this reason, we usually use a “$” at the start and end of a tag, e.g. $NPOP1$, $NPOP1_2$.

  2. Define the search range for each parameter in the EST file.

For further information on definition of models check the manual and the material given during the lecture about specification of models in fastsimcoal2. NOTE: the TPL and EST files need to have the same filename, but have different extensions: [filename].est and [filename].tpl.

TO DISCUSS
From looking at the TPL and EST file:
– What is the mutation rate assumed?
– Do we need a mutation rate estimate to be able to run fastsimcoal2?
– Can we estimate the mutation rate from the SFS?

2.3. Settings to run fastsimcoal2

To run fastsimcoal2 we need to specify the following options:

  • the number of coalescent simulations to approximate the expected SFS (-n option). This should be larger than 100,000. We recommend to use 200,000 to 1,000,000.

  • the number of optimization cycles (-L option). This should be at least 20, but values between 50 and 100 are recommended.

  • use option -M indicating that we will run the likelihood optimization to infer parameters based on the SFS.

  • use option -d for the derived SFS and -m for the MAF SFS.

  • use option -0 if you do not have monomorphic sites in your observed SFS. NOTE: in this case you need to fix a parameter (e.g. a split time or an effective size), and all parameters will be relative to that fixed parameter.

  • use option -q to have a quite mode, without printing lots of information.

  • specify option -C1 such that the likelihood is computed for all entries with at least 1 SNP. If you specify -Cx then all entries with less than x SNPs are pooled together. This option is useful when there are many entries in the observed SFS with few SNPs and with a limited number of SNPS to avoid overfitting.

  • specify options for multithread. -c1 -B1 for a single core, -c4 -B4 for 4 cores using 4 batches.

2.4. Run fastsimcoal2 from R

The following scripts will run 1 fastsimcoal2 optimization, starting from a random initial starting value. When dealing with real data one needs to repeat at least 20 to 100 runs (i.e. runs with different initial starting values), and then we select the run that maximizes the likelihood (i.e. with the highest MaxEstLhood).

# load functions to run fastsimcoal2 and to process the output 
source("utilFscFunctions.r")
source("ParFileInterpreter_VS.r")

# create a list that saves all the required settings to run fastsimcoal2
settings <- list()
# path to fastsicoal2 executable file
# in this case the fastsimcoal2 executable is in the folder /home/popgen/software/fsc26_linux64/
settings$pathToFsc <- "/home/popgen/software/fsc26_linux64/"
# DEFINE OBSERVED SFS INPUT FILE
# path to input files (working directory)
# In this case, the input files are in the folder ./FscInputFiles/
settings$pathTo_InputFile <- "./FscInputFiles/"
# DEFINE THE MODEL WITH EST AND TPL FILES 
# tags to the TPL and EST files. Files should be "tag".tpl and "tag".est
# in this case files are 2PopDiv20Mb.tpl and 2PopDiv20Mb.est
settings$TPL_EST_file_tag <- "2PopDiv20Mb"
# number of coalescent simulations
settings$n_coalsims <- 100000
# number of optimization cycles
settings$n_cycles <- 20

# IF YOU WANT TO RUN FASTSIMCOAL2 FROM THE COMMAND LINE
# get the fastsimcoal2 command line
cmd <- get_fsc2_commandline(settings)
cmd
# IF YOU WANT TO RUN FASTSIMCOAL2 BY CALLING IT FROM R
# Only works in Unix systems
# run fastsimcoal2
run_fsc2(settings)

Running fastsimcoal2 should take less than 2min. You will get a print of the optimization procedure, with the likelihood at each iteration. For instance, I obtained the following in one of the runs:

Estimation of parameters by conditional maximization via Brent algorithm (initial lhood = -356267)
Iter 1 Curr best params: 24107 20448 12188 1119 lhood=-138019
Iter 2 Curr best params: 13304 5500 9889 1012 lhood=-137381
Iter 3 Curr best params: 20621 5307 9790 1046 lhood=-137361

Iter 20 Curr best params: 20414 5155 9895 1013 lhood=-137360

The likelihood is given in units of log10. Hence, the closer to zero the higher the likelihood. We can see that the likelihood increased from -356267 to -137360 in 20 cycles of the optimization. We can also see that after 1 iteration the parameters were (in same order as in EST file): NPOP1 24107, NPOP2 20448, NANC 12188, TDIV 1119 and at iteration 20 the values were NPOP1 20414, NPOP2 5155, NANC 9895, TDIV 1013.

How to interpret fastsimcoal2 results? Fastsimcoal2 will output several files, which contain important information. The most relevant information for most users is:

  • *.bestlhoods : file with the Maximum likelihood parameter estimates and corresponding likelihood. This is what we want – parameter estimates!

  • *_DAFpop0.txt : file with the expected SFS obtained with the parameters that maximized the likelihood during optimization. This is needed to visually check the fit of the SFS. Or we can use this to recompute the likelihood of other observed SFS (e.g. without linked sites) given our model.

  • *.simparam : file with an example of the settings to run the simulations. It is useful to check this file when you have errors.

Exercise 2.4
Obtain parameter estimates and maximum likelihood.
Follow the code below that will read the output files from fastsimcoal2 to obtain the parameter estimates.

# read the file with the maximum likelihood estimates
maxlhoodEstimates <- read.table(paste(settings$pathTo_InputFile, "/",
                                      settings$TPL_EST_file_tag, "/",
                                      settings$TPL_EST_file_tag, ".bestlhoods",
                                      sep=""), header=T)
# look at the parameter estimates
maxlhoodEstimates

The maxlhoodEstimates contain a matrix where each column corresponds to a parameter name tag and the corresponding parameter estimate. You have the result of 1 run. In the analysis of real data sets we need to run several independent runs (different starting values), and select the parameter estimates of the run attaining the maximum likelihood estimate (MaxEstLhood).

TO DISCUSS:
– What are the parameter estimates of your run?

NOTE: Difference between MaxObsLhood and MaxEstLhood:

MaxObsLhood: is the maximum possible value for the likelihood if there was a perfect fit of the expected to the observed SFS, i.e. if the expected SFS was the relative observed SFS.

MaxEstLhood: is the maximum likelihood estimated according to the model parameters.

The better the fit, the smaller the difference between MaxObsLhood and MaxEstLhood

So, in your case, all participants should have the same MaxObsLhood because all students are looking at the same OBSERVED SFS, but each one of you will have a different MaxEstLhood because you have run different optimization cycles, starting from different initial values.

Exercise 2.5
– Assess the fit of the expected SFS under the model to the observed SFS.

Follow the code below that will read the output files from fastsimcoal2 to obtain the observed and expected SFS.

# Fit of the model expected SFS to the observed SFS

# Tag for the end of the observed SFS file
obsfileend <- "_jointDAFpop1_0"

# Read the observed SFS - SNP counts
obssfs <- read.table(paste(settings$pathTo_InputFile,
                    settings$TPL_EST_file_tag, obsfileend, ".obs",
                    sep=""), skip=1, stringsAsFactors = F, header=T)

# Read the expected SFS - PROBABILITIES
expsfs <- read.table(paste(settings$pathTo_InputFile,
                    settings$TPL_EST_file_tag, "/",
                    settings$TPL_EST_file_tag, obsfileend, ".txt",
                    sep=""), header=T)

# Plot the fit of the 2D SFS, including of the marginal 1D SFS
# the function plot2dSFS is defined in the utilfunctions.r
# you need to give as input the following arguments
#    obsSFS : matrix with observed SFS (counts)
#    expSFS : matrix with expected SFS (probabilities)
#    xtag : string with the label of x-axis
#    ytag : string with the label of y-axis
#    minentry : number with the minimum entry in the SFS (all entries with less than this are pooled together)
plot2dSFS(obsSFS=obssfs, expSFS=expsfs, xtag="Pop2", ytag="Pop1", minentry=1)

# Draw a scenario of the model parameter estimates assuming a generation time of 1 year
# path to the maxL.par file that is a par file with the parameters that maximize the likelihood
path_to_maxL_file <- paste(settings$pathTo_InputFile, settings$TPL_EST_file_tag, "/",settings$TPL_EST_file_tag, "_maxL", sep="")
parFileInterpreter(args=path_to_maxL_file, pop.names=c("Pop1","Pop2"), gentime=1, printPDF=FALSE)
## [1] "./FscInputFiles/2PopDiv20Mb/2PopDiv20Mb_maxL"

Note the difference between the observed SFS and the expected SFS. The observed SFS shows the number of sites with a given frequency (counts, which are integer values), whereas the expected SFS shows the probability of finding a site with a given absolute allele frequency in both populations (i.e. probabilities with values between 0 and 1.0).

DISCUSSION
– Is the method able to infer the correct parameters?
– Does the expected SFS fit the observed SFS?
– Does the SFS contain information about the demographic history of populations?
– Time estimates are given in generations. How to convert to years?


3. Application to real genome-wide SNP data from human populations

Apply fastsimcoal2 to infer parameters of a model with two populations

The folder “./Henn_et_al_data” contains exome data from two human populations (Namibian San, SAN; https://en.wikipedia.org/wiki/San_people, and Mexican Mayan, MAYA, https://en.wikipedia.org/wiki/Maya_peoples) published in Henn et al. (2015) PNAS (http://www.pnas.org/content/113/4/E440.abstract). This data was kindly provided by Stephan Peischl (http://www.bioinformatics.unibe.ch/about_us/staff/dr_peischl_stephan/index_eng.html), as is a small sub-set of the original data. As such, it cannot be used outside the scope of this course. The original data is available at NCBI Sequence Read Archive (accession no. SRP036155).

The folder “./Henn_et_al_data” contains the following files:

  • 2D-SFS for the two populations, assuming that we sequenced a total of 44Mb

  • Original genotypic matrix

  • TPL and EST files for a model without gene flow

In this case pop1 is Maya with 8 diploid individuals sampled (16 gene copies), and pop2 is San with 6 diploid individuals sampled (12 gene copies).

Exercise 3.1.
Have a look at the input files (obsSFS, EST and TPL files). Answer the following questions:
– Are monomorphics sites included in the observed SFS?
– What is the sample size from each population?
– What is the mutation rate?
– How many historical events does the model include?
– What are the parameter search ranges?

If the monomorphic sites are not included we cannot use the mutation rate as a clock to get inferences. When we do not have the number of monomorphic sites, all parameters must be specified in relation to a given fixed parameter (e.g. an effective size, or a time of event). Here I obtained the number of monomorphic sites assuming that a total of 44MB from the exome was sequenced and that sites that are not SNPs are monomorphic, i.e. #monomorphic = 44e6 – #SNPs.

Exercise 3.2.
Infer the time of divergence between two human populations, San from Africa and Maya from America using fastsimcoal2.
Consider a model without gene flow and with a potential bottleneck in the Maya population.

Use the command lines below for the model without gene flow. Note that we assume that the Maya population could have potentially a bottleneck associated with the split time, lasting for a fixed number of generations. Have a look at the TPL and EST files to see how the bottleneck was