Inferring demographic parameters from the SFS with composite likelihood method implemented in fastsimcoal2
Vitor C. Sousa
Objectives of this practical session
- What is the SFS?
- Derived (unfolded) SFS from genotypic data
- Does the SFS contain information about demography?
- Infer parameters under simple models with fastsimcoal2 (simulated data)
- Input files for fastsimcoal2
- Settings to run fastsimcoal2
- Analyse the output of fastsimcoal2
- Does our model describe well the data? Fit of expected SFS under the moded with observed SFS
- Application to real dataset from human populations
- Apply fastsimcoal2 for a model with two populations
- Compare results with and without gene flow
- Application SFS obtained with ANGSD for human populations
- Apply fastsimcoal2 for cases with no monomorphic sites, with relative parameters
For this practical, we will use the R Studio of the Amazon instance and call fastsimcoal2 from command line of the Amazon instance.
# Code in grey code blocks are commands to be run in Rstudio on the Amazon instance.
# Code in black code blocks are examples of commands to be run in the terminal of the Amazon instance.
The practical requires the following R packages:
You can install these packages with the following commands in R.
install.packages("plotrix") install.packages("diagram") install.packages("RColorBrewer") library("RColorBrewer")
1. What is the site frequency spectrum (SFS)?
1.1. Build the derived SFS from genotypic data
The SFS (also called Allele frequency spectrum – AFS) is a summary of genome-wide data that describes the number of single nucleotide polymorphisms (SNPs) with a given frequency in our sample.
Consider the following genotypic data matrix where each entry codes for the genotypes as 0 (homozygote for reference allele), 1 (heterozygote), 2 (homozygote for alternative allele). In this matrix each row corresponds to a site in the genome, and each column corresponds to an individual. This is similar to what we can obtain from a variant call format (VCF) file.
Let’s compute the observed SFS from the above matrix, assuming it comes from a single population.
Before starting, answer the following two questions:
What is the x-axis of the SFS? What is the range of possible values for the x-axis? (click on the triangle to see the answer)
Answer: The x-axis of the SFS contains the absolute frequency of the derived allele in the sample. Thus, it can take values from zero (if at a site all individuals are homozygote for ancestral allele) to \(2*n\) (if all the \(n\) individuals are homozygote for the derived allele). Note that we need to multiply by 2 because individuals are diploid.
What is the y-axis of the SFS? What is the range of possible values for the y-axis? (click on the triangle to see the answer)
Answer: The y-axis of the SFS contains the number of sites in the genome with a given absolute frequency. Thus, it can take values between zero (if there are no sites in the genome with a given frequency) to \(S\), where \(S\) is the number of sites.
Build the observed SFS given the genotypes, following the steps below.
First, we need to compute the absolute allele frequency for each SNP.
Second, we need to count the number of sites with a given allele frequency.
You should obtain the following table, which can be represented as a barplot. For instance, we can see in the SFS that there are 5 sites with a frequency of 1, 2 sites with a frequency of 3, and zero sites with a frequency larger than 6.
|Number of sites||0||5||3||2||1||1||0||0||0||0||0|
1.2. Multi-dimensional SFS (2D-SFS)
The SFS can be extended to any number of populations. For instance, if we had sampled data from two populations, the 2D-SFS corresponds to a matrix where the entry (i,j) corresponds to the count of sites with frequency i in population 1, and frequency j in population 2.
Example of a 2D-SFS
The following matrix contains the 2D-SFS of 50,000 simulated sites from two populations, with a sample size of 2 diploids for pop1 and 3 diploids for pop2. Hence, we have 5 rows and 7 columns. Each entry (i,j) contains the number of SNPs with a derived allele frequency of i in pop1, and j in pop2. The sum of the entries in this matrix is the total number of sites, which in this case is 50,000. This matrix can be visualy represented by a heatmap, as in the figure below.
Based on the above matrix with the 2D SFS, answer the following questions.
How many sites have 1 heterozygote individual in pop1 and 1 heterozygote individual in pop2?
How many sites have a derived allele frequency of 4 in pop1 and a derived allele frequency of 4 in pop2?
Identify the entry of the 2D-SFS matrix corresponding to the sites where all individuals in both populations are heterozygote.
2. Does the SFS contain information about demographic history?
Infer parameters under simple models with fastsimcoal2 with simulated data
We will analyse the simulated SFS (our observed SFS) according to the scenario below. In real life the observed SFS would have been obtained from your VCF files.
2.1 Input files: DATA
The input files with the observed SFS are in the folder: “./FscInputFiles/”:
- TwoPops_2DSFS.obs: 2D observed SFS for the model with two populations.
Exercise 2.1. Have a look at the input files using a text editor (e.g. RStudio editor, gedit, TextPad, Notepad). You can read them with R using the read.table() function.
# set the working directory setwd("~/workshop_materials/a09_fastsimcoal2") # read the observed SFS # 2D from two populations with different sample sizes pop2d <- read.table("./FscInputFiles/TwoPops_2DSFS.obs", skip=1, stringsAsFactors = F, header=T, row.names = 1)
The observed derived SFS is (here the identifiers di_j represent the pop \(i\) and SNP frequency \(j\)):
Have a look at the observed SFS file. Does the observed SFS includes the monomorphic sites?
The entry for the monomorphic sites is the one with a derived frequency of zero in pop1 and in pop2, i.e. the entry [1,1] corresponding to the 1st row and 1st column.
Input files 2.2: DEFINITION OF MODELS
Have a look at the TPL and EST files for the models shown above. Use a texteditor (e.g. RStudio editor, gedit, TextPad) to have a look at these files.
We will consider a model without gene flow. That model and the parameters are defined in the files “2PopDiv_NoMig.est” and “2PopDiv_NoMig.tpl”. Each model has a corresponding TPL file. The parameters we want to infer are given a parameter name tag (e.g. $NPOP$, $TDIV$). Have a look at the TPL file on the folder “FscInputFiles”, corresponding to the model:
The search range for each parameter is defined in the EST file. For each parameter we specify the search range using the corresponding parameter name tag.
Have a look at the EST file on the folder “FscInputFiles”, corresponding to the No migration model.
Note the keyword “absoluteResize” that means that the resize is given in absolute effective size, rather than a relative effective size. Note the search ranges and the keyword “bounded” in the line of \(TDIV\) parameters, which means that the range of that parameter is bounded. For all the other parameters the upper bound will be increased if the likelihood is maximized near the maximum value.
In sum, there are two steps:
Define the model with the TPL file, specifying the parameter tags for those parameters we want to infer. NOTE: different parameters need to be given different name tags, and name tags cannot be a sub-string another one (e.g. NPOP1, NPOP1_2 cannot be used because NPOP1 is a sub-string of NPOP1_2). For this reason, we usually use a “$” at the start and end of a tag, e.g. $NPOP1$, $NPOP1_2$.
Define the search range for each parameter in the EST file.
For further information on definition of models check the manual and the material given during the lecture about specification of models in fastsimcoal2.
NOTE: the TPL and EST files need to have the same filename, but have different extensions: [filename].est and [filename].tpl.
TO DISCUSS Exercise 2.2.2
From looking at the TPL and EST file:
– What is the mutation rate assumed?
– Do we need a mutation rate estimate to be able to run fastsimcoal2?
– Can we estimate the mutation rate from the SFS using fastsimcoal2?
A: The average mutation rate is shown in the last line of the TPL file. In this case it is assumed to be 2.5e-8 mutations/site/generation. You do not need a mutation rate estimate to run fastsimcoal2. If you have the number of monomorphic sites and a mutation rate you can estimate the parameters with time in generations. Otherwise, if you do not have a mutation rate and/or you do not have the number of monomorphic sites, then you can still run fastsimcoal2 but in that case you will need to define a reference effective size or a reference time for an event. All the other parameters would be estimated in relation to that reference parameter. For instance, you could define that the effective size of one of the populations is the reference \(NPOP1\) and fixed to a value, e.g. 10,000, and then you can estimate the relative size of other populations (e.g., $NPOP2$/$NPOP1$), even if you do not have monomorphic sites and/or a mutation rate. Although in theory it is possible to estimate an average mutation rate, we do not recommend this as you would need to fix an effective size.
When estimating parameters with the SFS you CANNOT estimate jointly the mutation rate and the other parameters. You either need to give a fixed mutation rate or fix one effective size as a reference. This is different from other methods, such as Approximate Bayesian Computation where you simulate data and you can jointly estimate both mutation rate and demographic history parameters.
2.3. Settings to run fastsimcoal2
To run fastsimcoal2 we need to specify the following options:
the number of coalescent simulations to approximate the expected SFS (-n option). This should be larger than 100,000. We recommend to use 200,000 to 1,000,000.
the number of optimization cycles (-L option). This should be at least 20, but values between 50 and 100 are recommended.
use option -M indicating that we will run the likelihood optimization to infer parameters based on the SFS.
use option -d for the derived SFS and -m for the MAF SFS.
use option -0 if you do not have monomorphic sites in your observed SFS. NOTE: in this case you need to fix a parameter (e.g. a split time or an effective size), and all parameters will be relative to that fixed parameter.
use option -q to have a quite mode, without printing lots of information.
specify option -C1 such that the likelihood is computed for all entries with at least 1 SNP. If you specify -Cx then all entries with less than x SNPs are pooled together. This option is useful when there are many entries in the observed SFS with few SNPs and with a limited number of SNPS to avoid overfitting.
specify options for multithread. -c1 -B1 for a single core, -c4 -B4 for 4 cores using 4 batches.
2.4. Run fastsimcoal2 (a single run only to exemplify)
The following scripts will run one fastsimcoal2 optimization, starting from a random initial starting value. When dealing with real data one needs to repeat at least 20 to 100 runs (i.e., runs with different initial starting values), and then we select the run that maximizes the likelihood (i.e., the run with the highest MaxEstLhood).
# load functions to run fastsimcoal2 and to process the output source("utilFscFunctions.r") source("ParFileInterpreter_VS.r") # create a list that saves all the required settings to run fastsimcoal2 settings <- list() # path to fastsicoal2 executable file # in this case the fastsimcoal2 executable is in the folder ./fastsimcoal2/fsc26_linux64/ settings$pathToFsc <- "./fastsimcoal2/fsc27_linux64/" # executable filename settings$Fsc <- "fsc2702" # DEFINE OBSERVED SFS INPUT FILE settings$obsfile <- "TwoPops_2DSFS.obs" # path to input files (working directory) # In this case, the input files are in the folder ./FscInputFiles/ settings$pathTo_InputFile <- "./FscInputFiles/" # Tag for the end of the observed SFS file # this depends on whether it is a 1D SFS, 2D SFS or --multiSFS (check fastsimcoal2 manual) settings$obsfileend <- "_jointDAFpop1_0" # DEFINE THE MODEL WITH EST AND TPL FILES # tags to the TPL and EST files. Files should be "tag".tpl and "tag".est # in this case files are 2PopDiv_NoMig.tpl and 2PopDiv_NoMig.est settings$TPL_EST_file_tag <- "2PopDiv_NoMig" # number of coalescent simulations settings$n_coalsims <- 100000 # number of optimization cycles settings$n_cycles <- 20 # get the command lines to copy files and run fastsimcoal2 cmd <- get_fsc2_commandline(settings)
# print command lines to copy files cmd$copyfiles # rename observed file name cmd$renameobs # print command line to run fastsimcoal2 cmd$fsc2cmd
Copy the above lines and run then in the terminal of your Amazon instance.
The lines should be the following:
# Go to the server # Go to the folder ~/workshop_materials/a09_fastsimcoal2/ cd ~/workshop_materials/a09_fastsimcoal2/ # copy the lines that you get from the code above cp ./fastsimcoal2/fsc27_linux64/fsc2702 ./FscInputFiles/;chmod +x ./FscInputFiles/fsc2702;cd ./FscInputFiles/; cp TwoPops_2DSFS.obs 2PopDiv_NoMig_jointDAFpop1_0.obs ./fsc2702 -t 2PopDiv_NoMig.tpl -e 2PopDiv_NoMig.est -L 20 -n 100000 -d -M -q -C1 -c4 -B4;cd ..
Running fastsimcoal2 should take less than 2min. You will get a print of the optimization procedure, with the likelihood at each iteration. For instance, I obtained the following in one of the runs:
Estimation of parameters by conditional maximization via Brent algorithm (initial lhood = -356267)
Iter 1 Curr best params: 24107 20448 12188 1119 lhood=-138019
Iter 2 Curr best params: 13304 5500 9889 1012 lhood=-137381
Iter 3 Curr best params: 20621 5307 9790 1046 lhood=-137361
Iter 20 Curr best params: 20414 5155 9895 1013 lhood=-137360
The likelihood is given in units of log10. Hence, the closer to zero the higher the likelihood. We can see that the likelihood increased from -356267 to -137360 in 20 cycles of the optimization. We can also see that after 1 iteration the parameters were (in same order as in EST file): NPOP1 24107, NPOP2 20448, NANC 12188, TDIV 1119 and at iteration 20 the values were NPOP1 20414, NPOP2 5155, NANC 9895, TDIV 1013.
How to interpret fastsimcoal2 results?
Fastsimcoal2 will output several files, which contain important information. The most relevant information for most users is:
*.bestlhoods : file with the Maximum likelihood parameter estimates and corresponding likelihood. This is what we want – parameter estimates!
*_DAFpop0.txt : file with the expected SFS obtained with the parameters that maximized the likelihood during optimization. This is needed to visually check the fit of the SFS. Or we can use this to recompute the likelihood of other observed SFS (e.g. without linked sites) given our model.
*.simparam : file with an example of the settings to run the simulations. It is useful to check this file when you have errors.
Obtain parameter estimates and maximum likelihood.
Follow the code below that will read the output files from fastsimcoal2 to obtain the parameter estimates.
# read the file with the maximum likelihood estimates maxlhoodEstimates <- read.table(paste(settings$pathTo_InputFile, settings$TPL_EST_file_tag, "/", settings$TPL_EST_file_tag, ".bestlhoods", sep=""), header=T, check.names = F) # replace the X. at the start of each parameter name paramnames <- colnames(maxlhoodEstimates) eval <- which(substr(paramnames,1,2)=="X.") # get params starting with X. paramnames[eval] <- substr(paramnames[eval],start=3, stop=nchar(paramnames[eval])) # get only the substring after X. for those parameteres colnames(maxlhoodEstimates) <- paramnames # replace column names # look at the parameter estimates maxlhoodEstimates
## $NPOP1$ $NPOP2$ $NANC$ $TDIV$ MaxEstLhood MaxObsLhood ## 1 19595 5402 9867 1044 -137360.1 -137353.9
The maxlhoodEstimates contain a matrix where each column corresponds to a parameter name tag and the corresponding parameter estimate. You have the result of 1 run. In the analysis of real data sets we need to run several independent runs (different starting values), and select the parameter estimates of the run attaining the maximum likelihood estimate (MaxEstLhood).
Exercise 2.5 TO DISCUSS:
– What are the parameter estimates of your run?
– What is the difference between MaxObsLhood and MaxEstLhood?
NOTE: Difference between MaxObsLhood and MaxEstLhood:
MaxObsLhood: is the maximum possible value for the likelihood if there was a perfect fit of the expected to the observed SFS, i.e. if the expected SFS was the relative observed SFS.
MaxEstLhood: is the maximum likelihood estimated according to the model parameters.
The better the fit, the smaller the difference between MaxObsLhood and MaxEstLhood
So, in your case, all participants should have the same MaxObsLhood because all students are looking at the same OBSERVED SFS, but each one of you will have a different MaxEstLhood because you have run different optimization cycles, starting from different initial values.
– Assess the fit of the expected SFS under the model to the observed SFS.
Follow the code below that will read the output files from fastsimcoal2 to obtain the observed and expected SFS.
# Fit of the model expected SFS to the observed SFS # Read the observed SFS - SNP counts obssfs <- read.table(paste(settings$pathTo_InputFile, settings$TPL_EST_file_tag, settings$obsfileend, ".obs", sep=""), skip=1, stringsAsFactors = F, header=T) # Read the expected SFS - PROBABILITIES expsfs <- read.table(paste(settings$pathTo_InputFile, settings$TPL_EST_file_tag, "/", settings$TPL_EST_file_tag, settings$obsfileend, ".txt", sep=""), header=T) # Plot the fit of the 2D SFS, including of the marginal 1D SFS # the function plot2dSFS is defined in the utilfunctions.r # you need to give as input the following arguments # obsSFS : matrix with observed SFS (counts) # expSFS : matrix with expected SFS (probabilities) # xtag : string with the label of x-axis # ytag : string with the label of y-axis # minentry : number with the minimum entry in the SFS (all entries with less than this are pooled together) plot2dSFS(obsSFS=obssfs, expSFS=expsfs, xtag="Pop2", ytag="Pop1", minentry=1)
Note the difference between the observed SFS and the expected SFS. The observed SFS shows the number of sites with a given frequency (counts, which are integer values), whereas the expected SFS shows the probability of finding a site with a given absolute allele frequency in both populations (i.e. probabilities with values between 0 and 1.0).
# Draw a scenario of the model parameter estimates assuming a generation time of 1 year # path to the maxL.par file that is a par file with the parameters that maximize the likelihood layout(1) path_to_maxL_file <- paste(settings$pathTo_InputFile, settings$TPL_EST_file_tag, "/",settings$TPL_EST_file_tag, "_maxL", sep="") parFileInterpreter(args=path_to_maxL_file, pop.names=c("Pop1","Pop2"), gentime=1, printPDF=FALSE)
##  "./FscInputFiles/2PopDiv_NoMig/2PopDiv_NoMig_maxL"
## Loading required package: shape
– Is the method able to infer the correct parameters?
– Does the expected SFS fit the observed SFS?
– Does the SFS contain information about the demographic history of populations?
– Time estimates are given in generations. How to convert to years?
3. Application to real genome-wide SNP data from human populations
Apply fastsimcoal2 to infer parameters of a model with two populations
The folder “./Henn_et_al_data” contains exome data from two human populations (Namibian San, SAN; https://en.wikipedia.org/wiki/San_people, and Mexican Mayan, MAYA, https://en.wikipedia.org/wiki/Maya_peoples) published in Henn et al. (2015) PNAS (http://www.pnas.org/content/113/4/E440.abstract). This data was kindly provided by Stephan Peischl (http://www.bioinformatics.unibe.ch/about_us/staff/dr_peischl_stephan/index_eng.html), as is a small sub-set of the original data. As such, it cannot be used outside the scope of this course. The original data is available at NCBI Sequence Read Archive (accession no. SRP036155).
The folder “./Henn_et_al_data” contains the following files:
2D-SFS for the two populations, assuming that we sequenced a total of 44Mb
Original genotypic matrix
TPL and EST files for a model without gene flow
In this case pop1 is Maya with 8 diploid individuals sampled (16 gene copies), and pop2 is San with 6 diploid individuals sampled (12 gene copies).
Have a look at the input files (obsSFS, EST and TPL files). Answer the following questions:
– Are monomorphics sites included in the observed SFS?
– What is the sample size from each population?
– What is the mutation rate?
– How many historical events does the model include?
– What are the parameter search ranges?
If the monomorphic sites are not included we cannot use the mutation rate as a clock to get inferences. When we do not have the number of monomorphic sites, all parameters must be specified in relation to a given fixed parameter (e.g. an effective size, or a time of event). Here I obtained the number of monomorphic sites assuming that a total of 44MB from the exome was sequenced and that sites that are not SNPs are monomorphic, i.e. #monomorphic = 44e6 – #SNPs.
Infer the time of divergence between two human populations, San from Africa and Maya from America using fastsimcoal2.
Consider a model without gene flow and with a potential bottleneck in the Maya population.
Use the command lines below for the model without gene flow. Note that we assume that the Maya population could have potentially a bottleneck associated with the split time, lasting for a fixed number of generations. Have a look at the TPL and EST files to see how the bottleneck was implemented.
# load functions to run fastsimcoal2 and to process the output source("utilFscFunctions.r") source("ParFileInterpreter_VS.r") # create a list that saves all the required settings to run fastsimcoal2 settings <- list() # path to fastsicoal2 executable file settings$pathToFsc <- "./fastsimcoal2/fsc27_linux64/" # executable filename settings$Fsc <- "fsc2702" # DEFINE OBSERVED SFS INPUT FILE settings$obsfile <- "Maya_San_DSFS.obs" # path to input files (working directory) # In this case, the input files are in the folder ./Henn_et_al_data/ settings$pathTo_InputFile <- "./Henn_et_al_data/" # DEFINE THE MODEL WITH EST AND TPL FILES # tags to the TPL and EST files. Files should be "tag".tpl and "tag".est settings$TPL_EST_file_tag <- "NoMig_San_Maya" # Tag for the end of the observed SFS file # this depends on whether it is a 1D SFS, 2D SFS or --multiSFS (check fastsimcoal2 manual) settings$obsfileend <- "_jointDAFpop1_0" # number of coalescent simulations settings$n_coalsims <- 100000 # number of optimization cycles settings$n_cycles <- 20 # get the fastsimcoal2 command line cmd <- get_fsc2_commandline(settings)
Copy the command lines to the server (Amazon instance) and run fastsimcoal2. If you type the following you should get the following command lines
# print command lines to copy files cmd$copyfiles # rename observed file name cmd$renameobs # print command line to run fastsimcoal2 cmd$fsc2cmd
Go to the Amazon instance and copy the command lines to run them in the terminal. This run will take longer than previous runs, approx. ~10 min.
The lines should be the following:
# Go to the folder ~/workshop_materials/a09_fastsimcoal2/ cd ~/workshop_materials/a09_fastsimcoal2/ # copy the lines that you get from the code above cp ./fastsimcoal2/fsc27_linux64/fsc2702 ./Henn_et_al_data/;chmod +x ./Henn_et_al_data/fsc2702;cd ./Henn_et_al_data/; cp Maya_San_DSFS.obs NoMig_San_Maya_jointDAFpop1_0.obs; ./fsc2702 -t NoMig_San_Maya.tpl -e NoMig_San_Maya.est -L 20 -n 100000 -d -M -q -C1 -c4 -B4;cd ..
FASTSIMCOAL2 IS NOW RUNNING. Once the run is finished.
Example of results of one run for the model without migration. Copy the following commands to RStudio.
# read the file with the maximum likelihood estimates maxlhoodEstimates <- read.table(paste(settings$pathTo_InputFile, settings$TPL_EST_file_tag, "/", settings$TPL_EST_file_tag, ".bestlhoods", sep=""), header=T, check.names = F) # replace the X. at the start of each parameter name paramnames <- colnames(maxlhoodEstimates) eval <- which(substr(paramnames,1,2)=="X.") # get params starting with X. paramnames[eval] <- substr(paramnames[eval],start=3, stop=nchar(paramnames[eval])) # get only the substring after X. for those parameteres colnames(maxlhoodEstimates) <- paramnames # replace column names # print the estimates maxlhoodEstimates
## $NPOP1$ $NPOP2$ $NANC$ $NBOTP1$ $TDIV$ $TBOT_END$ MaxEstLhood MaxObsLhood ## 1 7696 34346 18553 83 3333 1906 -476812.2 -475819.9
These are the estimates I obtained, but you will very likely have different results because your run started from a different starting value.
Now, lets compare the fit of the expected SFS under the model to the observed data.
# Read the observed SFS # 2D from two human populations: San and Maya maya_san_2dsfs <- read.table(paste(settings$pathTo_InputFile, settings$TPL_EST_file_tag, settings$obsfileend,".obs",sep=""), skip=1, stringsAsFactors = F, header=T, row.names = 1) # Read the expected SFS expectedSFS <- read.table(paste(settings$pathTo_InputFile, settings$TPL_EST_file_tag, "/", settings$TPL_EST_file_tag, settings$obsfileend, ".txt", sep=""), header=T) # Plot the SFS fit plot2dSFS(maya_san_2dsfs, expectedSFS, xtag="San", ytag="Maya", minentry=1)
# Plot the relative 2D-SFS fit plot_relDiff2dSFS(maya_san_2dsfs, expectedSFS, xtag="San", ytag="Maya", minentry=1)
# Draw a scenario of the model parameter estimates assuming a generation time of 29 years # path to the maxL.par file that is a par file with the parameters that maximize the likelihood layout(1) path_to_maxL_file <- paste(settings$pathTo_InputFile, settings$TPL_EST_file_tag, "/",settings$TPL_EST_file_tag, "_maxL", sep="") parFileInterpreter(args=path_to_maxL_file, pop.names=c("Maya","San"), gentime=29, printPDF=FALSE)
##  "./Henn_et_al_data/NoMig_San_Maya/NoMig_San_Maya_maxL"