Inferring demographic parameters from the SFS with composite likelihood method implemented in fastsimcoal2







Objectives of this practical session


  1. What is the SFS?
  • Derived (unfolded) SFS from genotypic data
  1. Does the SFS contain information about demography?
  • Infer parameters under simple models with fastsimcoal2 (simulated data)
  • Input files for fastsimcoal2
  • Settings to run fastsimcoal2
  • Analyse the output of fastsimcoal2
  • Does our model describe well the data? Fit of expected SFS under the moded with observed SFS
  1. Application to real dataset from human populations
  • Apply fastsimcoal2 for a model with two populations
  • Compare results with and without gene flow
  1. Application SFS obtained with ANGSD for human populations
  • Apply fastsimcoal2 for cases with no monomorphic sites, with relative parameters

Presentation slides

Notes on the fastsimcoal2 input files


REQUIREMENTS

For this practical, we will use the R Studio of the Amazon instance and call fastsimcoal2 from command line of the Amazon instance.

# Code in grey code blocks are commands to be run in Rstudio on the Amazon instance.
# Code in black code blocks are examples of commands to be run in the terminal of the Amazon instance.

The practical requires the following R packages:

  • plotrix

  • diagram

  • RColorBrewer

You can install these packages with the following commands in R.

install.packages("plotrix")
install.packages("diagram")
install.packages("RColorBrewer")
library("RColorBrewer")

1. What is the site frequency spectrum (SFS)?

1.1. Build the derived SFS from genotypic data

The SFS (also called Allele frequency spectrum – AFS) is a summary of genome-wide data that describes the number of single nucleotide polymorphisms (SNPs) with a given frequency in our sample.

Consider the following genotypic data matrix where each entry codes for the genotypes as 0 (homozygote for reference allele), 1 (heterozygote), 2 (homozygote for alternative allele). In this matrix each row corresponds to a site in the genome, and each column corresponds to an individual. This is similar to what we can obtain from a variant call format (VCF) file.

Let’s compute the observed SFS from the above matrix, assuming it comes from a single population.

Before starting, answer the following two questions:

What is the x-axis of the SFS? What is the range of possible values for the x-axis? (click on the triangle to see the answer)

Answer: The x-axis of the SFS contains the absolute frequency of the derived allele in the sample. Thus, it can take values from zero (if at a site all individuals are homozygote for ancestral allele) to \(2*n\) (if all the \(n\) individuals are homozygote for the derived allele). Note that we need to multiply by 2 because individuals are diploid.

What is the y-axis of the SFS? What is the range of possible values for the y-axis? (click on the triangle to see the answer)

Answer: The y-axis of the SFS contains the number of sites in the genome with a given absolute frequency. Thus, it can take values between zero (if there are no sites in the genome with a given frequency) to \(S\), where \(S\) is the number of sites.

Exercise 1.1
Build the observed SFS given the genotypes, following the steps below.

First, we need to compute the absolute allele frequency for each SNP.

Individuals
Derived allele frequency
ind1 ind2 ind3 ind4 ind5 Absolute frequency
snp1 0 1 0 1 0 2
snp2 0 1 0 1 0 2
snp3 0 1 1 1 2 5
snp4 0 1 1 0 0 2
snp5 0 1 0 0 2 3
snp6 1 0 0 1 1 3
snp7 1 0 0 0 0 1
snp8 0 0 1 0 0 1
snp9 0 0 0 0 1 1
snp10 0 0 0 0 1 1
snp11 0 0 1 0 0 1
snp12 0 1 2 1 0 4

Second, we need to count the number of sites with a given allele frequency.

Results.

You should obtain the following table, which can be represented as a barplot. For instance, we can see in the SFS that there are 5 sites with a frequency of 1, 2 sites with a frequency of 3, and zero sites with a frequency larger than 6.

Abs. Frequency of derived allele
0 1 2 3 4 5 6 7 8 9 10
Number of sites 0 5 3 2 1 1 0 0 0 0 0

1.2. Multi-dimensional SFS (2D-SFS)

The SFS can be extended to any number of populations. For instance, if we had sampled data from two populations, the 2D-SFS corresponds to a matrix where the entry (i,j) corresponds to the count of sites with frequency i in population 1, and frequency j in population 2.

Example of a 2D-SFS

The following matrix contains the 2D-SFS of 50,000 simulated sites from two populations, with a sample size of 2 diploids for pop1 and 3 diploids for pop2. Hence, we have 5 rows and 7 columns. Each entry (i,j) contains the number of SNPs with a derived allele frequency of i in pop1, and j in pop2. The sum of the entries in this matrix is the total number of sites, which in this case is 50,000. This matrix can be visualy represented by a heatmap, as in the figure below.

Abs. Frequency of derived allele in Pop1
0 1 2 3 4
0 23688 3450 897 194 27
1 4640 2000 889 338 75
2 1790 1310 929 485 175
3 699 832 763 568 256
4 240 492 627 569 393
5 87 223 417 560 623
6 20 72 228 424 1020

Exercise 1.2
Based on the above matrix with the 2D SFS, answer the following questions.

  • How many sites have 1 heterozygote individual in pop1 and 1 heterozygote individual in pop2?

  • How many sites have a derived allele frequency of 4 in pop1 and a derived allele frequency of 4 in pop2?

  • Identify the entry of the 2D-SFS matrix corresponding to the sites where all individuals in both populations are heterozygote.


2. Does the SFS contain information about demographic history?

Infer parameters under simple models with fastsimcoal2 with simulated data

We will analyse the simulated SFS (our observed SFS) according to the scenario below. In real life the observed SFS would have been obtained from your VCF files.


“Data simulated according to the following scenario.”

2.1 Input files: DATA

The input files with the observed SFS are in the folder: “./FscInputFiles/”:

  • TwoPops_2DSFS.obs: 2D observed SFS for the model with two populations.

Exercise 2.1. Have a look at the input files using a text editor (e.g. RStudio editor, gedit, TextPad, Notepad). You can read them with R using the read.table() function.

# set the working directory 
setwd("~/workshop_materials/a09_fastsimcoal2")
# read the observed SFS
# 2D from two populations with different sample sizes
pop2d <- read.table("./FscInputFiles/TwoPops_2DSFS.obs", skip=1, stringsAsFactors = F, header=T, row.names = 1)

The observed derived SFS is (here the identifiers di_j represent the pop \(i\) and SNP frequency \(j\)):

d0_0 d0_1 d0_2 d0_3 d0_4 d0_5 d0_6
d1_0 19969298 8235 2294 945 368 152 49
d1_1 4188 1615 1142 703 471 234 112
d1_2 1204 853 815 674 532 370 221
d1_3 415 429 539 556 535 471 363
d1_4 94 199 296 390 543 695 0

Have a look at the observed SFS file. Does the observed SFS includes the monomorphic sites?
The entry for the monomorphic sites is the one with a derived frequency of zero in pop1 and in pop2, i.e. the entry [1,1] corresponding to the 1st row and 1st column.

Input files 2.2: DEFINITION OF MODELS

Exercise 2.2.
Have a look at the TPL and EST files for the models shown above. Use a texteditor (e.g. RStudio editor, gedit, TextPad) to have a look at these files.

We will consider a model without gene flow. That model and the parameters are defined in the files “2PopDiv_NoMig.est” and “2PopDiv_NoMig.tpl”. Each model has a corresponding TPL file. The parameters we want to infer are given a parameter name tag (e.g. $NPOP$, $TDIV$). Have a look at the TPL file on the folder “FscInputFiles”, corresponding to the model:

  • 2PopDiv_NoMig.tpl

The search range for each parameter is defined in the EST file. For each parameter we specify the search range using the corresponding parameter name tag.

Have a look at the EST file on the folder “FscInputFiles”, corresponding to the No migration model.

  • 2PopDiv_NoMig.est

Note the keyword “absoluteResize” that means that the resize is given in absolute effective size, rather than a relative effective size. Note the search ranges and the keyword “bounded” in the line of \(TDIV\) parameters, which means that the range of that parameter is bounded. For all the other parameters the upper bound will be increased if the likelihood is maximized near the maximum value.

In sum, there are two steps:

  1. Define the model with the TPL file, specifying the parameter tags for those parameters we want to infer. NOTE: different parameters need to be given different name tags, and name tags cannot be a sub-string another one (e.g. NPOP1, NPOP1_2 cannot be used because NPOP1 is a sub-string of NPOP1_2). For this reason, we usually use a “$” at the start and end of a tag, e.g. $NPOP1$, $NPOP1_2$.

  2. Define the search range for each parameter in the EST file.

For further information on definition of models check the manual and the material given during the lecture about specification of models in fastsimcoal2.

NOTE: the TPL and EST files need to have the same filename, but have different extensions: [filename].est and [filename].tpl.

TO DISCUSS Exercise 2.2.2
From looking at the TPL and EST file:
– What is the mutation rate assumed?
– Do we need a mutation rate estimate to be able to run fastsimcoal2?
– Can we estimate the mutation rate from the SFS using fastsimcoal2?

A: The average mutation rate is shown in the last line of the TPL file. In this case it is assumed to be 2.5e-8 mutations/site/generation. You do not need a mutation rate estimate to run fastsimcoal2. If you have the number of monomorphic sites and a mutation rate you can estimate the parameters with time in generations. Otherwise, if you do not have a mutation rate and/or you do not have the number of monomorphic sites, then you can still run fastsimcoal2 but in that case you will need to define a reference effective size or a reference time for an event. All the other parameters would be estimated in relation to that reference parameter. For instance, you could define that the effective size of one of the populations is the reference \(NPOP1\) and fixed to a value, e.g. 10,000, and then you can estimate the relative size of other populations (e.g., $NPOP2$/$NPOP1$), even if you do not have monomorphic sites and/or a mutation rate. Although in theory it is possible to estimate an average mutation rate, we do not recommend this as you would need to fix an effective size.

VERY IMPORTANT

When estimating parameters with the SFS you CANNOT estimate jointly the mutation rate and the other parameters. You either need to give a fixed mutation rate or fix one effective size as a reference. This is different from other methods, such as Approximate Bayesian Computation where you simulate data and you can jointly estimate both mutation rate and demographic history parameters.

2.3. Settings to run fastsimcoal2

To run fastsimcoal2 we need to specify the following options:

  • the number of coalescent simulations to approximate the expected SFS (-n option). This should be larger than 100,000. We recommend to use 200,000 to 1,000,000.

  • the number of optimization cycles (-L option). This should be at least 20, but values between 50 and 100 are recommended.

  • use option -M indicating that we will run the likelihood optimization to infer parameters based on the SFS.

  • use option -d for the derived SFS and -m for the MAF SFS.

  • use option -0 if you do not have monomorphic sites in your observed SFS. NOTE: in this case you need to fix a parameter (e.g. a split time or an effective size), and all parameters will be relative to that fixed parameter.

  • use option -q to have a quite mode, without printing lots of information.

  • specify option -C1 such that the likelihood is computed for all entries with at least 1 SNP. If you specify -Cx then all entries with less than x SNPs are pooled together. This option is useful when there are many entries in the observed SFS with few SNPs and with a limited number of SNPS to avoid overfitting.

  • specify options for multithread. -c1 -B1 for a single core, -c4 -B4 for 4 cores using 4 batches.

2.4. Run fastsimcoal2 (a single run only to exemplify)

The following scripts will run one fastsimcoal2 optimization, starting from a random initial starting value. When dealing with real data one needs to repeat at least 20 to 100 runs (i.e., runs with different initial starting values), and then we select the run that maximizes the likelihood (i.e., the run with the highest MaxEstLhood).

# load functions to run fastsimcoal2 and to process the output 
source("utilFscFunctions.r")
source("ParFileInterpreter_VS.r")

# create a list that saves all the required settings to run fastsimcoal2
settings <- list()
# path to fastsicoal2 executable file
# in this case the fastsimcoal2 executable is in the folder ./fastsimcoal2/fsc26_linux64/
settings$pathToFsc <- "./fastsimcoal2/fsc27_linux64/"
# executable filename
settings$Fsc <- "fsc2702"
# DEFINE OBSERVED SFS INPUT FILE
settings$obsfile <- "TwoPops_2DSFS.obs"
# path to input files (working directory)
# In this case, the input files are in the folder ./FscInputFiles/
settings$pathTo_InputFile <- "./FscInputFiles/"
# Tag for the end of the observed SFS file 
# this depends on whether it is a 1D SFS, 2D SFS or --multiSFS (check fastsimcoal2 manual)
settings$obsfileend <- "_jointDAFpop1_0"
# DEFINE THE MODEL WITH EST AND TPL FILES 
# tags to the TPL and EST files. Files should be "tag".tpl and "tag".est
# in this case files are 2PopDiv_NoMig.tpl and 2PopDiv_NoMig.est
settings$TPL_EST_file_tag <- "2PopDiv_NoMig"
# number of coalescent simulations
settings$n_coalsims <- 100000
# number of optimization cycles
settings$n_cycles <- 20

# get the command lines to copy files and run fastsimcoal2 
cmd <- get_fsc2_commandline(settings)
# print command lines to copy files
cmd$copyfiles
# rename observed file name
cmd$renameobs
# print command line to run fastsimcoal2
 cmd$fsc2cmd

Copy the above lines and run then in the terminal of your Amazon instance.

The lines should be the following:

# Go to the server
# Go to the folder ~/workshop_materials/a09_fastsimcoal2/
cd ~/workshop_materials/a09_fastsimcoal2/

# copy the lines that you get from the code above
cp ./fastsimcoal2/fsc27_linux64/fsc2702 ./FscInputFiles/;chmod +x ./FscInputFiles/fsc2702;cd ./FscInputFiles/;
cp TwoPops_2DSFS.obs 2PopDiv_NoMig_jointDAFpop1_0.obs
./fsc2702 -t 2PopDiv_NoMig.tpl -e 2PopDiv_NoMig.est -L 20 -n 100000 -d -M -q -C1 -c4 -B4;cd ..

Running fastsimcoal2 should take less than 2min. You will get a print of the optimization procedure, with the likelihood at each iteration. For instance, I obtained the following in one of the runs:

Estimation of parameters by conditional maximization via Brent algorithm (initial lhood = -356267)
Iter 1 Curr best params: 24107 20448 12188 1119 lhood=-138019
Iter 2 Curr best params: 13304 5500 9889 1012 lhood=-137381
Iter 3 Curr best params: 20621 5307 9790 1046 lhood=-137361

Iter 20 Curr best params: 20414 5155 9895 1013 lhood=-137360

The likelihood is given in units of log10. Hence, the closer to zero the higher the likelihood. We can see that the likelihood increased from -356267 to -137360 in 20 cycles of the optimization. We can also see that after 1 iteration the parameters were (in same order as in EST file): NPOP1 24107, NPOP2 20448, NANC 12188, TDIV 1119 and at iteration 20 the values were NPOP1 20414, NPOP2 5155, NANC 9895, TDIV 1013.

How to interpret fastsimcoal2 results?

Fastsimcoal2 will output several files, which contain important information. The most relevant information for most users is:

  • *.bestlhoods : file with the Maximum likelihood parameter estimates and corresponding likelihood. This is what we want – parameter estimates!

  • *_DAFpop0.txt : file with the expected SFS obtained with the parameters that maximized the likelihood during optimization. This is needed to visually check the fit of the SFS. Or we can use this to recompute the likelihood of other observed SFS (e.g. without linked sites) given our model.

  • *.simparam : file with an example of the settings to run the simulations. It is useful to check this file when you have errors.

Exercise 2.4
Obtain parameter estimates and maximum likelihood.
Follow the code below that will read the output files from fastsimcoal2 to obtain the parameter estimates.

# read the file with the maximum likelihood estimates
maxlhoodEstimates <- read.table(paste(settings$pathTo_InputFile,
settings$TPL_EST_file_tag, "/",
settings$TPL_EST_file_tag, ".bestlhoods",
sep=""), header=T, check.names = F)
# replace the X. at the start of each parameter name
paramnames <- colnames(maxlhoodEstimates) 
eval <- which(substr(paramnames,1,2)=="X.") # get params starting with X.
paramnames[eval] <- substr(paramnames[eval],start=3,  stop=nchar(paramnames[eval])) # get only the substring after X. for those parameteres
colnames(maxlhoodEstimates) <- paramnames # replace column names

# look at the parameter estimates
maxlhoodEstimates
##   $NPOP1$ $NPOP2$ $NANC$ $TDIV$ MaxEstLhood MaxObsLhood
## 1   19595    5402   9867   1044   -137360.1   -137353.9

The maxlhoodEstimates contain a matrix where each column corresponds to a parameter name tag and the corresponding parameter estimate. You have the result of 1 run. In the analysis of real data sets we need to run several independent runs (different starting values), and select the parameter estimates of the run attaining the maximum likelihood estimate (MaxEstLhood).

Exercise 2.5 TO DISCUSS:
– What are the parameter estimates of your run?
– What is the difference between MaxObsLhood and MaxEstLhood?

NOTE: Difference between MaxObsLhood and MaxEstLhood:

MaxObsLhood: is the maximum possible value for the likelihood if there was a perfect fit of the expected to the observed SFS, i.e. if the expected SFS was the relative observed SFS.

MaxEstLhood: is the maximum likelihood estimated according to the model parameters.

The better the fit, the smaller the difference between MaxObsLhood and MaxEstLhood

So, in your case, all participants should have the same MaxObsLhood because all students are looking at the same OBSERVED SFS, but each one of you will have a different MaxEstLhood because you have run different optimization cycles, starting from different initial values.

Exercise 2.6
– Assess the fit of the expected SFS under the model to the observed SFS.

Follow the code below that will read the output files from fastsimcoal2 to obtain the observed and expected SFS.

# Fit of the model expected SFS to the observed SFS

# Read the observed SFS - SNP counts
obssfs <- read.table(paste(settings$pathTo_InputFile,
settings$TPL_EST_file_tag, settings$obsfileend, ".obs",
sep=""), skip=1, stringsAsFactors = F, header=T)

# Read the expected SFS - PROBABILITIES
expsfs <- read.table(paste(settings$pathTo_InputFile,
settings$TPL_EST_file_tag, "/",
settings$TPL_EST_file_tag, settings$obsfileend, ".txt",
sep=""), header=T)

# Plot the fit of the 2D SFS, including of the marginal 1D SFS
# the function plot2dSFS is defined in the utilfunctions.r
# you need to give as input the following arguments
#    obsSFS : matrix with observed SFS (counts)
#    expSFS : matrix with expected SFS (probabilities)
#    xtag : string with the label of x-axis
#    ytag : string with the label of y-axis
#    minentry : number with the minimum entry in the SFS (all entries with less than this are pooled together)
plot2dSFS(obsSFS=obssfs, expSFS=expsfs, xtag="Pop2", ytag="Pop1", minentry=1)

Note the difference between the observed SFS and the expected SFS. The observed SFS shows the number of sites with a given frequency (counts, which are integer values), whereas the expected SFS shows the probability of finding a site with a given absolute allele frequency in both populations (i.e. probabilities with values between 0 and 1.0).

# Draw a scenario of the model parameter estimates assuming a generation time of 1 year
# path to the maxL.par file that is a par file with the parameters that maximize the likelihood
layout(1)
path_to_maxL_file <- paste(settings$pathTo_InputFile, settings$TPL_EST_file_tag, "/",settings$TPL_EST_file_tag, "_maxL", sep="")
parFileInterpreter(args=path_to_maxL_file, pop.names=c("Pop1","Pop2"), gentime=1, printPDF=FALSE)
## [1] "./FscInputFiles/2PopDiv_NoMig/2PopDiv_NoMig_maxL"
## Loading required package: shape

TO DISCUSS
– Is the method able to infer the correct parameters?
– Does the expected SFS fit the observed SFS?
– Does the SFS contain information about the demographic history of populations?
– Time estimates are given in generations. How to convert to years?


3. Application to real genome-wide SNP data from human populations

Apply fastsimcoal2 to infer parameters of a model with two populations

The folder “./Henn_et_al_data” contains exome data from two human populations (Namibian San, SAN; https://en.wikipedia.org/wiki/San_people, and Mexican Mayan, MAYA, https://en.wikipedia.org/wiki/Maya_peoples) published in Henn et al. (2015) PNAS (http://www.pnas.org/content/113/4/E440.abstract). This data was kindly provided by Stephan Peischl (http://www.bioinformatics.unibe.ch/about_us/staff/dr_peischl_stephan/index_eng.html), as is a small sub-set of the original data. As such, it cannot be used outside the scope of this course. The original data is available at NCBI Sequence Read Archive (accession no. SRP036155).

The folder “./Henn_et_al_data” contains the following files:

  • 2D-SFS for the two populations, assuming that we sequenced a total of 44Mb

  • Original genotypic matrix

  • TPL and EST files for a model without gene flow

In this case pop1 is Maya with 8 diploid individuals sampled (16 gene copies), and pop2 is San with 6 diploid individuals sampled (12 gene copies).

Exercise 3.1.
Have a look at the input files (obsSFS, EST and TPL files). Answer the following questions:
– Are monomorphics sites included in the observed SFS?
– What is the sample size from each population?
– What is the mutation rate?
– How many historical events does the model include?
– What are the parameter search ranges?

If the monomorphic sites are not included we cannot use the mutation rate as a clock to get inferences. When we do not have the number of monomorphic sites, all parameters must be specified in relation to a given fixed parameter (e.g. an effective size, or a time of event). Here I obtained the number of monomorphic sites assuming that a total of 44MB from the exome was sequenced and that sites that are not SNPs are monomorphic, i.e. #monomorphic = 44e6 – #SNPs.

Exercise 3.2.
Infer the time of divergence between two human populations, San from Africa and Maya from America using fastsimcoal2.
Consider a model without gene flow and with a potential bottleneck in the Maya population.

Use the command lines below for the model without gene flow. Note that we assume that the Maya population could have potentially a bottleneck associated with the split time, lasting for a fixed number of generations. Have a look at the TPL and EST files to see how the bottleneck was implemented.

# load functions to run fastsimcoal2 and to process the output 
source("utilFscFunctions.r")
source("ParFileInterpreter_VS.r")

# create a list that saves all the required settings to run fastsimcoal2
settings <- list()
# path to fastsicoal2 executable file
settings$pathToFsc <- "./fastsimcoal2/fsc27_linux64/"
# executable filename
settings$Fsc <- "fsc2702"
# DEFINE OBSERVED SFS INPUT FILE
settings$obsfile <- "Maya_San_DSFS.obs"
# path to input files (working directory)
# In this case, the input files are in the folder ./Henn_et_al_data/
settings$pathTo_InputFile <- "./Henn_et_al_data/"
# DEFINE THE MODEL WITH EST AND TPL FILES 
# tags to the TPL and EST files. Files should be "tag".tpl and "tag".est
settings$TPL_EST_file_tag <- "NoMig_San_Maya"
# Tag for the end of the observed SFS file 
# this depends on whether it is a 1D SFS, 2D SFS or --multiSFS (check fastsimcoal2 manual)
settings$obsfileend <- "_jointDAFpop1_0"
# number of coalescent simulations
settings$n_coalsims <- 100000
# number of optimization cycles
settings$n_cycles <- 20

# get the fastsimcoal2 command line
cmd <- get_fsc2_commandline(settings)

Copy the command lines to the server (Amazon instance) and run fastsimcoal2. If you type the following you should get the following command lines

# print command lines to copy files
cmd$copyfiles
# rename observed file name
cmd$renameobs
# print command line to run fastsimcoal2
cmd$fsc2cmd

Go to the Amazon instance and copy the command lines to run them in the terminal. This run will take longer than previous runs, approx. ~10 min.

The lines should be the following:

# Go to the folder ~/workshop_materials/a09_fastsimcoal2/
cd ~/workshop_materials/a09_fastsimcoal2/

# copy the lines that you get from the code above
cp ./fastsimcoal2/fsc27_linux64/fsc2702 ./Henn_et_al_data/;chmod +x ./Henn_et_al_data/fsc2702;cd ./Henn_et_al_data/; 
cp Maya_San_DSFS.obs NoMig_San_Maya_jointDAFpop1_0.obs;
./fsc2702 -t NoMig_San_Maya.tpl -e NoMig_San_Maya.est -L 20 -n 100000 -d -M -q -C1 -c4 -B4;cd ..

FASTSIMCOAL2 IS NOW RUNNING. Once the run is finished.

Example of results of one run for the model without migration. Copy the following commands to RStudio.

# read the file with the maximum likelihood estimates
maxlhoodEstimates <- read.table(paste(settings$pathTo_InputFile,
settings$TPL_EST_file_tag, "/",
settings$TPL_EST_file_tag, ".bestlhoods",
sep=""), header=T, check.names = F)
# replace the X. at the start of each parameter name
paramnames <- colnames(maxlhoodEstimates) 
eval <- which(substr(paramnames,1,2)=="X.") # get params starting with X.
paramnames[eval] <- substr(paramnames[eval],start=3,  stop=nchar(paramnames[eval])) # get only the substring after X. for those parameteres
colnames(maxlhoodEstimates) <- paramnames # replace column names
# print the estimates
maxlhoodEstimates
##   $NPOP1$ $NPOP2$ $NANC$ $NBOTP1$ $TDIV$ $TBOT_END$ MaxEstLhood MaxObsLhood
## 1    7696   34346  18553       83   3333       1906   -476812.2   -475819.9

These are the estimates I obtained, but you will very likely have different results because your run started from a different starting value.

\(NPOP1\) \(NPOP2\) \(NANC\) \(NBOTP1\) \(TDIV\) \(TBOT_END\) MaxEstLhood MaxObsLhood
7696 34346 18553 83 3333 1906 -476812.2 -475819.9

Now, lets compare the fit of the expected SFS under the model to the observed data.

# Read the observed SFS
# 2D from two human populations: San and Maya
maya_san_2dsfs <- read.table(paste(settings$pathTo_InputFile,
settings$TPL_EST_file_tag, settings$obsfileend,".obs",sep=""), 
skip=1, stringsAsFactors = F, header=T, row.names = 1)
# Read the expected SFS
expectedSFS <- read.table(paste(settings$pathTo_InputFile,
settings$TPL_EST_file_tag, "/",
settings$TPL_EST_file_tag, settings$obsfileend, ".txt",
sep=""), header=T)

# Plot the SFS fit
plot2dSFS(maya_san_2dsfs, expectedSFS, xtag="San", ytag="Maya", minentry=1)    

# Plot the relative 2D-SFS fit
plot_relDiff2dSFS(maya_san_2dsfs, expectedSFS, xtag="San", ytag="Maya", minentry=1)

# Draw a scenario of the model parameter estimates assuming a generation time of 29 years
# path to the maxL.par file that is a par file with the parameters that maximize the likelihood
layout(1)
path_to_maxL_file <- paste(settings$pathTo_InputFile, settings$TPL_EST_file_tag, "/",settings$TPL_EST_file_tag, "_maxL", sep="")
parFileInterpreter(args=path_to_maxL_file, pop.names=c("Maya","San"), gentime=29, printPDF=FALSE)
## [1] "./Henn_et_al_data/NoMig_San_Maya/NoMig_San_Maya_maxL"

DISCUSSION
– What are the estimated effective sizes?
– Is there evidence for a bottleneck in Maya?
– What are the estimated times of divergence?
– Does the expected SFS fit the observed SFS?
– Does the fit of expected SFS give you an idea on how to improve the model?

Exercise 3.3.
Modify the input files to run a model with constant migration between Maya and San.

You already have the solutions in the file and folder called “Mig_San_Maya”. You can have a look at those files if you want to check the solutions.

Check carefully the EST and TPL files. In the TPL notice the “keep” keyword in the definition of the migration matrix in historical events. The “nomig” keyword indicates that at older times than that event there was no migration. Also, in the TPL notice the “absoluteResize” keyword, which means that the population size is not given as a relative value. In the EST notice the “bounded” and “paramInRange” that are useful to define parameters that have bounded upper limits and that depend on other parameter values.

# DEFINE THE MODEL WITH EST AND TPL FILES 
# tags to the TPL and EST files. Files should be "tag".tpl and "tag".est
settings$TPL_EST_file_tag <- "Mig_San_Maya"
# number of coalescent simulations
settings$n_coalsims <- 100000
# number of optimization cycles
settings$n_cycles <- 20

# RUN FASTSIMCOAL2 FROM THE COMMAND LINE
# get the fastsimcoal2 command line
cmd <- get_fsc2_commandline(settings)
# print command lines to copy files
cmd$copyfiles
# rename observed file name
cmd$renameobs
# print command line to run fastsimcoal2
cmd$fsc2cmd

After running fastsimcoal2 in the server (Amazon instance). This run will take longer than previous runs, approx. ~10 min.

Run the following command lines in RStudio.

# read the file with the maximum likelihood estimates
maxlhoodEstimates <- read.table(paste(settings$pathTo_InputFile, "/",
settings$TPL_EST_file_tag, "/",
settings$TPL_EST_file_tag, ".bestlhoods",
sep=""), header=T, check.names = F)
# look at the parameter estimates
maxlhoodEstimates
##   $NPOP1$ $NPOP2$ $NANC$ $NBOTP1$ $TDIV$  $N1M12$   $N2M21$ $TMIG_STOP$
## 1    9767   35258  15378       38   4702 0.506294 0.1263726         149
##   $TBOT_END$     $MIG12$     $MIG21$ MaxEstLhood MaxObsLhood
## 1       1402 5.18347e-05 3.58425e-06   -476464.1   -475819.9
# Read the observed SFS
# 2D from two human populations: San and Maya
maya_san_2dsfs <- read.table(paste(settings$pathTo_InputFile, "/", 
settings$TPL_EST_file_tag, settings$obsfileend,".obs",sep=""), 
skip=1, stringsAsFactors = F, header=T, row.names = 1)
# Read the expected SFS
expectedSFS <- read.table(paste(settings$pathTo_InputFile,
settings$TPL_EST_file_tag, "/",
settings$TPL_EST_file_tag, settings$obsfileend, ".txt",
sep=""), header=T)

# Plot the SFS fit
plot2dSFS(maya_san_2dsfs, expectedSFS, xtag="San", ytag="Maya", minentry=1)    

# Plot the relative 2D-SFS fit
plot_relDiff2dSFS(maya_san_2dsfs, expectedSFS, xtag="San", ytag="Maya", minentry=1)

# Draw a scenario of the model parameter estimates assuming a generation time of 29 years
# path to the maxL.par file that is a par file with the parameters that maximize the likelihood
layout(1)
path_to_maxL_file <- paste(settings$pathTo_InputFile, settings$TPL_EST_file_tag, "/",settings$TPL_EST_file_tag, "_maxL", sep="")
parFileInterpreter(args=path_to_maxL_file, pop.names=c("Maya","San"), gentime=29, printPDF=FALSE)
## [1] "./Henn_et_al_data/Mig_San_Maya/Mig_San_Maya_maxL"

DISCUSSION
– What are the estimated effective sizes?
– What are the estimated times of divergence?
– Does the expected SFS fit the observed SFS?
– How are the parameter values affected by adding migration to the model? Would you change your conclusions?
– What are the estimated migration rates?
– Is the model with migration better than a model without migration?
– Are these models sufficient to explain the data?
– Here we analyzed data from exome sequencing. Can we trust these demographic estimates?

4. Application to SFS obtained from low coverage whole genome data using ANGSD

The folder “./ANGSD_SFS” contain the 2D-SFS you obtained in the ANGSD tutorial, from 10 Yoruba and 10 CEU. The folder “./ANGSD_SFS” contains the following files:

  • 2D-SFS for the two populations (NoMig_ANGSD_yrbceu_rel_DSFS.obs)

  • TPL and EST files for a model without gene flow
    In this case pop1 corresponds to 10 Yoruba diploid individuals (20 gene copies), and pop2 corresponds to 10 CEU diploid individuals (20 gene copies).

Exercise 4.1.
Have a look at the input files (obsSFS, EST and TPL files). Answer the following questions:
– Are monomorphics sites included in the observed SFS?
– Given that there are no monomorphic sites, what is the parameter set as a reference?

For this observed SFS, there are two key differences compared with exercise 3:

  • it is coded as a single line rather than a 2D-matrix (use the –multiSFS argumment);

  • it does not include the monomorphic sites (use the -0 argumment);

Since the monomorphic sites are not included we cannot use the mutation rate as a clock to get inferences. In the TPL and EST file all parameters are relative to tha reference parameter, in this case the NANC (ancestral population size), which we defined as 40000. You can still estimate relative parameters, such as relative time of divergence and relative effective sizes.

Exercise 4.2.
Infer relative the time of divergence between Yoruba and CEU.
Consider a model without gene flow and with a potential bottleneck in CEU population at the time of divergence from Yoruba.

Go to the Amazon instance and copy the command lines you have compiled in Rstudio, to run them in the terminal. This run will take longer than previous runs, approx. ~10 min.

If you need help assembling the R code, click here
# create a list that saves all the required settings to run fastsimcoal2
settings <- list()
# DEFINE OBSERVED SFS INPUT FILE
settings$obsfile <- "NoMig_ANGSD_yrbceu_rel_DSFS.obs"
# path to input files (working directory)
# In this case, the input files are in the folder ./ANGSD_SFS/
settings$pathTo_InputFile <- "./ANGSD_SFS/"
# DEFINE THE MODEL WITH EST AND TPL FILES 
# tags to the TPL and EST files. Files should be "tag".tpl and "tag".est
settings$TPL_EST_file_tag <- "NoMig_ANGSD_yrbceu_rel"
# observed SFS file end
settings$obsfileend <- "_DSFS"

# read the file with the maximum likelihood estimates
maxlhoodEstimates <- read.table(paste(settings$pathTo_InputFile,
settings$TPL_EST_file_tag, "/",
settings$TPL_EST_file_tag, ".bestlhoods",
sep=""), header=T, check.names = F)
# replace the X. at the start of each parameter name
paramnames <- colnames(maxlhoodEstimates) 
eval <- which(substr(paramnames,1,2)=="X.") # get params starting with X.
paramnames[eval] <- substr(paramnames[eval],start=3,  stop=nchar(paramnames[eval])) # get only the substring after X. for those parameteres
colnames(maxlhoodEstimates) <- paramnames # replace column names
# print the estimates
maxlhoodEstimates

The lines should be the following:

# Go to the folder ~/workshop_materials/a09_fastsimcoal2/
cd ~/workshop_materials/a09_fastsimcoal2/

# copy the lines that you get from the code above
cp ./fastsimcoal2/fsc27_linux64/fsc2702 ./ANGSD_SFS/;chmod +x ./ANGSD_SFS/fsc2702;cd ./ANGSD_SFS/; 

# run fastimcoal2 with the -0 and --multiSFS option
./fsc2702 -t NoMig_ANGSD_yrbceu_rel.tpl -e NoMig_ANGSD_yrbceu_rel.est -L 20 -n 100000 -d -0 --multiSFS -M -q -C1 -c4 -B4

Example of results of one run for the model without migration.

##   $RELNPOP1$ $RELNPOP2$ $NBOTP2$ $RELTDIV$ $TDIV$ $NPOP1$ $NPOP2$ MaxEstLhood
## 1   2.882062  0.5743272       93  0.059426   2377  115282   22973   -18008.74
##   MaxObsLhood
## 1   -17311.15

As in exercise 3, you will have different results because your run started from a different starting value.

\(RELNPOP1\) \(RELNPOP2\) \(NBOTP2\) \(RELTDIV\) \(TDIV\) \(NPOP1\) \(NPOP2\) MaxEstLhood MaxObsLhood
2.882062 0.5743272 93 0.059426 2377 115282 22973 -18008.74 -17311.15

Now, lets compare the fit of the expected SFS under the model to the observed data.

# Read the observed SFS
# 2D from two human populations: Yoruba and CEU
yrb_ceu_2dsfs <- scan(paste(settings$pathTo_InputFile,
settings$TPL_EST_file_tag, settings$obsfileend,".obs",sep=""), 
skip=2)
# Read the expected SFS
yrb_ceu_expSFS <- scan(paste(settings$pathTo_InputFile,
settings$TPL_EST_file_tag, "/",
settings$TPL_EST_file_tag, settings$obsfileend, ".txt",
sep=""), skip=2)

# convert the 1 line multi SFS to 2D-SFS matrix
obsmat <- matrix(yrb_ceu_2dsfs, ncol=21, byrow=T)
expmat <- matrix(yrb_ceu_expSFS, ncol=21, byrow=T)

# Plot the SFS fit
plot2dSFS(obsmat, expmat, xtag="Youruba", ytag="CEU", minentry=1)    

# Draw a scenario of the model parameter estimates assuming a generation time of 29 years
# path to the maxL.par file that is a par file with the parameters that maximize the likelihood
layout(1)
path_to_maxL_file <- paste(settings$pathTo_InputFile, settings$TPL_EST_file_tag, "/",settings$TPL_EST_file_tag, "_maxL", sep="")
parFileInterpreter(args=path_to_maxL_file, pop.names=c("Yoruba","CEU"), gentime=29, printPDF=FALSE)
## [1] "./ANGSD_SFS/NoMig_ANGSD_yrbceu_rel/NoMig_ANGSD_yrbceu_rel_maxL"

Exercise 4.3. DISCUSSION
– Given the lack of monomorphic sites, what are the estimated times of divergence and effective sizes?
– If the reference NANC was actually 20000, what would be the estimated divergence time?
– Does the expected SFS fit the observed SFS?
– Given the observed SFS and the fit of the expected SFS, are you confident you have enough data?