fastsimcoal2 activity - Evolution and Genomics

Practical session: Inferring demographic parameters from the SFS with composite likelihood method implemented in fastsimcoal2

Vitor C. Sousa, 26 January 2018

Objectives of this practical session

What is the SFS?

Derived (unfolded) SFS from genotypic data
Multi-dimensional SFS (2D-SFS)

Does the SFS contain information about demography?

Infer parameters under simple models with fastsimcoal2 (simulated data)
Input files for fastsimcoal2
Settings to run fastsimcoal2
Analyse the output of fastsimcoal2
Does our model describes well the data? Fit of expected SFS under the moded with observed SFS

Application to real dataset from human populations

Apply fastsimcoal2 for a model with two populations
Compare results with and without gene flow

Advanced

Folded vs unfolded SFS
How to deal with missing data?

Requirements

For this practical, we will use R Studio in the browser (open http://YOUR_IP.compute-1.amazonaws.com:8787) and call fastsimcoal2 directly from R.
We will then visualize the results in R, which requires the following R packages:

plotrix
diagram
RColorBrewer

You can load the libraries with the following commands in R:

library(plotrix)
library(diagram)
library(RColorBrewer)

1. What is the site frequency spectrum (SFS)? (45 min)

1.1. Build the derived SFS from genotypic data

The SFS (also called Allele frequency spectrum – AFS) is a summary of genome-wide data that describes the number of single nucleotide polymorphisms (SNPs) with a given frequency in our sample.

Consider the following genotypic data matrix where each entry codes for the genotypes as 0 (homozygote for reference allele), 1 (heterozygote), 2 (homozygote for alternative allele). In this matrix each column corresponds to a site, and each row corresponds to an individual. This is similar to what we can obtain from a variant call format (VCF) file.

	snp1	snp2	snp3	snp4	snp5	snp6	snp7	snp8	snp9	snp10	snp11	snp12
ind1	1	0	0	0	0	0	0	0	0	0	0	1
ind2	0	0	1	1	1	0	1	0	1	0	1	0
ind3	0	1	2	1	1	1	0	0	0	0	0	0
ind4	0	0	1	0	1	0	0	0	1	0	1	1
ind5	0	0	0	0	2	0	2	1	0	1	0	1

Exercise 1.1.1 (10 min).
Use a piece of paper and a pen to compute the observed SFS from the above matrix.

Before starting, answer the following questions:
– What is the x-axis of the SFS? What is the range of possible values for the x-axis?
– What is the y-axis of the SFS? What is the range of possible values for the y-axis?

First, you need to compute the absolute allele frequency for each SNP.
Second, you need to count the number of sites with a given allele frequency.

Exercise 1.1.2.
Compute and plot the derived site frequency spectrum from the dataset in file “~/workshop_materials/06_fastsimcoal2/Data/example.GT” with R.

Use the R code given below. First, read the file using the function read.table(). Second, compute the allele frequency for each SNP using the function colSums(). Third, count the number of sites with a given allele frequenc using the function table(). Fourth, plot the SFS with the barplot() function.

This dataset was simulated from a single stationary population with a constant effective size, consisting of a sample of 5 diploid individuals genotyped at 10,000 sites.

IMPORTANT: In this exercise we assume that the ancestral/derived state of each allele can be determined (e.g. comparing with data from a related species), and that the reference allele corresponds to the ancestral state, and the alternative allele to the derived state.


# set the working directory
setwd("~/workshop_materials/06_fastsimcoal2")

# set the folder Data as the working directory. Please change in your computer accordingly.
path_to_datafile <- "./Data/"
# number of SNPs
nsnps <- 10000
# number of individuals
nind <- 5
# read the genotype data as a matrix
genomat <- matrix(scan(file=paste(path_to_datafile,"example.GT", sep="")), nrow=nind, ncol=nsnps, byrow=T) 
# Exercise: get the observed SFS with the colSums() and table() functions.
all_freq <- colSums(genomat)
obs.sfs <- table(all_freq)
# to plot use the barplot() function.

Results.

You should obtain something like the following barplot on the left.

The plot on the left shows the SFS including the monomorphic sites, whereas the right plot shows the relative SFS, including only the polymorphic sites. The solid line on the right corresponds to the theoretical expectation for a stationary population of constant size.

TO DISCUSS
The SFS is an efficient way to sumarize genome-wide variation in a sample.

How many entries does the SFS has? Does it depend on the number of SNPs or number of individuals?
Why do we expect most SNPs to have a low frequency in a stationary (constant-size) population?

1.2. Multi-dimensional SFS (2D-SFS)

So far we have been looking at 1D SFS. But this can be extended to any number of populations. For instance, if we had sampled data from two populations, the 2D-SFS corresponds to a matrix where the entry (i,j) corresponds to the count of sites with frequency i in population 1, and frequency j in population 2.

Consider the following genotype matrix for 12 SNPs genotyped at 5 individuals from 2 populations, 2 diploid individuals from Pop1, and 3 diploid individuals from Pop2. Each row corresponds to one individual and each column to one SNP. Genotypes are coded as 0, 1, 2 as in Exercise 1.1.

Pop1

	snp1	snp2	snp3	snp4	snp5	snp6	snp7	snp8	snp9	snp10	snp11	snp12
Pop1 ind1	1	0	0	1	0	0	0	0	1	0	0	0
Pop1 ind2	0	0	1	0	0	1	0	0	0	1	2	0

Pop2

	snp1	snp2	snp6	snp7	snp8	snp9	snp10	snp11	snp12
Pop2 ind1	1	1	0	1	0	1	1	1	1
Pop2 ind2	0	0	1	0	0	0	1	0	1
Pop2 ind3	0	0	0	0	1	1	0	1	2

Exercise 1.2.1 (10 min).
Use a piece of paper and a pen to compute the observed 2 dimensional SFS from the above matrices with the genotypes for each population.

The 2D SFS can be seen as a matrix. Before starting answer the following questions:

How many rows has the 2D SFS for this dataset? What affects the number of rows?
How many columns has this 2D SFS for this dataset? What affects the number of columns?
What is the range of possible values for each entry of the matrix?

First, you need to compute the absolute allele frequency for each SNP in each population.
Second, you need to count the number of sites with a given allele frequency in each population.

Example of a 2D-SFS

The following matrix contains the 2D-SFS of 50,000 simulated sites from two populations, with a sample size of 2 diploids for pop1 and 3 diploids for pop2. Hence, we have 5 rows and 7 columns. Each entry (i,j) contains the number of SNPs with a derived allele frequency of i in pop1, and j in pop2. The sum of the entries in this matrix is the total number os sites, which in this case is 50,000.

	0	1	2	3	4	5	6
0	23718	4770	1804	683	288	85	22
1	3428	1919	1370	826	495	214	93
2	873	866	853	783	584	428	219
3	225	329	450	533	541	557	471
4	23	72	155	251	413	649	1010

Exercise 1.2.2.
Based on the above matrix, answer the following questions.

How many sites have 1 heterozygote individual in pop1 and 1 heterozygote individual in pop2?
How many sites have a derived allele frequency of 4 in pop1 and a derived allele frequency of 4 in pop2?
Can you identify the entry corresponding to the sites where all individuals are heterozygote?

2. Does the SFS contain information about demography? (60 min)

2.1. Infer parameters under simple models with fastsimcoal2 (simulated data)

We will analyse the models simulated according to the following scenarios.

2.1.2 Input files: DATA

The input files with the observed SFS are in the folder: “~/workshop_materials/06_fastsimcoal2/FscInputFiles/”. There are two files:

1PopExpInst20Mb_DAFpop0.obs: 1D observed SFS for the model with a single population.
2PopDiv20Mb_jointDAFpop1_0.obs: 2D observed SFS for the model with two populations.

Exercise 2.1.2.
Have a look at the input files.

You can do this using a text editor (e.g. RStudio editor, gedit, TextPad, Notepad). You can also read them with R using the read.table() function using the code below.

# read the observed SFS
# 1D from a population that underwent an expansion
pop1d <- read.table("./FscInputFiles/1PopExpInst20Mb_DAFpop0.obs", skip=1, stringsAsFactors = F, header=T)

# 2D from two populations with different Ne that diverged some time ago
pop2d <- read.table("./FscInputFiles/2PopDiv20Mb_jointDAFpop1_0.obs", skip=1, stringsAsFactors = F, header=T, row.names = 1)

For the sample from a single population, the observed derived SFS is:

d0_0	d0_1	d0_2	d0_3	d0_4	d0_5	d0_6	d0_7	d0_8	d0_9	d0_10
19988389	10543	355	163	127	128	78	82	84	51	0

For the sample from two populations, the observed derived SFS is:

	d0_0	d0_1	d0_2	d0_3	d0_4	d0_5	d0_6
d1_0	19969298	8235	2294	945	368	152	49
d1_1	4188	1615	1142	703	471	234	112
d1_2	1204	853	815	674	532	370	221
d1_3	415	429	539	556	535	471	363
d1_4	94	199	296	390	543	695	0

The SFS is the data used to infer the parameters with fastsimcoal2. You do not need to give more information! So, all the genome-wide data is summarized by the SFS and no other information is required. From this data we can infer parameters and compute the likelihood.

TO DISCUSS:
Can the observed SFS include the monomorphic sites?
Can the observed SFS include linked sites?

2.1.3 Input files: DEFINITION OF MODELS

Exercise 2.1.3.
Have a look at the TPL and EST files for the models shown above.

Each model is defined in a TPL file. The parameters we want to infer are given a parameter name tag (e.g. NPOP, TDIV). There are two TPL files on the folder “~/workshop_materials/06_fastsimcoal2/FscInputFiles”, corresponding to the two models.

1PopExpInst20Mb.tpl
2PopDiv20Mb.tpl

The search range for each parameter is defined in the EST file. For each parameter we specify the search range using the corresponding parameter name tag. There are two EST files on the folder: “~/workshop_materials/06_fastsimcoal2/FscInputFiles”, corresponding to the two models.

1PopExpInst20Mb.est
2PopDiv20Mb.est

In sum, there are two steps:

Define the model with the TPL file, specifying the parameter tags for those parameters we want to infer. NOTE: different parameters need to be given different parameter name tags, and name tags cannot be a sub-string another one (e.g. NPOP1, NPOP1_2 cannot be used because NPOP1 is a sub-string of NPOP1_2).
Define the search range for each parameter in the EST file.

For further information on definition of models check the manual and the material given during the lecture. NOTE: the TPL and EST files need to have the same filename, but have different extensions: [filename].est and [filename].tpl.

TO DISCUSS
What is the mutation rate assumed? Do we need a mutation rate?

2.1.3.1 Settings to run fastsimcoal2

To run fastsimcoal2 we need to specify the following options:

the number of coalescent simulations to approximate the expected SFS (-n option). This should be larger than 100,000. We recommend to use 200,000 to 1,000,000.
the number of optimization cycles (-L option). This should be at least 20, but values between 50 and 100 are recommended.
use option -M indicating that we will run the likelihood optimization to infer parameters based on the SFS.
use option -d for the derived SFS and -m for the MAF SFS.
use option -q to have a quite mode, without printing lots of information.
specify option -C1 such that the likelihood is computed for all entries with at least 1 SNP. If you specify -Cx then all entries with less than x SNPs are pooled together. This option is useful when there are many entries in the observed SFS with few SNPs and with with a limited number of SNPS to avoid overfitting.
specify options for multithread. -c1 -B1 for a single core, -c4 -B4 for 4 cores using 4 batches.

2.2. Run fastsimcoal2 from R

The following scripts will run 1 fastsimcoal2 optimization, starting from a random initial starting value. When dealing with real data one needs to repeat at least 20 to 100 runs (i.e. runs with different initial starting values), and then we select the run that maximizes the likelihood.

Exercise 2.2.1.
– Run fastsimcoal2 to estimate the parameters of the single population model.

Follow the R commands below. Try to understand what is done at each step of the code.

# load functions to run fastsimcoal2 and to process the output 
source("utilFscFunctions.r")

# create a list that saves all the required settings to run fastsimcoal2
settings <- list()
# path to fastsicoal2 executable file
settings$pathToFsc <- "./fastsimcoal2/fsc26_linux64/"
# path to input files (working directory)
settings$pathTo_InputFile <- "./FscInputFiles/"
# path to TPL and EST file (tags)
settings$TPL_EST_file_tag <- "1PopExpInst20Mb"
# number of coalescent simulations
settings$n_coalsims <- 100000
# number of optimization cycles
settings$n_cycles <- 20

# IF YOU WANT TO RUN FASTSIMCOAL2 FROM THE COMMAND LINE
# get the fastsimcoal2 command line
cmd <- get_fsc2_commandline(settings)

# IF YOU WANT TO RUN FASTSIMCOAL2 BY CALLING IT FROM R
# run fastsimcoal2
run_fsc2(settings)

Running fastsimcoal2 should take less than 20 seconds. You will get a print of the optimization procedure, with the likelihood at each iteration. For instance, I obtained the following in one of the runs:

Estimation of parameters by conditional maximization via Brent algorithm (initial lhood = -111487)
Iter 1 Curr best params: 4224 55 74883 lhood=-48455.9
Iter 2 Curr best params: 4224 55 74883 lhood=-48455.9
Iter 3 Curr best params: 4224 55 74883 lhood=-48455.9
Iter 4 Curr best params: 4219 114 2852 lhood=-47560.3
Iter 5 Curr best params: 82085 344 2223 lhood=-45117.9
… Iter 20 Curr best params: 566385 532 2046 lhood=-45045.8

The likelihood is given in units of log10. Hence, the closer to zero the higher the likelihood. We can see that the likelihood increased from -111487 to -45045.8 in 20 cycles of optimization. We can also see that after 1 iteration the parameters were NPOP 4224, NANC 55, TCHANGE 74883, and that at iteration 20 the values were NPOP 566385, NANC 532, TCHANGE 2046.

How to interpret fastsimcoal2 results?

Fastsimcoal2 will output several files, which contain important information. The most relevant information for most users is:

[filetag].bestlhoods : file with the Maximum likelihood parameter estimates and corresponding likelihood. This is what we want: the parameter estimates!
[filetag]_DAFpop0.txt : file with the expected SFS obtained with the parameters that maximized the likelihood during optimization. This is needed to visually check the fit of the SFS.
[filetag].simparam : file with an example of the settings to run the simulations. This is useful to check when you have errors. Many times errors in specification of models can be detected in this file.

Exercise 2.2.2
Obtain parameter estimates and maximum likelihood.

Follow the code below that will read the output files from fastsimcoal2 to obtain the parameter estimates.

# read the file with the maximum likelihood estimates
maxlhoodEstimates <- read.table(paste(settings$pathTo_InputFile, "/",
                                      settings$TPL_EST_file_tag, "/",
                                      settings$TPL_EST_file_tag, ".bestlhoods",
                                      sep=""), header=T)
# look at the parameter estimates
maxlhoodEstimates

##     NPOP NANC TCHANGE MaxEstLhood MaxObsLhood
## 1 525242  540    2037   -45045.87   -45043.21

The maxlhoodEstimates contain a matrix where each column corresponds to a parameter name tag and the corresponding parameter estimate. You have the result of 1 run. In the analysis of real data sets we need to run several independent runs (different starting values), and select the parameter estimates of the run attaining the maximum likelihood estimate (MaxEstLhood).

TO DISCUSS:

What are the parameter estimates? What is the likelihood?
Time estimates are given in generations. How to convert to years?

Difference between MaxObsLhood and MaxEstLhood

MaxObsLhood: is the maximum possible value for the likelihood if there was a perfect fit of the expected to the observed SFS, i.e. if the expected SFS was the relative observed SFS.
MaxEstLhood: is the maximum likelihood estimated according to the model parameters.

The better the fit, the smaller the difference between MaxObsLhood and MaxEstLhood

So, in your case, all students should have the same MaxObsLhood because all students are looking at the same OBSERVED SFS, but each one of you will have a different MaxEstLhood because you have run optimization cycles starting from different initial values.

Exercise 2.2.3
– Assess the fit of the expected SFS under the model to the observed SFS.

Follow the code below that will read the output files from fastsimcoal2 to obtain the parameter estimates.

# Assess the fit of the expected SFS into the observed SFS
# Read the expected SFS
# 1D from a population that underwent an expansion
obsfileend <- "_DAFpop0"
pop1d <- read.table(paste(settings$pathTo_InputFile,
                    settings$TPL_EST_file_tag, obsfileend, ".obs",
                    sep=""), skip=1, stringsAsFactors = F, header=T)
expectedSFS <- read.table(paste(settings$pathTo_InputFile,
                    settings$TPL_EST_file_tag, "/",
                    settings$TPL_EST_file_tag, obsfileend, ".txt",
                    sep=""), header=T)
# discard the monomorphic sites from the observed SFS to compare with the expected SFS
observedSFS <- pop1d
# observed SFS just with SNPs (polymorphic sites)
observedSFS[1] <- 0
# Plot the observed vs expected SFS
plot_fitSFS_1d(observedSFS, expectedSFS, c(1,max(observedSFS)*1.1), "pop1")

Make a copy of the file with the parameters that maximized the likelihood for your run, “*.bestlhoods” and rename it with YOUR_NAME.bestlhoods. Then, copy the renamed file to the shared folder: “scp YOUR_NAME.bestlhoods [email protected]:~/all_likelihoods” such that we have a set of all the runs to analyse.

DISCUSSION

Is the method able to infer the correct parameters?
Does the expected SFS fit the observed SFS?
Does the SFS contain information about the demographic history of populations?

Exercise 2.3. (optional) – Run for the model with two populations by modifying the code above.

3. Application to real data from human population (60 min)

3.1. Apply fastsimcoal2 for a model with two populations

The folder “~/workshop_materials/06_fastsimcoal2/Henn_et_al_data” contains exome data from two human populations (Namibian San, SAN; https://en.wikipedia.org/wiki/San_people, and Mexican Mayan, MAYA, https://en.wikipedia.org/wiki/Maya_peoples) published in Henn et al. (2015) PNAS (http://www.pnas.org/content/113/4/E440.abstract). This data was kindly provided by Stephan Peischl (http://www.bioinformatics.unibe.ch/about_us/staff/dr_peischl_stephan/index_eng.html), as is a small sub-set of the original data. As such, it cannot be used outside the scope of this course. The original data is available at NCBI Sequence Read Archive (accession no. SRP036155).

The folder “~/workshop_materials/06_fastsimcoal2/Henn_et_al_data” contains the following files:
– 2D-SFS for the two populations, assuming that we sequenced a total of 44Mb
– Original genotypic matrix
– TPL and EST files for a model without gene flow

In this case pop1 is Maya with 8 diploid individuals sampled (16 gene copies), and pop2 is San with 6 diploid individuals sampled (12 gene copies).

Exercise 3.1.1. Have a look at the input files and answer the following questions:

Are monomorphics sites included in the observed SFS?
What is the sample size from each population?
What is the mutation rate?
How many historical events does the model include?
What are the parameter search ranges?

If the monomorphic sites are not included we cannot use the mutation rate as a clock to get inferences. When we do not have the number of monomorphic sites, all parameters must be specified in relation to a given fixed parameter (e.g. an effective size, or a time of event). Here I obtained the number of monomorphic sites assuming that a total of 44MB from the exome was sequenced and that sites that are not SNPs are monomorphic, i.e. #monomorphic = 44e6 – #SNPs.

Exercise 3.1.2.
Infer the time of divergence between two human populations, San from Africa and Maya from America using fastsimcoal2

Start by considering a model without gene flow.
Then, consider a model with gene flow.

1. create an EST and TPL file for the model with migration (if you’re unsure how to do that you may have a look at the files in subdirectory “solutions_for_files_with_migration” – but try creating them yourself first).
1. Run fastsimcoal2 for the model with gene flow.

Use the script below for the model without gene flow (the run takes about 5 min).

# load functions to run fastsimcoal2 and to process the output 
source("utilFscFunctions.r")
source("ParFileInterpreter-VS1.r")

# create a list that saves all the required settings to run fastsimcoal2
settings <- list()
# path to fastsicoal2 executable file
settings$pathToFsc <- "./fastsimcoal2/fsc26_linux64/"
# path to input files (working directory)
settings$pathTo_InputFile <- "./Henn_et_al_data/"
# path to TPL and EST file (tags)
settings$TPL_EST_file_tag <- "NoMig_San_Maya"
# number of coalescent simulations
settings$n_coalsims <- 100000
# number of optimization cycles
settings$n_cycles <- 20

# IF YOU WANT TO RUN FASTSIMCOAL2 FROM THE COMMAND LINE
# get the fastsimcoal2 command line
cmd <- get_fsc2_commandline(settings)

# IF YOU WANT TO RUN FASTSIMCOAL2 BY CALLING IT FROM R
# run fastsimcoal2
run_fsc2(settings)

# read the file with the maximum likelihood estimates
maxlhoodEstimates <- read.table(paste(settings$pathTo_InputFile, "/",
                                      settings$TPL_EST_file_tag, "/",
                                      settings$TPL_EST_file_tag, ".bestlhoods",
                                      sep=""), header=T)
# look at the parameter estimates
maxlhoodEstimates

# Read the observed SFS
obsfileend <- "_jointDAFpop1_0"
# 2D from two human populations: San and Maya
maya_san_2dsfs <- read.table(paste(settings$pathTo_InputFile, "/", settings$TPL_EST_file_tag, obsfileend,".obs",sep=""), skip=1, stringsAsFactors = F, header=T, row.names = 1)
# Read the expected SFS
expectedSFS <- read.table(paste(settings$pathTo_InputFile,
                    settings$TPL_EST_file_tag, "/",
                    settings$TPL_EST_file_tag, obsfileend, ".txt",
                    sep=""), header=T)

# discard the monomorphic sites from the observed SFS
pol_obs_2d <- as.matrix(maya_san_2dsfs)
# observed SFS just with SNPs (polymorphic sites)
pol_obs_2d[c(1,1)] <- 0
# get the marginal 1d sfs for pop 1
marginal1d_sfs <- rowSums(pol_obs_2d)
exp.marginal1d_sfs <- rowSums(expectedSFS)
# Plot the observed vs expected SFS
plot_fitSFS_1d(marginal1d_sfs, exp.marginal1d_sfs, c(1,max(marginal1d_sfs)*1.1), "Maya")

# get the marginal 1d sfs for pop 2
marginal1d_sfs <- colSums(pol_obs_2d)
exp.marginal1d_sfs <- colSums(expectedSFS)
# Plot the observed vs expected SFS
plot_fitSFS_1d(marginal1d_sfs, exp.marginal1d_sfs, c(1,max(marginal1d_sfs)*1.1), "San")

# Plot the 2D-SFS fit
plot2dSFS(maya_san_2dsfs, expectedSFS, xtag="San", ytag="Maya", minentry=1)    
plot_relDiff2dSFS(maya_san_2dsfs, expectedSFS, xtag="San", ytag="Maya", minentry=1)

# Draw a scenario of the model parameter estimates assuming a generation time of 29 years
# pdf(file=paste(settings$TPL_EST_file_tag,"_model.pdf", sep=""), width=8, height=10)
# path to the maxL.par file that is a par file with the parameters that maximize the likelihodo
path_to_maxL_file <- paste(settings$pathTo_InputFile, settings$TPL_EST_file_tag, "/",settings$TPL_EST_file_tag, "_maxL", sep="")
parFileInterpreter(args=path_to_maxL_file, pop.names=c("Maya","San"), gentime=29, printPDF=FALSE)
# dev.off()

Modify the script for the model with gene flow between San and Maya.

DISCUSSION

What are the estimated effective sizes?
What are the estimated times of divergence?
Does the expected SFS fit the observed SFS?
How are the parameter values affected by adding migration? Would you change your conclusions?
What are the estimated migration rates?
Is the model with migration better than a model without migration?
Are these models sufficient to explain the data?
Here we analyzed data from exome sequencing. Can we trust these demographic estimates?

4. ADVANCED – folded SFS and missing data (EXTRA TIME)

4.1. Folded vs unfolded SFS

To obtain the SFS in the previous exercises we assumed we could determine the ancestral and derived state of alleles. This is usually done by comparing our data with outgroup species. We further assumed that genotypes were coded according to the derived state of alleles (i.e., reference allele corresponded to the ancestral state, and the alternative allele to the derived). The resulting SFS is thus called “derived SFS” or “unfolded SFS”. However, in many cases, especially for non-model organisms, we do not have data from closely related species (outgroup) to determine the ancestral state of mutations. In such cases we can summarize the data as a function of the “minor allele frequency” SFS or “folded SFS”.

EXERCISE 4.1.
Compute the folded (minor allele frequency) SFS for the following genotypic matrix.

Follow the steps below. The “minor allele frequency” refers to the frequency of the less frequent allele at a particular site. If we go back to our genotype matrix:

snp1	snp2	snp3	snp4	snp5	snp6	snp7	snp8	snp9	snp10	snp11	snp12
0	0	2	0	1	0	1	0	0	0	1	2
1	0	1	1	2	0	2	0	1	0	0	2
0	1	1	0	2	1	1	0	1	0	0	1
1	0	2	0	2	0	2	0	0	0	1	1
0	0	2	2	2	0	1	1	0	1	1	1

Assuming that allles were determined arbitrarily as reference (allele as in reference genome) or alternative (allele different from reference), and genotypes are coded as 0 (homozygote for reference allele), 1 (heterozygote), 2 (homozygote for alternative allele) we can obtain the minor allele frequency spectrum as follows. First, compute the allele frequency of the reference and alternative state for each site.

The corresponding allele frequencies are:

	1	2	3	4	5	6	7	8	9	10	11	12
allele1	8	9	2	7	1	9	3	9	8	9	7	3
allele2	2	1	8	3	9	1	7	1	2	1	3	7

Now, for each site we need to decide what is the reference allele, which for the folded SFS is the allele with the minor allele frequency, i.e. the allele with a frequency less than 50%. In this example the allele1 is the minor allele for sites 3, 5, 7 and 12, whereas the allele2 is the minor allele for sites 1, 2, 4, 6, 8, 9, 10 and 11.

Next, we compute the “minor allele SFS” by counting how many sites have a given minor allele frequency.
To compute the SFS entry for the minor allele frequency of 1, we count the number of sites with a minor allele frequency of 1.

SFS entry 1: 5 (snps 2, 5, 6, 8, 10)
SFS entry 2: 3 (snps 1, 3, 9)
SFS entry 3: 4 (snps 4, 7, 11, 12)
SFS entries from 4 to 10: 0

Another way to compute the minor allele SFS is to obtain the SFS assuming that the alternative allele is the derived. Then add the entries with the same “minor allele frequency”. Note that alleles with a derived allele of 1 or 9 (out of 10) correspond to a minor allele frequency of 1, that alleles with a derived allele of 2 or 8 (out of 10) correspond to a minor allele frequency of 2, and so on. Hence, the minor allele (or folded) SFS can be computed by adding the entries of a derived SFS leading to the same minor allele frequency, as if we were folding the derived SFS at the 50% allele frequency.

EXERCISE 4.2.
compute the MAF SFS for the genotypic matrix in the file “~/workshop_materials/06_fastsimcoal2/Data/example.GT”.

Follow the steps below. Here we will use the SFS computed assuming that reference allele is the ancestral and then add the entries with the same “minor allele frequency”.

Another way to see this is that we need to add the entries with the same color to obtain the “minor allele SFS (MAF SFS)” or “folded SFS”.

library(RColorBrewer)
par(mfrow=c(1,2))
colvs <- brewer.pal(6, "Set1")
barplot(obs.sfs, xlab="derived allele frequency", ylab="number of sites", col=colvs[c(1:6,5:1)])
# set the maf sfs the same as obs.sfs
maf.obs.sfs <- as.numeric(obs.sfs)
# all the entries of maf.sfs > (nind/2) set to zero
maf.obs.sfs[c((nind+2):((2*nind)+1))] <- 0
# add the entry 1 to (nind*2)+1, 2 to (nind*2)-1, ...
maf.obs.sfs[c(1:(nind))] <- obs.sfs[c(1:(nind))]+obs.sfs[c(((2*nind)+1):(nind+2))]  
# str(maf.obs.sfs)
barplot(maf.obs.sfs, xlab="minor allele frequency", ylab="number of sites", col=colvs, main="Minor allele frequency (MAF) SFS", names.arg =c(0:(nind*2)))

Below I provide you with an R function to compute the MAF (folded) from an unfolded (derived) SFS.

# GETMAF
# Given an unfolded SFS get the minor allele frequency spectrum (folded SFS)
# INPUT
#   unfolded.sfs : numeric vector of size (nindividuals*2)+2 with the unfolded SFS
# NOTE! Only works for unfolded.sfs with an odd number of entries!!
getMaf <- function(unfolded.sfs) {
  # set the maf sfs the same as obs.sfs
  maf <- as.numeric(unfolded.sfs)
  # define the length of the SFS and half the SFS
  n <- length(maf)
  nhalf <- ceiling(n/2)
  if(n %% 2 == 0) {
    stop("This getMaf function only works for SFS with odd number of entries (for data from diploid individuals.)")
  }
  # all the entries of maf.sfs > (n/2) set to zero, where n is 2*nindividuals
  maf[(nhalf+1):n] <- 0
  # add the entry 1 to n, 2 to n-1, and so on...
  maf[1:(nhalf-1)] <- unfolded.sfs[1:(nhalf-1)]+
                              unfolded.sfs[n:(nhalf+1)]
  maf
}

# Apply function to get maf.obs.sfs
maf.obs.sfs <- getMaf(obs.sfs)

4.2. How to deal with missing data?

In most cases data contains missing data because we could not determine the genotype for a given individual at a particular site. When computing the observed SFS missing data pose some problems, as the x-axis in the SFS corresponds the absolute allele frequency (not the relative frequency). A genotype matrix with missing data could be as follows:

	snp1	snp2	snp3	snp4	snp5	snp6	snp7	snp8	snp9	snp10	snp11	snp12
ind1	1	0	0	0	0	0	0	0	0	0	0	1
ind2	NA	0	1	1	1	NA	1	0	1	0	1	0
ind3	0	1	2	NA	1	1	0	NA	0	0	0	0
ind4	0	0	1	0	1	0	0	NA	1	0	1	1
ind5	0	0	0	0	2	0	2	1	0	1	0	1

Consider sites snp1 and snp2 that both have one heterozygote individual for the derived allele, and all the remaining with ancestral allele. (Note: here I assume we know the ancestral state). If there was no missing data both snps would contribute to the entry 1 of the SFS. However, here snp2 has no missing data, and hence we have a frequency of 1 out of 10 gene copies (5 diploids), whereas snp1 has a frequency of 1 out of 8 gene copies (4 diploids) because one individual is missing. Which entry should snp1 contribute to? It is not easy to answer this question!

To deal with missing data we have three options:

Discard sites with missing data, such that we only keep sites where all individuals were genotyped. This usually leads to a huge loss of data.
Find the minimum sample size across all sites (call this min_ss ), and for each site resample min_ss individuals, such that we create an SFS with data for all sites. This can be extended to take into account linkage, sampling blocks of data. See the paper Bagley et al (2017) Mol Ecol (https://doi.org/10.1111/mec.13972) and scripts available on dryad to do this block sampling without missing data: http://datadryad.org/resource/doi:10.5061/dryad.vh75r/2
Use the methods that take into account genotype uncertainty for NGS data. Implemented in ANGSD software. (http://www.popgen.dk/angsd/index.php/ANGSD)

In the script below I exemplify how we could get the SFS by resampling genotypes such that we have min_ss individuals per site with data. The required info is a genotypic matrix.

# number of SNPs
nsnps <- 10000
# number of individuals
nind <- 5

# read the genotype matrix with missing data
md_geno <- matrix(scan(file="./Data/example_missing_data.GT"), nrow=nind, ncol=nsnps, byrow=T)

# find minimum sample size
md_persite <- colSums(is.na(md_geno))
min_ss <- nind-max(md_persite)

# Go through each site, sample min_ss genotypes for each site without missing data
no_ms_geno <- apply(md_geno, 2, function(each_site) {sample(each_site[!is.na(each_site)], size=min_ss, replace = F)})
# Check that there is no missing data
if(sum(is.na(no_ms_geno))>0) {stop("Error: there is still missing data after resampling!")}

# Get the SFS for the dataset without missing data
obs.sfs <- table(colSums(no_ms_geno))
# Plot the unfolded and folded SFS
par(mfrow=c(1,2))
# unfolded
barplot(obs.sfs, xlab="derived allele frequency", ylab="number of sites")
# folded
maf.obs.sfs <- getMaf(obs.sfs)
barplot(maf.obs.sfs, xlab="minor allele frequency", ylab="number of sites", main="Minor allele frequency (MAF) SFS", names.arg =c(0:(min_ss*2)))