BayeScan exercise - Evolution and Genomics

Running BayeScan

The method has been implemented in C++ and there are command-line versions of BayeScan for all operating systems (OS X, Linux, and Windows). There is a GUI version only for Windows. Here we will use the command-line version.

Input file

The best approach is to prepare files in FSTAT, popgen, or Arlequin format. Then use PGDSpider to convert them into GESTE/BayScan format.

Output files

There are a large number of files that can be obtained by running BayeScan. The most important ones, which are also the ones printed by default, are:
*_Verif.txt: you should use this file in order to verify that BayeScan read the input files properly.
*.sel: provides the trace of some of the parameters estimated by the model. This include the loglikelihood and the local population Fst’s. This file is used to verify convergence of the chain.
*_fst.txt: this file is used to interpret the results. In other words, to identify the outlier loci.

Command line

If you type bayescan, you get the list of options with a brief explanation.

[toggle title_open=”Hide Answer” title_closed=”Show Answer” hide=”yes” border=”yes” style=”default” excerpt_length=”0″ read_more_text=”Read More” read_less_text=”Read Less” include_excerpt_html=”no”] —————————
| BayeScan 2.0 usage: |
—————————
-help Prints this help
—————————
| Input |
—————————
alleles.txt Name of the genotypes data input file
-d discarded Optional input file containing list of loci to discard
-snp Use SNP genotypes matrix
—————————
| Output |
—————————
-od . Output file directory, default is the same as program file
-o alleles Output file prefix, default is input file without the extension
-fstat Only estimate F-stats (no selection)
-all_trace Write out MCMC trace also for alpha paremeters (can be a very large file)
—————————
| Parameters of the chain |
—————————
-threads n Number of threads used, default is number of cpu available
-n 5000 Number of outputted iterations, default is 5000
-thin 10 Thinning interval size, default is 10
-nbp 20 Number of pilot runs, default is 20
-pilot 5000 Length of pilot runs, default is 5000
-burn 50000 Burn-in length, default is 50000
—————————
| Parameters of the model |
—————————
-pr_odds 10 Prior odds for the neutral model, default is 10
-lb_fis 0 Lower bound for uniform prior on Fis (dominant data), default is 0
-hb_fis 1 Higher bound for uniform prior on Fis (dominant data), default is 1
-beta_fis Optional beta prior for Fis (dominant data, m_fis and sd_fis need to be set)
-m_fis 0.05 Optional mean for beta prior on Fis (dominant data with -beta_fis)
-sd_fis 0.01 Optional std. deviation for beta prior on Fis (dominant data with -beta_fis)
-aflp_pc 0.1 Threshold for the recessive genotype as a fraction of maximum band intensity, default is 0.1
—————————
| Output files |
—————————
-out_pilot Optional output file for pilot runs
-out_freq Optional output file for allele frequencies[/toggle]

The output files we will use were generated with the following commands. We have already run these commands for you, so you will be able to use the output files straightaway. Still, let’s take a moment to understand the commands.

Long run that leads to results that are reliable because convergence of the MCMC was achieved. The prior odds are somewhat small (=10) for a dataset of 5000 SNPs, so the number of false positives is likely to be large unless we use a very stringent FDR threshold.

bayescan HsIMM-Us100-2.txt -o bayescanLongRunOD10 -n 5000 -thin 10 -nbp 20 -pilot 5000 -burn 50000 -pr_odds 10

Long run (convergence of the MCMC was achieved). The prior odds are much larger than in the previous case and more appropriate for 5000 SNPs, so the number of false positives will be smaller.
bayescan HsIMM-Us100-2.txt -o bayescanLongRunOD100 -n 5000 -thin 10 -nbp 20 -pilot 5000 -burn 50000 -pr_odds 100

A very important parameter of the method is the prior odds (PO) of the neutral model, highlighted in red in the above commands. This parameter indicates our skepticism about the possibility that a given locus is under selection. In other words, it indicates how much more likely we think the neutral model is (which does not include the locus-specific effect α) compared to the model with selection (i.e., with the locus-specific effect α).

For example, a PO = 10 indicates that we think the neutral model is 10 times more likely than the model with selection. The larger the PO, the more conservative the test of selection is. In principle, this parameter will not influence the results much if we have a large amount of data including many populations and many individuals per population. However, very frequently the number of populations or sample sizes are limited (e.g. less than 20 populations) so we do need to pay attention to this parameter.

The PO value that should be used depends on how many loci are included in the data set. If there are less than 1000 loci, then PO = 10 is reasonable, but with more loci (say between 1000 and 10000) PO = 100 or larger is a better choice. With millions of markers, as is the case in GWAS, values as large as 10000 may be necessary.

Interpreting the results

After verifying that the chain has converged using the methods described in subsection “Evaluating Convergence” (see below), then we can interpret the results. For this we focus on the *_fst.txt file, which includes six columns.

On the Amazon cloud, please find the BayeScan output files of both runs here:
~/wpsg_2016/activities/bayescan

In the *_fst.txt file, the first column is the SNP index or ID as given in the input file. The next three (prob, log10(P0), and qval) are related to the test of local adaptation, and therefore the model including the locus-specific effect. These include the posterior probability for the model including selection, the logarithm of the posterior odds for the model with selection, and the q-value for the model with selection. The fifth column gives the size of the locus-specific effect (alpha parameter). The last one provides the locus-specific FST averaged over all populations.

In order to decide if a locus is a good candidate for being under the influence of selection we use the q_value.

Exercises

As is typical of all MCMC methods, BayeScan is computationally intensive and a single run can take several hours unless you use a workstation or a cluster with many processors. Thus, we won’t be able to generate the output here. Instead, we have produced several output files using the commands provided earlier in the tutorial. Nevertheless, you can start a run to see how the run progresses for half an hour or so. In the meantime, we will begin working with the output files we produced previously.

1) You will find two outputs of BayeScan in this folder:
~/wpsg_2016/activities/bayescan
They correspond to runs with prior odds 10 and 100.

How many outlier loci do you identify if the target FDR is (a) 0.01, (b) 0.001, and (c) 0.000? Explain how the results change depending on the prior odds.

Evaluating convergence

We will use CODA, an R package that implements several convergence tests. If you are using the cloud for this tutorial, you will either need to use a remote desktop or RStudio through the browser in order to view plots. Everything other than viewing plots can be done with a plain SSH connection.

Start R from inside one of the output folders.

Load the package
library(coda)

Read the data
chain <- read.table(“FILENAME.sel”,header=TRUE)

Create an MCMC object with the correct thinning interval (in the exercises below it is 10).
chain <- mcmc(chain,thin=10)

Always start with a plot. A rough way of verifying convergence is to plot the trace and the posterior distribution of some of the parameters.
plot(chain)

Obtaining summary statistics for the MCMC chain. We can obtain mean, standard deviation, naïve standard error of the mean (ignoring autocorrelation of the chain) and time-series standard error (taking into account autocorrelation) as well as
quantiles of the sample distribution. For this we use the command:
summary(chain)

Is the sample from the MCMC large enough? An important step is to verify that the sample size used to estimate the posteriors is large enough. The effective sample size on which the parameter estimates are based can be much smaller than the sample size used (in our example 5000) because the MCMC algorithm explores the parameter space by moving in small steps. Thus, two consecutive values will be strongly correlated; that’s why we use a thinning interval (in our examples it is 10 iterations).
We can check the correlation between sampled parameter values that our thinned chain uses for estimation:
autocorr.diag(chain)

If there is some correlation, then the effective sample size will be smaller than the sample size used as input. We can verify this with the command:
effectiveSize(chain)

As you will see when doing the exercises, the effective size of the likelihood sample is much smaller than the input value (5000), while those of the Fst parameters are less affected by the correlation. It is clear why this is the case: the correlation decreases much more rapidly for the Fst’s than for the likelihood.

Has the chain converged? Now we can use some more formal approaches for the detection of non-convergence.
A very simple test is Geweke’s convergence diagnostic based on the comparison of the means of the first and last parts of a Markov chain. The diagnostic reports the z-scores for each parameter. For example, with α = 0.05, the critical values of z are – 1.96 and +1.96. We reject H0 (equality of means => convergence) if z < -1.96 or z > +1.96.
geweke.diag(chain, frac1=0.1, frac2=0.5)

Another popular test is Heidelberg and Welch’s convergence diagnostic, which is also based on a single chain:
heidel.diag(chain, eps=0.1, pvalue=0.05)

It is generally better to run more than one chain and then use tests that compare them. One popular test is the Gelman and Rubin’s diagnostic. To use it you need first to combine two chains into a single mcmc list:
combined = mcmc.list(chain1,chain2)
plot(combined)
gelman.diag(combined)
gelman.plot(combined,ask)

The gelman.diag is based on a comparison of between and within chain variances. If the chains converged, these two variances should be equal. The output of gelman.diag are the scale reduction factors for each parameter. A factor of 1 means that between- and within-chain variance are the same; larger values mean that they are fairly different, indicating convergence problems. The rule of thumb is that values below 1.1 or so are OK but, being a rule of thumb, this is not an extremely rigorous criterion.

A very useful graphic representation of this convergence test is given by the command gelman.plot, which shows you the trace of the scale-reduction over time (chain steps). Thus, it is very useful to see whether a low chain reduction is also stable (you will see that sometimes the factors go down and then go up again). More importantly it allows us to find out how long the burn-in should be in order to eliminate any bias that arises from the starting point of the chain. The length of the burn-in depends on both the method and the dataset. Values between 1000-10000 steps are fairly common and lead to low computational overhead but there are many population genetics methods that can require a much longer burn-in (sometimes millions of steps).

Survey

Please let us know how you got along with the exercise in this very short opinion poll.