## Probability foundation for population genomics – activity

Alex Buerkle, 21 January 2020

This document contains questions to accompany the code in buerkle_code.R (also located in ~/workshop_materials/21_probability/) and lecture presentations about the corresponding material. At the end there are some additional notes regarding Bayesian parameter estimation and some aspects of the models we used.

Requirements

This exercise uses R, JAGS, and the rjags package within R.
Note, there is also an Introduction to R activity to help you with the basics of R and R studio.

## Generative F-model

### Illustration of drift variance

Use the following R code (line 26 of buerkle_code.R) to plot the distribution of allele frequencies across populations with: $$\pi = 0.1, 0.5$$ and $$F_{\mathrm ST}=0.01, 0.4$$

p<-seq(0,1,0.01)
plot(p, dbeta(p, shape1=0.5 *  (-1 + 1/0.4), # Fst 0.4
shape2=(1-0.5)* (-1 + 1/0.4)), col="red",  type="l", ylab="density",
xlab="p", ylim=c(0,10))
lines(p, dbeta(p, shape1=0.5 *  (-1 + 1/0.01), # Fst 0.01
shape2=(1-0.5)* (-1 + 1/0.01)), col="purple")
abline(v=0.5, col="blue") # ancestral allele frequency

Repeat this for an initial frequency ($$\pi$$) of 0.1. Experiment with different parameters.

1. Q What effect does a larger $$F_{\mathrm ST}$$ have on resulting allele frequencies?
2. Q What effect does the starting frequency have on the distribution of allele frequencies and the variance introduced by $$F_{\mathrm ST}$$?
3. Q Which results best illustrate the lack of one-to-one correspondence between process and empirical pattern?

### F$$_{\mathrm{ST}}$$ as a function of allele frequency

We have a generative model for how drift can influence allele frequencies. Let’s use this model to create reasonable population genetic data and investigate the relationship between global allele frequency and F$$_{\mathrm{ST}}$$.

We will use Hudson’s F$$_{\mathrm{ST}}$$ as a summary statistic on the variance in allele frequencies in a pair of populations, but we could also use other definitions of F$$_{\mathrm{ST}}$$.

Using the code on lines 38–76, we can learn about the relationship between global allele frequency and F$$_{\mathrm{ST}}$$.

1. Q How would you describe the pattern?
2. Q What are the implications of this?
3. Q How will this affect observations of F$$_{\mathrm{ST}}$$ of loci across the genome?

## Statistical model for allelic diversity

Use the code on lines 79–201 to generate simulated data with a known $$\theta$$ parameter and use MCMC to learn about recovering $$\theta$$ estimates from the simulated data.

1. Examine the plots for estimates of $$\theta$$ at each of the 500 retained steps from each of your three replicate chains.

• Q Does it appear that you have proper mixing within chains?

• Q What would an example of poor mixing look like?

• Q Does it appear that you good convergence among the three chains?

• Q What would an example of poor convergence look like?

2. Q Does the 95% credible interval for $$\theta$$
include the true value that you simulated (sim.theta
on line 83 of the code)? What does this mean?
3. For a simulation that sets sim.theta to 5:

• Q how well do the estimates agree if you simulated 20 individuals, compared to their agreement if you sampled 100 individuals?

You’ll need to simulate new data for each (nind$$\leftarrow$$20 and nind$$\leftarrow$$100) and run the MCMC on each simulated data set.

• Q What do you think is the cause of the difference?

4. Compare a simulation with sim.theta$$\leftarrow$$5 to a simulation with sim.theta$$\leftarrow$$0.1.

• Q How does this affect the genome-wide mean $$H_e$$ statistic on the data?
• Q Does this fit with your expectations about
how $$\theta$$
and $$H_e$$ relate to diversity?

## Statistical F-model

Use the code on lines 203–271 to generate simulated data with a known $$\theta$$ parameter and use MCMC to learn about recovering $$\theta$$ estimates from the simulated data.

1. Q How realistic is it to assume all populations and loci share a single $$F_{\mathrm{ST}}$$ parameter?
2. Q How could we relax this assumption?
3. Q Please explain the plot on line 226 (code also below) for a simulation with simFst set to 0.01 and again for a simulation with simFst set to 0.3. What are the noteworthy aspects?
4. For a simulation that sets simFst to 0.01:

• Q (lines 270–271) Does the 95% credible interval for $$F_{\mathrm{ST}}$$ include the true value that you simulated (simFst=0.01 on line 8 of the program)?
• Q What does this mean?
5. For a simulation that sets simFst to 0.3:

• Q Does the 95% credible interval for $$F_{\mathrm{ST}}$$ include the true value of $$F_{\mathrm{ST}} = 0.3$$?
• Q What do you think is a likely cause of the discrepancy?
6. Compare a plot of a PCA of populations simulated with simFst$$\leftarrow$$0.01 to populations simulated with simFst$$\leftarrow$$0.3.

• Q In which of these simulations are individuals more readily assigned to clusters in the PC2 vs. PC1 plot?
• Q Does the percentage of variation explained by PC1 scale with simFst?
7. Q Compare a plot of a PCA of populations simulated with simFst$$\leftarrow$$0.01 to a PCA of the very similar simulation to which a small number of loci were added that were more differentiated (line 306). What is the effect on clustering?

## Essentials of Bayesian estimation

Given that we have been working with Bayesian inference, here is some additional description of some key elements of this approach.

1. Bayesian methods result in a probability density for the model parameters, given the data (the posterior probability).

2. Probabilities provide a robust framework for scientific inference – model choice, strength of evidence, etc.

3. The probability density fully describes all information about the parameters and our uncertainty about them.

4. The measure of uncertainty for each individual parameter incorporates uncertainty for all parameters.

5. Densities can readily be obtained for estimates of transformed parameters.

6. Computers make it possible to obtain a very good description of the probability density of the parameters, given the data.

Bayesian inference is one of three principal approaches to inference and parameter estimation. The others are Likelihood analysis and Frequentist statistics. We’ll prefer Bayes for the attributes in the previous list. Most of these things are not true for Likelihood or Frequentist methods. Likelihood and Bayesian estimates will often agree approximately on the point estimate of parameters. Bayesian methods are preferred for their probabilistic measures of uncertainty, which are critical for scientific inference.

Bayes theorem is the basis for Bayesian models and is simply elaborated for more complex, hierarchical models. The equation for the posterior probability is:

$P(\theta | \mathrm{data}) = \frac{P(\mathrm{data} | \theta) \cdot P(\theta)}{P(\mathrm{data})}$

1. analytical solutions for the posterior distribution are available in simple cases like our initial beta-binomial example. They provide a “closed form” solution for the posterior probability density in the form of an integral. Quantiles (0.025, median, 0.975) and expectation (mean) can be calculated directly, as for any probability distribution.

2. more typically, we require simulation (MCMC) methods to obtain stochastic samples from the posterior distribution. We use algorithms for MCMC to draw samples from posterior and fully characterize it. This is typical in population genomics.

3. $$P(\mathrm{data})$$: it is a normalizing constant for the sum $$P(\mathrm{data})$$ across all possible values of $$\theta$$. Makes it so that the posterior density sums to one and is a proper probability distribution. We typically will not be able to calculate $$P(\mathrm{data})$$. MCMC methods allow us to draw samples from the posterior and ensure that the estimated probability density function (pdf) sums to one.

$$P(\theta | \mathrm{data}) \propto P(\mathrm{data} | \theta) \cdot P(\theta)$$

4. We can obtain quantiles (2.5% to 97.5% is the 95% ETPI or credible interval for true parameter) and expectation (mean) from posterior probability distribution.

5. Approximate Bayesian Computation (ABC) –- this provides a method when the equation of $$P(Data|\theta)$$ is not known, but it can be simulated, for example with the coalescent (but not just with the coalescent).