Probability in evolutionary genomics

Milan Malinsky, 16th January 2025



Background and Objectives

Thinking about probabilities is one of cornerstones of population and speciation genomics. We often aim to evaluate the strength of evidence for a particular conclusion and quantify the uncertainties associated with it. We also often want to estimate parameters of evolutionary models, e.g, of populations. In this exercise, we are going to explore examples of using probability in (evolutionary) genomics.

Learning goals:

    • Get familiar with probabilities and likelihood in the context of genomics
    • Appreciate some sources of uncertainty: (i) base calling errors; (ii) finite sampling of reads from diploid individuals; (iii) finite sampling of individuals from populations
    • Recognise the Binomial distribution and where it applies
    • Use R code to explore these questions and plot the results
    • Consider how appreciation of probability may affect you experimental choices and interpretation in the context of estimating FST


Table of contents



1) Base calling error rates

Most genomic data analyses start with DNA reads. You are probably familiar with output of a DNA sequencer in the standard FASTQ format. Here is one DNA read (150bp):

The base qualities output by the sequencing machine are encoded so that each character corresponds to a number (a score). For example , = 11, : = 25, and F = 37. The full translation list can be found for example on this Illumina webpage. The quality scores correspond to base calling error probabilities. If we denote the score as Q, there is the following relationship Q=-10*log_{10}(P_{error}). Therefore, we can calculate the probability of incorrect base call as: P_{error} = 10^{-Q/10}

Q How many errors do you expect in five bases with the following base quality string: ,:F,F?

Show me the answer!

    • The quality scores corresponding to the sequence of five characters ,:F,F are 11, 25, 37, 11, and 37.
    • The probabilities corresponding to those quality scores are: 0.0794, 0.0031, 0.0002, 0.0794, and 0.0002
    • The expected number of errors is: 0.0794 + 0.0031 + 0.0002 + 0.0794 + 0.0002 = 0.16

Q The average per-base error rate in Illumina reads is about 1%. Assuming constant per-base error rate of 0.01, what would be the probability of seeing at least one error in a 150bp read? What about the probability of seeing two or more errors?

Show me the answer!

We can consider this problem as a sequence of 150 success/failure experiments (so-called Bernoulli trials) with the probability of failure p_{fail} = 0.01 in each trial. Those of you with some background in statistics will recognise that the total number of failures (base calling errors), let’s denote it by X, has the Binomial distribution. In mathematical notation: X \sim Binomial(150,0.01).

To answer our question, we can then use the Binomial Cumulative Distribution Function, for example as implemented in R:

    • The probability of seeing at least one error: pbinom(0, 150, 0.01,lower.tail=FALSE)
    • The probability of seeing at least two errors: pbinom(1, 150, 0.01,lower.tail=FALSE)
    • The probability of seeing at least three errors: pbinom(2, 150, 0.01,lower.tail=FALSE)

As you can see in the R output, the probabilities are approximately 78%, 44% and 19%.

Q In the example above, what would be the probability of seeing exactly two sequencing errors in the read?

Show me the answer!

We can calculate this from the above results. The probability of seeing exactly two sequencing errors is the probability of seeing two or more errors minus the probability of seeing three or more errors.

Translated to R code: pbinom(1, 150, 0.01,lower.tail=FALSE) - pbinom(2, 150, 0.01,lower.tail=FALSE)

This is approximately 44% – 19% = 25%.

Alternatively we can answer this question directly, using the R implementation of the Binomial Probability Mass Function: dbinom(2, 150, 0.01)



2) Genotype likelihoods

Most of us study diploid organisms. This means that when we sequence DNA of an individual, the reads come as a random mixture of the two homologous copies for each chromosome. The presence of heterozygous sites – that is, the loci where the two homologous chromosomes differ – presents a statistical challenge when estimating genotypes.

Consider a biallelic locus and let the reference allele be represented as 0 and the alternative allele as 1. The possible diploid genotypes are then the homozygous 0/0 and 1/1, and the heterozygous 0/1.

First, let us consider a scenario where, at a particular locus in the genome, we have one perfect read carrying the 0 allele. For now, we are going to simplify our lives and ignore base calling errors, (unrealistically) assuming that reads are perfect. We are also going to assume (again unrealistically), that we can place/map all the reads unambiguously to their matching place in the reference genome.

Q What are the likelihoods of the three possible genotypes? In other words, what is the probability of observing the data (the one read with 0 allele) given the genotypes?

Show me the answer!

    • The probability of a read carrying the 0 allele, denoted r_0, given the homozygous 0/0 genotype is: P(r_0 \mid 0/0) = 1
    • The probability of a read carrying the 0 allele, denoted r_0, given the heterozygous 0/1 genotype is: P(r_0 \mid 0/1) = 0.5
    • The probability of a read carrying the 0 allele, denoted r_0, given the homozygous 1/1 genotype is: P(r_0 \mid 1/1) = 0

These are the likelihoods, i.e.: \mathcal{L}(0/0 \mid r_0) = P(r_0 \mid 0/0) = 1, similarly \mathcal{L}(0/1 \mid r_0) = P(r_0 \mid 0/1) = 0.5, and \mathcal{L}(1/1 \mid r_0) = P(r_0 \mid 1/1) = 0.

In the above example, the 0/0 genotype has the highest likelihood, and, therefore, 0/0 is the maximum likelihood estimate of the genotype. However, on the basis of one read, we cannot be certain that 0/0 is the correct genotype. There is the possibility that the read with the 0 allele could have originated from a heterozygous 0/1 locus. This type of uncertainty is represented by the Genotype Quality (GQ) field in a VCF file. Specifically, the Genotype Quality field shows difference between the most likely and the second most likely genotype (e.g. see details for GQ in GATK output).

The Genotype Qualities are logarithmically scaled in the same way as the base error probabilities described above, i.e. -10*log_{10}. Therefore, for the above example, GQ=-10*log_{10}(\mathcal{L}(0/1 \mid r_0)) - (-10*log_{10}(\mathcal{L}(0/0 \mid r_0))).

Q Can you calculate GQ for this ‘single perfect read’ example?

Show me the answer!

    • In numbers: -10*log_{10}(0.5) - (-10)*log_{10}(1) = 3.01 - 0 = 3.01
    • In R code: GQ = -10*log10(0.5) - (-10*log10(1))
    • In the VCF file, the number is rounded to the nearest integer to save space. So GQ = 3 for the 0/0 call.

It is common practice to filter variants based on GQ and set genotypes with GQ falling below a particular threshold to missing. This threshold is often set to keep variants with GQ >=20.

Q With one perfect read we got GQ=3, which is substantially below the desired GQ >= 20. What can we do to improve the genotype quality?

Show me the answer!

We need more reads covering the variant. We need to increase our sequencing coverage.

As we increase the read depth, all additional reads are going to carry the 0 allele, because the true genotype is 0/0 and the reads have no base-calling errors. Therefore, the likelihood of the (correct) 0/0 genotype is going to always remain 1. However, as we add more reads (evidence), the likelihood of the (erroneous) 0/1 genotype is going to decrease.

Q If the true genotype is 0/0, how many perfect reads (how much read depth) would we need to get GQ >= 20?

Show me the answer!

We need the likelihood of the erroneous 0/1 genotype to decrease sufficiently that -10*log_{10}(\mathcal{L}(0/1 \mid reads) > 20). If we solve this equation, we find that -10*log_{10}(0.01) = 20; therefore, we require the likelihood of the 0/1 genotype to decrease below 0.01.

Let’s remind ourselves that the likelihood of the 0/1 genotype is the probability of seeing the read(s), given this genotype. If the 0/1 genotype were correct, any read would have 50% probability of carrying the 0 allele and 50% probability of carrying the 1 allele. The probability of seeing one read carrying the 0 allele is then 0.5. The reads are independent; therefore, the probability of seeing two reads carrying the 0 allele is 0.5 * 0.5 = 0.25. The probability of seeing three reads carrying the 0 allele is 0.5 * 0.5 * 0.5 = 0.125. And so on… It turns out that 0.5^6 = 0.0156 and 0.5^7 = 0.008. Therefore we need 7 perfect reads to get GQ >= 20.

Perhaps you can see that we can again consider this problem as a sequence of success/failure experiments where each read is an ‘experiment’. Therefore, the likelihood of the 0/1 genotype can be calculated using the Cumulative Density Function of the Binomial distribution. For example:

    • If seven out of seven reads carry the 0 allele: pbinom(0, 7, 0.5, lower.tail = T).
    • If one out of the seven reads carries the 1 allele: pbinom(1, 7, 0.5, lower.tail = T).
    • If ten out of ten reads carry the 0 allele: pbinom(0, 10, 0.5, lower.tail = T).


3) Genome coverage distributions

The GQ >= 20 threshold corresponds to getting about one in 100 genotype calls wrong, which may seem acceptable, and the above example suggests that about 7 perfect reads are needed to make a genotype call with this confidence in a diploid individual. We might therefore be tempted to sequence each individual to 7x genome coverage. However, we should remember that this would be the average. Some sites will be covered by more reads, some by fewer reads.

If we assume that each read has an equal chance of coming from anywhere in the genome (which is not quite true, but close enough in whole genome sequencing; see e.g. this paper), the coverage at any one site (let’s denote it X) has a Poisson distribution. So, if the coverage is 7x, we have X \sim Poisson(7). The plot below shows this distribution:


Code: barplot(dpois(0:20,7),names=0:20, main="mean = 7x",xlab="Coverage",ylab="Probability")

Q The Probability Mass Function of the Poisson distribution in R is dpois() and the Cumulative Density Function is ppois(). Can you use R to find the proportion of sites with less than 7x coverage? How about less than 3x coverage?

Show me the answer!

For the first question, P(X < 7), it might be tempting to quickly conclude that this should be 50%, because 7x is the mean. However, Poisson is not necessarily a symmetric distribution, it is bounded at zero at the lower end. This makes sense, of course, we can’t have less than zero read coverage. The answers are:

    • P(X < 7) = 0.46; Both sum(dpois(0:6,7)) and ppois(6,7,lower.tail=T) provide this answer
    • P(X < 3) = 0.03; The R code: sum(dpois(0:2,7)) or ppois(2,7,lower.tail=T)

Q How do the probabilities of sites with less than 7x coverage and less than 3x coverage change if the mean coverage is increased to 10x?

Show me the answer!

With X \sim Poisson(10), we find the following:

    • P(X < 7) = 0.13; The R code: sum(dpois(0:6,10)) or ppois(6,10,lower.tail=T)
    • P(X < 3) = 0.003; The R code: sum(dpois(0:2,10)) or ppois(2,10,lower.tail=T)

As you can see, a modest increase in mean coverage (from 7 to 10x) results in a substantial decrease in the number of low coverage sites.

Q When designing a sequencing experiment, you may consider a question such as: What average coverage should I aim for if I want 99% of (mappable) sites covered by at least 7 reads?

Show me the answer!

We can plot P(X < 7) as a function of mean coverage.

The R code:
plot(10:20,ppois(6,10:20,lower.tail=T),xlab="Mean coverage",ylab="P(X < 7)")
abline(h=0.01,col="red")

We can see that the proportion of sites with coverage below 7x falls below 1% when the mean genome-wide coverage is increased to 15x.


4) Estimating FST (Associated R code)

There can be some ambiguity in usage of the term FST – not all people mean the same thing by FST. One can talk of FST as an evolutionary parameter; in this case, the same evolutionary parameter can give rise to different allele frequencies. We also talk of FST as a deterministic summary of allele frequencies. These summaries are designed to be estimators of the evolutionary parameter, but sometimes we refer to them simply as FST. In this exercise, we are going to use the Hudson estimator of FST, as defined in equation (10) of this paper, which also describes other estimators and discusses many practical choices involved in applying the estimates to modern genomic data.

FST estimators are a function of allele frequencies in a population or in multiple populations. A common use case in modern population and speciation genomic studies is to calculate pairwise FST estimates to assess differentiation between two populations at different loci along the genome. The pairwise Hudson FST estimator is:

\hat{F}_{ST}^{Hudson} = \frac{(\tilde{p}_1-\tilde{p}_2)^2 - \frac{\tilde{p}_1*(1-\tilde{p}_1)}{n_1-1} - \frac{\tilde{p}_2*(1-\tilde{p}_2)}{n_2-1}}{\tilde{p}_1*(1-\tilde{p}_2) + \tilde{p}_2*(1-\tilde{p}_1)}

where \tilde{p}_1 is the sample allele frequency in population 1, \tilde{p}_2 is the sample allele frequency in population 2, and n_1 and n_2 are the haploid sample sizes. This means that if we sample 4 diploid individuals from population 1, the sample size is n_1=8.

In the following exercises, we are going to explore how the finite sampling of individuals affects estimates of allele frequencies and, how this, in turn, affects FST estimates. At this stage, we are going to ignore base and genotype calling uncertainties and assume we have perfect genotype calls. This allows us to explore directly the effect of the number of individuals in our sample.

We start by focussing on allele frequency estimates. Consider a biallelic SNP with a minor allele frequency (the frequency of the less common allele) of 0.45. In other words, 45% of the individuals in the population have the less common allele, e.g. an A and the remaining 55% of individuals have the more common (major) allele, e.g., a T.

Q How accurate estimate of this population allele frequency can we expect if we sequence four diploid individuals from this population? What about if we sequence 10 individuals? How about 20 individuals?

Show me the answer!

The Binomial distribution provides the answer again! If we encode the major allele as 0 and the minor allele as 1, then the probability of each observation of the minor allele in our sample is 0.45. Each observation is a Bernoulli trial and 0.45 is the probability of success (observing the minor allele). The number of minor alleles in our sample, let's denote it by X is then distributed as: X \sim Binomial(8,0.45) for 4 diploid individuals, X \sim Binomial(20,0.45) for 10 diploid individuals, and X \sim Binomial(40,0.45) for 20 diploid individuals. If we divide X by the sample size, we get the distribution of sample allele frequencies. We can see the three distributions for 4, 10, and 20 diploid individuals below:


R code:
plot((0:8)/8,dbinom(0:8, size = 8, prob=0.45),pch=16,col="red",xlab="Sample allele frequency",ylab="Probability",cex=1.2,type='b',main="Population allele frequency = 0.45")
points((0:20)/20,dbinom(0:20, size = 20, prob=0.45),pch=16,col="blue",cex=1.2,type='b')
points((0:40)/40,dbinom(0:40, size = 40, prob=0.45),pch=16,col="orange",type='b',cex=1.2)
legend("topright",legend=c("4 diploid individuals","10 diploid individuals", "20 diploid individuals"),pch=16,col=c("red","blue","orange"))

As expected, the higher the number of sequenced individuals, the more accurate are the allele frequency estimates (the distributions are narrower).

At the same time, we can see that all three distributions have rather large variances. With four individuals, there is 31% probability that our sample allele frequency is going to be less than 0.25 or more than 0.75, very far from the true population allele frequency of 0.45: sum(dbinom(c(0,1,2,6,7,8), size = 8, prob=0.45)). This probability drops to 6% with ten diploid individuals (sum(dbinom(c(1:5,15:20), size = 20, prob=0.45))), and 0.7% with 20 diploid individuals (sum(dbinom(c(1:10,30:40), size = 40, prob=0.45))).

Q How how about if the true population minor allele frequency is p=0.15? Does the variance of sample estimates from 4, 10, and 20 diploid individuals increase or decrease, compared with p=0.45?

Show me the answer!

The distributions have become narrower. There is a better chance to obtain a more accurate estimate of this minor allele frequency. In fact, it is straightforward to formally show that it is indeed true. The variance of a random variable with a Binomial distribution is Var[X] = n*p*(1-p), where n is the sample size. With p = 0.15 we get p*(1-p) = 0.15*0.85 = 0.1275 which is about half of the value with p = 0.45 where p*(1-p) = 0.45*0.55 = 0.2475.

Let's say that a polymorphism has minor allele frequency of 0.45 in population 1 and 0.15 in population 2. With these values, the Hudson estimator returns FST = 0.194 for this SNP.

Next we are going to explore how the uncertainties in allele frequency estimation translate into variation in pairwise FST estimates for a single SNP. Using R, we have drawn a hundred thousand observations from the sample allele frequency distributions with 4, 10, and 20 diploid individuals from each of the two population and then used the Hudson's FST estimator. We obtained the following FST values:

Is becomes apparent that the FST estimates with four diploid individuals are all over the place. In fact it is not much better with 10 diploid individuals. Only with 20 individuals we start getting a distribution that appears centered around the true FST of 0.194 (this value is denoted by the red vertical lines in the plots above).

The variances are huge. However, it is not the case that the Hudson estimator does not work. On average, we are getting the right answer, or at least very close to it (there is slight underestimation). As indicated in the figure, the mean of the observations was around 0.18, reasonably close to 0.194, in all analyses, whether with 4, 10, or 20 individuals.

Because single SNP estimates have such large variances, genome scans for differentiation usually average over multiple SNPs. Let's explore what happens to the variance of the FST estimator if we average over five, ten, or twenty independent SNPs. You can execute the provided R code and make your own plots, or click on the expanding boxes below.

See the results for five SNPs

The allele frequencies are for which we show results below are:
p1 = 0.375,0.018,0.159,0.273,0.064
p2 = 0.393,0.308,0.188,0.278,0.49

(you could use also use R to simulate five random minor allele frequencies: round(runif(5,max=0.5),3)

The population Hudson FST estimate for our p1 and p2 is 0.136. The distributions of FST estimates from sample allele frequencies with 4, 10, and 20 diploid individuals are shown below:

As expected, the FST distributions based on 5 SNP averages have considerably lower variances than single SNP estimates. Estimation based on ten diploid individuals is starting to look possible.

See the results for ten SNPs

The allele frequencies are for which we show results below are:
p1 = 0.421,0.154,0.055,0.197,0.062,0.043,0.094,0.343,0.314,0.024
p2 = 0.03,0.423,0.052,0.407,0.157,0.038,0.285,0.4,0.172,0.352

The population Hudson FST estimate is 0.138, similar as in the five SNP case. Again, the distributions of FST estimates from sample allele frequencies with 4, 10, and 20 diploid individuals are shown below:

As expected the sampling distributions narrow further with 10 SNPs. However, using only four individuals from each population still seems like a bad idea.

See the results for 20 SNPs

The allele frequencies are for which we show results below are:
p1 = 0.421, 0.154, 0.055, 0.197, 0.062, 0.043, 0.094, 0.343, 0.314, 0.024, 0.35, 0.344, 0.076, 0.406, 0.325, 0.369, 0.413, 0.033, 0.242, 0.027
p2 = 0.03, 0.423, 0.052, 0.407, 0.157, 0.038, 0.285, 0.4, 0.172, 0.352,0.013, 0.300, 0.200, 0.095, 0.425, 0.104, 0.154, 0.365, 0.184, 0.060

The population Hudson FST estimate is 0.138, again very similar similar as in the five and ten SNP cases.

The sampling distributions narrow even further with 20 SNPs. Even using only 4 diploid individuals is starting to look reasonable with this window size. And with 20 diploid individuals, we are getting very accurate FST estimates, with standard deviation of only 0.02.

It can be overwhelming to try to compare all the above results. Luckily, because the mean population FST value was very similar in the above examples, it makes sense to overlay the inferred probability density plots to see all the results at a glance in a single plot.

See the density overlay of results for different window sizes

Q Considering the above results, how does that inform you about experimental/analysis choices regarding the number of individuals to sequence and appropriate window size? Lets discuss this in the class.


5) Conclusions and outlook

In this activity, you have explored a number of simple statistical concepts, such as probabilities, likelihood, probability distributions (Binomial, Poisson), and distribution functions in the context of evolutionary genomics. You have also seen how you can gain information about probabilities using a number of simple, but powerful functions implemented in the R environment for statistical computing. Such insights can inform your experimental design choices (e.g., sequencing instrument, depth of coverage, number of individuals sequenced from each population), and they can also help you to evaluate the strength of evidence provided by your data.

Having explored base calling error rates, finite read sampling, and finite sampling of individuals, you might be wondering how to combine these sources of uncertainty into a unified framework. This is indeed possible, and the math is not so complicated, but it is beyond the scope of this introductory exercise.

In practice, variant callers such as GATK and bcftools integrate base calling errors and depth of coverage information into the Genotype Quality field.