Statistics with R for Genome Analyses

Hannah Tavalire and Bill Cresko

January 2019

git fetch
git status
git merge origin/master

That stupid chair again

How did I get here?

Applied math, statistics and computation are the foundations of genomic analysis

Period

Goals for today

WE WILL COVER TODAY

Probability and distributions
Different flavors of statistical inference
Parameter estimation and hypothesis testing
Power analyses
Linear models
Multivariate approaches
Application to genomic and phenotypic datasets
Help you get even more comfortable with R

EXTRA AT THE END OF THE SLIDES

Clinical trials as a guide to experimental design
Reporting statistics in manuscripts

Data wrangling and exploratory data analysis (EDA)

A biological example to get us started

Say you perform an experiment on two different strains of stickleback fish, one from an ocean population (RS) and one from a freshwater lake (BP) by making them microbe free.

Microbes in the gut are known to interact with the gut epithelium in ways that lead to a proper maturation of the immune system.

A biological example to get us started

You carry out an experiment by treating multiple fish from each strain so that some of them have a conventional microbiota, and some are inoculated with only one bacterial species.

You then measure the levels of gene expression in the stickleback gut using RNA-seq.

You suspect that the sex of the fish might be important so you track it too.

A biological example to get us started

Collecting Data with Analyses in Mind

How should the data set be organized to best analyze it?
What are the key properties of the variables?
Why does it matter for learning R?
Why does it matter for performing statistical analyses in R?

Data set rules of thumb (aka Tidy Data)

Store a copy of data in non-proprietary formats
Leave an uncorrected file when doing analyses
Maintain effective metadata about the data
When you add observations to a database, add rows
When you add variables to a database, add columns
A column of data should contain only one data type

Tidyverse family of packages

Hadley Wickham and others have written R packages to modify data
These packages do many of the same things as base functions in R
However, they are specifically designed to do them faster and more easily
Wickham also wrote the package GGPlot2 for elegant graphics creations
GG stands for ‘Grammar of Graphics’

Example of a tibble

Key functions in `dplyr` for vectors

Pick observations by their values with filter().
Reorder the rows with arrange().
Pick variables by their names with select().
Create new variables with functions of existing variables with mutate().
Collapse many values down to a single summary with summarise().

`filter()`, `arrange()` & `select()`

filter(flights, month == 11 | month == 12)

arrange(flights, year, month, day)

select(flights, year, month, day)

`mutate()` & `transmutate()`

This function will add a new variable that is a function of other variable(s)

mutate(flights,
  gain = arr_delay - dep_delay,
  hours = air_time / 60,
  gain_per_hour = gain / hours
)

This function will replace the old variable with the new variable

transmute(flights,
  gain = arr_delay - dep_delay,
  hours = air_time / 60,
  gain_per_hour = gain / hours
)

`group_by( )` & `summarize( )`

This first function allows you to aggregate data by values of categorical variables (factors)

by_day <- group_by(flights, year, month, day)

Once you have done this aggregation, you can then calculate values (in this case the mean) of other variables split by the new aggregated levels of the categorical variable

summarise(by_day, delay = mean(dep_delay, na.rm = TRUE))

`group_by( )` & `summarize( )`

Note - you can get a lot of missing values!
That’s because aggregation functions obey the usual rule of missing values:
- if there’s any missing value in the input, the output will be a missing value.
- fortunately, all aggregation functions have an na.rm argument which removes the missing values prior to computation

R INTERLUDE | Complete Exercise 2.3-2.4

Graphical communication

GGPlot2 and the Grammar of Graphics

GG stands for Grammar of Graphics
A good paragraph uses good grammar to convey information
A good figure uses good grammar in the same way
Seven general components used to create most figures

GGPlot2 and the Grammar of Graphics

xxx

Graphical representation | general approaches

Distributions of data
- location
- spread
- shape
Associations between variables
- relationship among two or more variables
- differences among groups in their distributions

Graphical representation | general approaches

Distributions of data
- bar graph
- histogram
- box plot
Associations between variables
- pie chart
- grouped bar graph
- mosaic plot
- box plot
- scatter plot
- dot plot ‘stripchart’

Box Plot

Displays median, first and third quartile, range, and extreme observations
Can be combined with mean and standard error of the mean
Concise way to visualize many aspects of distribution

Scatter Plot

Displays association between two numerical variables
Goal is association not magnitude or frequency
Points fill the space available

Examples of the good, bad and the ugly of graphical representation

Examples of bad graphs and how to improve them.
Courtesy of K.W. Broman
www.biostat.wisc.edu/~kbroman/topten_worstgraphs/

Ticker tape parade

A line to no understanding

A cup of hot nothing

A bake sale of pie charts

Wack a mole

Best practices

“Graphical excellence is that which gives to the viewer the greatest number of ideas in the shortest time with the least ink in the smallest space”

— Edward Tufte

Principles of effective display

Show the data
Encourage the eye to compare differences
Represent magnitudes honestly and accurately
Draw graphical elements clearly, minimizing clutter
Make displays easy to interpret

“Above all else show the data” | Tufte 1983

“Maximize the data to ink ratio, within reason” | Tufte 1983

Draw graphical elements clearly, minimizing clutter

“A graphic does not distort if the visual representation of the data is consistent with the numerical representation” – Tufte 1983

Represent magnitudes honestly and accurately

How Fox News makes a figure …

“Graphical excellence begins with telling the truth about the data” – Tufte 1983

GGPlot2 and nice figures

GGPlot2 and the Grammar of Graphics

xxx

The `geom_bar` function

library(ggplot2)
ggplot(data=diamonds) +
  geom_bar(mapping=aes(x=cut))

The `geom_bar` function

Now try this…

ggplot(data=diamonds) +
  geom_bar(mapping=aes(x=cut, colour=cut))

The `geom_bar` function

and this…

ggplot(data=diamonds) +
  geom_bar(mapping=aes(x=cut, fill=cut))

The `geom_bar` function

and finally this…

ggplot(data=diamonds) +
  geom_bar(mapping=aes(x=cut, fill=clarity), position="dodge")

The `geom_histogram` and `geom_freqpoly` function

With this function you can make a histogram

ggplot(data=diamonds) +
  geom_histogram(mapping=aes(x=carat), binwidth=0.5)

The `geom_histogram` and `geom_freqpoly` function

This allows you to make a frequency polygram

ggplot(data=diamonds) +
  geom_freqpoly(mapping=aes(x=carat), binwidth=0.5)

The `geom_boxplot` function

Boxplots are very useful for visualizing data

ggplot(data=diamonds, mapping=aes(x=cut, y=price)) +
  geom_boxplot()

The `geom_boxplot` function

ggplot(data=mpg, mapping=aes(x=class, y=hwy)) +
  geom_boxplot() +
  coord_flip()

The `geom_boxplot` function

ggplot(data=mpg, mapping=aes(x=reorder(class, hwy, FUN=median), y=hwy)) +
  geom_boxplot() +
  coord_flip()

The `geom_point` & `geom_smooth` functions

ggplot(data=diamonds, mapping=aes(x=x, y=y)) +
  geom_point()

`geom_point` & `geom_smooth`

ggplot(data=mpg) +
  geom_point(mapping=aes(x=displ, y=hwy)) +
  facet_wrap(~class, nrow=2)

`geom_point` & `geom_smooth`

ggplot(data=mpg) +
  geom_point(mapping=aes(x=displ, y=hwy)) +
  facet_grid(drv~cyl)

`geom_point` & `geom_smooth`

ggplot(data=mpg) +
  geom_smooth(mapping=aes(x=displ, y=hwy), method = "loess")

Combining geoms

ggplot(data=mpg) +
  geom_point(mapping=aes(x=displ, y=hwy)) +
  geom_smooth(mapping=aes(x=displ, y=hwy), method = "loess")

Adding labels

ggplot(data=mpg, aes(displ, hwy)) +
  geom_point(aes(color=class)) +
  geom_smooth(se=FALSE, method="loess") +
  labs(title = "Fuel efficiency generally decreases with engine size",
  subtitle = "Two seaters (sports cars) are an exception because of their light weight",
  caption = "Data from fueleconomy.gov")

Themes

Arranging Multiple Figures- Flexdashboard

Modify YAML header to specify graph orientation

---
title: Flexdashboard Options"
output: 
  flexdashboard::flex_dashboard:
    
   vertical_layout: 'fill' or 'scroll'
         or
     orientation: 'rows'
---

Or specify data width to reorganize (try this in template) https://rmarkdown.rstudio.com/flexdashboard/

Arranging Multiple Figures- Flexdashboard

R INTERLUDE | Complete Exercise 2.5

Statistical foundations for genomics

Probability and distributions

Different flavors of inferential statistics

Frequentist Statistics
- Classical or standard approaches
- Null hypothesis testing
- Genomic data provides challenges

Hierarchical Probabilistic Modeling
- Maximum Likelihood
- Bayesian Analyses
- Machine Learning

R (&Python) can be used for both flavors

Stan® is a platform for statistical modeling computation
Bayesian statistical inference with MCMC
approximate Bayesian inference with variational inference
maximum likelihood estimation with optimization
probability functions & linear algebra

What is probability

Frequency interpretation

“Probabilities are understood as mathematically convenient approximations to long run relative frequencies.”
Subjective interpretation

“A probability statement expresses the opinion of some individual regarding how certain an event is to occur.”

Random variables & probability

Probability is the expression of belief in some future outcome
A random variable can take on different values with different probabilities
The sample space of a random variable is the universe of all possible values

Random variables & probability

The sample space can be represented by a
- probability distribution (discrete)
- probability density function (continuous)
- algebra and calculus are used for each respectively
- probabilities of a sample space always sum to 1.0
There are many families or forms of distributions or PDFs
- depends on the dynamical system they represent
- the exact instantiation of the form depends on their parameter values
- we are often interested in estimating parameters

Bernoulli distribution

Describes the expected outcome of a single event with probability p
Example of flipping of a fair coin once

\[Pr(X=\text{Head}) = \frac{1}{2} = 0.5 = p \]

\[Pr(X=\text{Tails}) = \frac{1}{2} = 0.5 = 1 - p \]

Bernoulli distribution

If the coin isn’t fair then \(p \neq 0.5\)
However, the probabilities still sum to 1

\[ p + (1-p) = 1 \]
Same is true for other binary possibilities
- success or failure
- yes or no answers
- choosing an allele from a population based upon allele frequencies

Probability rules

Flip a coin twice
Represent the first flip as ‘X’ and the second flip as ‘Y’

\[ Pr(\text{X=H and Y=H}) = p*p = p^2 \] \[ Pr(\text{X=H and Y=T}) = p*p = p^2 \] \[ Pr(\text{X=T and Y=H}) = p*p = p^2 \] \[ Pr(\text{X=T and Y=T}) = p*p = p^2 \]

Probability rules

Probability that the H and T can occur in any order

\[ \text{Pr(X=H and Y=T) or Pr(X=T and Y=H)} = \] \[ (p*p) + (p*p) = 2p^{2} \]

These are the ‘and’ and ‘or’ rules of probability
- ‘and’ means multiply the probabilities
- ‘or’ means sum the probabilities
- most probability distributions can be built up from these simple rules

Expectation and Moments of Distributions

Distributions have moments that can be estimated
1st, 2nd, 3rd and 4th moments of a distribution?
The expectation or mean of a random variable X is:

\[E[X] = \sum_{\text{all x}}^{}xP(X=x) = \mu\]

Often we want to know how dispersed the random variable is around its mean.
One measure of dispersion is the variance

\[Var(X) = E[X^2] = \sigma^2\]

Joint probability

\[Pr(X,Y) = Pr(X) * Pr(Y)\]

Note that this is true for two independent events
However, for two non-independent events we also have to take into account their covariance
To do this we need conditional probabilities

Conditional probability

For two independent variables

\[Pr(Y|X) = Pr(Y)\text{ and }Pr(X|Y) = Pr(X)\]

For two non-independent variables

\[Pr(Y|X) \neq Pr(Y)\text{ and }Pr(X|Y) \neq Pr(X)\]

Variables that are non-independent have a shared variance, which is also known as the covariance
Covariance standardized to a mean of zero and a unit standard deviation is correlation

What is Likelihood vs. Probability?

The probability of an event is the proportion of times that the event would occur if we repeated a random trial over and over again under the same conditions.
The likelihood is a conditional probability of a parameter value given a set of data
The likelihood of a population parameter equaling a specific value, given the data

L[parameter|data] = Pr[data|parameter]

This is the likelihood function which can have a maximum

A Bayesian looks at estimation

A way of quantifying uncertainty of the population parameter, not our estimate of it, using prior information about the probability distribution of the parameter.

\[ P(\theta|d) = \frac{P(d|\theta)P(\theta)}{P(d)}\]

where

\(P(\theta|d)\) = posterior probability distribution
\(P(d|\theta)\) = likelihood function for \(\theta\)
\(P(\theta)\) = prior probability distribution
\(P(d)\) = mean of the likelihood function

Who wants to win a car?

Pretend that there are three doors, and behind one is a car.
You choose one of the doors, and then Monte Hall opens one of the two remaining doors.
At this point you have the choice of changing doors or staying with your original choice.
What should you do?

Common probability distributions in genomics | Many of these are thanks to Sally Otto at UBC

Discrete Probability Distributions

Geometric Distribution

If a single event has two possible outcomes the probability of having to observe k trials before the first “one” appears is given by the geometric distribution
The probability that the first “one” would appear on the first trial is p, but the probability that the first “one” appears on the second trial is (1-p)*p
By generalizing this procedure, the probability that there will be k-1 failures before the first success is:

\[P(X=k)=(1-p)^{k-1}p\]

mean = \(\frac{1}{p}\)
variance = \(\frac{(1-p)}{p^2}\)

Geometric Distribution

Example: If the probability of extinction of an endangered population is estimated to be 0.1 every year, what is the expected time until extinction?

Binomial Distribution

A binomial distribution results from the combination of several independent Bernoulli events
Example
- Pretend that you flip 20 fair coins
  - or collect alleles from a heterozygote
- Now repeat that process and record the number of heads
- We expect that most of the time we will get approximately 10 heads
- Sometimes we get many fewer heads or many more heads

Binomial Distribution

The distribution of probabilities for each combination of outcomes is

\[\large f(k) = {n \choose k} p^{k} (1-p)^{n-k}\]

n is the total number of trials
k is the number of successes
p is the probability of success
q is the probability of not success
For binomial as with the Bernoulli p = 1-q

Binomial Probability Distribution

Negative Binomial Distribution

Extension of the geometric distribution describing the waiting time until r “ones” have appeared.
Generalizes the geometric distribution
Probability of the \(r^{th}\) “one” appearing on the \(k^{th}\) trial:

\[P(X=k)=(\frac{k-1}{r-1})p^{r-1}(1-p)^{k-r}p\]

which simplifies to

\[P(X=k)=(\frac{k-1}{r-1})p^{r}(1-p)^{k-r}\]

mean = \(\frac{r}{p}\)
variance = \(r(1-p)/p^2\)

Negative Binomial Distribution

Example: If a predator must capture 10 prey before it can grow large enough to reproduce
What would the mean age of onset of reproduction be if the probability of capturing a prey on any given day is 0.1?
Notice that the variance is quite high (~1000) and that the distribution looks quite skewed

Poisson Probability Distribution

Another common situation in biology is when each trial is discrete, but the number of observations of each outcome is observed/counted
Some examples are
- counts of snails in several plots of land
- observations of the firing of a neuron in a unit of time
- count of genes in a genome binned to units of 500 AA
Just like before you have ‘successes’, but
- now you count them for each replicate
- the replicates now are units of area or time
- the values can now range from 0 to a large number

Poisson Probability Distribution

For example, you can examine 1000 genes
- count the number of base pairs in the coding region of each gene
- what is the probability of observing a gene with ‘r’ bp?
Pr(Y=r) is the probability that the number of occurrences of an event y equals a count r in the total number of trials

\[Pr(Y=r) = \frac{e^{-\mu}\mu^r}{r!}\]

Poisson Probability Distribution

Note that this is a single parameter function because \(\mu = \sigma^2\)
The two together are often just represented by \(\lambda\)

\[Pr(y=r) = \frac{e^{-\lambda}\lambda^r}{r!}\]

This means that for a variable that is truly Poisson distributed:
- the mean and variance should be equal to one another
- variables that are approximately Poisson distributed but have a larger variance than mean are often called ‘overdispersed’
- quite common in RNA-seq and microbiome data

Poisson Probability Distribution | gene length by bins of 500 nucleotides

Poisson Probability Distribution | increasing parameter values of \(\lambda\)

Continuous probability distributions

P(observation lies within dx of x) = f(x)dx

\[P(a\leq X \leq b) = \int_{a}^{b} f(x) dx\]

Remember that the indefinite integral sums to one

\[\int_{-\infty}^{\infty} f(x) dx = 1\]

Continuous probabilities

E[X] may be found by integrating the product of x and the probability density function over all possible values of x:

\[E[X] = \int_{-\infty}^{\infty} xf(x) dx \]

\(Var(X) = E[X^2] - (E[X])^2\), where the expectation of \(X^2\) is

\[E[X^2] = \int_{-\infty}^{\infty} x^2f(x) dx \]

Uniform Distribution

\[E[X] = \int_{a}^{b} x\frac{1}{b-a} dx = \frac{(a+b)}{2} \]

Exponential Distribution

\[f(x)=\lambda e^{-\lambda x}\]

E[X] can be found be integrating \(xf(x)\) from 0 to infinity

leading to the result that

\(E[X] = \frac{1}{\lambda}\)
\(E[X^2] = \frac{1}{\lambda^2}\)

Exponential Distribution

For example, let equal the instantaneous death rate of an individual.
The lifespan of the individual would be described by an exponential distribution (assuming that does not change over time).

Gamma Distribution

The gamma distribution generalizes the exponential distribution.
It describes the waiting time until the \(r^{th}\) event for a process that occurs randomly over time at a rate \(\lambda\) :

\[f(x) = \frac{e^{-\lambda x}\lambda x^{r-1}}{(r-1)!}\lambda\]

\[ Mean = \frac{r}{\lambda} \] \[ Variance = \frac{r}{\lambda^2} \]

Gamma Distribution

Example: If, in a PCR reaction, DNA polymerase synthesizes new DNA strands at a rate of 1 per millisecond, how long until 1000 new DNA strands are produced?

Assume that DNA synthesis does not deplete the pool of primers or nucleotides in the chamber, so that each event is independent of other events in the PCR chamber.

The Gaussian or Normal Distribution

Log-normal PDF | Continuous version of Poisson (-ish)

Transformations to ‘normalize’ data

Binomial to Normal | Categorical to continuous

The Normal (aka Gaussian) | Probability Density Function (PDF)

Normal PDF

Normal PDF | A function of two parameters

(\(\mu\) and \(\sigma\))

where \[\large \pi \approx 3.14159\]

\[\large \epsilon \approx 2.71828\]

To write that a variable (v) is distributed as a normal distribution with mean \(\mu\) and variance \(\sigma^2\), we write the following:

\[\large v \sim \mathcal{N} (\mu,\sigma^2)\]

Normal PDF | estimates of mean and variance

Estimate of the mean from a single sample

\[\Large \bar{x} = \frac{1}{n}\sum_{i=1}^{n}{x_i} \]

Estimate of the variance from a single sample

\[\Large s^2 = \frac{1}{n-1}\sum_{i=1}^{n}{(x_i - \bar{x})^2} \]

Normal PDF

Why is the Normal special in biology?

Parent-offspring resemblance

Genetic model of complex traits

Distribution of \(F_2\) genotypes | really just binomial sampling

Why else is the Normal special?

The normal distribution is immensely useful because of the central limit theorem, which says that the he mean of many random variables independently drawn from the same distribution is distributed approximately normally
One can think of numerous situations, such as
- when multiple genes contribute to a phenotype
- or that many factors contribute to a biological process
In addition, whenever there is variance introduced by stochastic factors the central limit theorem holds
Thus, normal distributions occur throughout genomics
It’s also the basis of the majority of classical statistics

z-scores of normal variables

Often we want to make variables more comparable to one another
For example, consider measuring the leg length of mice and of elephants
- Which animal has longer legs in absolute terms?
- Pproportional to their body size?
- Proportional to their body size?
A good way to answer these last questions is to use ‘z-scores’

z-scores of normal variables

z-scores are standardized to a mean of 0 and a standard deviation of 1
We can modify any normal distribution to have a mean of 0 and a standard deviation of 1
Another term for this is the standard normal distribution

\[\huge z_i = \frac{(x_i - \bar{x})}{s}\]

R Interlude | Complete Exercise 3.1

Hypothesis testing, test statistics, p-values

What is a hypothesis

A statement of belief about the world
Need a critical test to
- accept or reject the hypothesis
- compare the relative merits of different models
This is where statistical sampling distributions come into play
Statistical distributions are built upon sampling distributions

Hypothesis tests

What is the probability that we would reject a true null hypothesis?
What is the probability that we would accept a false null hypothesis?
How do we decide when to reject a null hypothesis and support an alternative?
What can we conclude if we fail to reject a null hypothesis?
What parameter estimates of distributions are important to test hypotheses?

Null and alternative hypotheses | population distributions

Statistical sampling distributions

Statistical tests provide a way to perform critical tests of hypotheses
Just like raw data, statistics are random variables and depend on sampling distributions of the underlying data
The particular form of the statistical distribution depends on the test statistic and parameters such as the degrees of freedom that are determined by sample size.

Statistical sampling distributions

In many cases we create a null statistical distribution that models the distribution of a test statistic under the null hypothesis.
Similar to point estimates, we calculate an observed test statistic value for our data
Then see how probable it was by comparing against the null distribution
The probability of seeing that value or greater is called the p-value of the statistic

Four common statistical distributions

The t-test and t sampling distribution

\[\large t = \frac{(\bar{y}_1-\bar{y}_2)}{s_{\bar{y}_1-\bar{y}_2}} \]

where

which is the calculation for the standard error of the mean difference

The t-test and t sampling distribution | one-tailed test

The t-test and t sampling distribution | two-tailed test

Assumptions of parameteric t-tests

The theoretical t-distributions for each degree of freedom were calculated for populations that are:
- normally distributed
- have equal variances (if comparing two means)
- observations are independent (randomly drawn)
This is an example of a parametric test
What do you do if the there is non-normality?
- nonparametric tests such as Mann-Whitney-Wilcoxon
- randomization tests to create a null distribution

Type 1 and Type 2 errors

Components of hypothesis testing

p-value = the long run probability of rejecting a true null hypothesis
\(\alpha\) = critical value of p-value cutoff for experiments. The Type I error we are willing to tolerate.
\(\beta\) = cutoff for probability of accepting a false null hypothesis
Power = the probability that a test will reject a false null hypothesis (1 - beta). It depends on effect size, sample size, chosen alpha, and population standard deviation
Multiple testing = performing the same or similar tests multiple times - need to correct alpha value
Can correct multiple testing using a tax (e.g. Bonferonni) or directly estimating a False Discovery Rate (FDR)

Statistical power

Type 1 error - \(\alpha\) - incorrectly rejecting a true null hypothesis
- This is saying that there is an effect when there isn’t
Type 2 error - \(\beta\) - incorrectly accepting a false null hypothesis
- This is saying that there isn’t an effect when there is
Power is the probability of rejecting a false null hypothesis
Mostly we shoot for a power of around 80%
Power can be calculated post hoc or a priori

Power | the things one needs to know

\[ Power \propto \frac{(ES)(\alpha)(\sqrt n)}{\sigma}\]

Power is proportional to the combination of these parameters
- ES - effect size; how large is the change of interest?
- \(\alpha\) - significance level (usually 0.05)
- n - sample size
- \(\sigma\) - standard deviation among experimental units within the same group.

Power | what we usually want to know

Power | rough calculation

R Interlude | Complete Exercises 3.2-3.4

Linear Models and Regression

Parent offspring regression

Linear Models - a note on history

Bivariate normality

Covariance and correlation

A linear model to relate two variables

Parameter Estimation | Ordinary Least Squares (OLS)

Algorithmic approach to parameter estimation
One of the oldest & best developed
Used extensively in linear models (ANOVA and regression)
By itself only produces a single best estimate (No C.I.’s)
Many OLS estimators have been replaced by ML estimators

Many approaches are linear models

Flexible - applicable to many different study designs
Provides a common set of tools
- lm in R for fixed effects
- nlme4 for random effects
Includes tools to estimate parameters:
- sizes of effects like the slope
- difference in means among categories
Is easier to work with, especially with multiple variables

Many approaches are linear models

Linear regression
Single factor ANOVA
Analysis of covariance (ANCOVA)
Multiple regression
Multi-factor ANOVA
Repeated-measures ANOVA

Plethora of linear models

General Linear Model (GLM) - two or more continuous variables
General Linear Mixed Model (GLMM) - a continuous response variable with a mix of continuous and categorical predictor variables
Generalized Linear Model - a GLM that doesn’t assume normality of the response
Generalized Additive Model (GAM) - a model that doesn’t assume linearity

Linear models

All can be written in the form

response variable = intercept + (explanatory_variables) + random_error

in the general form:

\[ Y=\beta_0 +\beta_1*X_1 + \beta_2*X_2 +... + \epsilon\]

where \(\beta_0, \beta_1, \beta_2, ....\) are the parameters of the linear model

linear model parameters

Linear models in R

Need to fit the model and then ‘read’ the output
This is similar to what you learned from Malachi about GGPlot2 figures

trial_lm <- lm(Y~X)
summary (trial_lm)

In general you will ask R to fit linear models and then do additional analyses on the output
Note that there are a number of new variables created in the output

Model fitting and hypothesis tests in regression

\[H_0 : \beta_0 = 0\] \[H_0 : \beta_1 = 0\]

full model - \(y_i = \beta_0 + \beta_1*x_i + error_i\)

reduced model - \(y_i = \beta_0 + 0*x_i + error_i\)

fits a “reduced” model without slope term (\(H_0\))
fits the “full” model with slope term added back
compares fit of full and reduced models using an F test

Model fitting in regression

Hypothesis tests in linear regression

Residual Analysis | did we meet our assumptions?

Independent errors (residuals)
Equal variance of residuals in all groups
Normally-distributed residuals
Robustness to departures from these assumptions is improved when sample size is large and design is balanced

Residual Analysis

\[y_i = \beta_0 + \beta_1 * x_i + \epsilon_i\]

Residual Analysis

Residual Plots | Spotting assumption violations

Anscombe’s quartet | what would residual plots look like for these?

Multiple Linear Regression - Goals

To develop a better predictive model than is possible from models based on single independent variables.
To investigate the relative individual effects of each of the multiple independent variables above and beyond the effects of the other variables.
The individual effects of each of the predictor variables on the response variable can be depicted by single partial regression lines.
The slope of any single partial regression line (partial regression slope) thereby represents the rate of change or effect of that specific predictor variable (holding all the other predictor variables constant to their respective mean values) on the response variable.

Multiple Linear Regression | Additive and multiplicative models of 2 or more predictors

Additive model \[y_i = \beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + ... + B_jx_{ij} + \epsilon_i\]

Multiplicative model (with two predictors) \[y_i = \beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + \beta_3x_{i1}x_{i2} + \epsilon_i\]

Multiple Linear Regression | Additive and multiplicative models

Multiple linear regression assumptions

linearity
normality
homogeneity of variance
The absence of multi-collinearity
- a predictor variable must not be linearly predicted by - or correlated with - any combination of the other predictor variables.

checking for multi-collinearity

library(car)
scatterplotMatrix(~var1+var2+var3, diag=”boxplot”)

R Interlude | Exercise 4.1

ANOVA

Stands for ANalysis of VAriance
Core statistical procedure in biology
Developed by R.A. Fisher in the early 20th Century
The core idea is to ask how much variation exists within vs. among groups
ANOVAs are linear models that have categorical predictor and continuous response variables
The categorical predictors are often called factors, and can have two or more levels (important to specify in R)
Each factor will have a hypothesis test
The levels of each factor may also need to be tested

ANOVA | Let’s start with an example

Percent time that male mice experiencing discomfort spent “stretching”.
Data are from an experiment in which mice experiencing mild discomfort (result of injection of 0.9% acetic acid into the abdomen)
The injected mice were then kept in:
- isolation
- with a companion mouse not injected or
- with a companion mouse also injected and exhibiting “stretching” behaviors associated with discomfort

From Langford, D. J.,et al. 2006. Science 312: 1967-1970

ANOVA | Let’s start with an example

In words:

stretching = intercept + treatment

The model statement includes a response variable, a constant, and an explanatory variable.
The only difference with regression is that here the explanatory variable is categorical.

ANOVA | Let’s start with an example

ANOVA

ANOVA | Conceptually similar to regression

ANOVA | Statistical results table

ANOVA | F-ratio calculation

ANOVA | One or more predictor variables

One-way ANOVAs just have a single factor
Multi-factor ANOVAs have more than one factor
The interaction among factors needs to be considered
Multi-factor ANOVAs
- Factorial - two or more factors and their interactions
- Nested - the levels of one factor are contained within another level
- The models can be quite complex

Multifactor ANOVA | Assumptions

Normally distributed groups
- robust to non-normality if equal variances and sample sizes
Equal variances across groups
- okay if largest-to-smallest variance ratio < 3:1
- problematic if there is a mean-variance relationship among groups
Observations in a group are independent
- randomly selected
- don’t confound group with another factor

Two Factor Factorial Designs

Interpretation | significant main and interaction effects

Interaction plots

R Interlude | Exercise 4.2

Multivariate Statistics

What are multivariate statitistics?

General - more than one variable recorded from a number of experimental sampling units
Specific - two or more response variables that likely co-vary
Goals of multivariate statistics
- Data reduction and simplification (PCA and PCoA)
- Organization of objects (Cluster Analysis and MDS)
- Testing the effects of a factor on linear combinations of variables (MANOVA and DFA)

Conceptual overview of multivariate statistics

Multivariate statitistics definitions?

Some definitions first
i = 1 to n objects and j = 1 to p variables
Measure of center of a multivariate distribution = the centroid
Multivariate statistics uses eigenanalysis of either matrices of covariances of variables (p-by-p), or dissimilarities of objects (n-by-n)
Matrix and linear algebra are therefore very useful in multivariate statistics

Multivariate statitistics conceptual overview

Eigenanalysis

\[Z_{ik} = c_1y_{i1} + c_2y_{i1} + c_3y_{i2} + c_1y_{i3} + ... + c_py_{ip}\]

Derives linear combinations of the original variables that best summarize the total variation in the data
These new linear combinations become new variables themselves
Each object can now have a score for the new variables

Eigenvalues

Also called
- characteristic or latent roots or
- factors
Rearranging the variance in the association matrix so that the first few derived variables explain most of the variation that was present between objects in the original variables
The eigenvalues can also be expressed as proportions or percents of the original variance explained by each new derived variable (also called components or factors)

Eigenvectors

Lists of the coefficients or weights showing how much each original variable contributes to each new derived variable
The linear combination s can be solved to provide a score (\(z_{ik}\)) for each object
There are the same number of derived variables as there are original variables (p)
The newly derived variables are extracted sequentially so that they are uncorrelated with each other
The eigenvalues and eigenvectors can be derived using either spectral decomposition of the p-by-p matrix or singular value decomposition of the original matrix

Correlation vs. dissimiliarity matrices

Principal Component Analysis (PCA) and Correspondence Analysis (CA) use covariance or correlation of variables.
Principal Coordinate Analysis (PCoA), Cluster Analysis and Multidimensional Scaling (MDS) use dissimilarity indices.
The scaling of the derived latent variables can therefore differ between analyses that use covariance of variables as compared to dissimilarity indices.
Whether the derived variable can be considered metric (e.g. on a rational scale) or non-metric (e.g. on an ordinal scale) depends on the analysis.
As a consequence the downstream use of the derived variables can change depending upon the analysis.

Correlation vs. dissimiliarity matrices

Dissimiliarity indices for continuous variables

How many PC’s or PCo’s should I examine?

What else can I do with the z-scores of the new PCs and metric PCoAs?

They’re nice new variables that you can use in any analysis you’ve learned previously!!
You can perform single or multiple regression of your PCs on other continuous variables
For example, ask whether they correlate with an environmental gradient or body mass index.
If you have one or more grouping variables you can use ANOVA on each newly derived PC.
NOTE - non-metric PCoA or NMDS values cannot just be put into linear models!!

R Interlude | Exercises 4.3-4.4

Get more advanced and bring it all home sisters and brothers

Design principles for planning a good experiment

What is an experimental study?

In an experimental study the researcher assigns treatments to units
In an observational study nature does the assigning of treatments to units
The crucial advantage of experiments derives from the random assignment of treatments to units
Random assignment, or randomization, minimizes the influence of confounding variables

Mount Everest example

Survival of climbers of Mount Everest is higher for individuals taking supplemental oxygen than those who don’t.

Why?

Mount Everest example

One possibility is that supplemental oxygen (explanatory variable) really does cause higher survival (response variable).
The other is that the two variables are associated because other variables affect both supplemental oxygen and survival.
Use of supplemental oxygen might be a benign indicator of a greater overall preparedness of the climbers that use it.
Variables (like preparedness) that distort the causal relationship between the measured variables of interest (oxygen use and survival) are called confounding variables
They are correlated with the variable of interest, and therefore preventing a decision about cause and effect.
With random assignment, no confounding variables will be associated with treatment except by chance.

Clinical Trials

The gold standard of experimental designs is the clinical trial
Experimental design in all areas of biology have been informed by procedures used in clinical trials
A clinical trial is an experimental study in which two or more treatments are assigned to human subjects
The design of clinical trials has been refined because the cost of making a mistake with human subjects is so high
Experiments on nonhuman subjects are simply called “laboratory experiments”or “field experiments”

Example of a clinical trial

Transmission of the HIV-1 virus via sex workers contributes to the rapid spread of AIDS in Africa
The spermicide nonoxynol-9 had shown in vitro activity against HIV-1, which motivated a clinical trial by van Damme et al. (2002).
They tested whether a vaginal gel containing the chemical would reduce the risk of acquiring the disease by female sex workers.
Data were gathered on a volunteer sample of 765 HIV-free sex-workers in six clinics in Asia and Africa.
Two gel treatments were assigned randomly to women at each clinic.
One gel contained nonoxynol-9 and the other a placebo.
Neither the subjects nor the researchers making observations at the clinics knew who received the treatment and who got the placebo.

Example of a clinical trial

Design components of a clinical trial

The goal of experimental design is to eliminate bias and to reduce sampling error when estimating and testing effects of one variable on another.

To reduce bias, the experiment included:
- Simultaneous control group: study included both the treatment of interest and a control group (the women receiving the placebo).
- Randomization: treatments were randomly assigned to women at each clinic.
- Blinding: neither the subjects nor the clinicians knew which women were assigned which treatment.

Design components of a clinical trial

To reduce the effects of sampling error, the experiment included:
- Replication: study was carried out on multiple independent subjects.
- Balance: number of women was nearly equal in the two groups at every clinic.
- Blocking: subjects were grouped according to the clinic they attended, yielding multiple repetitions of the same experiment in different settings (“blocks”).

Simultaneous control group

In clinical trials either a placebo or the currently accepted treatment should be provided.
In experiments requiring intrusive methods to administer treatment, such as
- injections
- surgery
- restraint
- confinement
the control subjects should be perturbed in the same way as the other subjects, except for the treatment itself, as far as ethical considerations permit.

Simultaneous control group

The “sham operation”, in which surgery is carried out without the experimental treatment itself, is an example.
In field experiments, applying a treatment of interest may physically disturb the plots receiving it and the surrounding areas, perhaps by trampling the ground by the researchers.
Ideally, the same disturbance should be applied to the control plots.

Randomization

The researcher should randomize assignment of treatments to units or subjects
Chance rather than conscious or unconscious decision determines which units end up receiving the treatment and which the control
A completely randomized design is one in which treatments are assigned to all units by randomization

Randomization

Randomization breaks the association between possible confounding variables and the explanatory variable
Randomization doesn’t eliminate the variation contributed by confounding variables, only their correlation with treatment
Randomization ensures that variation from confounding variables is similar between the different treatment groups.

Randomization

Randomization should be carried out using a random process:
- List all n subjects, one per row, in a computer spreadsheet.
- Use the computer to give each individual a random number.
- Assign treatment A to those subjects receiving the lowest numbers and treatment B to those with the highest numbers.
Other ways of assigning treatments to subjects are almost always inferior because they do not eliminate the effects of confounding variables.
“Haphazard” assignment, in which the researcher chooses a treatment while trying to make it random, has repeatedly been shown to be non-random and prone to bias.

Blinding

Blinding is the process of concealing information from participants (sometimes including researchers) about which subjects receive which treatment.
Blinding prevents subjects and researchers from changing their behavior, consciously or unconsciously, as a result of knowing which treatment they were receiving or administering.
For example, studies showing that acupuncture has a significant effect on back pain are limited to those without blinding (Ernst and White 1998).

Blinding

In a single-blind experiment, the subjects are unaware of the treatment that they have been assigned.
Treatments must be indistinguishable to subjects, which prevents them from responding differently according to knowledge of treatment.
Blinding can also be a concern in non-human studies where animals respond to stimuli

Blinding

In a double-blind experiment the researchers administering the treatments and measuring the response are also unaware of which subjects are receiving which treatments
- Researchers sometimes have pet hypotheses, and they might treat experimental subjects in different ways depending on their hopes for the outcome
- Many response variables are difficult to measure and require some subjective interpretation, which makes the results prone to a bias
- Researchers are naturally more interested in the treated subjects than the control subjects, and this increased attention can itself result in improved response

Blinding

Reviews of medical studies have revealed that studies carried out without double- blinding exaggerated treatment effects by 16% on average compared with studies carried out with double-blinding (Jüni et al. 2001).
Experiments on non–human subjects are also prone to bias from lack of blinding.

Blinding

Bebarta et al.(2003) reviewed 290 two-treatment experiments carried out on animals or on cell lines. The odds of detecting a positive effect of treatment were more than threefold higher in studies without blinding than in studies with blinding.
Blinding can be incorporated into experiments on nonhuman subjects using coded tags that identify the subject to a “blind” observer without revealing the treatment (and who measures units from different treatments in random order).

Replication

The goal of experiments is to estimate and test treatment effects against the background of variation between individuals (“noise”) caused by other variables
One way to reduce noise is to make the experimental conditions constant
In field experiments, however, highly constant experimental conditions might not be feasible nor desirable
By limiting the conditions of an experiment, we also limit the generality of the results
Another way to make treatment effects stand out is to include extreme treatments and to replicate the data.

Replication

Replication is the assignment of each treatment to multiple, independent experimental units.
Without replication, we would not know whether response differences were due to the treatments or just chance differences between the treatments caused by other factors.
Studies that use more units (i.e. that have larger sample sizes) will have smaller standard errors and a higher probability of getting the correct answer from a hypothesis test.
Larger samples mean more information, and more information means better estimates and more powerful tests.

Replication

Replication is not about the number of plants or animals used, but the number of independent units in the experiment. An “experimental unit” is the independent unit to which treatments are assigned.
The figure shows three experimental designs used to compare plant growth under two temperature treatments (indicated by the shading of the pots). The first two designs are un-replicated.

Pseudoreplication

Balance

A study design is balanced if all treatments have the same sample size.
Conversely, a design is unbalanced if there are unequal sample sizes between treatments.
Balance is a second way to reduce the influence of sampling error on estimation and hypothesis testing.
To appreciate this, look again at the equation for the standard error of the difference between two treatment means.

Balance

For a fixed total number of experimental units, n1 + n2, the standard error is smallest when n1 and n2 are equal.
Balance has other benefits. For example, ANOVA is more robust to departures from the assumption of equal variances when designs are balanced or nearly so.

Blocking

Blocking is the grouping of experimental units that have similar properties. Within each block, treatments are randomly assigned to experimental units.
Blocking essentially repeats the same, completely randomized experiment multiple times, once for each block.
Differences between treatments are only evaluated within blocks, and in this way the component of variation arising from differences between blocks is discarded.

Blocking | Paired designs

For example, consider the design choices for a two-treatment experiment to investigate the effect of clear cutting on salamander density.
In the completely randomized (“two-sample”) design we take a random sample of forest plots from the population and then randomly assign each plot to either the clear-cut treatment or the no clear-cut treatment.
In the paired design we take a random sample of forest plots and clear-cut a randomly chosen half of each plot, leaving the other half untouched.

Blocking | Paired designs

In the paired design, measurements on adjacent plot-halves are not independent. This is because they are likely to be similar in soil, water, sunlight, and other conditions that affect the number of salamanders.
As a result, we must analyze paired data differently than when every plot is independent of all the others, as in the case of the two-sample design.
Paired design is usually more powerful than completely randomized design because it controls for a lot of the extraneous variation between plots or sampling units that sometimes obscures the effects we are looking for.

Blocking | Paired designs

Blocking | Randomized complete block design

RCB design is analogous to the paired design, but may have more than two treatments. Each treatment is applied once to every block.
As in the paired design, treatment effects in a randomized block design are measured by differences between treatments exclusively within blocks.
By accounting for some sources of sampling variation blocking can make differences between treatments stand out.
Blocking is worthwhile if units within blocks are relatively homogeneous, apart from treatment effects, and units belonging to different blocks vary because of environmental or other differences.

What if you can’t do experiments?

Experimental studies are not always feasible, in which case we must fall back upon observational studies.
The best observational studies incorporate as many of the features of good experimental design as possible to minimize bias (e.g., blinding) and the impact of sampling error (e.g., replication, balance, blocking, and even extreme treatments) except for one: randomization.
Randomization is out of the question, because in an observational study the researcher does not assign treatments to subjects.
Two strategies are used to limit the effects of confounding variables on a difference between treatments in a controlled observational study: matching; and adjusting for known confounding variables (covariates).

How to present your statistical results

Style of a results section

Write the text of the Results section concisely and objectively.
The passive voice will likely dominate here, but use the active voice as much as possible.
Use the past tense.
Avoid repetitive paragraph structures. Do not interpret the data here.

Function of a results section

The function is to objectively present your key results, without interpretation, in an orderly and logical sequence using both text and illustrative materials (Tables and Figures).
The results section always begins with text, reporting the key results and referring to figures and tables as you proceed.
The text of the Results section should be crafted to follow this sequence and highlight the evidence needed to answer the questions/hypotheses you investigated.
Important negative results should be reported, too. Authors usually write the text of the results section based upon the sequence of Tables and Figures.

Summaries of the statistical analyses

May appear either in the text (usually parenthetically) or in the relevant Tables or Figures (in the legend or as footnotes to the Table or Figure).
Each Table and Figure must be referenced in the text portion of the results, and you must tell the reader what the key result(s) is that each Table or Figure conveys.
Tables and Figures are assigned numbers separately and in the sequence that you will refer to them from the text.
- The first Table you refer to is Table 1, the next Table 2 and so forth.
- Similarly, the first Figure is Figure 1, the next Figure 2, etc.

Summaries of the statistical analyses

Each Table or Figure must include a brief description of the results being presented and other necessary information in a legend.
- Table legends go above the Table; tables are read from top to bottom.
- Figure legends go below the figure; figures are usually viewed from bottom to top.
When referring to a Figure from the text, “Figure” is abbreviated as Fig.,e.g., (Fig. 1. Table is never abbreviated, e.g., Table 1.

Example

For example, suppose you asked the question, “Is the average height of male students the same as female students in a pool of randomly selected Biology majors?” You would first collect height data from large random samples of male and female students. You would then calculate the descriptive statistics for those samples (mean, SD, n, range, etc) and plot these numbers. Suppose you found that male Biology majors are, on average, 12.5 cm taller than female majors; this is the answer to the question. Notice that the outcome of a statistical analysis is not a key result, but rather an analytical tool that helps us understand what is our key result.

Differences, directionality, and magnitude

Report your results so as to provide as much information as possible to the reader about the nature of differences or relationships.
If you are testing for differences among groups, and you find a significant difference, it is not sufficient to simply report that “groups A and B were significantly different”. How are they different and by how much?
Much more informative to say “Group A individuals were 23% larger than those in Group B”, or, “Group B pups gained weight at twice the rate of Group A pups.”
Report the direction of differences (greater, larger, smaller, etc) and the magnitude of differences (% difference, how many times, etc.) whenever possible.

Statistical results in text

Statistical test summaries (test name, p-value) are usually reported parenthetically in conjunction with the biological results they support. This parenthetical reference should include the statistical test used, the value, degrees of freedom and the level of significance.
For example, if you found that the mean height of male Biology majors was significantly larger than that of female Biology majors, you might report this result (in blue) and your statistical conclusion (shown in red) as follows:
- “Males (180.5 ± 5.1 cm; n=34) averaged 12.5 cm taller than females (168 ± 7.6 cm; n=34) in the pool of Biology majors (two-sample t-test, t = 5.78, 33 d.f., p < 0.001).”

Statistical results in text

If the summary statistics are shown in a figure, the sentence above need not report them specifically, but must include a reference to the figure where they may be seen:
- “Males averaged 12.5 cm taller than females in the pool of Biology majors (two-sample t-test, t = 5.78, 33 d.f., p < 0.001; Fig. 1).”

Statistical results in text

Always enter the appropriate units when reporting data or summary statistics.
- for an individual value you would write, “the mean length was 10 cm”, or, “the maximum time was 140 min.”
- When including a measure of variability, place the unit after the error value, e.g., “…was 10 ± 2.3 m”.
- Likewise place the unit after the last in a series of numbers all having the same unit. For example: “lengths of 5, 10, 15, and 20 m”, or “no differences were observed after 2, 4, 6, or 8 min. of incubation”.

Statistics with R for Genome Analyses

Hannah Tavalire and Bill Cresko

January 2019

That stupid chair again

How did I get here?

Goals for today

Data wrangling and exploratory data analysis (EDA)

A biological example to get us started

A biological example to get us started

A biological example to get us started

Collecting Data with Analyses in Mind

Data set rules of thumb (aka Tidy Data)

Tidyverse family of packages

Tidyverse family of packages

Tidyverse family of packages

Example of a tibble

Example of a tibble

Key functions in dplyr for vectors

filter(), arrange() & select()

mutate() & transmutate()

group_by( ) & summarize( )

group_by( ) & summarize( )

R INTERLUDE | Complete Exercise 2.3-2.4

Graphical communication

GGPlot2 and the Grammar of Graphics

GGPlot2 and the Grammar of Graphics

Graphical representation | general approaches

Graphical representation | general approaches

Box Plot

Scatter Plot

Examples of the good, bad and the ugly of graphical representation

Ticker tape parade

A line to no understanding

A cup of hot nothing

A bake sale of pie charts

Wack a mole

Best practices

Principles of effective display

“Above all else show the data” | Tufte 1983

“Maximize the data to ink ratio, within reason” | Tufte 1983

“A graphic does not distort if the visual representation of the data is consistent with the numerical representation” – Tufte 1983

How Fox News makes a figure …

How Fox News makes a figure …

GGPlot2 and nice figures

GGPlot2 and the Grammar of Graphics

The geom_bar function

The geom_bar function

The geom_bar function

The geom_bar function

The geom_histogram and geom_freqpoly function

The geom_histogram and geom_freqpoly function

The geom_boxplot function

The geom_boxplot function

The geom_boxplot function

The geom_point & geom_smooth functions

geom_point & geom_smooth

geom_point & geom_smooth

geom_point & geom_smooth

Combining geoms

Adding labels

Themes

Arranging Multiple Figures- Flexdashboard

Arranging Multiple Figures- Flexdashboard

R INTERLUDE | Complete Exercise 2.5

Statistical foundations for genomics

Probability and distributions

Different flavors of inferential statistics

R (&Python) can be used for both flavors

What is probability

Random variables & probability

Random variables & probability

Bernoulli distribution

Bernoulli distribution

Probability rules

Probability rules

Expectation and Moments of Distributions

Joint probability

Conditional probability

What is Likelihood vs. Probability?

A Bayesian looks at estimation

Key functions in `dplyr` for vectors

`filter()`, `arrange()` & `select()`

`mutate()` & `transmutate()`

`group_by( )` & `summarize( )`

`group_by( )` & `summarize( )`

The `geom_bar` function

The `geom_bar` function

The `geom_bar` function

The `geom_bar` function

The `geom_histogram` and `geom_freqpoly` function

The `geom_histogram` and `geom_freqpoly` function

The `geom_boxplot` function

The `geom_boxplot` function

The `geom_boxplot` function

The `geom_point` & `geom_smooth` functions

`geom_point` & `geom_smooth`

`geom_point` & `geom_smooth`

`geom_point` & `geom_smooth`