git fetch
git status
git merge origin/master
Applied math, statistics and computation are the foundations of genomic analysis
Period
WE WILL COVER TODAY
R
EXTRA AT THE END OF THE SLIDES
R
?R
?
Hadley Wickham and others have written R
packages to modify data
These packages do many of the same things as base functions in R
However, they are specifically designed to do them faster and more easily
Wickham also wrote the package GGPlot2
for elegant graphics creations
GG stands for ‘Grammar of Graphics’
dplyr
for vectorsfilter()
.arrange()
.select()
.mutate()
.summarise()
.filter()
, arrange()
& select()
mutate()
& transmutate()
This function will add a new variable that is a function of other variable(s)
This function will replace the old variable with the new variable
group_by( )
& summarize( )
This first function allows you to aggregate data by values of categorical variables (factors)
Once you have done this aggregation, you can then calculate values (in this case the mean) of other variables split by the new aggregated levels of the categorical variable
group_by( )
& summarize( )
na.rm
argument which removes the missing values prior to computation
xxx
“Graphical excellence is that which gives to the viewer the greatest number of ideas in the shortest time with the least ink in the smallest space”
— Edward Tufte
Draw graphical elements clearly, minimizing clutter
Represent magnitudes honestly and accurately
“Graphical excellence begins with telling the truth about the data” – Tufte 1983
xxx
geom_bar
functiongeom_bar
functionNow try this…
geom_bar
functionand this…
geom_bar
functionand finally this…
geom_histogram
and geom_freqpoly
functionWith this function you can make a histogram
geom_histogram
and geom_freqpoly
functionThis allows you to make a frequency polygram
geom_boxplot
functionBoxplots are very useful for visualizing data
geom_boxplot
functiongeom_boxplot
functionggplot(data=mpg, mapping=aes(x=reorder(class, hwy, FUN=median), y=hwy)) +
geom_boxplot() +
coord_flip()
geom_point
& geom_smooth
functionsgeom_point
& geom_smooth
geom_point
& geom_smooth
geom_point
& geom_smooth
ggplot(data=mpg) +
geom_point(mapping=aes(x=displ, y=hwy)) +
geom_smooth(mapping=aes(x=displ, y=hwy), method = "loess")
ggplot(data=mpg, aes(displ, hwy)) +
geom_point(aes(color=class)) +
geom_smooth(se=FALSE, method="loess") +
labs(title = "Fuel efficiency generally decreases with engine size",
subtitle = "Two seaters (sports cars) are an exception because of their light weight",
caption = "Data from fueleconomy.gov")
---
title: Flexdashboard Options"
output:
flexdashboard::flex_dashboard:
vertical_layout: 'fill' or 'scroll'
or
orientation: 'rows'
---
Probability is the expression of belief in some future outcome
A random variable can take on different values with different probabilities
The sample space of a random variable is the universe of all possible values
Describes the expected outcome of a single event with probability p
Example of flipping of a fair coin once
\[Pr(X=\text{Head}) = \frac{1}{2} = 0.5 = p \]
\[Pr(X=\text{Tails}) = \frac{1}{2} = 0.5 = 1 - p \]
\[ Pr(\text{X=H and Y=H}) = p*p = p^2 \] \[ Pr(\text{X=H and Y=T}) = p*p = p^2 \] \[ Pr(\text{X=T and Y=H}) = p*p = p^2 \] \[ Pr(\text{X=T and Y=T}) = p*p = p^2 \]
H
and T
can occur in any order\[ \text{Pr(X=H and Y=T) or Pr(X=T and Y=H)} = \] \[ (p*p) + (p*p) = 2p^{2} \]
\[E[X] = \sum_{\text{all x}}^{}xP(X=x) = \mu\]
\[Var(X) = E[X^2] = \sigma^2\]
\[Pr(X,Y) = Pr(X) * Pr(Y)\]
\[Pr(Y|X) = Pr(Y)\text{ and }Pr(X|Y) = Pr(X)\]
\[Pr(Y|X) \neq Pr(Y)\text{ and }Pr(X|Y) \neq Pr(X)\]
The probability of an event is the proportion of times that the event would occur if we repeated a random trial over and over again under the same conditions.
The likelihood is a conditional probability of a parameter value given a set of data
The likelihood of a population parameter equaling a specific value, given the data
L[parameter|data] = Pr[data|parameter]
\[ P(\theta|d) = \frac{P(d|\theta)P(\theta)}{P(d)}\]
where
k
trials before the first “one” appears is given by the geometric distributionp
, but the probability that the first “one” appears on the second trial is (1-p)*p
k-1
failures before the first success is:\[P(X=k)=(1-p)^{k-1}p\]
A binomial distribution results from the combination of several independent Bernoulli events
\[\large f(k) = {n \choose k} p^{k} (1-p)^{n-k}\]
n
is the total number of trialsk
is the number of successesp
is the probability of successq
is the probability of not successp = 1-q
r
“ones” have appeared.\[P(X=k)=(\frac{k-1}{r-1})p^{r-1}(1-p)^{k-r}p\]
which simplifies to
\[P(X=k)=(\frac{k-1}{r-1})p^{r}(1-p)^{k-r}\]
Another common situation in biology is when each trial is discrete, but the number of observations of each outcome is observed/counted
Pr(Y=r)
is the probability that the number of occurrences of an event y
equals a count r
in the total number of trials\[Pr(Y=r) = \frac{e^{-\mu}\mu^r}{r!}\]
\[Pr(y=r) = \frac{e^{-\lambda}\lambda^r}{r!}\]
P(observation lies within dx of x) = f(x)dx
\[P(a\leq X \leq b) = \int_{a}^{b} f(x) dx\]
Remember that the indefinite integral sums to one
\[\int_{-\infty}^{\infty} f(x) dx = 1\]
E[X]
may be found by integrating the product of x
and the probability density function over all possible values of x
:
\[E[X] = \int_{-\infty}^{\infty} xf(x) dx \]
\(Var(X) = E[X^2] - (E[X])^2\), where the expectation of \(X^2\) is
\[E[X^2] = \int_{-\infty}^{\infty} x^2f(x) dx \]
\[E[X] = \int_{a}^{b} x\frac{1}{b-a} dx = \frac{(a+b)}{2} \]
\[f(x)=\lambda e^{-\lambda x}\]
E[X]
can be found be integrating \(xf(x)\) from 0 to infinity\[f(x) = \frac{e^{-\lambda x}\lambda x^{r-1}}{(r-1)!}\lambda\]
\[ Mean = \frac{r}{\lambda} \] \[ Variance = \frac{r}{\lambda^2} \]
where \[\large \pi \approx 3.14159\]
\[\large \epsilon \approx 2.71828\]
To write that a variable (v) is distributed as a normal distribution with mean \(\mu\) and variance \(\sigma^2\), we write the following:
\[\large v \sim \mathcal{N} (\mu,\sigma^2)\]
Estimate of the mean from a single sample
\[\Large \bar{x} = \frac{1}{n}\sum_{i=1}^{n}{x_i} \]
Estimate of the variance from a single sample
\[\Large s^2 = \frac{1}{n-1}\sum_{i=1}^{n}{(x_i - \bar{x})^2} \]
\[\huge z_i = \frac{(x_i - \bar{x})}{s}\]
What is the probability that we would reject a true null hypothesis?
What is the probability that we would accept a false null hypothesis?
How do we decide when to reject a null hypothesis and support an alternative?
What can we conclude if we fail to reject a null hypothesis?
What parameter estimates of distributions are important to test hypotheses?
\[\large t = \frac{(\bar{y}_1-\bar{y}_2)}{s_{\bar{y}_1-\bar{y}_2}} \]
where
which is the calculation for the standard error of the mean difference
\[ Power \propto \frac{(ES)(\alpha)(\sqrt n)}{\sigma}\]
Power is proportional to the combination of these parameters
lm
in R for fixed effectsnlme4
for random effectsGeneral Linear Model (GLM) - two or more continuous variables
General Linear Mixed Model (GLMM) - a continuous response variable with a mix of continuous and categorical predictor variables
Generalized Linear Model - a GLM that doesn’t assume normality of the response
Generalized Additive Model (GAM) - a model that doesn’t assume linearity
All can be written in the form
response variable = intercept + (explanatory_variables) + random_error
in the general form:
\[ Y=\beta_0 +\beta_1*X_1 + \beta_2*X_2 +... + \epsilon\]
where \(\beta_0, \beta_1, \beta_2, ....\) are the parameters of the linear model
GGPlot2
figures\[H_0 : \beta_0 = 0\] \[H_0 : \beta_1 = 0\]
full model - \(y_i = \beta_0 + \beta_1*x_i + error_i\)
reduced model - \(y_i = \beta_0 + 0*x_i + error_i\)
\[y_i = \beta_0 + \beta_1 * x_i + \epsilon_i\]
Additive model \[y_i = \beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + ... + B_jx_{ij} + \epsilon_i\]
Multiplicative model (with two predictors) \[y_i = \beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + \beta_3x_{i1}x_{i2} + \epsilon_i\]
From Langford, D. J.,et al. 2006. Science 312: 1967-1970
In words:
stretching = intercept + treatment
i = 1 to n
objects and j = 1 to p
variablesp-by-p
), or dissimilarities of objects (n-by-n
)\[Z_{ik} = c_1y_{i1} + c_2y_{i1} + c_3y_{i2} + c_1y_{i3} + ... + c_py_{ip}\]
p
)p-by-p
matrix or singular value decomposition of the original matrixGet more advanced and bring it all home sisters and brothers
Survival of climbers of Mount Everest is higher for individuals taking supplemental oxygen than those who don’t.
Why?
The goal of experimental design is to eliminate bias and to reduce sampling error when estimating and testing effects of one variable on another.
The function is to objectively present your key results, without interpretation, in an orderly and logical sequence using both text and illustrative materials (Tables and Figures).
The results section always begins with text, reporting the key results and referring to figures and tables as you proceed.
The text of the Results section should be crafted to follow this sequence and highlight the evidence needed to answer the questions/hypotheses you investigated.
Important negative results should be reported, too. Authors usually write the text of the results section based upon the sequence of Tables and Figures.
Report your results so as to provide as much information as possible to the reader about the nature of differences or relationships.
If you are testing for differences among groups, and you find a significant difference, it is not sufficient to simply report that “groups A and B were significantly different”. How are they different and by how much?
Much more informative to say “Group A individuals were 23% larger than those in Group B”, or, “Group B pups gained weight at twice the rate of Group A pups.”
Report the direction of differences (greater, larger, smaller, etc) and the magnitude of differences (% difference, how many times, etc.) whenever possible.
Statistical test summaries (test name, p-value) are usually reported parenthetically in conjunction with the biological results they support. This parenthetical reference should include the statistical test used, the value, degrees of freedom and the level of significance.
For example, if you found that the mean height of male Biology majors was significantly larger than that of female Biology majors, you might report this result (in blue) and your statistical conclusion (shown in red) as follows:
If the summary statistics are shown in a figure, the sentence above need not report them specifically, but must include a reference to the figure where they may be seen: