R for Bioinformatic Analyses

Hannah Tavalire and Bill Cresko - University of Oregon

January 2019 - Cesky Krumlov

Lecture 1 - Using R for Biostatistical Analyses

But first a beautiful chair

Before we talk about how amazing `R` is…

navigate into this directory: ~/workshop_materials/evomics_stats_2019/
type ‘git pull’

Why use `R`?

R is a statistical programming language (derived from S)
Superb data management & graphics capabilities
You can write your own functions
Powerful and flexible
Runs on all computer platforms
Well established system of packages and documentation
Active development and dedicated community
Can use a nice GUI front end such as Rstudio
Reproducibility
- keep your scripts to see exactly what was done
- distribute these with your data
- embed your R analyses in polished RMarkdown files
FREE

`R` resources

The R Project Homepage: http://www.r-project.org
Quick R Homepage: http://www.statmethods.net
Bioconductor: http://www.bioconductor.org
An Introduction to R (long!): http://cran.r-project.org/doc/manuals/R-intro.html
R for Data Science: https://r4ds.had.co.nz
Google - tutorials, guides, demos, packages and more

Running `R`

Need to make sure that you have R installed
- locally or on a server
- https://www.r-project.org
Run R from the command line
- just type R
- can run it locally as well as on clusters
Install a R Integrated Development Environment (IDE)
- RStudio: http://www.rstudio.com
- Makes working with R much easier, particularly for a new R user
- Run on Windows, Mac or Linux OS
- We’re running as a server on the AWS instances

`RStudio`

Exercise 1.1 - Exploring `RStudio`

Open RStudio by adding :8787 to your AMI url
Take a few minutes to familiarize yourself with the Rstudio environment by locating the following features:
- See what types of new files can be made in Rstudio by clicking the top left icon- open a new R script.
- The windows clockwise from top left are: the code editor, the workspace and history, the plots and files window, and the R console.
- In the plots and files window, click on the packages and help tabs to see what they offer.
Now open the file called Exercises_for_R_Lectures.Rmd in /workshop_materials/evomics_stat_2019/03.Exercises/
- This file will serve as your digital notebook for parts of the workshop and contains the other exercises.

Introduction to `RMarkdown`

`RMarkdown`

A great way to embed R code into descriptive files to keep your life organized
You can insert R chunks into Rmarkdown documents
You will be doing more with markdown on Thursday

Exercise 1.2 - Intro to `RMarkdown` Files

Take a few minutes to familiarize yourself with RMarkdown files by completing exercise 1.2 in your exercises document.

BASICS of `R`

Commands can be submitted through
- terminal, console or scripts
- can be embedded as code chunks in RMarkdown
On these slides evaluating code chunks and showing output
- shown here after the two # symbols
- the number of output items is in []
R follows the normal priority of mathematical evaluation (PEDMAS)

BASICS of `R`

Input code chunk and then output

4 * 4

## [1] 16

Input code chunk and then output

(4 + 3 * 2^2)

## [1] 16

Assigning Variables

A better way to do this is to assign variables
Variables are assigned values using the <- operator.
Variable names must begin with a letter, but other than that, just about anything goes.
Do keep in mind that R is case sensitive.

Assigning Variables

x <- 2
x * 3

## [1] 6

y <- x * 3
y - 2

## [1] 4

These do not work

3y <- 3
3*y <- 3

Arithmetic operations on functions

Arithmetic operations can be performed easily on functions as well as numbers.

x <- 12
x + 2

## [1] 14

x^2

## [1] 144

log(x)

## [1] 2.484907

Arithmetic operations on functions

Note that the last of these - log - is a built in function of R, and therefore the object of the function needs to be put in parentheses
These parentheses will be important, and we’ll come back to them later when we add arguments after the object in the parentheses
The outcome of calculations can be assigned to new variables as well, and the results can be checked using the print command

Arithmetic operations on functions

y <- 67
print(y)

## [1] 67

x <- 124
z <- (x * y)^2
print(z)

## [1] 69022864

STRINGS

Operations can be performed on character variables as well
Note that “characters” need to be set off by quotation marks to differentiate them from numbers
The c stands for concatenate
Note that we are using the same variable names as we did previously, which means that we’re overwriting our previous assignment
A good rule of thumb is to use new names for each variable, and make them short but still descriptive

STRINGS

x <- "I Love"
print(x)

## [1] "I Love"

y <- "Biostatistics"
print(y)

## [1] "Biostatistics"

z <- c(x, y)
print(z)

## [1] "I Love"        "Biostatistics"

VECTORS

In general R thinks in terms of vectors
- a list of characters, factors or numerical values (“I Love”)
- it will benefit any R user to try to write scripts with that in mind
- it will simplify most things
Vectors can be assigned directly using the ‘c()’ function and then entering the exact values.

VECTORS

n <- c(2, 3, 4, 2, 1, 2, 4, 5, 10, 8, 9)
print(n)

##  [1]  2  3  4  2  1  2  4  5 10  8  9

z <- n + 3
print(z)

##  [1]  5  6  7  5  4  5  7  8 13 11 12

FACTORS

The vector x is now what is called a list of character values (“I Love”).
Sometimes we would like to treat the characters as if they were units for subsequent calculations.
These are called factors, and we can redefine our character variables as factors.
This might seem a bit strange, but it’s important for statistical analyses where we might want to see the mean or variance for two different treatments.

FACTORS

x_factor <- as.factor(x)
print(x_factor)

## [1] I Love
## Levels: I Love

Note that factor levels are reported alphabetically

FACTORS

We can also determine how R “sees” a variable using str() or class() functions.
This is a useful check when importing datasets or verifying that you assigned a class correctly

str(x)

##  chr "I Love"

class(x)

## [1] "character"

Types or ‘classes’ of vectors of data

Types of vectors of data

int stands for integers
dbl stands for doubles, or real numbers
chr stands for character vectors, or strings
dttm stands for date-times (a date + a time)
lgl stands for logical, vectors that contain only TRUE or FALSE
fctr stands for factors, which R uses to represent categorical variables with fixed possible values
date stands for dates

Types of vectors of data

Logical vectors can take only three possible values:
- FALSE
- TRUE
- NA which is ‘not available’.
Integer and double vectors are known collectively as numeric vectors.
- In R numbers are doubles by default.
Integers have one special value: NA, while doubles have four:
- NA
- NaN which is ‘not a number’
- Inf
- -Inf

Basic Statistics

Many functions exist to operate on vectors.

mean(n)
median(n)
var(n)
log(n)
exp(n)
sqrt(n)
sum(n)
length(n)
sample(n, replace = T)  #has an additional argument (replace=T)

Arguments modify or direct the function in some way
- There are many arguments for each function, some of which are defaults
- Tab complete is helpful to view argument options

Getting Help

Getting Help on any function is very easy - just type a question mark and the name of the function.
There are functions for just about anything within R and it is easy enough to write your own functions if none already exist to do what you want to do.
In general, function calls have a simple structure: a function name, a set of parentheses and an optional set of parameters/arguments to send to the function.
Help pages exist for all functions that, at a minimum, explain what parameters exist for the function.

Getting Help

-help(mean)
-`?`(mean)
-example(mean)
-help.search("mean")
-apropos("mean")
-args(mean)

Creating vectors

Creating a vector of new data by entering it by hand can be a drag
However, it is also very easy to use functions such as
- seq
- sample

Creating vectors

What do the arguments mean?

seq_1 <- seq(0, 10, by = 0.1)
print(seq_1)

##   [1]  0.0  0.1  0.2  0.3  0.4  0.5  0.6  0.7  0.8  0.9  1.0  1.1  1.2  1.3
##  [15]  1.4  1.5  1.6  1.7  1.8  1.9  2.0  2.1  2.2  2.3  2.4  2.5  2.6  2.7
##  [29]  2.8  2.9  3.0  3.1  3.2  3.3  3.4  3.5  3.6  3.7  3.8  3.9  4.0  4.1
##  [43]  4.2  4.3  4.4  4.5  4.6  4.7  4.8  4.9  5.0  5.1  5.2  5.3  5.4  5.5
##  [57]  5.6  5.7  5.8  5.9  6.0  6.1  6.2  6.3  6.4  6.5  6.6  6.7  6.8  6.9
##  [71]  7.0  7.1  7.2  7.3  7.4  7.5  7.6  7.7  7.8  7.9  8.0  8.1  8.2  8.3
##  [85]  8.4  8.5  8.6  8.7  8.8  8.9  9.0  9.1  9.2  9.3  9.4  9.5  9.6  9.7
##  [99]  9.8  9.9 10.0

Creating vectors

seq_2 <- seq(10, 0, by = -0.1)
print(seq_2)

##   [1] 10.0  9.9  9.8  9.7  9.6  9.5  9.4  9.3  9.2  9.1  9.0  8.9  8.8  8.7
##  [15]  8.6  8.5  8.4  8.3  8.2  8.1  8.0  7.9  7.8  7.7  7.6  7.5  7.4  7.3
##  [29]  7.2  7.1  7.0  6.9  6.8  6.7  6.6  6.5  6.4  6.3  6.2  6.1  6.0  5.9
##  [43]  5.8  5.7  5.6  5.5  5.4  5.3  5.2  5.1  5.0  4.9  4.8  4.7  4.6  4.5
##  [57]  4.4  4.3  4.2  4.1  4.0  3.9  3.8  3.7  3.6  3.5  3.4  3.3  3.2  3.1
##  [71]  3.0  2.9  2.8  2.7  2.6  2.5  2.4  2.3  2.2  2.1  2.0  1.9  1.8  1.7
##  [85]  1.6  1.5  1.4  1.3  1.2  1.1  1.0  0.9  0.8  0.7  0.6  0.5  0.4  0.3
##  [99]  0.2  0.1  0.0

Creating vectors

seq_square <- (seq_2) * (seq_2)
print(seq_square)

##   [1] 100.00  98.01  96.04  94.09  92.16  90.25  88.36  86.49  84.64  82.81
##  [11]  81.00  79.21  77.44  75.69  73.96  72.25  70.56  68.89  67.24  65.61
##  [21]  64.00  62.41  60.84  59.29  57.76  56.25  54.76  53.29  51.84  50.41
##  [31]  49.00  47.61  46.24  44.89  43.56  42.25  40.96  39.69  38.44  37.21
##  [41]  36.00  34.81  33.64  32.49  31.36  30.25  29.16  28.09  27.04  26.01
##  [51]  25.00  24.01  23.04  22.09  21.16  20.25  19.36  18.49  17.64  16.81
##  [61]  16.00  15.21  14.44  13.69  12.96  12.25  11.56  10.89  10.24   9.61
##  [71]   9.00   8.41   7.84   7.29   6.76   6.25   5.76   5.29   4.84   4.41
##  [81]   4.00   3.61   3.24   2.89   2.56   2.25   1.96   1.69   1.44   1.21
##  [91]   1.00   0.81   0.64   0.49   0.36   0.25   0.16   0.09   0.04   0.01
## [101]   0.00

Creating vectors

seq_square_new <- (seq_2)^2
print(seq_square_new)

##   [1] 100.00  98.01  96.04  94.09  92.16  90.25  88.36  86.49  84.64  82.81
##  [11]  81.00  79.21  77.44  75.69  73.96  72.25  70.56  68.89  67.24  65.61
##  [21]  64.00  62.41  60.84  59.29  57.76  56.25  54.76  53.29  51.84  50.41
##  [31]  49.00  47.61  46.24  44.89  43.56  42.25  40.96  39.69  38.44  37.21
##  [41]  36.00  34.81  33.64  32.49  31.36  30.25  29.16  28.09  27.04  26.01
##  [51]  25.00  24.01  23.04  22.09  21.16  20.25  19.36  18.49  17.64  16.81
##  [61]  16.00  15.21  14.44  13.69  12.96  12.25  11.56  10.89  10.24   9.61
##  [71]   9.00   8.41   7.84   7.29   6.76   6.25   5.76   5.29   4.84   4.41
##  [81]   4.00   3.61   3.24   2.89   2.56   2.25   1.96   1.69   1.44   1.21
##  [91]   1.00   0.81   0.64   0.49   0.36   0.25   0.16   0.09   0.04   0.01
## [101]   0.00

R Interlude

Complete Exercises 1.3-1.6

Drawing samples from distributions

Here is a way to create your own data sets that are random samples…

x <- rnorm(n = 10000, mean = 0, sd = 10)
y <- sample(1:10000, 10000, replace = T)
xy <- cbind(x, y)
plot(xy)

Drawing samples from distributions

x <- rnorm(10000, 0, 10)
y <- sample(1:10000, 10000, replace = T)
xy <- cbind(x, y)
hist(x)

Drawing samples from distributions

You’ve probably figured out that y from the last example is drawing numbers with equal probability.
What if you want to draw from a distribution?
Again, play around with the arguments in the parentheses to see what happens.

x <- rnorm (10000, 0, 10)
y <- sample (???, 10000, replace = ???)

Drawing samples from distributions

dnorm() generates the probability density, which can be plotted using the curve() function.
Note that is curve is added to the plot using add=TRUE

x <- rnorm(1000, 0, 100)
hist(x, xlim = c(-500, 500))
curve(50000 * dnorm(x, 0, 100), xlim = c(-500, 500), add = TRUE, 
    col = "Red")

Visualizing Data in `R`

Visualizing Data

So far you’ve been visualizing just the list of output numbers
Except for the last example where I snuck in a hist function.
You can also visualize all of the variables that you’ve created using the plot function (as well as a number of more sophisticated plotting functions).
Each of these is called a high level plotting function, which sets the stage
Low level plotting functions will tweak the plots and make them beautiful

Visualizing Data

seq_1 <- seq(0, 10, by = 0.1)
plot(seq_1, xlab = "space", ylab = "function of space", type = "p", 
    col = "red")

Putting plots in a single figure

The first line of the lower script tells R that you are going to create a composite figure that has two rows and two columns (on next slide)
- Can you tell how?

seq_1 <- seq(0, 10, by = 0.1)
seq_2 <- seq(10, 0, by = -0.1)

par(mfrow = c(2, 2))
plot(seq_1, xlab = "time", ylab = "p in population 1", type = "p", 
    col = "red")
plot(seq_2, xlab = "time", ylab = "p in population 2", type = "p", 
    col = "green")
plot(seq_square, xlab = "time", ylab = "p2 in population 2", 
    type = "p", col = "blue")
plot(seq_square_new, xlab = "time", ylab = "p in population 1", 
    type = "l", col = "yellow")

Putting plots in a single figure

R Interlude

Complete Exercises 1.7-1.8

Working with Imported Datasets in `R`

Creating Data Frames in `R`

As you have seen, in R you can generate your own random data set drawn from nearly any distribution very easily.
Often we will want to use collected data.
Now, let’s make a dummy dataset to get used to dealing with data frames
- Set up three variables (habitat, temp and elevation) as vectors

habitat <- factor(c("mixed", "wet", "wet", "wet", "dry", "dry", 
    "dry", "mixed"))
temp <- c(3.4, 3.4, 8.4, 3, 5.6, 8.1, 8.3, 4.5)
elevation <- c(0, 9.2, 3.8, 5, 5.6, 4.1, 7.1, 5.3)

Creating Data Frames in R

Create a data frame where vectors become columns

mydata <- data.frame(habitat, temp, elevation)
row.names(mydata) <- c("Reedy Lake", "Pearcadale", "Warneet", 
    "Cranbourne", "Lysterfield", "Red Hill", "Devilbend", "Olinda")
head(mydata)

##             habitat temp elevation
## Reedy Lake    mixed  3.4       0.0
## Pearcadale      wet  3.4       9.2
## Warneet         wet  8.4       3.8
## Cranbourne      wet  3.0       5.0
## Lysterfield     dry  5.6       5.6
## Red Hill        dry  8.1       4.1

Now you have a hand-made data frame with row names

R Interlude: Reading in Data Frames in R

A strength of R is being able to import data from an external source
- Create the same table that you did above in a spreadsheet using LibreOffice
- Export it to comma separated and tab separated text files for importing into R.
- The first will read in a comma-delimited file, whereas the second is a tab-delimited
- In both cases the header and row.names arguments indicate that there is a header row and row label column
- Note that the name of the file by itself will have R look in the PWD, whereas a full path can also be used

Reading in Data Frames in R

YourFile <- read.table("yourfile.csv", header = T, row.names = 1, 
    sep = ",")
YourFile <- read.csv("yourfile.csv", header = T, row.names = 1, 
    sep = ",")
YourFile <- read.table("yourfile.txt", header = T, row.names = 1, 
    sep = "\t")

Exporting Data Frames in R

write.csv(YourFile, "yourfile.csv", quote = F, row.names = T, 
    sep = ",")
write.table(YourFile, "yourfile.txt", quote = F, row.names = T, 
    sep = "\t")

you will get more practice with this during the next R interlude

Where we left off…

Use :8787 to access R studio again
Please rename your Exercises document before doing any git pulls today, as to not overwrite your work!
- You can do this through the terminal window in R studio using the mv command
Working with imported datasets and reading and writing datasets
Next up: indexing!
But first- a note about arguments…

Arguments in `R` Functions

Sometimes R can guess what you mean because of order…

x <- rnorm(1000, 0, 10)  #n=, mean=, sd=
x[1:10]

##  [1]  17.600298   1.043059  -2.164877   5.635400   4.439286  -9.417456
##  [7] -26.140246   3.868278  -7.574435  12.027035

But sometimes if the order isn’t right, you can confuse R and get something you really didn’t want…

x2 <- rnorm(10, 1000, 0)  #n=, mean=, sd=
x2

##  [1] 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000

Arguments in `R` Functions

A work-around and best-practice: include the arguments!!

set.seed(145)
x <- rnorm(n = 1000, mean = 0, sd = 10)  #n=, mean=, sd=
x[1:10]

##  [1]  6.869129 10.663631  5.367006 19.060287 10.631596 13.703436  5.277918
##  [8]  4.030967 11.677516  7.926794

set.seed(145)
x2 <- rnorm(sd = 10, n = 1000, mean = 0)  #n=, mean=, sd=
x2[1:10]

##  [1]  6.869129 10.663631  5.367006 19.060287 10.631596 13.703436  5.277918
##  [8]  4.030967 11.677516  7.926794

Notice we also set the seed to replicate our sample results!

Indexing in data frames

Next up - indexing just a subset of the data
This is a very important feature in R, that allows you to analyze just a subset of the data.

print(YourFile[, 2])
print(YourFile$temp)
print(YourFile[2, ])
plot(YourFile$temp, YourFile$elevation)

Indexing in data frames

You can also assign values, or single values, from a data set to a new variable

x <- (YourFile[, 2])
y <- (YourFile$temp)
z <- (YourFile$elevation)
plot(y, z)

Indexing in data frames

You can perform operations on particular levels of a factor
Note that the first argument is the numerical column vector, and the second is the factor column vector.
The third is the operation. Reversing the first two does not work
- Tab complete will tell you the correct order for arguments

tapply(YourFile$temp, YourFile$habitat, mean)
tapply(YourFile$temp, YourFile$habitat, var)

R Interlude

Complete Exercises 1.9-1.10

Lecture 2 - Collaboration, Documentation and Reproducibility

Collaboration using Git and GitHub

Git and GitHub

https://learngitbranching.js.org/

Clone the repository

First make a new directory into which you will clone our course repository
- This will prevent you from overwriting any of the documents you have have edited
- And it’s good practice to do it again
You should work through the terminal application and use Unix to do this
Open the terminal and navigate to your new directory and type the following:

git clone https://github.com/wcresko/evomics_stat_2019.git

Update the repository

Now to update the repository you just need to use these commands

git status
git fetch
git status
git merge origin/master

The first command just tells you if anything has changed
If so, do the second!
This is much safer than git pull

Git and GitHub Interlude: Exercise 2.1

Please read the directions carefully to prevent pushing to the wrong repository

Clarity using Markdown and LaTeX

What is markdown?

Lightweight formal markup languages are used to add formatting to plaintext documents
- Adding basic syntax to the text will make elements look different once rendered/knit
- Available in many base editors (e.g., Atom text editor)
You then need a markdown application with a markdown processor/parser to render your text files into something more exciting
- Static and dynamic outputs!
- pdf, HTML, presentations, websites, scientific articles, books etc

What is Knitr and PANDOC?

Knitr is a package in R to render markdown files
PANDOC is a general way to render markdown files into something else
https://pandoc.orgis
Can include math using LaTeX
GitHub will render markdown directly
Markdown can easily be rendered within most editors now
Within RStudio just use the knit button to render markdown
Markdown syntax is very easy

Formatting text

*Italic* or _Italic_
**Bold** or __Bold__

Italic or Italic
Bold or Bold

Formatting text

> "You know the greatest danger facing us is ourselves, an irrational fear of the unknown. 
But there’s no such thing as the unknown — only things temporarily hidden, temporarily not understood."
>
> --- Captain James T. Kirk

“You know the greatest danger facing us is ourselves, an irrational fear of the unknown. But there’s no such thing as the unknown — only things temporarily hidden, temporarily not understood.”

— Captain James T. Kirk

Formatting lists

-list_element
-sub_list_element  #double tab to indent
-sub_list_element  #double tab to indent
-sub_list_element  #double tab to indent
-list_element
-sub_list_element  #double tab to indent
# note the space after each dash- this is important!

list_element
- sub_list_element
- sub_list_element
- sub_list_element
list_element
- sub_list_element

Formatting lists

1. One
2. Two
3. Three
4. Four

One
Two
Three
Four

Inserting images or URLs

[Link](https://commonmark.org/help/)
![Image](https://i1.wp.com/evomics.org/wp-content/uploads/2012/07/20120115-IMG_0297.jpg)

Link

What is LaTeX?

Pronounced «Lah-tech» or «Lay-tech» (to rhyme with «Bertolt Brecht»)
A document preparation system for high-quality typesetting
It is most often used for medium-to-large technical or scientific documents
Can be used for almost any form of publishing.
Typesetting journal articles, technical reports, books, and slide presentations
Allows for precise mathematical statements
https://www.latex-project.org

What is LaTeX?

LaTeX is not a word processor!
LaTeX encourages authors not to worry too much about the appearance of their documents but to concentrate on getting the right content.
Control over large documents containing sectioning, cross-references, tables and figures.
Typesetting of complex mathematical formulas.
Automatic generation of bibliographies and indexes.
Multi-lingual typesetting
https://bookdown.org/yihui/bookdown/

What is LaTeX?

Importantly, LaTeX can be included right into RMarkdown documents
The following slides have some examples

Operators and Symbols

$$ \large a^x, \sqrt[n]{x}, \vec{\jmath}, \tilde{\imath}$$

\[ \large a^x, \sqrt[n]{x}, \vec{\jmath}, \tilde{\imath}\]

$$ \large \alpha, \beta, \gamma$$

\[ \large \alpha, \beta, \gamma\]

Operators and Symbols

$$ \large\approx, \neq, \nsim $$

\[ \large\approx, \neq, \nsim \]

$$\large \partial, \mathbb{R}, \flat$$

\[\large \partial, \mathbb{R}, \flat\]

Equations

Binomial sampling equation

$$\large f(k) = {n \choose k} p^{k} (1-p)^{n-k}$$

\[\large f(k) = {n \choose k} p^{k} (1-p)^{n-k}\]

Poisson Sampling Equation

$$\large Pr(Y=r) = \frac{e^{-\mu}\mu^r}{r!}$$

\[\large Pr(Y=r) = \frac{e^{-\mu}\mu^r}{r!}\]

Differential Equations

$$\iint xy^2\,dx\,dy =\frac{1}{6}x^2y^3$$

\[\iint xy^2\,dx\,dy =\frac{1}{6}x^2y^3\]

Matrix formulations

$$  \begin{matrix}
        -2 & 1 & 0 & 0 & \cdots & 0  \\
        1 & -2 & 1 & 0 & \cdots & 0  \\
        0 & 1 & -2 & 1 & \cdots & 0  \\
        0 & 0 & 1 & -2 & \ddots & \vdots \\
        \vdots & \vdots & \vdots & \ddots & \ddots & 1  \\
        0 & 0 & 0 & \cdots & 1 & -2
    \end{matrix} $$

\[ \begin{matrix} -2 & 1 & 0 & 0 & \cdots & 0 \\ 1 & -2 & 1 & 0 & \cdots & 0 \\ 0 & 1 & -2 & 1 & \cdots & 0 \\ 0 & 0 & 1 & -2 & \ddots & \vdots \\ \vdots & \vdots & \vdots & \ddots & \ddots & 1 \\ 0 & 0 & 0 & \cdots & 1 & -2 \end{matrix} \]

Including LaTeX and Code into Markdown Files

Explicit inclusion of code and mathematical equations helps with reproducibility
Need to designate the ‘environment’ as being code or math
Can be included in-line or in ‘chunks’

In-line versus fenced

This equation, $y=\frac{1}{2}$, is included inline

This equation, \(y=\frac{1}{2}\), is included inline

Whereas this equation, $$y=\frac{1}{2}$$, is put on a separate line

Whereas this equation \[y=\frac{1}{2}\] is put on a separate line

Markdown is very flexible

You can import RMarkdown templates into RStudio and open as a new Rmarkdown file
Better yet there are packages that add functionality
When you install the package it will show up in the ‘From Template’ section of the ‘new file’ startup screen
There are packages to make
- books
- journal articles
- slide shows
- interactive exercises
- many more
Some of these use ‘Shiny’
- an interactive web based application
- allows users to input and get output

Final Thoughts on Markdown, LaTeX and GitHub

Many forms/flavors of markdown
- HTML and Rmarkdown are just forms of markdown
- There is a GitHub flavored markdown
- Once you learn one, all the others are very easy
The goal is increased collaboration and reproducibility
- Allows you to easily work with others by sharing the markdown file
- Allows formal representation of code and math
- Allows others to run your code directly
- Allows reports to nontech people
- All files are easily shared on GitHub
Once you start using Markdown you won’t stop…..

R Interlude | Exploring RMarkdown

Exercise 2.2

Data wrangling and exploratory data analysis (EDA)

A biological example to get us started

Say you perform an experiment on two different strains of stickleback fish, one from an ocean population (RS) and one from a freshwater lake (BP) by making them microbe free. Microbes in the gut are known to interact with the gut epithelium in ways that lead to a proper maturation of the immune system.

A biological example to get us started

You carry out an experiment by treating multiple fish from each strain so that some of them have a conventional microbiota, and some are inoculated with only one bacterial species. You then measure the levels of gene expression in the stickleback gut using RNA-seq. You suspect that the sex of the fish might be important so you track it too.

A biological example to get us started

Collecting Data with Analyses in Mind

How should the data set be organized to best analyze it?
What are the key properties of the variables?
Why does that matter for learning R?
Why does that matter for performing statistical analyses?

Data set rules of thumb (aka Tidy Data)

Store a copy of data in non-proprietary formats
Leave an uncorrected file when doing analyses
Maintain effective metadata about the data
When you add observations to a database, add rows
When you add variables to a database, add columns
A column of data should contain only one data type

Tidyverse family of packages

Hadley Wickham and others have written R packages to modify data
These packages do many of the same things as base functions in R
However, they are specifically designed to do them faster and more easily
Wickham also wrote the package GGPlot2 for elegant graphics creations
GG stands for ‘Grammar of Graphics’

Example of a tibble

Key functions in `dplyr` for vectors

Pick observations by their values with filter().
Reorder the rows with arrange().
Pick variables by their names with select().
Create new variables with functions of existing variables with mutate().
Collapse many values down to a single summary with summarise().

`filter()`, `arrange()` & `select()`

filter(flights, month == 11 | month == 12)

arrange(flights, year, month, day)

select(flights, year, month, day)

`mutate()` & `transmutate()`

This function will add a new variable that is a function of other variable(s)

mutate(flights, gain = arr_delay - dep_delay, hours = air_time/60, 
    gain_per_hour = gain/hours)

This function will replace the old variable with the new variable

transmute(flights, gain = arr_delay - dep_delay, hours = air_time/60, 
    gain_per_hour = gain/hours)

`group_by( )` & `summarize( )`

This first function allows you to aggregate data by values of categorical variables (factors)

by_day <- group_by(flights, year, month, day)

Once you have done this aggregation, you can then calculate values (in this case the mean) of other variables split by the new aggregated levels of the categorical variable

summarise(by_day, delay = mean(dep_delay, na.rm = TRUE))

`group_by( )` & `summarize( )`

Note - you can get a lot of missing values!
That’s because aggregation functions obey the usual rule of missing values:
- if there’s any missing value in the input, the output will be a missing value.
- fortunately, all aggregation functions have an na.rm argument which removes the missing values prior to computation

R INTERLUDE | Complete Exercise 2.3-2.4

Graphical Communication

GGPlot2 and the Grammar of Graphics

GG stands for ‘Grammar of Graphics’
A good paragraph uses good grammar to convey information
A good figure uses good grammar in the same way
Seven general components can be used to create most figures

GGPlot2 and the Grammar of Graphics

xxx

Graphical representation | general approaches

Distributions of data
- location
- spread
- shape
Associations between variables
- relationship among two or more variables
- differences among groups in their distributions

Graphical representation | general approaches

Distributions of data
- bar graph
- histogram
- box plot
Associations between variables
- pie chart
- grouped bar graph
- mosaic plot
- box plot
- scatter plot
- dot plot ‘stripchart’

Box Plot

Displays median, first and third quartile, range, and extreme observations
Can be combined with mean and standard error of the mean
Concise way to visualize many aspects of distribution

Scatter Plot

Displays association between two numerical variables
Goal is association not magnitude or frequency
Points fill the space available

Examples of the good, bad and the ugly of graphical representation

Examples of bad graphs and how to improve them.
Courtesy of K.W. Broman
www.biostat.wisc.edu/~kbroman/topten_worstgraphs/

Ticker tape parade

A line to no understanding

A cup of hot nothing

A bake sale of pie charts

Wack a mole

Graphical communication best practices

“Graphical excellence is that which gives to the viewer the greatest number of ideas in the shortest time with the least ink in the smallest space”

— Edward Tufte

Principles of effective display

Show the data
Encourage the eye to compare differences
Represent magnitudes honestly and accurately
Draw graphical elements clearly, minimizing clutter
Make displays easy to interpret

“Above all else show the data” | Tufte 1983

“Maximize the data to ink ratio, within reason” | Tufte 1983

Draw graphical elements clearly, minimizing clutter

“A graphic does not distort if the visual representation of the data is consistent with the numerical representation” – Tufte 1983

Represent magnitudes honestly and accurately

How Fox News makes a figure …

“Graphical excellence begins with telling the truth about the data” – Tufte 1983

Using GGPlot2 to make nice figures

GGPlot2 and the Grammar of Graphics

xxx

The `geom_bar` function

library(ggplot2)
ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut))

The `geom_bar` function

Now try this…

ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, colour = cut))

The `geom_bar` function

and this…

ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, fill = cut))

The `geom_bar` function

and finally this…

ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, fill = clarity), 
    position = "dodge")

The `geom_histogram` and `geom_freqpoly` function

With this function you can make a histogram

ggplot(data = diamonds) + geom_histogram(mapping = aes(x = carat), 
    binwidth = 0.5)

The `geom_histogram` and `geom_freqpoly` function

This allows you to make a frequency polygram

ggplot(data = diamonds) + geom_freqpoly(mapping = aes(x = carat), 
    binwidth = 0.5)

The `geom_boxplot` function

Boxplots are very useful for visualizing data

ggplot(data = diamonds, mapping = aes(x = cut, y = price)) + 
    geom_boxplot()

The `geom_boxplot` function

ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + geom_boxplot() + 
    coord_flip()

The `geom_boxplot` function

ggplot(data = mpg, mapping = aes(x = reorder(class, hwy, FUN = median), 
    y = hwy)) + geom_boxplot() + coord_flip()

The `geom_point` & `geom_smooth` functions

ggplot(data = diamonds, mapping = aes(x = x, y = y)) + geom_point()

The `geom_point` & `geom_smooth` functions

ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) + 
    facet_wrap(~class, nrow = 2)

The `geom_point` & `geom_smooth` functions

ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) + 
    facet_grid(drv ~ cyl)

The `geom_point` & `geom_smooth` functions

ggplot(data = mpg) + geom_smooth(mapping = aes(x = displ, y = hwy), 
    method = "loess")

Combining geoms

ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) + 
    geom_smooth(mapping = aes(x = displ, y = hwy), method = "loess")

Adding labels

ggplot(data = mpg, aes(displ, hwy)) + geom_point(aes(color = class)) + 
    geom_smooth(se = FALSE, method = "loess") + labs(title = "Fuel efficiency generally decreases with engine size", 
    subtitle = "Two seaters (sports cars) are an exception because of their light weight", 
    caption = "Data from fueleconomy.gov")

Themes

Arranging Multiple Figures- Flexdashboard

Modify YAML header to specify graph orientation

---
title: Flexdashboard Options"
output: 
  flexdashboard::flex_dashboard:
    
   vertical_layout: 'fill' or 'scroll'
         or
     orientation: 'rows'
---

Or specify data width to reorganize (try this in template) https://rmarkdown.rstudio.com/flexdashboard/

R for Bioinformatic Analyses

Hannah Tavalire and Bill Cresko - University of Oregon

January 2019 - Cesky Krumlov

Lecture 1 - Using R for Biostatistical Analyses

But first a beautiful chair

Before we talk about how amazing R is…

Why use R?

R resources

Running R

RStudio

Exercise 1.1 - Exploring RStudio

Introduction to RMarkdown

RMarkdown

Exercise 1.2 - Intro to RMarkdown Files

BASICS of R

BASICS of R

BASICS of R

Assigning Variables

Assigning Variables

Arithmetic operations on functions

Arithmetic operations on functions

Arithmetic operations on functions

STRINGS

STRINGS

VECTORS

VECTORS

FACTORS

FACTORS

FACTORS

Types or ‘classes’ of vectors of data

Types of vectors of data

Types of vectors of data

Basic Statistics

Getting Help

Getting Help

Creating vectors

Creating vectors

Creating vectors

Creating vectors

Creating vectors

R Interlude

Drawing samples from distributions

Drawing samples from distributions

Drawing samples from distributions

Drawing samples from distributions

Visualizing Data in R

Visualizing Data

Visualizing Data

Putting plots in a single figure

Putting plots in a single figure

R Interlude

Working with Imported Datasets in R

Creating Data Frames in R

Creating Data Frames in R

R Interlude: Reading in Data Frames in R

Reading in Data Frames in R

Exporting Data Frames in R

Where we left off…

Where we left off…

Arguments in R Functions

Arguments in R Functions

Indexing in data frames

Indexing in data frames

Indexing in data frames

R Interlude

Lecture 2 - Collaboration, Documentation and Reproducibility

Collaboration using Git and GitHub

Git and GitHub

Clone the repository

Update the repository

Git and GitHub Interlude: Exercise 2.1

Clarity using Markdown and LaTeX

What is markdown?

What is Knitr and PANDOC?

Formatting text

Formatting text

Formatting lists

Formatting lists

Inserting images or URLs

What is LaTeX?

Before we talk about how amazing `R` is…

Why use `R`?

`R` resources

Running `R`

`RStudio`

Exercise 1.1 - Exploring `RStudio`

Introduction to `RMarkdown`

`RMarkdown`

Exercise 1.2 - Intro to `RMarkdown` Files

BASICS of `R`

BASICS of `R`

BASICS of `R`

Visualizing Data in `R`

Working with Imported Datasets in `R`

Creating Data Frames in `R`

Arguments in `R` Functions

Arguments in `R` Functions

Key functions in `dplyr` for vectors

`filter()`, `arrange()` & `select()`

`mutate()` & `transmutate()`

`group_by( )` & `summarize( )`

`group_by( )` & `summarize( )`

The `geom_bar` function

The `geom_bar` function

The `geom_bar` function

The `geom_bar` function

The `geom_histogram` and `geom_freqpoly` function

The `geom_histogram` and `geom_freqpoly` function

The `geom_boxplot` function

The `geom_boxplot` function