R for Bioinformatic Analyses

Hannah Tavalire and Bill Cresko - University of Oregon

January 2019 - Cesky Krumlov

Lecture 1 - Using R for Biostatistical Analyses

But first a beautiful chair

Why use R?

  • R is a statistical programming language (derived from S)
  • Superb data management & graphics capabilities
  • You can write your own functions
  • Powerful and flexible
  • Runs on all computer platforms
  • Well established system of packages and documentation
  • Active development and dedicated community
  • Can use a nice GUI front end such as Rstudio
  • Reproducibility
    • keep your scripts to see exactly what was done
    • distribute these with your data
    • embed your R analyses in polished RMarkdown files
  • FREE

R resources

Running R

  • Need to make sure that you have R installed
  • Run R from the command line
    • just type R
    • can run it locally as well as on clusters
  • Install a R Integrated Development Environment (IDE)
    • RStudio: http://www.rstudio.com
    • Makes working with R much easier, particularly for a new R user
    • Run on Windows, Mac or Linux OS
    • We’re running as a server on the AWS instance

RStudio

Exercise 1.1 - Exploring RStudio

Exercise 1.1 - Exploring RStudio

  • Open RStudio by adding :8787 to your AWS url
  • Take a few minutes to familiarize yourself with the Rstudio environment by locating the following features:
    • See what types of new files can be made in Rstudio by clicking the top left icon- open a new R script.
    • The windows clockwise from top left are: the code editor, the workspace and history, the plots and files window, and the R console.
    • In the plots and files window, click on the packages and help tabs to see what they offer.
  • Now open the file called Exercises_for_R_Lectures.Rmd
    • This file will serve as your digital notebook for parts of the workshop and contains the other exercises.

Introduction to RMarkdown

RMarkdown

Exercise 1.2 - Intro to RMarkdown Files

  • Take a few minutes to familiarize yourself with RMarkdown files by completing exercise 1.2 in your exercises document.
  • You will be doing all of your work in RMarkdown

BASICS of R

BASICS of R

  • Commands can be submitted through
    • terminal, console or scripts
    • can be embedded as code chunks in RMarkdown
  • On these slides evaluating code chunks and showing output
    • shown here after the two # symbols
    • the number of output items is in []
  • R follows the normal priority of mathematical evaluation (PEDMAS)

BASICS of R

Input code chunk and then output

## [1] 16

Input code chunk and then output

## [1] 16

Assigning Variables

  • A better way to do this is to assign variables
  • Variables are assigned values using the <- operator.
  • Variable names must begin with a letter, but other than that, just about anything goes.
  • Do keep in mind that R is case sensitive.

Assigning Variables

## [1] 6
## [1] 4

These do not work

Arithmetic operations on functions

  • Arithmetic operations can be performed easily on functions as well as numbers.
## [1] 14
## [1] 144
## [1] 2.484907

Arithmetic operations on functions

  • Note that the last of these - log - is a built in function of R, and therefore the object of the function needs to be put in parentheses
  • These parentheses will be important, and we’ll come back to them later when we add arguments after the object in the parentheses
  • The outcome of calculations can be assigned to new variables as well, and the results can be checked using the print command

Arithmetic operations on functions

## [1] 67
## [1] 69022864

STRINGS

  • Operations can be performed on character variables as well
  • Note that “characters” need to be set off by quotation marks to differentiate them from numbers
  • The c stands for concatenate
  • Note that we are using the same variable names as we did previously, which means that we’re overwriting our previous assignment
  • A good rule of thumb is to use new names for each variable, and make them short but still descriptive

STRINGS

## [1] "I Love"
## [1] "Biostatistics"
## [1] "I Love"        "Biostatistics"

VECTORS

  • In general R thinks in terms of vectors
    • a list of characters, factors or numerical values (“I Love”)
    • it will benefit any R user to try to write scripts with that in mind
    • it will simplify most things
  • Vectors can be assigned directly using the ‘c()’ function and then entering the exact values.

VECTORS

##  [1]  2  3  4  2  1  2  4  5 10  8  9
##  [1]  5  6  7  5  4  5  7  8 13 11 12

FACTORS

  • The vector x is now what is called a list of character values (“I Love”).
  • Sometimes we would like to treat the characters as if they were units for subsequent calculations.
  • These are called factors, and we can redefine our character variables as factors.
  • This might seem a bit strange, but it’s important for statistical analyses where we might want to see the mean or variance for two different treatments.

FACTORS

## [1] I Love
## Levels: I Love
  • Note that factor levels are reported alphabetically
##  chr "I Love"
## [1] "character"
  • We can also determine how R “sees” a variable using str() or class() functions.
  • This is a useful check when importing datasets or verifying that you assigned a class correctly

Types or ‘classes’ of vectors of data

Types of vectors of data

  • int stands for integers

  • dbl stands for doubles, or real numbers

  • chr stands for character vectors, or strings

  • dttm stands for date-times (a date + a time)

  • lgl stands for logical, vectors that contain only TRUE or FALSE

  • fctr stands for factors, which R uses to represent categorical variables with fixed possible values

  • date stands for dates

Types of vectors of data

  • Logical vectors can take only three possible values:
    • FALSE
    • TRUE
    • NA which is ‘not available’.
  • Integer and double vectors are known collectively as numeric vectors.
    • In R numbers are doubles by default.
  • Integers have one special value: NA, while doubles have four:
    • NA
    • NaN which is ‘not a number’
    • Inf
    • -Inf

Basic Statistics

Many functions exist to operate on vectors.

  • Arguments modify or direct the function in some way
    • There are many arguments for each function, some of which are defaults
    • Tab complete is helpful to view argument options

Getting Help

  • Getting Help on any function is very easy - just type a question mark and the name of the function.
  • There are functions for just about anything within R and it is easy enough to write your own functions if none already exist to do what you want to do.
  • In general, function calls have a simple structure: a function name, a set of parentheses and an optional set of parameters/arguments to send to the function.
  • Help pages exist for all functions that, at a minimum, explain what parameters exist for the function.

Getting Help

Creating vectors

  • Creating a vector of new data by entering it by hand can be a drag
  • However, it is also very easy to use functions such as
    • seq
    • sample

Creating vectors

  • What do the arguments mean?
##   [1]  0.0  0.1  0.2  0.3  0.4  0.5  0.6  0.7  0.8  0.9  1.0  1.1  1.2  1.3
##  [15]  1.4  1.5  1.6  1.7  1.8  1.9  2.0  2.1  2.2  2.3  2.4  2.5  2.6  2.7
##  [29]  2.8  2.9  3.0  3.1  3.2  3.3  3.4  3.5  3.6  3.7  3.8  3.9  4.0  4.1
##  [43]  4.2  4.3  4.4  4.5  4.6  4.7  4.8  4.9  5.0  5.1  5.2  5.3  5.4  5.5
##  [57]  5.6  5.7  5.8  5.9  6.0  6.1  6.2  6.3  6.4  6.5  6.6  6.7  6.8  6.9
##  [71]  7.0  7.1  7.2  7.3  7.4  7.5  7.6  7.7  7.8  7.9  8.0  8.1  8.2  8.3
##  [85]  8.4  8.5  8.6  8.7  8.8  8.9  9.0  9.1  9.2  9.3  9.4  9.5  9.6  9.7
##  [99]  9.8  9.9 10.0

Creating vectors

##   [1] 10.0  9.9  9.8  9.7  9.6  9.5  9.4  9.3  9.2  9.1  9.0  8.9  8.8  8.7
##  [15]  8.6  8.5  8.4  8.3  8.2  8.1  8.0  7.9  7.8  7.7  7.6  7.5  7.4  7.3
##  [29]  7.2  7.1  7.0  6.9  6.8  6.7  6.6  6.5  6.4  6.3  6.2  6.1  6.0  5.9
##  [43]  5.8  5.7  5.6  5.5  5.4  5.3  5.2  5.1  5.0  4.9  4.8  4.7  4.6  4.5
##  [57]  4.4  4.3  4.2  4.1  4.0  3.9  3.8  3.7  3.6  3.5  3.4  3.3  3.2  3.1
##  [71]  3.0  2.9  2.8  2.7  2.6  2.5  2.4  2.3  2.2  2.1  2.0  1.9  1.8  1.7
##  [85]  1.6  1.5  1.4  1.3  1.2  1.1  1.0  0.9  0.8  0.7  0.6  0.5  0.4  0.3
##  [99]  0.2  0.1  0.0

Creating vectors

##   [1] 100.00  98.01  96.04  94.09  92.16  90.25  88.36  86.49  84.64  82.81
##  [11]  81.00  79.21  77.44  75.69  73.96  72.25  70.56  68.89  67.24  65.61
##  [21]  64.00  62.41  60.84  59.29  57.76  56.25  54.76  53.29  51.84  50.41
##  [31]  49.00  47.61  46.24  44.89  43.56  42.25  40.96  39.69  38.44  37.21
##  [41]  36.00  34.81  33.64  32.49  31.36  30.25  29.16  28.09  27.04  26.01
##  [51]  25.00  24.01  23.04  22.09  21.16  20.25  19.36  18.49  17.64  16.81
##  [61]  16.00  15.21  14.44  13.69  12.96  12.25  11.56  10.89  10.24   9.61
##  [71]   9.00   8.41   7.84   7.29   6.76   6.25   5.76   5.29   4.84   4.41
##  [81]   4.00   3.61   3.24   2.89   2.56   2.25   1.96   1.69   1.44   1.21
##  [91]   1.00   0.81   0.64   0.49   0.36   0.25   0.16   0.09   0.04   0.01
## [101]   0.00

Creating vectors

##   [1] 100.00  98.01  96.04  94.09  92.16  90.25  88.36  86.49  84.64  82.81
##  [11]  81.00  79.21  77.44  75.69  73.96  72.25  70.56  68.89  67.24  65.61
##  [21]  64.00  62.41  60.84  59.29  57.76  56.25  54.76  53.29  51.84  50.41
##  [31]  49.00  47.61  46.24  44.89  43.56  42.25  40.96  39.69  38.44  37.21
##  [41]  36.00  34.81  33.64  32.49  31.36  30.25  29.16  28.09  27.04  26.01
##  [51]  25.00  24.01  23.04  22.09  21.16  20.25  19.36  18.49  17.64  16.81
##  [61]  16.00  15.21  14.44  13.69  12.96  12.25  11.56  10.89  10.24   9.61
##  [71]   9.00   8.41   7.84   7.29   6.76   6.25   5.76   5.29   4.84   4.41
##  [81]   4.00   3.61   3.24   2.89   2.56   2.25   1.96   1.69   1.44   1.21
##  [91]   1.00   0.81   0.64   0.49   0.36   0.25   0.16   0.09   0.04   0.01
## [101]   0.00

Drawing samples from distributions

  • Here is a way to create your own data sets that are random samples…
  • Again, play around with the arguments in the parentheses to see what happens.

Drawing samples from distributions

Drawing samples from distributions

Drawing samples from distributions

  • You’ve probably figured out that y from the last example is drawing numbers with equal probability.
  • What if you want to draw from a distribution?
  • Again, play around with the arguments in the parentheses to see what happens.

Drawing samples from distributions

  • dnorm() generates the probability density, which can be plotted using the curve() function.
  • Note that is curve is added to the plot using add=TRUE

Visualizing Data in R

Visualizing Data

  • So far you’ve been visualizing just the list of output numbers
  • Except for the last example where I snuck in a hist function.
  • You can also visualize all of the variables that you’ve created using the plot function (as well as a number of more sophisticated plotting functions).
  • Each of these is called a high level plotting function, which sets the stage
  • Low level plotting functions will tweak the plots and make them beautiful

Visualizing Data

Putting plots in a single figure

  • The first line of the lower script tells R that you are going to create a composite figure that has two rows and two columns (on next slide)
    • Can you tell how?

Putting plots in a single figure

R Interlude

Complete Exercises 1.3-1.8

Working with Imported Datasets in R

Creating Data Frames in R

  • As you have seen, in R you can generate your own random data set drawn from nearly any distribution very easily.
  • Often we will want to use collected data.
  • Now, let’s make a dummy dataset to get used to dealing with data frames
    • Set up three variables (habitat, temp and elevation) as vectors

Creating Data Frames in R

  • Create a data frame where vectors become columns
##             habitat temp elevation
## Reedy Lake    mixed  3.4       0.0
## Pearcadale      wet  3.4       9.2
## Warneet         wet  8.4       3.8
## Cranbourne      wet  3.0       5.0
## Lysterfield     dry  5.6       5.6
## Red Hill        dry  8.1       4.1
  • Now you have a hand-made data frame with row names

R Interlude: Reading in Data Frames in R

  • A strength of R is being able to import data from an external source
    • Create the same table that you did above in a spreadsheet using LibreOffice
    • Export it to comma separated and tab separated text files for importing into R.
    • The first will read in a comma-delimited file, whereas the second is a tab-delimited
    • In both cases the header and row.names arguments indicate that there is a header row and row label column
    • Note that the name of the file by itself will have R look in the CWD, whereas a full path can also be used

Reading in Data Frames in R

Exporting Data Frames in R

  • you will get more practice with this during the next R interlude

Indexing in data frames

  • Next up - indexing just a subset of the data
  • This is a very important idea in R, that you can analyze just a subset of the data.

Indexing in data frames

  • You can also assign values, or single values, from a data set to a new variable

Indexing in data frames

  • You can perform operations on particular levels of a factor
  • Note that the first argument is the numerical column vector, and the second is the factor column vector.
  • The third is the operation. Reversing the first two does not work - convince yourself of this by altering the order.
    • Tab complete will tell you the correct order for arguments

R Interlude

Complete Exercises 1.9-1.10

Lecture 2 - Collaboration, Documentation and Reproducibility

Collaboration using Git and GitHub

Git and GitHub

https://learngitbranching.js.org/