R for Bioinformatic Analyses

Hannah Tavalire and Bill Cresko - University of Oregon

January 2019 - Cesky Krumlov

Lecture 1 - Using R for Biostatistical Analyses

But first a beautiful chair

Before we talk about how amazing R is…

  • navigate into this directory: ~/workshop_materials/evomics_stats_2019/
  • type ‘git pull’

Why use R?

  • R is a statistical programming language (derived from S)
  • Superb data management & graphics capabilities
  • You can write your own functions
  • Powerful and flexible
  • Runs on all computer platforms
  • Well established system of packages and documentation
  • Active development and dedicated community
  • Can use a nice GUI front end such as Rstudio
  • Reproducibility
    • keep your scripts to see exactly what was done
    • distribute these with your data
    • embed your R analyses in polished RMarkdown files
  • FREE

R resources

Running R

  • Need to make sure that you have R installed
  • Run R from the command line
    • just type R
    • can run it locally as well as on clusters
  • Install a R Integrated Development Environment (IDE)
    • RStudio: http://www.rstudio.com
    • Makes working with R much easier, particularly for a new R user
    • Run on Windows, Mac or Linux OS
    • We’re running as a server on the AWS instances

RStudio

Exercise 1.1 - Exploring RStudio

  • Open RStudio by adding :8787 to your AMI url
  • Take a few minutes to familiarize yourself with the Rstudio environment by locating the following features:
    • See what types of new files can be made in Rstudio by clicking the top left icon- open a new R script.
    • The windows clockwise from top left are: the code editor, the workspace and history, the plots and files window, and the R console.
    • In the plots and files window, click on the packages and help tabs to see what they offer.
  • Now open the file called Exercises_for_R_Lectures.Rmd in /workshop_materials/evomics_stat_2019/03.Exercises/
    • This file will serve as your digital notebook for parts of the workshop and contains the other exercises.

Introduction to RMarkdown

RMarkdown

Exercise 1.2 - Intro to RMarkdown Files

  • Take a few minutes to familiarize yourself with RMarkdown files by completing exercise 1.2 in your exercises document.

BASICS of R

BASICS of R

  • Commands can be submitted through
    • terminal, console or scripts
    • can be embedded as code chunks in RMarkdown
  • On these slides evaluating code chunks and showing output
    • shown here after the two # symbols
    • the number of output items is in []
  • R follows the normal priority of mathematical evaluation (PEDMAS)

BASICS of R

Input code chunk and then output

## [1] 16

Input code chunk and then output

## [1] 16

Assigning Variables

  • A better way to do this is to assign variables
  • Variables are assigned values using the <- operator.
  • Variable names must begin with a letter, but other than that, just about anything goes.
  • Do keep in mind that R is case sensitive.

Assigning Variables

## [1] 6
## [1] 4

These do not work

Arithmetic operations on functions

  • Arithmetic operations can be performed easily on functions as well as numbers.
## [1] 14
## [1] 144
## [1] 2.484907

Arithmetic operations on functions

  • Note that the last of these - log - is a built in function of R, and therefore the object of the function needs to be put in parentheses
  • These parentheses will be important, and we’ll come back to them later when we add arguments after the object in the parentheses
  • The outcome of calculations can be assigned to new variables as well, and the results can be checked using the print command

Arithmetic operations on functions

## [1] 67
## [1] 69022864

STRINGS

  • Operations can be performed on character variables as well
  • Note that “characters” need to be set off by quotation marks to differentiate them from numbers
  • The c stands for concatenate
  • Note that we are using the same variable names as we did previously, which means that we’re overwriting our previous assignment
  • A good rule of thumb is to use new names for each variable, and make them short but still descriptive

STRINGS

## [1] "I Love"
## [1] "Biostatistics"
## [1] "I Love"        "Biostatistics"

VECTORS

  • In general R thinks in terms of vectors
    • a list of characters, factors or numerical values (“I Love”)
    • it will benefit any R user to try to write scripts with that in mind
    • it will simplify most things
  • Vectors can be assigned directly using the ‘c()’ function and then entering the exact values.

VECTORS

##  [1]  2  3  4  2  1  2  4  5 10  8  9
##  [1]  5  6  7  5  4  5  7  8 13 11 12

FACTORS

  • The vector x is now what is called a list of character values (“I Love”).
  • Sometimes we would like to treat the characters as if they were units for subsequent calculations.
  • These are called factors, and we can redefine our character variables as factors.
  • This might seem a bit strange, but it’s important for statistical analyses where we might want to see the mean or variance for two different treatments.

FACTORS

## [1] I Love
## Levels: I Love
  • Note that factor levels are reported alphabetically

FACTORS

  • We can also determine how R “sees” a variable using str() or class() functions.
  • This is a useful check when importing datasets or verifying that you assigned a class correctly
##  chr "I Love"
## [1] "character"

Types or ‘classes’ of vectors of data

Types of vectors of data

  • int stands for integers

  • dbl stands for doubles, or real numbers

  • chr stands for character vectors, or strings

  • dttm stands for date-times (a date + a time)

  • lgl stands for logical, vectors that contain only TRUE or FALSE

  • fctr stands for factors, which R uses to represent categorical variables with fixed possible values

  • date stands for dates

Types of vectors of data

  • Logical vectors can take only three possible values:
    • FALSE
    • TRUE
    • NA which is ‘not available’.
  • Integer and double vectors are known collectively as numeric vectors.
    • In R numbers are doubles by default.
  • Integers have one special value: NA, while doubles have four:
    • NA
    • NaN which is ‘not a number’
    • Inf
    • -Inf

Basic Statistics

Many functions exist to operate on vectors.

  • Arguments modify or direct the function in some way
    • There are many arguments for each function, some of which are defaults
    • Tab complete is helpful to view argument options

Getting Help

  • Getting Help on any function is very easy - just type a question mark and the name of the function.
  • There are functions for just about anything within R and it is easy enough to write your own functions if none already exist to do what you want to do.
  • In general, function calls have a simple structure: a function name, a set of parentheses and an optional set of parameters/arguments to send to the function.
  • Help pages exist for all functions that, at a minimum, explain what parameters exist for the function.

Getting Help

Creating vectors

  • Creating a vector of new data by entering it by hand can be a drag
  • However, it is also very easy to use functions such as
    • seq
    • sample

Creating vectors

  • What do the arguments mean?
##   [1]  0.0  0.1  0.2  0.3  0.4  0.5  0.6  0.7  0.8  0.9  1.0  1.1  1.2  1.3
##  [15]  1.4  1.5  1.6  1.7  1.8  1.9  2.0  2.1  2.2  2.3  2.4  2.5  2.6  2.7
##  [29]  2.8  2.9  3.0  3.1  3.2  3.3  3.4  3.5  3.6  3.7  3.8  3.9  4.0  4.1
##  [43]  4.2  4.3  4.4  4.5  4.6  4.7  4.8  4.9  5.0  5.1  5.2  5.3  5.4  5.5
##  [57]  5.6  5.7  5.8  5.9  6.0  6.1  6.2  6.3  6.4  6.5  6.6  6.7  6.8  6.9
##  [71]  7.0  7.1  7.2  7.3  7.4  7.5  7.6  7.7  7.8  7.9  8.0  8.1  8.2  8.3
##  [85]  8.4  8.5  8.6  8.7  8.8  8.9  9.0  9.1  9.2  9.3  9.4  9.5  9.6  9.7
##  [99]  9.8  9.9 10.0

Creating vectors

##   [1] 10.0  9.9  9.8  9.7  9.6  9.5  9.4  9.3  9.2  9.1  9.0  8.9  8.8  8.7
##  [15]  8.6  8.5  8.4  8.3  8.2  8.1  8.0  7.9  7.8  7.7  7.6  7.5  7.4  7.3
##  [29]  7.2  7.1  7.0  6.9  6.8  6.7  6.6  6.5  6.4  6.3  6.2  6.1  6.0  5.9
##  [43]  5.8  5.7  5.6  5.5  5.4  5.3  5.2  5.1  5.0  4.9  4.8  4.7  4.6  4.5
##  [57]  4.4  4.3  4.2  4.1  4.0  3.9  3.8  3.7  3.6  3.5  3.4  3.3  3.2  3.1
##  [71]  3.0  2.9  2.8  2.7  2.6  2.5  2.4  2.3  2.2  2.1  2.0  1.9  1.8  1.7
##  [85]  1.6  1.5  1.4  1.3  1.2  1.1  1.0  0.9  0.8  0.7  0.6  0.5  0.4  0.3
##  [99]  0.2  0.1  0.0

Creating vectors

##   [1] 100.00  98.01  96.04  94.09  92.16  90.25  88.36  86.49  84.64  82.81
##  [11]  81.00  79.21  77.44  75.69  73.96  72.25  70.56  68.89  67.24  65.61
##  [21]  64.00  62.41  60.84  59.29  57.76  56.25  54.76  53.29  51.84  50.41
##  [31]  49.00  47.61  46.24  44.89  43.56  42.25  40.96  39.69  38.44  37.21
##  [41]  36.00  34.81  33.64  32.49  31.36  30.25  29.16  28.09  27.04  26.01
##  [51]  25.00  24.01  23.04  22.09  21.16  20.25  19.36  18.49  17.64  16.81
##  [61]  16.00  15.21  14.44  13.69  12.96  12.25  11.56  10.89  10.24   9.61
##  [71]   9.00   8.41   7.84   7.29   6.76   6.25   5.76   5.29   4.84   4.41
##  [81]   4.00   3.61   3.24   2.89   2.56   2.25   1.96   1.69   1.44   1.21
##  [91]   1.00   0.81   0.64   0.49   0.36   0.25   0.16   0.09   0.04   0.01
## [101]   0.00

Creating vectors

##   [1] 100.00  98.01  96.04  94.09  92.16  90.25  88.36  86.49  84.64  82.81
##  [11]  81.00  79.21  77.44  75.69  73.96  72.25  70.56  68.89  67.24  65.61
##  [21]  64.00  62.41  60.84  59.29  57.76  56.25  54.76  53.29  51.84  50.41
##  [31]  49.00  47.61  46.24  44.89  43.56  42.25  40.96  39.69  38.44  37.21
##  [41]  36.00  34.81  33.64  32.49  31.36  30.25  29.16  28.09  27.04  26.01
##  [51]  25.00  24.01  23.04  22.09  21.16  20.25  19.36  18.49  17.64  16.81
##  [61]  16.00  15.21  14.44  13.69  12.96  12.25  11.56  10.89  10.24   9.61
##  [71]   9.00   8.41   7.84   7.29   6.76   6.25   5.76   5.29   4.84   4.41
##  [81]   4.00   3.61   3.24   2.89   2.56   2.25   1.96   1.69   1.44   1.21
##  [91]   1.00   0.81   0.64   0.49   0.36   0.25   0.16   0.09   0.04   0.01
## [101]   0.00

R Interlude

Complete Exercises 1.3-1.6

Drawing samples from distributions

  • Here is a way to create your own data sets that are random samples…