Introduction to R

Angela Fuentes-Pardo, 20 January 2020

Table of contents


  1. Introduction
  2. Why R and RStudio?
  3. How to install R packages
  4. R documentation and getting help
  5. Operators
  6. Data types
  7. Data structures
  8. Functions
  9. Data visualisation
  10. Read and write data
  11. Other considerations
  12. R Markdown
  13. Session info
  14. Further readings and acknowledgements
  15. Some population genomic analysis in R

Introduction


Background and Aim

Genomic datasets generally consist of thousands to millions of genotypes obtained from numerous individuals and populations. Such large datasets thus, cannot be analysed using programs with restricted memory capacity (e.g. Microsoft Excel can handle up to 1 048 576 rows and 16 384 columns). In this tutorial you will be introduced to the basic functionalities of the R environment for statistical computing (R), one of the most popular software used for statistical analysis and plotting of genomic data. To run this tutorial smoothly, we will use RStudio, a graphic interface for R.

Requirements

  • R
  • RStudio

How to do this activity

We will run this activity in Rstudio in the web browser. To connect use: http://ec2-XX-XXX-XXX-XXX.compute-1.amazonaws.com:8787. A RStudio login page should appear on screen. Use ‘popgen’ as username and the password written in the whiteboard. Press the “Enter” key and voila! You are ready to start having fun with R and RStudio 😀

2. Why R and RStudio?


R is a language and interactive environment for statistical analysis. It is open source and it runs on Windows and UNIX-based operating systems (e.g. Linux and MacOS). Besides having many built-in (“Base”) libraries, R is highly extensible. There is a large user community world-wide that is actively developing tools for a variety of applications, including genomics. These characteristics make data analysis in R highly reproducible and shareable.

After installation, the typical way to run R is using the R console in the Terminal window. Type in R in the Terminal Window. You know the R console is enabled when you see the prompt (>). To execute code in R, just type the commands next to the prompt and hit Enter. R then performs the requested computations and the result is printed to the console.

For this tutorial we will use RStudio, a graphic user interface for R that facilitates code writing and development. RStudio is also open source and runs in the major operating systems. It is a working space that has 4 panels:

  • the console (bottom-left)
  • a syntax-highlighting text editor (top-left)
  • an environment exploration panel that lists variables loaded in memory and command history (top-right)
  • a panel for file management, plotting, package installation, and help documentation (bottom-right)

To run code using RStudio, you can either type it directly to the console panel, or execute it from the text editor by pressing the “Run” button (on the top-right corner of the text editor panel) or by pressing at the same time the keys Ctrl and Enter (in Mac you can use also Cmd + Enter). This button (and the key combination) will execute the current line by default. To run several lines (or the entire code) in once, first select them and then hit the “Run” button.

3. How to install R packages


R packages are available online from one of these main repositories: CRAN, Bioconductor, and Github.

  • CRAN stands for the Comprehensive R Archive network. It consists of a group of servers that store R packages and their documentation (for more information go to https://cran.r-project.org).
    To install packages available in CRAN using the console, use the function install.packages(). The name of the package must be written between single '' or double "" quotes inside the parenthesis:

    install.packages("ggplot2")

    To start using the installed package, load it in memory using the function library():

    library(ggplot2)

    Installing packages with RStudio is equally easy. Go to the “Packages” tab (bottom-right panel) and click “Install”, then type in the package name, and hit the “Install” button.

  • Bioconductor is an archive of R packages for analysis of high-throughput genomic data (for more information go to https://bioconductor.org). To install packages available in Biocoductor, first install the function BiocManager(), and then your package of interest:
    # if BiocManager() is not available, then install it
    if (!requireNamespace("BiocManager"))
    install.packages("BiocManager")
    
    # install the package of interest
    BiocManager::install("qvalue")

    Note that in the previous code there are some lines with a # at the beginning. These are comments that we can add to the code for clarity and they are actually not executed by the interpreter.

  • Github is one of many repositories where individuals store their code. To download an R package from Github, first you need to install the package devtools and then use the function install_github():
    if (!require("devtools")) install.packages("devtools")  # install the package devtools if not available
    library(devtools)  # load devtools
    install_github("stephenturner/qqman")  # install the package qqman from the github page of the developer
    library(qqman)  # load the library qqman
  • 4. R documentation and getting help


    R documentation can be easily retrieved for a package loaded in memory by typing in the console ? followed by the name of the function of interest. Use ?? for a library that is not yet loaded in memory. Some packages also have a long-form guide or documentation that can be explored using the function vignette():

    ?plot
    ??ggplot
    vignette("qqman")  # check out the cool Manhattan plots you can create with qqman

    Within RStudio, documentation can be searched in the “Help” tab (bottom-right panel).

    R is a fantastic tool but it might involve a steep learning curve. Sometimes you might run out of ideas on how to address a particular programming problem or do not remember how to change the labels of the plot axis. The good news are that you are not alone! There is a large online community of R users that contribute to forums where people post and solve common questions. Examples of these are the RStudio community and Stack Overflow. Remember, internet is your best friend!

    5. Operators


    Assignment operator

    To create an object (or variable), give it a name and assign a value to it using the assignment operator <- or =. In this tutorial, we will use <-. In RStudio, you can easily write this operator in one go with the keyboard combination (Alt and -) in a PC, or with (Option and -) in a Mac:

    size <- 8
    age = 77
    
    size  # type in the name of an object to print it in screen
    age

    Arithmetic operators

    Some common arithmetic operators are sum +, subtraction -, multiplication *, division /, and power ^:

    5 + 2
    5 - 2
    5 * 2
    5 / 2
    5 ^ 2
    
    # using variables with integers
    x <- 7
    y <- 1
    
    x + y
    x - y
    x * y
    x / y
    x ^ y

    Logical operators

    Logical operators return TRUE or FALSE depending on a value comparison:

    m <- 2
    n <- 3
    
    m == n  # is equal?
    m != n  # is not equal?
    m > n  # is it greater than?
    m >= n  # is it greater than or equal to?
    m < n  # is it less than?
    m <= n  # is it less than or equal to?

    6. Data types


    The most common data types in R are: numeric (decimal or real), integer, character (text strings), logical (TRUE or FALSE), and factor. In the example below, note that characters are written between quotes while numbers are not:

    x <- 1
    y <- "1"
    
    class(x)  # class() is a function that prints the data class of a given object
    class(y)
    class(7)
    class("7")
    class("Evomics")
    class(T)  # Same as using class(TRUE)
    class(FALSE)

    A factor is a categorical variable assigned to a vector. The different categories are called “levels”, which can be either numeric or character. Internally though, levels are stored as integers with an associated label. Levels are sorted in alphabetical order by default, but their order can be manually changed.

    # create a vector
    eyeColors <- c('brown','brown','green','blue','grey','blue','green')
    
    # create a factor object using the function factor()
    factorEye <- factor(eyeColors)
    
    # print the set factor
    print(factorEye)
    print(nlevels(factorEye)) # use the nlevels() function to know how many levels the set factor has

    7. Data structures


    The main data structures (or classes) in R are: vectors, matrices, arrays, data frames, and lists.

    Vectors

    A vector is an object consisting of a one-dimensional set of elements of the same data type.

    x <- rep(1,10)  # use ?rep() to know what this function is capable of doing
    x
    
    foo <- c("first", "second", "third")
    foo

    To combine vectors use the c() function:

    x <- rep(1,5)
    y <- 1:5
    
    c(x,y)

    You can perform arithmetic calculations on vectors using the mathematical operators seen before. Note that such calculations are applied to each element of the vector:

    x + y
    x / y

    To access one element of a vector, specify its index (or position in the vector starting from 1) inside of []:

    vec <- c("1st", "2nd", "3rd", "4th", "5th")
    vec
    vec[4]

    To subset or extract a part of the vector (e.g from the 2nd to 4th element) use:

    vec[2:4]
    
    length(vec)  # Explore how many elements the vector "vec" has with the length() function
    vec[3:length(vec)]

    Matrices

    A matrix is a vector with two dimensions (rows and columns). Its elements must be of the same data type:

    mx <- matrix(c(1, 2, 3, 4, 5, 6), nrow = 3, ncol = 2)
    mx
    class(mx)
    str(mx)

    Explore how many dimensions the matrix has:

    dim(mx)

    To retrieve one element of the matrix, use its coordinates within [], [row, column]:

    mx[3, 1]

    Subset one column or one row of the matrix:

    mx[, 1]  # 1st column
    mx[3, ]  # 3rd row

    Arrays

    An array is a tridimensional vector that only can contain elements of the same data type:

    m <- c(5,9,3)
    n <- c(10,11,12,13,14,15)
    
    arr <- array(c(m, n), dim = c(3,3,2))  # create an array using the vectors m and n as input
    print(arr)

    Lists

    A list is a special type of vector where each element can be of a different data type and can has different length:

    myList <- list(head(cars), "I love coding in R", c(0.5, 0.7, 1.0, 0.1))
    myList

    Use double square brackets to extract the content of elements of a list:

    myList[[1]]

    You can name the elements of a list and retrieve them by its name using $ sign:

    myList <- list(data = head(cars), mss = "I love coding in R", freq = c(0.5, 0.7, 1.0, 0.1))
    
    myList$freq
    myList$mss
    names(myList)  # Explore the names of the elements in the list

    Data frames

    A data frame is a type of list where every element has the same length and the columns can be of different data type.
    Besides column names, data frames have also row names, which can be useful for filtering and subsetting data. Note though, that computations will only apply to the actual table, not to the row or column names.
    Let’s create a data frame from scratch:

    df <- data.frame(Numbers = 1:10, Letters = c("a", "e", "i", "o", "u", "x", "y", "z", "h", "m"), stringsAsFactors = FALSE)
    df

    Explore the data frame using these various functions:

    class(df)
    str(df)
    nrow(df)
    ncol(df)
    rownames(df)
    colnames(df)
    head(df)
    tail(df)
    dim(df)
    
    # change the row names manually
    rownames(df) <- c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J")
    rownames(df)
    df

    An element of a data frame can be extracted by specifying the row and column index within []:

    df[3, 2]

    To subset a column of a data frame, you can use its index or name:

    df[, 2]
    df$Letters

    You can join two data frames either row- or column-wise. For this you can use the functions rbind() or cbind(), respectively:

    df1 <- data.frame(Letters = c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J"), Numbers = 1:10)
    df2 <- data.frame(Numbers = 21:30, Letters = c("Z", "I", "J", "A", "E", "F", "G", "B", "C", "D"))
    
    df1
    df2
    
    cbind(df1, df2)
    rbind(df1, df2)

    8. Functions


    A function is a group of statements that performs a specific task. Functions have four main components: name, arguments (values that are passed to the function, optional), body (or statements), and return value. They have this general structure:

    function_name <- function(argument1, argument2, …) {
    Statements
    }

    R has many built-in functions, let’s explore some of them:

    x <- c(1:50)
    x
    
    min(x)
    max(x)
    sum(x)
    mean(x)
    log(x)
    x/2
    round(x/2)
    sort(x)

    To set a working directory or to know in which directory we are working at, use:

    # set a working directory
    setwd("/home/popgen/workshop_materials/20_R_intro")
    
    # the same can be achieved in this way
    path <- "/home/popgen/workshop_materials/20_R_intro"  # create a character vector with the path to the working directory
    setwd(path)
    
    getwd()  # print working directory

    Let’s create a function from scratch:

    calcMean <- function(x = x) {
      sum(x) / length(x)
    }
    
    vec <- c(1,2,3,4,5,6,7,8,9,10)  # create a vector of integers
    calcMean(vec)  # apply the new function to the vector
    
    mean(vec)  # Hey, there is already a base function that calculates the mean

    9. Data visualisation


    Base graphics

    R is an excellent tool to generate publication-level plots. Many graphic functions come by default with the installation of R, which are known as “base” graphics.
    Use the command demo(graphics) to get an idea of the different plots that can be created with R base graphics.

    Let’s make a scatterplot using the base dataset called “iris”:

    data(iris)
    plot(iris$Petal.Length, iris$Petal.Width)
    
    plot(iris$Petal.Length, iris$Petal.Width, xlab = "Petal length", ylab = "Petal width")  # modify the x and y axis labels

    Now color the points by the categorical value (factor) of the column “Species”:

    plot(iris$Petal.Length, iris$Petal.Width, xlab = "Petal length", ylab = "Petal width", col=iris$Species)
    legend("bottomright", legend = unique(iris$Species), col = 1:length(iris$Species), pch = 1)

    Grid graphics

    Grid graphics are an alternative to base R graphics. They provide more control and flexibility to modify plot appearance. ggplot2 is perhaps the most popular grid-graphics package nowadays.
    Let’s use it to create the same plot we did before:

    library(ggplot2)
    
    ggplot(iris, aes(x = Petal.Length, y = Petal.Width, colour = Species))+
      geom_point()+
      labs(x = "Petal length", y = "Petal width")

    To learn more about how to customise plots generated with ggplot, check out its website https://ggplot2.tidyverse.org or https://ggplot2.tidyverse.org/reference/theme.html.

    10. Read and write data


    There are specific functions to import and export text data in a variety of formats. In general, the write.table() is used for saving data as a tab-delimited file, while write.csv() is for saving data as a comma-delimited file. Use ?write.table() and ?write.csv() to know which arguments are required by each function.
    Save two of the previously created data frames:

    write.table(df1, "example1.txt", sep = "\t", row.names = F, quote = F)
    write.csv(df2, "example2.csv", row.names = F)

    Let’s review the arguments used for saving these files:

    • df1 is the object we want to save
    • "example1.txt" is the name (in string characters) of the output file
    • sep = "\t" sets tab character as delimiter
    • row.names = FALSE sets that row names are not written to the output file
    • quote = F sets that strings of characters are not surrounded by double quotes (default).

    Now load these files but assigning them to different objects:

    # load a tab-delimited (.txt) text file
    dat1 <- read.table("example1.txt", header = TRUE, stringsAsFactors = F)
    
    # load a comma-delimited (.csv) file
    dat2 <- read.csv("example2.csv", header = TRUE)

    Explore how the imported data looks like:

    head(dat1)
    str(dat1)
    
    head(dat2)
    str(dat2)

    Traditional R functions (e.g. read.table()) load an entire dataset to memory in once. This becomes a problem when dealing with large genomic datasets, as they generally consists of several GB. An alternative to read.table() is fread(), a function from the package data.table that, instead of loading the whole file to RAM, reads file sections on demand.
    To save plots using the command line, you first need to open a graphics device, create the plot, and then close the device. There are several functions to save graphics, which generally are called as the format into which the plots are saved (e.g. png(), pdf()):

    # To save the scatter plot we made before in png format
    png(filename = "scatterplot-in-base-graphics.png")
    
    plot(iris$Petal.Length, iris$Petal.Width, xlab = "Petal length", ylab = "Petal width", col=iris$Species)
    legend("bottomright", legend = unique(iris$Species), col = 1:length(iris$Species), pch = 1)
    
    dev.off()

    To save a plot created with ggplot(), you can either use the function ggsave(), or you can do it in the traditional way just that first you need to save the plot in a variable:

    p <- ggplot(iris, aes(x = Petal.Length, y = Petal.Width, colour = Species)) +
    geom_point() +
    labs(x = "Petal length", y = "Petal width")
    
    ggsave(filename = "scatterplot-in-ggplot-graphics1.png", plot = p)
    
    # alternatively
    png(filename = "scatterplot-in-ggplot-graphics2.png")
    print(p)
    dev.off()

    In RStudio, you can easily save a plot by using the “Export” tab.

    11. Other considerations


    • Rows are designated for observations and columns for variables
    • A column can only contains one data type
    • Avoid naming objects that have the same name as R functions, e.g. c(), or t()
    • When naming columns, avoid using . or - or start a name with a number. If a separator is desired for naming columns, use _ instead
    • When naming objects, prefer camelStyle instead of using dots, e.g. camel.style
    • Factors are not equivalent to characters, thus pay attention to the structure of imported data. If stringsAsFactors = FALSE is not set, characters will be converted to factors, and functions do not work the same for characters and factors

    Further recommendations on R style can be found here http://jef.works/R-style-guide/

    12. R Markdown


    Nothing makes sense in science if it is not reproducible. Same as molecular biologists have a notebook where they register the different experiments conducted in the lab, we should keep detailed note of the code and analysis we perform.
    You can create R notebooks using markdown and RStudio. Markdown is a syntax language that gives format to plain text. To learn more about R Markdown, go to this website https://rmarkdown.rstudio.com/lesson-2.html or check out the R Markdown reference guide https://rstudio.com/wp-content/uploads/2015/03/rmarkdown-reference.pdf

    13. Session info


    With the function sessionInfo() you can record the packages that were used in this tutorial. It is always a good practice to save this kind of information for reproducibility.

    14. Further readings and acknowledgements


    One last thing to run in the console:

    library(meme)
    meme("http://i0.kym-cdn.com/entries/icons/mobile/000/000/745/success.jpg", "Mission accomplished", "Tell me more about R")

    Congratulations for completing the R tutorial! 🎉 We hope you found it useful and are motivated to continue learning about R. Below, a few links that served as great inspiration to compile this tutorial.

    In case you want to run this tutorial in your own machine in the future, install R and RStudio by following the instructions given in their respective websites. For R go to https://www.r-project.org/ and for RStudio go to https://www.rstudio.com/. Happy coding!

  • Some population genomic analysis in R