Introduction to R
Angela Fuentes-Pardo, 20 January 2020
Table of contents
- Introduction
- Why R and RStudio?
- How to install R packages
- R documentation and getting help
- Operators
- Data types
- Data structures
- Functions
- Data visualisation
- Read and write data
- Other considerations
- R Markdown
- Session info
- Further readings and acknowledgements
- Some population genomic analysis in R
Introduction
Background and Aim
Genomic datasets generally consist of thousands to millions of genotypes obtained from numerous individuals and populations. Such large datasets thus, cannot be analysed using programs with restricted memory capacity (e.g. Microsoft Excel can handle up to 1 048 576 rows and 16 384 columns). In this tutorial you will be introduced to the basic functionalities of the R environment for statistical computing (R), one of the most popular software used for statistical analysis and plotting of genomic data. To run this tutorial smoothly, we will use RStudio, a graphic interface for R.
Requirements
- R
- RStudio
How to do this activity
We will run this activity in Rstudio in the web browser. To connect use: http://ec2-XX-XXX-XXX-XXX.compute-1.amazonaws.com:8787
. A RStudio login page should appear on screen. Use ‘popgen’ as username and the password written in the whiteboard. Press the “Enter” key and voila! You are ready to start having fun with R and RStudio 😀
2. Why R and RStudio?
R is a language and interactive environment for statistical analysis. It is open source and it runs on Windows and UNIX-based operating systems (e.g. Linux and MacOS). Besides having many built-in (“Base”) libraries, R is highly extensible. There is a large user community world-wide that is actively developing tools for a variety of applications, including genomics. These characteristics make data analysis in R highly reproducible and shareable.
After installation, the typical way to run R is using the R console in the Terminal window. Type in R
in the Terminal Window. You know the R console is enabled when you see the prompt (>). To execute code in R, just type the commands next to the prompt and hit Enter. R then performs the requested computations and the result is printed to the console.
For this tutorial we will use RStudio, a graphic user interface for R that facilitates code writing and development. RStudio is also open source and runs in the major operating systems. It is a working space that has 4 panels:
- the console (bottom-left)
- a syntax-highlighting text editor (top-left)
- an environment exploration panel that lists variables loaded in memory and command history (top-right)
- a panel for file management, plotting, package installation, and help documentation (bottom-right)
To run code using RStudio, you can either type it directly to the console panel, or execute it from the text editor by pressing the “Run” button (on the top-right corner of the text editor panel) or by pressing at the same time the keys Ctrl
and Enter
(in Mac you can use also Cmd
+ Enter
). This button (and the key combination) will execute the current line by default. To run several lines (or the entire code) in once, first select them and then hit the “Run” button.
3. How to install R packages
R packages are available online from one of these main repositories: CRAN
, Bioconductor
, and Github
.
To install packages available in
CRAN
using the console, use the function install.packages()
. The name of the package must be written between single ''
or double ""
quotes inside the parenthesis:
install.packages("ggplot2")
To start using the installed package, load it in memory using the function library()
:
library(ggplot2)
Installing packages with RStudio is equally easy. Go to the “Packages” tab (bottom-right panel) and click “Install”, then type in the package name, and hit the “Install” button.
Biocoductor
, first install the function BiocManager()
, and then your package of interest:
# if BiocManager() is not available, then install it
if (!requireNamespace("BiocManager"))
install.packages("BiocManager")
# install the package of interest
BiocManager::install("qvalue")
Note that in the previous code there are some lines with a #
at the beginning. These are comments that we can add to the code for clarity and they are actually not executed by the interpreter.
devtools
and then use the function install_github()
:
if (!require("devtools")) install.packages("devtools") # install the package devtools if not available
library(devtools) # load devtools
install_github("stephenturner/qqman") # install the package qqman from the github page of the developer
library(qqman) # load the library qqman
4. R documentation and getting help
R documentation can be easily retrieved for a package loaded in memory by typing in the console ?
followed by the name of the function of interest. Use ??
for a library that is not yet loaded in memory. Some packages also have a long-form guide or documentation that can be explored using the function vignette()
:
?plot
??ggplot
vignette("qqman") # check out the cool Manhattan plots you can create with qqman
Within RStudio, documentation can be searched in the “Help” tab (bottom-right panel).
R is a fantastic tool but it might involve a steep learning curve. Sometimes you might run out of ideas on how to address a particular programming problem or do not remember how to change the labels of the plot axis. The good news are that you are not alone! There is a large online community of R users that contribute to forums where people post and solve common questions. Examples of these are the RStudio community and Stack Overflow. Remember, internet is your best friend!
5. Operators
Assignment operator
To create an object (or variable), give it a name and assign a value to it using the assignment operator <-
or =
. In this tutorial, we will use <-
. In RStudio, you can easily write this operator in one go with the keyboard combination (Alt
and -
) in a PC, or with (Option
and -
) in a Mac:
size <- 8
age = 77
size # type in the name of an object to print it in screen
age
Arithmetic operators
Some common arithmetic operators are sum +
, subtraction -
, multiplication *
, division /
, and power ^
:
5 + 2
5 - 2
5 * 2
5 / 2
5 ^ 2
# using variables with integers
x <- 7
y <- 1
x + y
x - y
x * y
x / y
x ^ y
Logical operators
Logical operators return TRUE or FALSE depending on a value comparison:
m <- 2
n <- 3
m == n # is equal?
m != n # is not equal?
m > n # is it greater than?
m >= n # is it greater than or equal to?
m < n # is it less than?
m <= n # is it less than or equal to?
6. Data types
The most common data types in R are: numeric (decimal or real), integer, character (text strings), logical (TRUE or FALSE), and factor. In the example below, note that characters are written between quotes while numbers are not:
x <- 1
y <- "1"
class(x) # class() is a function that prints the data class of a given object
class(y)
class(7)
class("7")
class("Evomics")
class(T) # Same as using class(TRUE)
class(FALSE)
A factor is a categorical variable assigned to a vector. The different categories are called “levels”, which can be either numeric or character. Internally though, levels are stored as integers with an associated label. Levels are sorted in alphabetical order by default, but their order can be manually changed.
# create a vector
eyeColors <- c('brown','brown','green','blue','grey','blue','green')
# create a factor object using the function factor()
factorEye <- factor(eyeColors)
# print the set factor
print(factorEye)
print(nlevels(factorEye)) # use the nlevels() function to know how many levels the set factor has
7. Data structures
The main data structures (or classes) in R are: vectors, matrices, arrays, data frames, and lists.
Vectors
A vector is an object consisting of a one-dimensional set of elements of the same data type.
x <- rep(1,10) # use ?rep() to know what this function is capable of doing
x
foo <- c("first", "second", "third")
foo
To combine vectors use the c()
function:
x <- rep(1,5)
y <- 1:5
c(x,y)
You can perform arithmetic calculations on vectors using the mathematical operators seen before. Note that such calculations are applied to each element of the vector:
x + y
x / y
To access one element of a vector, specify its index (or position in the vector starting from 1) inside of []
:
vec <- c("1st", "2nd", "3rd", "4th", "5th")
vec
vec[4]
To subset or extract a part of the vector (e.g from the 2nd to 4th element) use:
vec[2:4]
length(vec) # Explore how many elements the vector "vec" has with the length() function
vec[3:length(vec)]
Matrices
A matrix is a vector with two dimensions (rows and columns). Its elements must be of the same data type:
mx <- matrix(c(1, 2, 3, 4, 5, 6), nrow = 3, ncol = 2)
mx
class(mx)
str(mx)
Explore how many dimensions the matrix has:
dim(mx)
To retrieve one element of the matrix, use its coordinates within []
, [row, column]:
mx[3, 1]
Subset one column or one row of the matrix:
mx[, 1] # 1st column
mx[3, ] # 3rd row
Arrays
An array is a tridimensional vector that only can contain elements of the same data type:
m <- c(5,9,3)
n <- c(10,11,12,13,14,15)
arr <- array(c(m, n), dim = c(3,3,2)) # create an array using the vectors m and n as input
print(arr)
Lists
A list is a special type of vector where each element can be of a different data type and can has different length:
myList <- list(head(cars), "I love coding in R", c(0.5, 0.7, 1.0, 0.1))
myList
Use double square brackets to extract the content of elements of a list:
myList[[1]]
You can name the elements of a list and retrieve them by its name using $
sign:
myList <- list(data = head(cars), mss = "I love coding in R", freq = c(0.5, 0.7, 1.0, 0.1))
myList$freq
myList$mss
names(myList) # Explore the names of the elements in the list
Data frames
A data frame is a type of list where every element has the same length and the columns can be of different data type.
Besides column names, data frames have also row names, which can be useful for filtering and subsetting data. Note though, that computations will only apply to the actual table, not to the row or column names.
Let’s create a data frame from scratch:
df <- data.frame(Numbers = 1:10, Letters = c("a", "e", "i", "o", "u", "x", "y", "z", "h", "m"), stringsAsFactors = FALSE)
df
Explore the data frame using these various functions:
class(df)
str(df)
nrow(df)
ncol(df)
rownames(df)
colnames(df)
head(df)
tail(df)
dim(df)
# change the row names manually
rownames(df) <- c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J")
rownames(df)
df
An element of a data frame can be extracted by specifying the row and column index within []
:
df[3, 2]
To subset a column of a data frame, you can use its index or name:
df[, 2]
df$Letters
You can join two data frames either row- or column-wise. For this you can use the functions rbind()
or cbind()
, respectively:
df1 <- data.frame(Letters = c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J"), Numbers = 1:10)
df2 <- data.frame(Numbers = 21:30, Letters = c("Z", "I", "J", "A", "E", "F", "G", "B", "C", "D"))
df1
df2
cbind(df1, df2)
rbind(df1, df2)
8. Functions
A function is a group of statements that performs a specific task. Functions have four main components: name, arguments (values that are passed to the function, optional), body (or statements), and return value. They have this general structure:
function_name <- function(argument1, argument2, …) {
Statements
}
R has many built-in functions, let’s explore some of them:
x <- c(1:50)
x
min(x)
max(x)
sum(x)
mean(x)
log(x)
x/2
round(x/2)
sort(x)
To set a working directory or to know in which directory we are working at, use:
# set a working directory
setwd("/home/popgen/workshop_materials/20_R_intro")
# the same can be achieved in this way
path <- "/home/popgen/workshop_materials/20_R_intro" # create a character vector with the path to the working directory
setwd(path)
getwd() # print working directory
Let’s create a function from scratch:
calcMean <- function(x = x) {
sum(x) / length(x)
}
vec <- c(1,2,3,4,5,6,7,8,9,10) # create a vector of integers
calcMean(vec) # apply the new function to the vector
mean(vec) # Hey, there is already a base function that calculates the mean
9. Data visualisation
Base graphics
R is an excellent tool to generate publication-level plots. Many graphic functions come by default with the installation of R, which are known as “base” graphics.
Use the command demo(graphics)
to get an idea of the different plots that can be created with R base graphics.
Let’s make a scatterplot using the base dataset called “iris”:
data(iris)
plot(iris$Petal.Length, iris$Petal.Width)
plot(iris$Petal.Length, iris$Petal.Width, xlab = "Petal length", ylab = "Petal width") # modify the x and y axis labels
Now color the points by the categorical value (factor) of the column “Species”:
plot(iris$Petal.Length, iris$Petal.Width, xlab = "Petal length", ylab = "Petal width", col=iris$Species)
legend("bottomright", legend = unique(iris$Species), col = 1:length(iris$Species), pch = 1)
Grid graphics
Grid graphics are an alternative to base R graphics. They provide more control and flexibility to modify plot appearance. ggplot2
is perhaps the most popular grid-graphics package nowadays.
Let’s use it to create the same plot we did before:
library(ggplot2)
ggplot(iris, aes(x = Petal.Length, y = Petal.Width, colour = Species))+
geom_point()+
labs(x = "Petal length", y = "Petal width")
To learn more about how to customise plots generated with ggplot
, check out its website https://ggplot2.tidyverse.org or https://ggplot2.tidyverse.org/reference/theme.html.
10. Read and write data
There are specific functions to import and export text data in a variety of formats. In general, the write.table()
is used for saving data as a tab-delimited file, while write.csv()
is for saving data as a comma-delimited file. Use ?write.table()
and ?write.csv()
to know which arguments are required by each function.
Save two of the previously created data frames:
write.table(df1, "example1.txt", sep = "\t", row.names = F, quote = F)
write.csv(df2, "example2.csv", row.names = F)
Let’s review the arguments used for saving these files:
df1
is the object we want to save"example1.txt"
is the name (in string characters) of the output filesep = "\t"
sets tab character as delimiterrow.names = FALSE
sets that row names are not written to the output filequote = F
sets that strings of characters are not surrounded by double quotes (default).
Now load these files but assigning them to different objects:
# load a tab-delimited (.txt) text file
dat1 <- read.table("example1.txt", header = TRUE, stringsAsFactors = F)
# load a comma-delimited (.csv) file
dat2 <- read.csv("example2.csv", header = TRUE)
Explore how the imported data looks like:
head(dat1)
str(dat1)
head(dat2)
str(dat2)
Traditional R functions (e.g. read.table()
) load an entire dataset to memory in once. This becomes a problem when dealing with large genomic datasets, as they generally consists of several GB. An alternative to read.table()
is fread()
, a function from the package data.table
that, instead of loading the whole file to RAM, reads file sections on demand.
To save plots using the command line, you first need to open a graphics device, create the plot, and then close the device. There are several functions to save graphics, which generally are called as the format into which the plots are saved (e.g. png()
, pdf()
):
# To save the scatter plot we made before in png format
png(filename = "scatterplot-in-base-graphics.png")
plot(iris$Petal.Length, iris$Petal.Width, xlab = "Petal length", ylab = "Petal width", col=iris$Species)
legend("bottomright", legend = unique(iris$Species), col = 1:length(iris$Species), pch = 1)
dev.off()
To save a plot created with ggplot()
, you can either use the function ggsave()
, or you can do it in the traditional way just that first you need to save the plot in a variable:
p <- ggplot(iris, aes(x = Petal.Length, y = Petal.Width, colour = Species)) +
geom_point() +
labs(x = "Petal length", y = "Petal width")
ggsave(filename = "scatterplot-in-ggplot-graphics1.png", plot = p)
# alternatively
png(filename = "scatterplot-in-ggplot-graphics2.png")
print(p)
dev.off()
In RStudio, you can easily save a plot by using the “Export” tab.
11. Other considerations
- Rows are designated for observations and columns for variables
- A column can only contains one data type
- Avoid naming objects that have the same name as R functions, e.g.
c()
, ort()
- When naming columns, avoid using
.
or-
or start a name with a number. If a separator is desired for naming columns, use_
instead - When naming objects, prefer camelStyle instead of using dots, e.g. camel.style
- Factors are not equivalent to characters, thus pay attention to the structure of imported data. If
stringsAsFactors = FALSE
is not set, characters will be converted to factors, and functions do not work the same for characters and factors
Further recommendations on R style can be found here http://jef.works/R-style-guide/
12. R Markdown
Nothing makes sense in science if it is not reproducible. Same as molecular biologists have a notebook where they register the different experiments conducted in the lab, we should keep detailed note of the code and analysis we perform.
You can create R notebooks using markdown and RStudio. Markdown is a syntax language that gives format to plain text. To learn more about R Markdown, go to this website https://rmarkdown.rstudio.com/lesson-2.html or check out the R Markdown reference guide https://rstudio.com/wp-content/uploads/2015/03/rmarkdown-reference.pdf
13. Session info
With the function sessionInfo()
you can record the packages that were used in this tutorial. It is always a good practice to save this kind of information for reproducibility.
14. Further readings and acknowledgements
One last thing to run in the console:
library(meme)
meme("http://i0.kym-cdn.com/entries/icons/mobile/000/000/745/success.jpg", "Mission accomplished", "Tell me more about R")
Congratulations for completing the R tutorial! 🎉 We hope you found it useful and are motivated to continue learning about R. Below, a few links that served as great inspiration to compile this tutorial.
- An Introduction to R, manual by the r-project https://cran.r-project.org/doc/manuals/r-release/R-intro.pdf
- Intro to R, by Data Carpentry https://datacarpentry.org/R-genomics/01-intro-to-R.html
- Introduction to R, by the Monash Bioinformatics Platform https://datacarpentry.org/R-genomics/01-intro-to-R.html
In case you want to run this tutorial in your own machine in the future, install R and RStudio by following the instructions given in their respective websites. For R go to https://www.r-project.org/ and for RStudio go to https://www.rstudio.com/. Happy coding!