1 Introduction

Welcome to the R and ggplot2 session. The purpose of this file is to guide you through the exercises. In this session, we will work in RStudio and use an R Markdown file. R Markdown documents are good for keeping the R code, plots and descriptions (structured text) in one place. It enables you to make nice reports from the data that can be exported as html, pdf etc. You can find more information on R Markdown here and a cheatsheet for R Markdown here.

Some general useful information:

If you want to insert a new code chunk, you can do it with ctrl + alt + i (Mac: opt + cmd + i). Pressing ctrl + enter (Mac: cmd + enter) sends the line of code where the cursor is (or the selected code) to the console and runs it. You can run whole chunk by pressing the green triangle on the right side.

2 Load all libraries and set working directory

Load all the packages you need using the syntax from the slides. The code is included below.

# In case that you need to install a package before loading it, you can simply use the command install.packages("package_name")
library(ggplot2)
library(data.table)
library(dplyr)
library(tidyr)
library(purrr)
library(ape)
library(tidyverse)
library(stringr)

You can check which packages are already installed with the command below. In case that you need a new package, you could get it from CRAN, Bioconductor or look for the respective GitHub repository.

# If you would like to list your already installed packages
library()

# Check your current R version and packages
sessionInfo()

You can set a working folder in R. It is the default folder where you read and write files. We will set it to the r_ggplot2 folder where our training data can be found. However, you can read and write files from/to other directories if you provide the right path.

# Check the current working folder
getwd()
## [1] "/Users/Janina/Documents/Teaching/Teaching"
# Set the working folder
#setwd("/home/genomics/workshop_materials/r_ggplot2")

3 Loading and checking your data

Dataset description

This dataset comes from Milos’s experiment with Arabidopsis thaliana plants. Three genotypes of plants were used for the experiment - wt (wild type) and psbo1cr and psbo2cr mutants. The mutants have non-functional psbO1 or psbO2 genes, encoding slightly different paralogs of the PsbO protein. PsbO is a subunit of the photosystem II, one of the main components of photosynthetic apparatus in a chloroplast. Plants completely without a PsbO protein (double mutants) would die, however, the single mutants are able to survive and can be used to study the differences between PsbO1 and PsbO2 proteins.

The purpose of the experiment was to see whether the mutations give the plants some advantages or disadvantages under drought (water stress) conditions or salt stress conditions. The plants were cultivated for several weeks under normal condition (control), drought stress (less watering) and salt stress (watering with NaCl solution). At the end of the experiment, the chlorophyll fluorescence was measured with a chlorophyll fluorescence imaging device, giving us area of the leaf rosette and Fv/Fm (denoted by the QY_max column of the dataset below).

Loading the dataset

Load the dataset and take a look at it:

# Import the data and assign it the name "plantData" using <-
plantData <- read.csv(file = "plantData.csv")

With the above command you have loaded your dataset into R and can start working on it!

Some explanations:

  • Everything that is after # is regarded as comment and not code.
  • plantData is a name of the new object.
  • read.csv is a function that we used to read a comma-separated file ending with .csv.
  • In the brackets () after the function name are the arguments to the function (now we have only the file argument, if there would be more, you would separate them with comma ,). It is not needed to name the arguments if they are in the right order (you can use only plantData <- read.csv("plantData.csv")).
  • <- is used to assign the output of the function to the plantData object.

Now we do some check ups on our dataset. You do not need to run these commands but they are very useful to get familiar with a dataset.

The first commands are good for big datasets. As our dataset is quite small, you can also view the entire dataset in the console by just running the name of the dataset.

# Print the first six rows of the dataset
head(plantData)

# Summary of the dataset
summary(plantData)

# What type of data do I have?
class(plantData)

# Column names of the dataset
colnames(plantData)

# Count the number of rows in your dataset
nrow(plantData)

# Printing whole dataset in the console (usable for small datasets)
plantData

# look at a selection of columns
# This command shows the first 5 rows of the first and the third column
plantData[1:5, c(1, 3)]
# c() function combines several values in one object (vector in R terminology)
# 1:5 makes vector of values 1, 2, 3, 4, 5
# the square brackets are used for subsetting of dataframe or matrix in form [row, columns]

# Viewing the dataset in separate window
View(plantData)

The class() command tells you what type of data object you are dealing with. In this case, we have a data.frame object. You can read more about the different data types and data objects in R here.

Column description

The column names in this dataset are mostly self-explanatory.

  • condition defines whether the plant was cultivated under control conditions (c), salt stress (s) or drought stress (d).
  • plant_id is the ID of the particular plant.
  • genotype denotes the genotype of the particular plant.
  • time is a measurement of time in minutes (time since the start of all measurements). It could be useful to check whether or not the measurements were influenced by the time duration the plants were standing by the machine and waiting for the measurement.
  • size_mm2 is the area of the leaf rosette from the top view.
  • QY_max is a maximum quantum yield of photosystem II, a term Fv/Fm is also frequently used for this variable. It is the most important variable that we measured, giving us some measure of how well the photosystem II works. The value is mostly around 0.83 for healthy plants. Unhealthy plants have lower values.
  • Columns NPQ_L1 to NPQ_L4 are measurements of non-photochemical quenching in different time points after providing actinic light to the plant. We will use them later. Non-photochemical quenching is a process of safe dissipation of excitation energy to heat in order to protect the photosynthetic machinery in the case that there is more light than the photosynthetic apparatus is able to use. As chlorophyll fluorescence quenching analysis is not easy to explain, we will not go into details now. However, you can read e.g. this article if you are interested in that.

Load additional dataset for repotting of plants

We have a list of the plants that were repotted during the experiment. The info about repotting can be found in a separate file. We would like to add that info in case repotting influenced the results of this experiment.

Exercise 1: Load the info_repotted.csv dataset.

First, use the function read.csv() to read the data from file "~/workshop_materials/r_ggplot2/info_repotted.csv", and assign it to a variable called df (dataframe). Then you can look at the data using some of the functions used above.

# Import additional info which plants have been repotted
df <- read.csv("info_repotted.csv")
df
##            info
## 1  c09_repotted
## 2  c12_repotted
## 3  c36_repotted
## 4  s09_repotted
## 5  s15_repotted
## 6  s21_repotted
## 7  s34_repotted
## 8  s38_repotted
## 9  d03_repotted
## 10 d15_repotted
## 11 d19_repotted
## 12 d27_repotted

Several tips:

  • Use the Tab completion as in the Unix terminal. It works both for file paths, object names and function names.
  • F1 will bring help for the function where the cursor is.

If you succeeded, you can see that the list in the file is not in a very useful format. We first need to split the column info into two columns. Then we can intersect the two dataframes based on the column plant_id (thus adding the term “repotted” to the corresponding plants). For similar tasks like extracting something from longer strings, regular expressions are often very useful, so it is well worth to learn them.

Manipulate data with dplyr or Base R

We will show you different possibilities to manipulate & filter your data. For every single step that you would like to perform on your data, there are several methods existing in R. We will present the dplyr/tidyr package and the Base R syntax here, please run both possibilities and compare the results. In the end (for your own scripts and analysis) you should find the method which is most suitable for you.

dplyr/tidyr The dplyr package uses a similar syntax to Unix, the %>% symbol in dplyr simply represents the | (pipe) symbol in Unix. Find a cheat sheet here

Base R The base R language doesn’t use this, an input for base R should be an argument in brackets (x = input). (Actually, there is an base R version of pipe in last versions of R but it is not used much.) The dollar sign $ can be used to access a column in data.frame by its name. Here we will use df$info to access the column named info in the data.frame named info. Similarly, you can use plantData$plant_id to access the plant_id column of plantData data.frame.

# Split columns

## Possibility 1: use "separate" function from the tidyr package
df2 <- df %>% 
  tidyr::separate(col = 1, into = c("plant_id", "info"), sep = "_")

## Possibility 2: use sub from base R and regular expressions to exchange part of the line for nothing
df.2b1 <- sub(pattern = "_.*$", replacement = "", x = df$info) # exchange the part after underscore for nothing ("")
df.2b2 <- sub(pattern = "^.*_", replacement = "", x = df$info) # exchange the part before underscore for nothing ("")
df.2b <- data.frame(plant_id = df.2b1, info = df.2b2) # put the two vectors to data.frame
# Regular expression explanation:
# _ means literally _
# . means any character (number, alphabet, etc.)
# * means that the preceding character is used 0 or more times
# $ means the end of the line
# ^ means the beginning of the line
# So the first regular expression ("_.*$") means that the sub function should search for underscore followed by any character that is there 0 or more times, which should be followed by the end of the line.


# Intersect both datasets to add repotting info based on plant_id
## Possibility 1: use merge from base R
merging_a <- merge(plantData, df.2b, by="plant_id", all=TRUE)

## Possibility 2: use the dplyr package
merging_b <- plantData %>% 
  dplyr::left_join(df.2b, by=c("plant_id"))

# We will continue with the merging_b dataframe, so we will rename it back to plantData (note that this overwrites the original plantData variable, which is fine in our case, but if you wanted to keep your older dataframe, you can assign it to a unique name.)

plantData <- merging_b

Clean-up

We will clean-up our dataset before we create some plots. First, we will re-name the columns NPQ_L1, NPQ_L2, NPQ_L3 and NPQ_L4 into NPQ_t1, NPQ_t2, NPQ_t3 and NPQ_t4 because they represent timepoints of measurements. Here is the code how to do it for NPQ_L1 and NPQ_L2 column.

# Rename columns
# The pattern for dplyr is rename(new_col_name=old_col_name).
# All your changes will not be overwritten unless you assign it a name with <-
plantData <- plantData %>%
  dplyr::rename(NPQ_t1 = NPQ_L1, NPQ_t2 = NPQ_L2)

Exercise 2: Rename columns NPQ_L3 and NPQ_L4 into NPQ_t3 and NPQ_t4.

plantData <- plantData %>%
  dplyr::rename(NPQ_t3 = NPQ_L3, NPQ_t4 = NPQ_L4)

# Another example how to rename your columns based on position of column
# colnames(plantData)[7:10] <- c("NPQ_t1", "NPQ_t2", "NPQ_t3", "NPQ_t4")  # The line is "commented" to not run it.

Now we will add a column which explains the condition better (c -> control, d -> drought, s -> salt). We will name the new column condition_info.

# Create a new column condition_info
plantData <- plantData %>%
  dplyr::mutate(condition_info = case_when(
    (condition == "c") ~ "Control",
    (condition == "d") ~ "Drought",
    (condition == "s") ~ "Salt"
    ))

# Move the new condition_info column next to the old condition column (new columns are added to the far right side of the df by default)
plantData <- plantData %>% 
  dplyr::relocate(condition_info, .after = condition)

mutate() can also be used in conjunction with ifelse() to create a new column based on conditions applied to an existing column. Think of ifelse() as the quick decision-maker. It takes a look at each piece of the data and asks a question. If the answer is yes (TRUE), it gives one response; if is no (FALSE), it gives another one.

We can use these two functions to create a new column called size_category based on the values of size_mm2.

# Add a new column size_category using mutate and ifelse
plantData <- plantData %>%
  dplyr::mutate(size_category = ifelse(is.na(size_mm2), "Not measured", # Check for NA values first
                                       ifelse(size_mm2 < 500, "Small",  # Condition for small size
                                              ifelse(size_mm2 <= 2500, "Medium", # Condition for medium size
                                                     "Large"))) # Default condition for large size
  )

In this example: is.na(size_mm2) checks if the size_mm2 value is not available. If this is TRUE, “Not measured” is returned. size_mm2 < 500 checks if the size_mm2 value is less than 500. If this is TRUE, “Small” is returned. size_mm2 <= 2500 checks if the size_mm2 value is less than or equal to 2500. If this is TRUE, “Medium” is returned. If none of the above conditions are TRUE, “Large” is returned by default.

After that, we will remove plants that have died during the experiment. These plants have NA values in columns size_mm2 and QY_max. The NPQ values were not measured for all plants, and we are not going to plot those columns yet, so it is fine for now that there are some NA values there.

# Remove all NA values from size_mm2 & QY_max column, as this indicates that plants died during the experiment and we do not have data for them
pD_clean <- plantData %>%
  filter(!is.na(size_mm2), !is.na(QY_max))

Exercise 3: Check how many lines we have removed.

Hint: Use the nrow() function or dim() function on the original plantData data.frame and on the new pD_clean data.frame.

# Dimensions of the original data.frame (number of rows and number of columns)
dim(plantData)
# Dimensions of the cleaned data.frame
dim(pD_clean)

Exercise 4: Filter the plantData dataset so that you will only keep the plants that have died in a new dataframe. Make sure to assign it a new name, so you do not overwrite your original dataset which we will use for plotting in the next section.

pD_deadPlants <- plantData %>%
  filter(is.na(size_mm2), is.na(QY_max))

Exercise Optional: Let’s say we are working with the plantData dataframe and want to simplify the analysis by focusing on only a few columns. Use select() to create a new dataframe that includes only the plant_id, genotype, condition, and the newly created size_category columns.

simplifiedPlantData <- plantData %>%
  dplyr::select(plant_id, genotype, condition, size_category)

Congrats! You are now good to go to create some nice plots with your pD_clean dataset! :)

4 Creating plots with ggplot2

There are at least three primary graphics programs available within the R environment. A package for base R graphics is installed by default and provides a simple mechanism to quickly create graphs. lattice is another graphics package that attempts to improve on base R graphics by providing better defaults and the ability to easily display multivariate relationships. In particular, the package supports the creation of trellis graphs - graphs that display a variable or the relationship between variables, conditioned on one or more other variables. Finally, ggplot2 is a graphics program based on the grammar of graphics ideology, and will be the primary focus of this course due to the extensive documentation and resources available for it.

The syntax for a basic plot using ggplot2 starts with the ggplot() function and we specify the required data (pD_clean) and variables to be plotted as shown below. The geometric layer elements are added on with a + followed by a specific geom_*() function. Learn more about geoms by running help.search() or ??geom_ below to get a list of available geoms and to familiarize yourself with the help documentation. Use geom_point() to display the scatter plot. Run help(geom_point) to check out it’s help documentation.

4.1 Plot one genotype first

Start with making a basic boxplot of just the wt plants.

Exercise 5: Make a new dataframe called pD_clean_wt that will contain only the plants with genotype “wt”.

Hint: Use the function filter from the package dplyr. As an argument for the filter function you should use genotype == "wt". We use the double equal sign ==, because the algorithm is going through each row of the dataframe and checking whether the genotype is wt or something else, returning TRUE or FALSE. Single equal sign = is used for assigning values to arguments.

# Subset your data to only have wt genotype
pD_clean_wt <- pD_clean %>%
  filter(genotype == "wt")

We will use some of the “geoms” (geometric objects) introduced earlier to make our plots today. While there are many geoms possible (list of geoms), we will focus on the following geoms in this session:

  • geom_boxplot() - A classic box and whiskers plot
  • geom_point() - A plot with data points (like a typical scatterplot, but it can also plot a numeric variable against a categorical variable)
  • geom_jitter() - Similar to geom_point() but it adds some jitter to the points in case they’re overlapping (more applicable when plotting a numeric variable with a categorical variable)
  • geom_smooth() - Typically used as an additional layer to the scatterplot to help identify trends or patterns in the data.

Now make the boxplot.

The main ggplot function needs as arguments the dataframe and what variables in the dataframe need to be plotted. The dataframe is simply provided to the data argument. Information on what is plotted is provided using the aes (aesthetic mappings) function. This allows us to tell ggplot what variables we want to plot (that is, which columns of our dataframe should be on the x and y axes).

# Creating the background of ggplots
p0 <- ggplot(data = pD_clean_wt, aes(x=condition_info, y=QY_max))
p0

# Given this background, we can now add a boxplot "layer" to the background. ggplot is modular, so we can sequentially add layers to it. 

# Basic boxplot of condition_info with QY_max
p1 <- ggplot(data = pD_clean_wt, aes(x=condition_info, y=QY_max)) +
  geom_boxplot()
p1

# Fun fact - you can also make the p1 plot using the code p0 + geom_boxplot() 

You see that you can create an object with the output from the ggplot() function. When you run the name of the object (p1 in this case), the plot will show.

Comment on biology: The result is little bit puzzling, as the stressed plants under drought conditions and salt stress conditions have higher values of QY_max. However, the differences are very small.

Exercise 6: Make a similar boxplot that will show size of the plants (size_mm2) instead of QY_max.

p2 <- ggplot(pD_clean_wt, aes(x=condition_info, y=size_mm2)) +
  geom_boxplot()
p2

Now we will make a stripplot (single-axis scatter plot) to have a better sense of the real data.

# Basic scatterplot with size_mm2
p3 <- ggplot(pD_clean_wt, aes(x=condition_info, y=size_mm2)) + 
  geom_point()
p3

If we have multiple values (or data points) several times, we cannot see them. Thus, it is better make the points “jitter” a bit (compare p3 above and p3a below for clarity on “jitter”). In ggplot, this can be done using the function geom_jitter(). The width argument within geom_jitter defines how scattered the points are (in our case, it will scatter the points horizontally). You can try to omit the width argument, which will default to a value 0.4, according to “help”, but as you’ll notice, the points from the categories on the x-axis end up being too close to each other to distinguish which category they are from.

p3a <- ggplot(pD_clean_wt, aes(x=condition_info, y=size_mm2)) + 
  geom_jitter(width = 0.1)
p3a

Exercise 7: Make the same stripplot for the genotype psbo1cr, showing QY_max values for different conditions.

Hint: You should first filter the dataframe called pD_clean to get only the plants of the right genotype. Then, make the plot.

# Subset your data to only have psbo1cr genotype
pD_clean_psbo1cr <- pD_clean %>%
  filter(genotype == "psbo1cr")
p4 <- ggplot(pD_clean_psbo1cr, aes(x=condition_info, y=QY_max)) + 
  geom_jitter(width = 0.1)
p4

Comment on biology: You can see that the QY_max values for psbo1cr are lower than for the wt, but they do not differ much for the different conditions.

Compare the code for the boxplots and stripplots. The first line is the same for both types of plots, and then we can add either boxplots or points or jitter points as layers, highlighting the modular nature of ggplot.

Now we will plot the correlation of size_mm2 vs QY_max (size is dependent on QY_max).

# Plot correlation as scatterplot
p5 <- ggplot(pD_clean_wt, aes(x=QY_max, y=size_mm2)) + 
  geom_point()
p5

# Add a regression line with standard error and a title to the plot
p6 <- ggplot(pD_clean_wt, aes(x=QY_max, y=size_mm2)) + 
  geom_point() + 
  geom_smooth(method = lm) + 
  ggtitle("QY_max vs size_mm2 scatterplot for wt genotype")
p6

# We can colour the points according to condition
p6a <- ggplot(pD_clean_wt, aes(x=QY_max, y=size_mm2)) + 
  geom_point(aes(colour = condition_info)) + 
  geom_smooth(method = lm) +
  ggtitle("QY_max vs size_mm2 scatterplot for wt genotype")
p6a

Comment on biology: As before, this is again a bit puzzling. We would expect rather positive correlation than negative. The cause is that the stressed wt plants had slightly higher QY_max, but they were smaller. However, the differences in QY_max are very small.

Exercise 8: Make similar plot for control conditions and all genotypes.

Hint: You should first filter the dataframe called pD_clean to get only the plants of the right condition_info. Then, make the plot.

pD_clean_control <- pD_clean %>%
  filter(condition_info == "Control")
p6b <- ggplot(pD_clean_control, aes(x=QY_max, y=size_mm2)) + 
  geom_point(aes(colour = genotype)) + 
  geom_smooth(method = lm) +
  ggtitle("QY_max vs size_mm2 scatterplot for control condition")
p6b