R intro with ggplot - Evolution and Genomics

Introduction

In this exercise we will be going through some very introductory steps for using R effectively. We will read in, manipulate, analyze and export data. We will be using RStudio which is a user friendly graphical interface to R. Please be aware that R has an extremely diverse developer ecosystem and is a very function rich tool. The steps used to complete each step of this exercise can be completed in a variety of ways. The steps shown here just demonstrate one possible solution.

You can either double-click the RStudio logo in the desktop of your DNS server or type ‘http://Your-DNS-id:8787’ in your local browser.

!!! RStudio seems to crash in the cloud very often. So, we highly recommend you to save the R session as often as you would like to as you go along with this exercise.

Important to remember! You can use tab-completion like in bash to complete your commands and up arrow to go through your history. You can also get help with any R function while in R! This can be done by typing a ? ahead of the command:

?read.table

Additionally, the internet has a large number of useful resources:

The R Project Homepage: http://www.r-project.org
Quick R Homepage: http://www.statmethods.net
Bioconductor: http://www.bioconductor.org
An Introduction to R (long!): http://cran.r-project.org/doc/manuals/R-intro.html
Google – there are tons of tutorials, guides, demos, packages and more

In this exercise we will be looking at and analyzing data in a “data frame”. A data frame is basically R’s table format. The data frame we will be using is viral abundance in the stool of healthy or sick individuals.

The context of the data is not important for completing the exercise. The goal of this exercise is to familiarize you with working with data in R, so the lessons learned working with this data set should be extendable to a variety of uses.

Getting data into R

You should first set your working directory (setwd) to the R tutorial directory.

setwd("/home/genomics/workshop_data/R_tutorial")

You should be able to find all the files you need for this exercise in this directory. You can check the files you have in this directory by using the command:

list.files()

Then you should use the read.table function to read the files into RStudio.

[box]Remember, you can get detailed information and examples for any R function by preceding the ?. For example, ?read.table.[/box]

For simplicity, we will just rename our data tables “healthy” and “sick”:

healthy <- read.table("myoviridae_healthy.txt")

sick <- read.table("myoviridae_sick.txt")

Note that when a file outside of R is referenced it must appear in quotes. Go ahead and take a look at the data frame by simply typing the R objects healthy and then sick.

You should see the full data tables spill out on the screen. Since this data table is large it will be difficult to look at in its entirety, fortunately we can use some basic commands to view small slices of the full data table. You can slice data using the following convention:

R_object[rows,columns]

The rows and columns can be separated by a : to describe a range. For example, if we just wanted to look at the first 3 rows of a our data we would type:

healthy[1:3,]

To look at the first three columns we would type:

sick[,1:3]

Note the importance of the placement of the comma for selecting either rows or columns of data.

You can also use the head command (type ?head to get an idea of what it does) to display the top portion of our data table.

?head

Activity

Exercise 1: Look at the first few rows of the bac data table using the head function:

[toggle hide=”yes” border=”yes” style=”white”]

head(healthy)

[/toggle]

You should spend some time slicing the data table up in various ways.

You can specify a column of data using the $ before the column name. Try defining the Tevenvirinae column using $Tevenvirinae on the sick data frame you just imported.

[toggle hide=”yes” border=”yes” style=”white”]

[/toggle]

You should become comfortable with defining subsets of the data table before moving forward. Please spend some time defining various subsets of the data table and observing the output.

Exercise 2: Creating new data tables from pre-existing data tables.

You can create new data tables with subsets of the original data table. You do this by assigning a subset of data using <- . The basic convention for creating a new data table (or any other data structure) is:

new_R_object <- data.frame(old_R_object(functions))

For example, create a new data table with just Tevenvirinae. For simplicity, just use the *_tev so you won’t have to type Tevenvirinae any more. Remember, tab-completion is supported in RStudio!

healthy_tev <- data.frame(healthy$Tevenvirinae)

sick_tev <- data.frame(sick$Tevenvirinae)

Look at your newly created data frames so they look correct.

Exercise 3: Data frame transformations

Packages can be installed from command input, or via searching/installing in RStudio. Packages are typically stored in the Comprehensive R Archive Network or CRAN, but they can also be pulled from GitHub or loaded manually.

For this exercise we will use the vegan package from CRAN archive. Vegan is a well-developed community ecology package for R which implements a number of ordination methods and diversity analysis on ecological data. You generally install packages in R using the install.packages() command. But to make it quick for you, we have already installed this package in the cloud!

With R packages, you need to load them into your R session using the library command:

library("vegan")

While there are many native R functions for transforming data, we will take advantage of the decostand functions of vegan to do some common ecological data transformations. It’s outside the scope of this short R introduction to explain the transformation functions in the vegan package, but you can read about vegan here, and learn more about the decostand function and view some examples by typing:

?decostand

Let’s start by transforming our healthy and sick data frames using the total method of decostand. The basic syntax for this is below. The transformation method can be substituted, and you should name your R object something memorable such as healthy_total:

new_R_object <- decostand(data.frame, method=”total”)

[toggle hide=”yes” border=”yes” style=”white”]

healthy_total <- decostand(healthy, method="total")

[/toggle]

Do the same thing for the sick data frame.

Repeat this procedure for the healthy and sick data frames, but instead of using total normalization use Hellinger normalization.

[toggle hide=”yes” border=”yes” style=”white”]

healthy_hellinger <- decostand(healthy, method="hellinger")

sick_hellinger <- decostand(sick, method="hellinger")

[/toggle]

At the end of this exercise you should end up with four new R objects. Two should be total normalized for both healthy and sick, and two for Hellinger normalized for both healthy and sick.

If you accidentally made a data frame that you no longer want, it can be removed using the rm command. For example rm(test) will remove the data frame named test. Your environment should look more-or-less like the picture below.

You should now be comfortable with:

Subsetting data frames by rows or columns
The concept of installing packages from the CRAN archive and load them into R
How to apply commonly used ecological data transformations to a data frame using the decostand function in vegan
How to remove data using the rm command

If you do not understand these basic concepts go back and review as they will be important for moving forward.

Exercise 4: Use the summary function on descriptive data to quickly quantify each type of sample in the data table.

The summary function is quite useful and a great tool that does precisely what it sounds like. It summarizes the given data and provides basic metrics and statistics. This can be very useful for generating quick overviews of factorial data which in many studies takes the form of metadata tables.

Now, import the metadata files “healthy_metadata.txt” and “sick_metadata.txt” into your R session!

[toggle hide=”yes” border=”yes” style=”white”]

healthy_metadata <- read.table("healthy_metadata.txt")

sick_metadata <- read.table("sick_metadata.txt")

[/toggle]

Run the summary function on each newly imported data frame to get a quick overview of the metadata associated with this study.

[toggle hide=”yes” border=”yes” style=”white”]

summary(healthy_metadata)

summary(sick_metadata)

[/toggle]

You can also produce summary data for all of the data in the healthy and sick data frames. If you do this you will get a lot of information that will pour through the screen. Go ahead and try it out.

Exercise 5: Basic plotting.

Boxplots in R use the conventions detailed in the figure below and are useful for describing the variance in a set of numerical data.

boxplot_description

Let’s make a boxplot comparing the ages in both our healthy and sick metadata data frames. Using the boxplot function, attempt to make the figure below using the format boxplot(x1, x2). Try to do this before revealing the solution building on what you learned from above.

[toggle hide=”yes” border=”yes” style=”white”]

boxplot(healthy_metadata$Age, sick_metadata$Age)

[/toggle]

Notice how this boxplot doesn’t have a lot of titles or other information. Let’s do some manipulations to this graph to try and make it a little more informative.

Read through the boxplot options using

?boxplot

and try to recreate something that approximates the graph below. Try to see how far you can get before looking at the hidden answer and don’t worry if you can’t get the color or line width exactly as it is in this figure. If this is your first time using R it is unlikely you will know all of the commands to completely reproduce this graph, but give it a try.

[toggle hide=”yes” border=”yes” style=”white”]

boxplot(healthy_metadata$Age, sick_metadata$Age, col="light blue", names=c("healthy", "sick"), lwd=3, main="Comparison of Age Between Groups", ylab="Age")

[/toggle]

An explanation of each of these modifiers is below:

– main adds a main title to the plot

– ylab adds a y-axis label

– names: adds “healthy” and “sick” labels to the x-axis. This one is a bit tricky and you have to use the names function in box plots. Use the ?boxplot help page for assistance and remember that text strings should be enclosed in quotes. This is basically how you label the x-axis

– col: adds color to the box plot, in this case we used light blue

– lwd: increased the width of the boxplot lines from the default of 1 to 3

Exercise 6: Defining layouts.

R has powerful graphical layout tools. These layout options allow you to plot several graphs next to one another in a very controlled manner. There are a variety of ways to define these layouts, but the simplest and most frequently used way is to define the layout paramaters using the par function.

For example, the following command will define a 2×2 layout for graphing:

par(mfrow=c(2,2))

While this would define a single row with three columns (1×3)

par(mfrow=c(1,3))

1x3

These settings are maintained by R until you change them. To get back to the default layout you can simply enter:

par(mfrow=c(1,1))

Define a 1×3 layout and make 3 boxplots comparing the abundances of Tevenvirinae, PhiCD119likevirus and Clostridium_phage_c.st between healthy and sick individuals.

[toggle hide=”yes” border=”yes” style=”white”]

boxplot(healthy$Tevenvirinae, sick$Tevenvirinae)
boxplot(healthy$PhiCD119likevirus, sick$PhiCD119likevirus)
boxplot(healthy$Clostridium_phage_c.st, sick$Clostridium_phage_c.st)

[/toggle]

Now attempt to draw the same plot, but use the Hellinger normalized data you generated previously.

[toggle hide=”yes” border=”yes” style=”white”]

boxplot(healthy_hellinger$Tevenvirinae, sick_hellinger$Tevenvirinae)
boxplot(healthy_hellinger$PhiCD119likevirus, sick_hellinger$PhiCD119likevirus)
boxplot(healthy_hellinger$Clostridium_phage_c.st, sick_hellinger$Clostridium_phage_c.st)

[/toggle]

You can immediately see the impact that Hellinger normalization had on the sample data. However, the graph is still difficult to interpret. Try to use the skills you obtained from previous Exercises to put together a graph similar to the one below.

[toggle hide=”yes” border=”yes” style=”white”]

par(mfrow=c(1,3))

boxplot(healthy_hellinger$Tevenvirinae, sick_hellinger$Tevenvirinae, ylim=c(0,1), col="salmon", lwd=2, names=c("Healthy", "Sick"), main="Tevenvirinae")

boxplot(healthy_hellinger$PhiCD119likevirus, sick_hellinger$PhiCD119likevirus, ylim=c(0,1), col="yellow", lwd=2, names=c("Healthy", "Sick"), main="PhiCD119likevirus")

boxplot(healthy_hellinger$Clostridium_phage_c.st, sick_hellinger$Clostridium_phage_c.st, ylim=c(0,1), col="steel blue", lwd=2, names=c("Healthy", "Sick"), main="Clostridium_phage_c.st")

par(mfrow=c(1,1))

[/toggle]

Exercise 7: More with packages and drawing heatmaps

In this exercise we will work with the library pheatmap designed to produce high-quality heatmaps.

You can now load this package into the R session like before:

library("pheatmap")

Once loaded you should review its documentation with

?pheatmap

Heatmap visualization can benefit from data normalization to diminish the challenges associated with discerning differences between very large and small values. There are a number of ways to normalize data (log, sqrt, chi-square transform amongst others). For this exercise we will continue to use the Hellinger normalized data used in previous exercises.

Taking guidance from the pheatmap help file attempt to generate the heatmap shown below.

[toggle hide=”yes” border=”yes” style=”white”]

pheatmap(healthy_hellinger, cluster_cols=FALSE, cellwidth=8, cellheight=8, main="Healthy")

pheatmap(sick_hellinger, cluster_cols=FALSE, cellwidth=8, cellheight=8, main="Sick")

[/toggle]

[box]Of note, pheatmap doesn’t utilize the par functions like boxplot does in the previous examples. You will get one heatmap per page and need to move forward and backward to see both plots.[/box]

Exercise 8: Exporting Data

Tabular data can be exported using the write.table function in R. You can also specify the deliminator.

To export your newly normalized hellinger R object of the healthy data to analyze in another program requiring a tab-deliminated file type, you could simply type:

write.table(healthy_hellinger, file="healthy_hellinger.txt", sep="\t")

If you would like to export to Excel format you can do so using the xlsReadWrite package (not installed).

Exporting plots in RStudio is accomplished using the Export tab in the plot window. A variety of formats and sizing options are available. Try it out!

Now you finished the basic introductory R exercise!

The phyloseq package you will use later in the workshop is heavily dependent on the R package ggplot . Continue with ggplot exercises here!