image_panel

Background

This exercise is designed to give a basic introduction to ggplot2 which is part of the TidyVerse suite developed by Hadley Wickham. ggplot2 is an extensive plotting library that enables the construction of a large number of plot types using a structured grammar.

The capabilities of R and ggplot2 go far beyond the time we have available within the Workshop. The goal of this set of exercises is primarily to get you familiar enough with the syntax and basic functionality of ggplot2. To this end, we have focused this material on a specific subset of R and ggplot2 functions.

If you would like to learn more about using ggplot2 there are extensive on-line resources, many of them linked here: http://ggplot2.tidyverse.org.

Exercise 1: Basic intro using the classic Iris data set

ggplot2 uses a basic syntax framework (called a Grammar in ggplot2) for all plot types:

A basic ggplot2 plot consists of the following components:

  • data in the form of a data frame
  • aesthetics: How your data are represneted
    • x, y, color, size, shape
  • geometry: Geometries of the plotted objects
    • points, lines, bars, etc.
  • More … we won’t delve into this much here, but there are extensive geometries for a variety of other tasks such as setting the basic plot theme

The bolded and underlined portions in the list above are the actual commands you will enter into the console to build your plot (dataaesgeom). These are your basic building blocks for all ggplot2 plots.

For our first example data set we will use classic Iris Flower Data Set generated by Ronald Fisher in 1936. It is included with the base R installation and is regularly used in introductory statistics courses as it has features that are easy to discriminate and plot. The data are made up of morphological measurements of a three types of flowers, the Iris setosaIris versicolor and Iris virginica.

Step 1: Generate basic summary statistics for the iris data set

summary(iris)

This command will provide you with summary statistics (min, max, means, 1st and 3rd quartile boundaries, etc.) about the iris data. The result should look like this:

iris_summary

Step 2: Basic ggplot2 example

Prior to using any library in R you need to load it. To load ggplot2 use the following command:

library(ggplot2)

Go ahead and enter the command below and we will break it down after it has successfully run:

ggplot(data = iris, aes(x=Sepal.Length, y=Sepal.Width, col = Species)) + geom_point()

This should generate a plot that looks like this:

iris_ggplot_point

The plot displays a point (geom_point) for all of the data in the iris data frame. Sepal Width is plotted along the Y-axis (y=Sepal.Width), the Sepal Length along the x-axis (x=Sepal.Length). Each point is colored according to the flower species.

This is a minimal example of a ggplot code. All ggplot commands begin with ggplot, are followed with the entry of the data, then a basic description of how the data aesthetics using various aes (aesthetic) paramaters. After that, you add (+) the plot geometry using geom_point or another geom such as geom_bargeom_linegeom_area. You can see a list all base ggplot2 geometries at: http://docs.ggplot2.org/current/.

To test your knowledge on implementing a basic ggplot2 command, try to draw a box plot of the iris data using species as the x-axis category and Petal Length as the y-axis category. The code is in the toggle box below the plot, but try to generate it before looking!

iris_boxplot_1

Boxplot of Petal Lengths categorized by flower species

ggplot(data = iris, aes(x=Species, y=Petal.Length, col = Species)) + geom_boxplot()

As you can see, the x-axis was defined by a categorical instead of a numeric variable. You could easily adjust this code to display the Sepal Length, Sepal Width or other variable on the Y-axis in place of Petal Length.

Step 3: Modifying plots

In the previous examples we have left empty parenthesis () after specifying the geom (eg. geom_boxplot()). However, you can fill these parenthesis up with additional plotting specifications.

You can view all available parameters in R using the ? command (e.g. ?geom_boxplot for boxplots) or in the on-line documentation (http://docs.ggplot2.org/current/geom_boxplot.html).

Let’s take advantage of this and update our boxplots to have notches and to display the outlier points in black instead of the category color. Please attempt to generate the plot yourself using the information in the help or on-line documentation prior to revealing the code..

iris_boxplot_2

Notched Boxplot with Black Outliers Categorized by Flower Species

ggplot(data = iris, aes(x=Species, y=Petal.Length, col = Species)) + geom_boxplot(notch = TRUE, outlier.colour = “black”)

You can also add geometries together by separating them with a + symbol just as you added the first geometry. For example, you can add a geom_point to the geom_boxplot to display all of the data.

Using this concept, try to generate the following plot. Of note, I reduced the size of the points in geom_point so that you could still view the black outlier points. I then used ggtitle to add the title.

iris_boxplot_3

Petal Lengths by Species : Titled Boxplot with Points

ggplot(data = iris, aes(x=Species, y=Petal.Length, col = Species)) + geom_boxplot(notch = TRUE, outlier.colour = “black”) + geom_point(size=0.05) + ggtitle(“Petal Lengths by Species”)

At this stage you should have a basic understanding of the basic elements of a ggplot and how modifiers can be used to create new and interesting ways to display your data. There are a large number of built-in geometries and modifiers that can be used to get the result you want.

In addition, there is a community of developers who have created ggplot2 styled libraries. So if you want a geometry that doesn’t exist in base ggplot2 you should search for a library that may have what you want. The official ggplot2 extension library has a large number of new geometries which you can install (http://www.ggplot2-exts.org/gallery/). In addition, there are well developed libraries like PhyloSeq which will use later on leverages ggplot2.

Step 4: Faceting plots

A powerful feature of ggplot is it’s ability to ‘facet’ data by categorical variable. There are a number of ways to do this. The simplest way to explain this is by example:

ggplot(data = iris, aes(x=Sepal.Length, y=Sepal.Width, col = Species)) + geom_point() + facet_grid(~Species)

facet_plot

The facet_grid command instructed ggplot2 to divide, or facet, the data across the categories defined in Species.

You will practice more with this concept below and it is used extensively in the PhyloSeq tutorial material.

Step 5: Saving and displaying plots

The above examples just pushed the plots to the standard graphics viewer available in RStudio. If you want to save a plot you can do so be redirecting it into a new object in R.

For example, to save the first plot you drew you could save it to an object called plot1 with the following code:

plot1 <- ggplot(data = iris, aes(x=Sepal.Length, y=Sepal.Width, col = Species)) + geom_point()

A new R object will be added to your environment. You can call this object and display it by just typing the name of the object (plot1).

Save ggplot2 objects can be directly modified. So instead of rewriting the code to add a title to the plot you can just ? ggtitle(“Plot Title”) using the following syntax

plot1 + ggtitle(“Plot Title”)

Once you have a plot just how you like it, you can export it as a PNG, JPEG, TIFF, BMP, SVG or EPS using the dropdown menu labelled export in the plotting window.

Step 6: Examples on your own

Using the concepts learned above, work on your own to reproduce the following plots using the diamonds data set which included with the ggplot2 library. The data contain the prices and other attributes of almost 54,000 diamonds.

Plot #1: Faceted scatterplot

challenge_plot_1

Plot #2: Colored Histogram

challenge_plot_2

Plot #3: Bar plot

Flipped_barplot

 

EXERCISE 2: Rmarkdown and reusable code

As you saw in Exercise 1, the way in which ggplot2 code is put together can become unwieldy as you add elements. A solution to tracking your code is to use RMarkdown documents which serve as running “notebooks” of your commands. You can then edit, re-execute or share these commands with other people so that they can reproduce your results.

The best way to understand how RMarkdown documents work is to create one yourself. You can do this in RStudio using the Menu commands “File -> New File -> R Markdown”. You will be presented with a window asking for the Title and Author of the new document. You can enter a title and author now or later.

When you complete these steps, you should have a new window/tab in RStudio that looks similar to this:

rmarkdown

This template will be generated every time you generate a new RMarkdown document. You can edit the entire document and there are extensive modifications and controls that can be added to the document which you can read about in the documentation at http://rmarkdown.rstudio.com. But for the purposes of the Workshop you only need to know a few basic things:

  1. There is an optional (but very useful!) header which is at the top of the document and is contained within a set of — (three hyphens). You can add simple text to this area and there are specific options such as the output: html_document that will control some of the function of the final document.
  2. All R code is contained within Code Chunks. Code chunks are everything contained within “`{r} and closed with a “`. For example, the final command to generate the box plot in Exercise 1 would look like this within a code chunk:

rmarkdown_chunk

3. Code chunks are editable and re-usable. You can now edit the code in the chunk and then reissue it to the console using the options in the Run menu at the top of the RMarkdown document.

That’s it! You now have a digital lab notebook with editable R commands that you use for your analysis. It is highly recommended that you keep all of your R code within code chunks. It will save you a tremendous amount of time when it comes to parameterizing your plot.

There is much more to learn about RMarkdown documents and how they can be used for analysis sharing and automated report generation. If you would like to learn more you can read through the full RMarkdown documentation at: http://rmarkdown.rstudio.com.