Load Packages, Import Data

There thousands of publicly available R extensions, AKA “packages”. They are not all installed on your system, and only core R packages are loaded when you start a new R session. The library() function is used to load non-core packages so that subsequent commands are searched against the functions and data in these packages as well as the core.

I like to also show the package version number in my Rmd workflows (using packageVersion()) so that I know what was used at the time the code was executed.

I also like to include a date stamp. There’s one above, but here’s how you can include it in your code:

gsub("[[:punct:][:space:]]", "-", Sys.time())

gsub("[[:punct:][:space:]]", "-", Sys.time())
## [1] "2016-01-18-15-55-00"

Load phyloseq and other packages

  1. Load the phyloseq, data.table, and ggplot2, packages using the library() command.
  2. Check the version number of each package using the packageVersion() command.
library("phyloseq"); packageVersion("phyloseq")
## Warning: replacing previous import by 'ggplot2::Position' when loading
## 'phyloseq'
## Warning: replacing previous import by 'scales::alpha' when loading
## 'phyloseq'
## [1] '1.14.0'
library("data.table"); packageVersion("data.table")
## [1] '1.9.6'
library("ggplot2"); packageVersion("ggplot2")
## [1] '2.0.0'

Load Pre-Organized Data from Previous Section

In the previous section you organized our Moving Pictures example data using phyloseq tools, and then saved this data from your R session (system memory, RAM) to the hard disk as a compressed, serialized file (extension “.RDS”). This is a handy way to store complicated single data objects for later use, without needing to define and document one or more file formats. This data was imported from 3 or 4 QIIME output files in various locations, but now this one self-describing file is all you need.

  1. Load your Moving Pictures RDS data file into your R session using the readRDS() function, and store the result as an object named mp.
mp = readRDS("mp-phyloseq-lab.RDS")

Reminder: For storing many objects from an R session at once, try save() or save.image(), and load().


Initial exploration of data

Explore the data. Executing a command in the terminal that is just a data object invokes the print() or summary() functions, which usually give summary details about the data if it is a complex object (like ours), or show the data directly if it is simple. If it is simple but large, you might hit the streaming limits in your console, so try and check sizes first. The “Environment” box in your RStudio console usually tells you these details as well.

  1. “print” to screen the mp object
  2. This will give you other details about the object, as well as a few key functions that can be used to access parts of it. This data is too large to show all in the console, with the exception of phy_tree() and refseq(), which have their own special print summary functions.

  3. Print mp to the screen.
  4. Get the number of samples and number of taxa (OTUs) directly, using the nsamples() and ntaxa() functions.
  5. The sample_data and tax_table components of this phyloseq object have their own special variables, namely sample_variables() and rank_names(). Use these functions on mp. What do you get? What does it mean?
  6. Find details about the phylogenetic tree in this data using the phy_tree() function. What is it telling you? How many leaves are in the tree? How many internal nodes?
  7. You could attempt to plot this tree with the default plot() function, but it has way too many leaves to be legible. Also, the default plot function is limited, so we’ll come back to this and how to use the phyloseq function plot_tree() later on.

mp
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 3426 taxa and 34 samples ]
## sample_data() Sample Data:       [ 34 samples by 11 sample variables ]
## tax_table()   Taxonomy Table:    [ 3426 taxa by 7 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 3426 tips and 3424 internal nodes ]
nsamples(mp)
## [1] 34
ntaxa(mp)
## [1] 3426
sample_variables(mp)
##  [1] "X.SampleID"               "BarcodeSequence"         
##  [3] "LinkerPrimerSequence"     "SampleType"              
##  [5] "Year"                     "Month"                   
##  [7] "Day"                      "Subject"                 
##  [9] "ReportedAntibioticUsage"  "DaysSinceExperimentStart"
## [11] "Description"
rank_names(mp)
## [1] "Kingdom" "Phylum"  "Class"   "Order"   "Family"  "Genus"   "Species"

Summarize sequencing depths, in general and by category

Create histograms that summarize sequencing depth in this data. Start with overall, but then also create category-specific histograms to help evaluate the balance in this experiment.

For this we are going to use the ggplot2 package.

  1. Define a data.frame or data.table that contains the total number of reads per sample using the sample_sums() function, and the geom_histogram() geometric object from ggplot2.
  2. Separate these histograms by SampleType in separate plot panels using the facet_wrap() or facet_grid() functions.
  3. Add an informative title using the ggtitle() function.
  4. Modify the axis labels using the xlab() and ylab() functions.
  5. Make a grid of panels by SampleType and Subject using the facet_grid() function. Does it look like there are any imbalances in total counts or number of samples among these critical groups?
sdt = data.table(as(sample_data(mp), "data.frame"),
                 TotalReads = sample_sums(mp), keep.rownames = TRUE)
setnames(sdt, "rn", "SampleID")
pSeqDepth = ggplot(sdt, aes(TotalReads)) + geom_histogram() + ggtitle("Sequencing Depth")
pSeqDepth
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

pSeqDepth + facet_wrap(~SampleType)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

pSeqDepth + 
  facet_grid(SampleType ~ Subject) + 
  ggtitle("Seq. Depth by Subject and Body Site")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The lowest number of reads per sample was 1114. That is low for typical new projects, but not that unusual compared to the rest of the samples in this tutorial dataset. From these plots, I don’t recommend filtering any samples based on depth alone, but keep in mind that in some cases you might, mainly when the number of read counts in a sample is so low that it is apparent that something went wrong with that sample. We will keep an eye out for artifacts and odd behavior (outliers) as we go along.

The plots above also do not indicate strong biases in sequencing depth by group or subject. Do you agree?

Sequencing depth across time

Here we want to check for any associations with sequencing depth and time. This is worthwhile to check early on in an experiment, as metadata variables like sequencing depth should be uncorrelated with key experimental variables. Why is that? What would it mean if there was a strong correlation between depth and variable?

  1. Create a ggplot graphic in which the horizontal axis is mapped to DaysSinceExperimentStart, the vertical axis is mapped to TotalReads, and the color variable is mapped to Subject.
  2. Make this a scatter plot using the geom_point() function.
  3. Add a title the same way you did in the previous graphic. Label the graphic Sequencing Depth vs. Time.
  4. Scale the vertical axis to base-10 logarithm using the scale_y_log10() layer function. Does this improve the plot?
  5. What is the minimum number of reads for any sample? (That is, what is the smallest library size?). Does this seem like a problem? Would you remove any samples?

Note that you should enter Subject as as.factor(Subject) to let ggplot2 know that this is really a discrete variable, even though it is encoded as a numeric vector of 1 and 2. Alternatively you could modify the Subject variable to be a character or factor, or you could add a new variable that holds Subject with a discrete data type; however, I prefer to not modify the original data unless necessary or useful, and the as.factor() expression is an elegant option in this case. Why might it be a good idea to avoid modifying key variables in the data?

ggplot(sdt, aes(DaysSinceExperimentStart, TotalReads, color = as.factor(Subject))) + 
  geom_point(size = 5) + 
  geom_smooth(method = lm) +
  ggtitle("Sequencing Depth vs. Time") +
  scale_y_log10() +
  facet_grid(SampleType ~ .)

sdt[, min(TotalReads)]
## [1] 1114

Filter Taxa

Filtering rare taxa is usually a necessary step in this type of data. We will actually perform the filtering elsewhere. For now, just explore the distribution of taxa across the dataset. There are two obvious measures to consider right away: (1) prevalence - the number of samples in which a taxa appears, and (2) total counts - the total number (or proportion) of observations of a taxa across all samples.

I like to plot them together on the same plot, as well as histograms of each. This will help determine what appears to be reasonable thresholds that define “unreasonably high” or “unhelpfully low” presence of a taxa in the data. Note the subjectivity in this last statement. Filtering is often subjective. We can be honest about that. Our goal is to both justify and document our decision for filtering. The goal of filtering is to remove noise from the data, in this case noisy variables that are unlikely to provide any useful information. Our criteria for this filtering should not include variables from which we want to infer biological insights later on. Why is that?

Taxa total counts histogram

  1. How many singleton taxa are there (OTUs that occur in just one sample, one time)?
  2. How many doubletons are there (OTU that occurs just twice)?
  3. Create a histogram of the total counts of each OTU.
  4. Calculate the cumulative sum of OTUs that would be filtered at every possible value of such a threshold, from zero to the most-observed OTU. This one is tricky. Feel free to glance at the answers. I used some data.table magic to make this easier
  5. Plot the cumulative sum of OTUs against the total counts using ggplot2 commands to make a scatter plot, and save this object as pCumSum, then “print” it to the terminal to render a graphic. What behavior do you see in the data? Are most of the OTUs in the table rare? Where would you set the threshold?
  6. To help clarify, zoom-in on the region between zero and 100 total counts, by “adding” the following layer to pCumSum: pCumSum + xlim(0, 100) Now where would you set the threshold?
tdt = data.table(tax_table(mp),
                 TotalCounts = taxa_sums(mp),
                 OTU = taxa_names(mp))
ggplot(tdt, aes(TotalCounts)) + 
  geom_histogram() + 
  ggtitle("Histogram of Total Counts")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# How many singletons?
tdt[(TotalCounts <= 0), .N]
## [1] 0
tdt[(TotalCounts <= 1), .N]
## [1] 0
# None. This data has been filtered already. 
# How many doubletons?
tdt[(TotalCounts <= 2), .N]
## [1] 991
# taxa cumulative sum
taxcumsum = tdt[, .N, by = TotalCounts]
setkey(taxcumsum, TotalCounts)
taxcumsum[, CumSum := cumsum(N)]
# Define the plot
pCumSum = ggplot(taxcumsum, aes(TotalCounts, CumSum)) + 
  geom_point() +
  xlab("Filtering Threshold, Minimum Total Counts") +
  ylab("OTUs Filtered") +
  ggtitle("OTUs that would be filtered vs. the minimum count threshold")
pCumSum

# Zoom-in
pCumSum + xlim(0, 100)
## Warning: Removed 168 rows containing missing values (geom_point).

Taxa prevalence histogram, and fast_melt()

I’ve included a special function with the course materials, called fast_melt(), that I like. It is not yet incorporated into the phyloseq package. You can make it available to your session by “sourcing” the R code file that defines it. Use:

source("taxa_summary.R", local = TRUE)

In fact, now that you’ve done that, run the following code:

mdt = fast_melt(mp)
prevdt = mdt[, list(Prevalence = sum(count > 0), 
                    TotalCounts = sum(count)),
             by = TaxaID]

If you’ve used the data.table package before you might guess what this does. Briefly, the fast_melt() function “melts” the OTU table into a long format with three main columns: SampleID, TaxaID, and count. This allows us to do some additional data.table magic. For the rest of this section, go ahead and peak at the answers and come back (try not to peak ahead!).

I define Prevalence here as the number of times an OTU is observed at least once. That is, it is the number of samples in which each OTU was non-zero. I find this to be a more useful filtering and diagnostic criteria.

  1. What is the plot showing?
  2. Why might prevalence be a useful filtering criteria?
  3. Where would you set a prevalence threshold?
# Source the file that defines the new functions
source("taxa_summary.R", local = TRUE)
mdt = fast_melt(mp)
prevdt = mdt[, list(Prevalence = sum(count > 0), 
                    TotalCounts = sum(count)),
             by = TaxaID]
ggplot(prevdt, aes(Prevalence)) + 
  geom_histogram() + 
  ggtitle("Histogram of Taxa Prevalence")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# How many singletons?
prevdt[(Prevalence <= 0), .N]
## [1] 0
prevdt[(Prevalence <= 1), .N]
## [1] 506
# None. This data has been filtered already. 
# How many doubletons?
prevdt[(Prevalence <= 2), .N]
## [1] 1555
# taxa cumulative sum
prevcumsum = prevdt[, .N, by = Prevalence]
setkey(prevcumsum, Prevalence)
prevcumsum[, CumSum := cumsum(N)]
pPrevCumSum = ggplot(prevcumsum, aes(Prevalence, CumSum)) + 
  geom_point() +
  xlab("Filtering Threshold, Prevalence") +
  ylab("OTUs Filtered") +
  ggtitle("OTUs that would be filtered vs. the minimum count threshold")
pPrevCumSum

Prevalence vs. Total Count Scatter plot

  1. Make sure you executed the code (you peeked!) from the answers that defines prevdt.
  2. Now use ggplot2 and prevdt to plot TotalCounts versus Prevalence as a scatter plot.
  3. Does this help you decide on a filtering threshold?
ggplot(prevdt, aes(Prevalence, TotalCounts)) + 
  geom_point(size = 4, alpha = 0.75) + 
  scale_y_log10()

Extra Credit

  1. Subset to the most abundant 9 phyla, and map these to color in a ggplot2 scatter plot.
addPhylum = unique(copy(mdt[, list(TaxaID, Phylum)]))
# Join by TaxaID
setkey(prevdt, TaxaID)
setkey(addPhylum, TaxaID)
prevdt <- addPhylum[prevdt]
showPhyla = prevdt[, sum(TotalCounts), by = Phylum][order(-V1)][1:9]$Phylum
setkey(prevdt, Phylum)
ggplot(prevdt[showPhyla], 
       mapping = aes(Prevalence, TotalCounts, color = Phylum)) + 
  geom_point(size = 4, alpha = 0.75) + 
  scale_y_log10()