There thousands of publicly available R extensions, AKA “packages”. They are not all installed on your system, and only core R packages are loaded when you start a new R session. The library()
function is used to load non-core packages so that subsequent commands are searched against the functions and data in these packages as well as the core.
I like to also show the package version number in my Rmd workflows (using packageVersion()
) so that I know what was used at the time the code was executed.
I also like to include a date stamp. There’s one above, but here’s how you can include it in your code:
gsub("[[:punct:][:space:]]", "-", Sys.time())
gsub("[[:punct:][:space:]]", "-", Sys.time())
## [1] "2016-01-18-15-55-00"
library()
command.packageVersion()
command.library("phyloseq"); packageVersion("phyloseq")
## Warning: replacing previous import by 'ggplot2::Position' when loading
## 'phyloseq'
## Warning: replacing previous import by 'scales::alpha' when loading
## 'phyloseq'
## [1] '1.14.0'
library("data.table"); packageVersion("data.table")
## [1] '1.9.6'
library("ggplot2"); packageVersion("ggplot2")
## [1] '2.0.0'
In the previous section you organized our Moving Pictures example data using phyloseq tools, and then saved this data from your R session (system memory, RAM) to the hard disk as a compressed, serialized file (extension “.RDS
”). This is a handy way to store complicated single data objects for later use, without needing to define and document one or more file formats. This data was imported from 3 or 4 QIIME output files in various locations, but now this one self-describing file is all you need.
readRDS()
function, and store the result as an object named mp
.mp = readRDS("mp-phyloseq-lab.RDS")
Reminder: For storing many objects from an R session at once, try save()
or save.image()
, and load()
.
Explore the data. Executing a command in the terminal that is just a data object invokes the print()
or summary()
functions, which usually give summary details about the data if it is a complex object (like ours), or show the data directly if it is simple. If it is simple but large, you might hit the streaming limits in your console, so try and check sizes first. The “Environment” box in your RStudio console usually tells you these details as well.
mp
objectThis will give you other details about the object, as well as a few key functions that can be used to access parts of it. This data is too large to show all in the console, with the exception of phy_tree()
and refseq()
, which have their own special print summary functions.
mp
to the screen.nsamples()
and ntaxa()
functions.sample_data
and tax_table
components of this phyloseq object have their own special variables, namely sample_variables()
and rank_names()
. Use these functions on mp
. What do you get? What does it mean?phy_tree()
function. What is it telling you? How many leaves are in the tree? How many internal nodes?You could attempt to plot this tree with the default plot()
function, but it has way too many leaves to be legible. Also, the default plot function is limited, so we’ll come back to this and how to use the phyloseq function plot_tree()
later on.
mp
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 3426 taxa and 34 samples ]
## sample_data() Sample Data: [ 34 samples by 11 sample variables ]
## tax_table() Taxonomy Table: [ 3426 taxa by 7 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 3426 tips and 3424 internal nodes ]
nsamples(mp)
## [1] 34
ntaxa(mp)
## [1] 3426
sample_variables(mp)
## [1] "X.SampleID" "BarcodeSequence"
## [3] "LinkerPrimerSequence" "SampleType"
## [5] "Year" "Month"
## [7] "Day" "Subject"
## [9] "ReportedAntibioticUsage" "DaysSinceExperimentStart"
## [11] "Description"
rank_names(mp)
## [1] "Kingdom" "Phylum" "Class" "Order" "Family" "Genus" "Species"
Create histograms that summarize sequencing depth in this data. Start with overall, but then also create category-specific histograms to help evaluate the balance in this experiment.
For this we are going to use the ggplot2 package.
sample_sums()
function, and the geom_histogram()
geometric object from ggplot2.facet_wrap()
or facet_grid()
functions.ggtitle()
function.xlab()
and ylab()
functions.SampleType
and Subject
using the facet_grid()
function. Does it look like there are any imbalances in total counts or number of samples among these critical groups?sdt = data.table(as(sample_data(mp), "data.frame"),
TotalReads = sample_sums(mp), keep.rownames = TRUE)
setnames(sdt, "rn", "SampleID")
pSeqDepth = ggplot(sdt, aes(TotalReads)) + geom_histogram() + ggtitle("Sequencing Depth")
pSeqDepth
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
pSeqDepth + facet_wrap(~SampleType)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
pSeqDepth +
facet_grid(SampleType ~ Subject) +
ggtitle("Seq. Depth by Subject and Body Site")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The lowest number of reads per sample was 1114. That is low for typical new projects, but not that unusual compared to the rest of the samples in this tutorial dataset. From these plots, I don’t recommend filtering any samples based on depth alone, but keep in mind that in some cases you might, mainly when the number of read counts in a sample is so low that it is apparent that something went wrong with that sample. We will keep an eye out for artifacts and odd behavior (outliers) as we go along.
The plots above also do not indicate strong biases in sequencing depth by group or subject. Do you agree?
Here we want to check for any associations with sequencing depth and time. This is worthwhile to check early on in an experiment, as metadata variables like sequencing depth should be uncorrelated with key experimental variables. Why is that? What would it mean if there was a strong correlation between depth and variable?
DaysSinceExperimentStart
, the vertical axis is mapped to TotalReads
, and the color variable is mapped to Subject
.geom_point()
function.scale_y_log10()
layer function. Does this improve the plot?Note that you should enter Subject as as.factor(Subject)
to let ggplot2 know that this is really a discrete variable, even though it is encoded as a numeric vector of 1
and 2
. Alternatively you could modify the Subject
variable to be a character or factor, or you could add a new variable that holds Subject
with a discrete data type; however, I prefer to not modify the original data unless necessary or useful, and the as.factor()
expression is an elegant option in this case. Why might it be a good idea to avoid modifying key variables in the data?
ggplot(sdt, aes(DaysSinceExperimentStart, TotalReads, color = as.factor(Subject))) +
geom_point(size = 5) +
geom_smooth(method = lm) +
ggtitle("Sequencing Depth vs. Time") +
scale_y_log10() +
facet_grid(SampleType ~ .)