This is the first section of the phyloseq lab. In this section you will learn how to import data from a common format and how to manipulate, investigate, and merge data together into a single “experiment-level” object. You will also learn how to save this object into a compressed and self-describing file, that you can easily and quickly load into a new R session without repeating the entire parsing and organizing process.
The phyloseq object that you create and save in this section will be used in the next section; but don’t worry! If you have trouble or don’t finish a copy of the object we are trying to make in this tutorial is included with these materials, the file named: mp-phyloseq-lab-00.RDS
A secondary goal of this section is for you to become familiar or even comfortable reading and using the documentation of R functions that accompanies every good R package. Don’t give up too quickly! Every time you get stuck and start searching for an example you are practicing how to get around R on your own. Each of the functions we will use are well-documented, and the tutorial text provides each main function that you will use.
In programming there are many ways to accomplish a task. I have provided an answer key in the form of a working example of code that accomplishes the task being asked of you. If your version of the code does not match exactly, that is okay, yours is still correct as long as it works! Please try not to use the answer key until you have exhausted your other options, including an earnest attempt at exploring the documentation of the functions mentioned in each prompt, as well as your own trial-and-error attempts to accomplish the task.
wget
, tar
, etc..tgz
), or uncompressed file. Note this location or save the path as an object. If you do not have the relative path immediately available from the earlier QIIME lab, you can download the lab to the current directory from the following link: ftp://ftp.microbio.me/qiime/tutorial_files/moving_pictures_tutorial-1.9.0.tgz
Hint, you do not need to leave your R session to do this. R can initiate system commands (bash) using the system()
function. So in this case: system("wget ftp://ftp.microbio.me/qiime/tutorial_files/moving_pictures_tutorial-1.9.0.tgz")
system("tar -xvzf moving_pictures_tutorial-1.9.0.tgz")
# Define paths
dataDirectoryPath = "moving_pictures_tutorial-1.9.0/illumina/precomputed-output/otus"
biomFile = "otu_table_mc2_w_tax_no_pynast_failures.biom"
fullInputPath = file.path(dataDirectoryPath, biomFile)
# Only attempt download if file not already present
if( !file.exists(fullInputPath) ){
# Download to current directory
system("wget ftp://ftp.microbio.me/qiime/tutorial_files/moving_pictures_tutorial-1.9.0.tgz")
# Unzip
system("tar -xvzf moving_pictures_tutorial-1.9.0.tgz")
}
There is a specific biom file we want to work with. The QIIME tutorial might have many.
otu_table_mc2_w_tax_no_pynast_failures.biom
. Hint: use the autocomplete feature of system paths by typing the tab button when the cursor is within a pair of quotation marks. The path within the tutorial is: moving_pictures_tutorial-1.9.0/illumina/precomputed-output/otus
fullInputPath
file.exists()
file.info()
# Define paths
dataDirectoryPath = "moving_pictures_tutorial-1.9.0/illumina/precomputed-output/otus"
biomFile = "otu_table_mc2_w_tax_no_pynast_failures.biom"
fullInputPath = file.path(dataDirectoryPath, biomFile)
fullInputPath
## [1] "moving_pictures_tutorial-1.9.0/illumina/precomputed-output/otus/otu_table_mc2_w_tax_no_pynast_failures.biom"
# Check for existence of biom file.
file.exists(fullInputPath)
## [1] TRUE
file.info(fullInputPath)
## size
## moving_pictures_tutorial-1.9.0/illumina/precomputed-output/otus/otu_table_mc2_w_tax_no_pynast_failures.biom 1065793
## isdir
## moving_pictures_tutorial-1.9.0/illumina/precomputed-output/otus/otu_table_mc2_w_tax_no_pynast_failures.biom FALSE
## mode
## moving_pictures_tutorial-1.9.0/illumina/precomputed-output/otus/otu_table_mc2_w_tax_no_pynast_failures.biom 644
## mtime
## moving_pictures_tutorial-1.9.0/illumina/precomputed-output/otus/otu_table_mc2_w_tax_no_pynast_failures.biom 2015-01-16 20:18:52
## ctime
## moving_pictures_tutorial-1.9.0/illumina/precomputed-output/otus/otu_table_mc2_w_tax_no_pynast_failures.biom 2016-01-18 14:23:54
## atime
## moving_pictures_tutorial-1.9.0/illumina/precomputed-output/otus/otu_table_mc2_w_tax_no_pynast_failures.biom 2016-01-18 14:24:01
## uid
## moving_pictures_tutorial-1.9.0/illumina/precomputed-output/otus/otu_table_mc2_w_tax_no_pynast_failures.biom 1001
## gid
## moving_pictures_tutorial-1.9.0/illumina/precomputed-output/otus/otu_table_mc2_w_tax_no_pynast_failures.biom 1001
## uname
## moving_pictures_tutorial-1.9.0/illumina/precomputed-output/otus/otu_table_mc2_w_tax_no_pynast_failures.biom genomics
## grname
## moving_pictures_tutorial-1.9.0/illumina/precomputed-output/otus/otu_table_mc2_w_tax_no_pynast_failures.biom genomics
library()
command.packageVersion()
command.library("phyloseq"); packageVersion("phyloseq")
## Warning: replacing previous import by 'ggplot2::Position' when loading
## 'phyloseq'
## Warning: replacing previous import by 'scales::alpha' when loading
## 'phyloseq'
## [1] '1.14.0'
library("biomformat"); packageVersion("biomformat")
## [1] '0.99.2'
library("ggplot2"); packageVersion("ggplot2")
## [1] '2.0.0'
We want to use a special import function that is not yet fully integrated into phyloseq. This is to accomodate the new version of the biom format that is based on HDF5 and has become the default output from QIIME.
biomformat::read_biom()
, and save the result as an object named biom1
. Make sure to specify the package with the ::
notation, because the older biom package also includes a read_biom()
function. Save this data, the output of biomformat::read_biom()
as biom1
. (The phyloseq package will be migrating to biomformat in the next release.)import_biom()
function can already properly interpret biom1
and convert it to proper phyloseq format. Save this, the output of import_biom()
as mp0
.tax_table(mp0) <- tax_table(mp0)[, 1:7]
This is to fix a weird importing artifact. But now you know how to do it :-)print()
function, or simply typing the object name and pressing enter. What does it show you?rank_names()
function. What are the taxonomic ranks available in this data?biom1 = biomformat::read_biom(biom_file = fullInputPath)
## Found more than one class "file" in cache; using the first, from namespace 'BiocGenerics'
## Found more than one class "file" in cache; using the first, from namespace 'BiocGenerics'
## Warning in strsplit(msg, "\n"): input string 1 is invalid in this locale
names(biom1)
## [1] "id" "format" "format_url"
## [4] "type" "generated_by" "date"
## [7] "matrix_type" "matrix_element_type" "rows"
## [10] "columns" "shape" "data"
# treefilename = treeFile1
mp0 = import_biom(biom1, parseFunction = parse_taxonomy_greengenes)
## Warning in parseFunction(i$metadata$taxonomy): No greengenes prefixes were found.
## Consider using parse_taxonomy_default() instead if true for all OTUs.
## Dummy ranks may be included among taxonomic ranks now.
## Warning in parseFunction(i$metadata$taxonomy): No greengenes prefixes were found.
## Consider using parse_taxonomy_default() instead if true for all OTUs.
## Dummy ranks may be included among taxonomic ranks now.
## Warning in parseFunction(i$metadata$taxonomy): No greengenes prefixes were found.
## Consider using parse_taxonomy_default() instead if true for all OTUs.
## Dummy ranks may be included among taxonomic ranks now.
## Warning in parseFunction(i$metadata$taxonomy): No greengenes prefixes were found.
## Consider using parse_taxonomy_default() instead if true for all OTUs.
## Dummy ranks may be included among taxonomic ranks now.
## Warning in parseFunction(i$metadata$taxonomy): No greengenes prefixes were found.
## Consider using parse_taxonomy_default() instead if true for all OTUs.
## Dummy ranks may be included among taxonomic ranks now.
## Warning in parseFunction(i$metadata$taxonomy): No greengenes prefixes were found.
## Consider using parse_taxonomy_default() instead if true for all OTUs.
## Dummy ranks may be included among taxonomic ranks now.
## Warning in parseFunction(i$metadata$taxonomy): No greengenes prefixes were found.
## Consider using parse_taxonomy_default() instead if true for all OTUs.
## Dummy ranks may be included among taxonomic ranks now.
## Warning in parseFunction(i$metadata$taxonomy): No greengenes prefixes were found.
## Consider using parse_taxonomy_default() instead if true for all OTUs.
## Dummy ranks may be included among taxonomic ranks now.
## Warning in parseFunction(i$metadata$taxonomy): No greengenes prefixes were found.
## Consider using parse_taxonomy_default() instead if true for all OTUs.
## Dummy ranks may be included among taxonomic ranks now.
## Warning in parseFunction(i$metadata$taxonomy): No greengenes prefixes were found.
## Consider using parse_taxonomy_default() instead if true for all OTUs.
## Dummy ranks may be included among taxonomic ranks now.
## Warning in parseFunction(i$metadata$taxonomy): No greengenes prefixes were found.
## Consider using parse_taxonomy_default() instead if true for all OTUs.
## Dummy ranks may be included among taxonomic ranks now.
## Warning in parseFunction(i$metadata$taxonomy): No greengenes prefixes were found.
## Consider using parse_taxonomy_default() instead if true for all OTUs.
## Dummy ranks may be included among taxonomic ranks now.
# Keep only the first 7 ranks
tax_table(mp0) <- tax_table(mp0)[, 1:7]
# Explore the data.
rank_names(mp0)
## [1] "Kingdom" "Phylum" "Class" "Order" "Family" "Genus" "Species"
mp0
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 3426 taxa and 34 samples ]
## tax_table() Taxonomy Table: [ 3426 taxa by 7 taxonomic ranks ]
If not, it should be in: moving_pictures_tutorial-1.9.0/illumina/map.tsv
import_qiime_sample_data()
function, and save this as an object named qsd
.merge_phyloseq()
function.Read out the sample variables that are now present in your object.
qsd = import_qiime_sample_data("moving_pictures_tutorial-1.9.0/illumina/map.tsv")
mp1 = merge_phyloseq(mp0, qsd)
mp1
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 3426 taxa and 34 samples ]
## sample_data() Sample Data: [ 34 samples by 11 sample variables ]
## tax_table() Taxonomy Table: [ 3426 taxa by 7 taxonomic ranks ]
sample_variables(mp1)
## [1] "X.SampleID" "BarcodeSequence"
## [3] "LinkerPrimerSequence" "SampleType"
## [5] "Year" "Month"
## [7] "Day" "Subject"
## [9] "ReportedAntibioticUsage" "DaysSinceExperimentStart"
## [11] "Description"
Read phylogenetic tree. Add it to the data object.
treeFile1
read_tree()
function to import the tree. Save it as tree1
.mp2
. Again, to do this you’ll use merge_phyloseq()
with arguments mp1
and tree1
.treeFile1 = "moving_pictures_tutorial-1.9.0/illumina/precomputed-output/otus/rep_set.tre"
tree1 = read_tree(treeFile1)
tree1
##
## Phylogenetic tree with 3426 tips and 3424 internal nodes.
##
## Tip labels:
## New.CleanUp.ReferenceOTU647, 14030, New.CleanUp.ReferenceOTU858, 187582, 193769, 369227, ...
## Node labels:
## , 0.779, 0.834, 0.576, 0.184, 0.874, ...
##
## Unrooted; includes branch lengths.
class(tree1)
## [1] "phylo"
# Add to the data object
mp2 = merge_phyloseq(mp1, tree1)
mp2
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 3426 taxa and 34 samples ]
## sample_data() Sample Data: [ 34 samples by 11 sample variables ]
## tax_table() Taxonomy Table: [ 3426 taxa by 7 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 3426 tips and 3424 internal nodes ]
The OTU centroids, often called the “representative sequences”, is the set of sequences that represent the center of an OTU in sequence space. There is one such sequence per OTU, and they are all different from one another. In principle they should be at least as dissimilar from one another as the OTU radius that you defined during OTU clustering. This is commonly 3% sequence dissimilarity, but this valuse is arbitrary, and you may select a different, smaller value in practce.
moving_pictures_tutorial-1.9.0/illumina/precomputed-output/otus/rep_set.fna
.Biostrings::readDNAStringSet()
function for reading fasta files, and save this DNAStringSet object as bs1
.names(bs1) <- gsub("\\s.+$", "", names(bs1))
bs1
to your data object using the merge_phyloseq()
function, and save the result as mp3
.ntaxa()
function on the unmerged objects and on mp3
.intersect(taxa_names(object1), taxa_names(object2))
. Note that if there are many taxa, you might want to save the result of intersect()
as an object and query it in pieces, or its length()
, rather than fill up your terminal with OTU identifiers.repseqFile = "moving_pictures_tutorial-1.9.0/illumina/precomputed-output/otus/rep_set.fna"
bs1 = Biostrings::readDNAStringSet(repseqFile)
## Found more than one class "file" in cache; using the first, from namespace 'BiocGenerics'
## Found more than one class "file" in cache; using the first, from namespace 'BiocGenerics'
## Found more than one class "file" in cache; using the first, from namespace 'BiocGenerics'
# Remove non-OTU information from sequence name
names(bs1) <- gsub("\\s.+$", "", names(bs1))
sum(names(bs1) %in% taxa_names(mp2))
## [1] 3426
# Add to the data object
mp3 = merge_phyloseq(mp2, bs1)
mp3
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 3426 taxa and 34 samples ]
## sample_data() Sample Data: [ 34 samples by 11 sample variables ]
## tax_table() Taxonomy Table: [ 3426 taxa by 7 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 3426 tips and 3424 internal nodes ]
## refseq() DNAStringSet: [ 3426 reference sequences ]
You’ve got an object you want to use for later analysis (or many of them!); AND it is well-organized and self-describing with a tree, sequences, taxonomy, sample data, and of course the counts of each OTU in each sample (OTU table). Before doing further analysis, let’s end this session by saving our hard-earned object as a compressed, serialized R object. These objects
Save the file as mp-phyloseq-lab.RDS, using the saveRDS()
function.
# Easier without loop
saveRDS(mp2, "mp-phyloseq-lab.RDS")
mapply()
function to loop over two vectors vectors simultaneously, one with the objects (a list) and the other a character vector of the file names you will store.mget()
and paste0()
functions to store a list of the 4 mp*
objects we created is a helpful fist step.# Define object names
mpObjNames = paste0("mp", 0:3)
# Define list of the different stages of our object build
mpList = mget(mpObjNames)
# Use mapply() function to loop over each object
mapply("saveRDS",
mpList,
paste0("mp", 0:3, ".RDS"),
SIMPLIFY = TRUE)
## $mp0
## NULL
##
## $mp1
## NULL
##
## $mp2
## NULL
##
## $mp3
## NULL
If you want to store many objects as one file, or you want to store all the objects at the end of your session, then you may find that save()
or save.image()
to be easier to use than saveRDS()
. The files are usually given the extension .RData
instead of .RDS
, and the reverse operation (loading the stored data into R session) is accomplished with the load()
function, rather than readRDS()
.
Try for yourself!
save(list = paste0("mp", 0:3), file = "myObjList.RData")
This is an R Markdown document (or the HTML result of one). Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
summary(cars)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
You can also embed plots, for example:
plot(cars)