MG-RAST - Evolution and Genomics

The mg-RAST (metagenomics Rapid Annotation using Subsystem Technology) is a web accessible system that provides a suite of tools for analysis and visualization of metagenomic data. The system is an adaptation of the RAST server system which was originally implemented to allow for high-quality annotation of complete microbial genomes using SEED data. The microbial SEED data are still used in the mg-RAST analysis pipeline; however, numerous other resources have been added to the system in order to enhance microbial sequence taxonomic and functional classification. More recently the greengenes, RDP-II and European ribosomal RNA databases have been added to enable 16s rRNA classification of metagenomic data sets.

The mg-RAST server is made available using technology established at Argonne National Laboratory and the University of Chicago. Registration with the site is required. User submission and analysis remain confidential, however it is possible to make your data ‘public’ and compare it with other public data sets. At its core, the system annotates individual sequence fragments, providing taxonomic and functional classification within a single metagenome and in comparison between multiple metagenomes. These data are presented using various visualization methods and are adjustable on the fly.

Currently the server handles direct upload of files in fasta, fastq and sff format. Files larger than 50M can be uploaded in zip or gzip format. Both fasta and fastq need to be submitted in plain text ASCII format. Quality information can be supplied for fasta files by submitting it as a file with the same prefix followed by .qual. Multiplexed sequence data files can be parsed by submitting a descriptive multiplex identifier (MID) file in plain text ASCII format.

expected learning outcomes

This exercise is designed to introduce you to the mg-RAST analysis platform. You will …

getting started

Access the mg-RAST homepage at http://metagenomics.anl.gov/. You do not need to be a registered user in order to explore the publicly available data sets; however, if you would like to use mg-RAST to analyze your own data, you will need to request access via the registration page. The requests are handled manually and it may take a few days before you are granted access.

There is normally a warning message underneath the title indicating that mg-RAST has been optimized for use with the Firefox browser. There are some browser-to-browser issues with visualization of certain diagrams. This exercise was tested using both Safari Version 5.1 (7534.48.3) and Firefox version 7.01 with no notable conflicts. However, if you notice something is not appearing or looks unusual it can frequently be explained by a browser conflict. Switching between browsers will normally correct these situations.

navigating the mg-RAST site

The icons at the top-right portion of the page are common to all pages within mg-RAST. You can mouse over each icon to see where you will be taken by selecting it (e.g. home, browse metagenomes, analyze metagenomes). These icons are useful for navigating throughout the mg-RAST system while performing an analysis, so it is worthwhile becoming familiar with where they take you.

[hr]

exercise 1: browsing metagenomes

Select the icon to be taken to the browse metagenome window. This page provides a table that lists all publicly available metagenomes processed using the mg-RAST system (by default 25 samples at a time). Limited information about each sample is provided in the table, and some summary information about the entirety of the mg-RAST metagenome library is listed on the left of the page and above the table.

The table has basic searching and sorting functions. By default the following fields are shown: project, name, biome and type. Additional fields can be added by selecting the (…) column header at the right of the table. You should spend some time exploring the table to give yourself a good idea of what types of samples are available for analysis.

After you are done exploring the available data click clear table filters towards the top of the page to reset the table. We will now locate and examine two publicly available metagenome projects from the JVCI Global Ocean Sampling Expedition. These two samples will be used for subsequent exercises.

Using the search boxes in the table, locate and select the sample taken from a the Dirty Rock diving site on the Cocos Island of Costa Rica. Answer the following questions about the data and metadata associated with this sample.

Question 1: How many sequences were uploaded?

Question 2: What type of sequencing chemistry/technology was used to produce this data?

Question 3: How many sequences remain post quality-control (QC) analysis? Why do you think this is?

[toggle title_open=”Hide Answer” title_closed=”Show Answer” hide=”yes” border=”yes” style=”default” excerpt_length=”0″ read_more_text=”Read More” read_less_text=”Read Less” include_excerpt_html=”no”]120,671. The quality control was completed prior to uploading the data to mg-rast. This should be considered if this data were to be used for comparative analysis since there would be no way to replicate the current data without access to the original data and the specific QC parameters.[/toggle]

Question 4: How many other metagenomes are available for the “marine habitat” biome?

[toggle title_open=”Hide Answer” title_closed=”Show Answer” hide=”yes” border=”yes” style=”default” excerpt_length=”0″ read_more_text=”Read More” read_less_text=”Read Less” include_excerpt_html=”no”]256. This answer can be found by selecting the Find Metagenomes within this Biome link.[/toggle]

Question 5: How many reads were assigned to the eukaryota using the mg-RAST taxonomic assignment pipeline?

[toggle title_open=”Hide Answer” title_closed=”Show Answer” hide=”yes” border=”yes” style=”default” excerpt_length=”0″ read_more_text=”Read More” read_less_text=”Read Less” include_excerpt_html=”no”]25,898 (15.8%). Most easily determined by hovering over the correct wedge of the pie chart.[/toggle]

Question 6: What is the most abundant phylum identified in this sample?

[toggle title_open=”Hide Answer” title_closed=”Show Answer” hide=”yes” border=”yes” style=”default” excerpt_length=”0″ read_more_text=”Read More” read_less_text=”Read Less” include_excerpt_html=”no”]Cyanobacteria. Note: the description of the table used to identify this information indicates that it is displaying “species richness”; however, by default the plot is actually displaying phylum.[/toggle]

Question 7: How many sequences were assigned a function-based similarity to a known sequence in the KEGG Database with an e-value between -10 and -20?

[toggle title_open=”Hide Answer” title_closed=”Show Answer” hide=”yes” border=”yes” style=”default” excerpt_length=”0″ read_more_text=”Read More” read_less_text=”Read Less” include_excerpt_html=”no”]7,960. Most easily determined by hovering over the correctly colored bar in the KEGG chart.[/toggle]

Question 8: Why are so few sequences assigned function based on similarity to a protein sequence in the SwissProt database?

[toggle title_open=”Hide Answer” title_closed=”Show Answer” hide=”yes” border=”yes” style=”default” excerpt_length=”0″ read_more_text=”Read More” read_less_text=”Read Less” include_excerpt_html=”no”]Compared to many of the other databases on the list, there are relatively few sequences in the SwissPort database (current stats compared to the TrEMBL database stats).[/toggle]

The second sample we will use was obtained from an ocean sample collected at a depth of 1.7 meters in 25 C water. Identify this sample in the metagenome browser and the following information for this sample meeting these criteria.

Sample Location:

Number of Sequences Submitted:

Most abundant species (phylum):

[hr]

exercise 2: analyzing metagenomes

For all of exercise 2 we will be working with a single metagenome. If you are not already examining the Dirty Rock diving site data, please return to that page.

Click on the icon next to the sample name (not in the master menu at the top-right) to be taken to the analysis portal.

The analysis portal is roughly divided into three panels. The upper-left panel titled “Analysis Views” sets global options for how your analysis will be completed. The “Organism” header provides options for how you would like to perform taxonomic assignment of the metagenome sequence data. In addition, it provides an option for the production of a recruitment plot that maps your metagenome sequences to a genome sequence in the database. If you are interested in performing a functional analysis you need to select “hierarchical analysis” or “annotation” in the appropriate section.

The upper-right panel is where you will select the metagenome(s) you would like to analyze. If you selected the from the Dirty Rock sample page, it should currently display 4441593.3 at the top of this window. If you would have selected the from the master menu at the top-right of the page, this entry would be blank. You can easily search the public metagenomes from this page and can identify and select the Dirty Rock metagenome from this page.

In addition, during the mg-RAST assembly, annotation and taxonomic assignment processes a large amount of data about the analysis is saved and available to you in this panel. For example, the blast scores are saved and linked to the sequence data. This enables you to set a similarity-based threshold for your analysis. So if you want to limit your analysis to sequences that were annotated based on a blast e-value of < e-10, you can set this parameter here. Criteria can also be set for the minimum % identity cut-off and the minimum alignment length in order to increase or decrease threshold limits that each sequence must meet in order to be included in your analysis.

Embedded into the upper-right table is a section titled “Data Visualization”. In reality, it is more than just a data visualization menu. This is where you will specify how you would like the metagenomics data to be parsed, clustered and analyzed and will most likely be the menu you utilize the most.

The bottom window is organized as a series of tabs. With a new analysis, there should only be two tabs; one should be titled “workbench” and the other “getting started”. The getting started tab is always available and provides general usage information. The Workbench tab will store data that you assign to it. This will become more clear as we proceed through the exercise. New tabs will be assigned to this window every time you select a new Data Visualization and click the Generate button.

Make sure that the 4441593.3 metagenome is selected and set the e-value threshold to e-10. Leave all of the other parameters at their defaults and select the radio button under the bar chart in and click the Generate button.

Each analysis will take anywhere from a few seconds to a few minutes to complete. This is largely dependent on the size of your data, but can also be affected by server load. You know your analysis is running if you see a spinning wheel at the bottom of the Analysis View window.

Once the analysis is complete, you should see a new tab in the bottom window titled “Organism barchart 1”. Tabs will continue to be added in this manner as you complete additional analysis. This is convenient as you are always able to return to previous analysis.

Bar charts are interactive and you can “drill down” through taxonomic space by selecting a specific bar. Select several different bars on the chart and see how this works. You can always return to the default by selecting the draw button towards the top of the tab.

By default the data are displayed following a normalization procedure described here. If you prefer to view the raw read counts, you can select this from the menu and redraw the bar chart by selecting draw. Do this now and answer the following basic questions about the Dirty Rock data set.

Question 9: How many sequences are proteobacterial?

Question 10: How many Salmonella sequences were identified in the sample?

In order to store a specific subset of data for future analysis you can select them and move them to the Workbench using the to Workbench button. Do this now for all proteobacterial sequences. You can confirm that the sequences are in your Workbench by selecting the Workbench tab.

The proteobacterial sequences you just assigned to your Workbench can be organized into a tree representation of their taxonomic relationships by selecting the Use features from Workbench option in the Data Selection window and then selecting the radio button under “tree” in the Data Visualization window. Once these two options are selected, click on the Generate button to draw a tree representation of the proteobacterial taxa in the sample. Once complete, select the new tree tab to visualize the results.

The default is to display the data at the taxonomic level of “order” and color each order according to “phylum”. Altering this configuration can aid in visualizing taxonomic distribution in your sample. Select several options to understand how taxon level coloring changes visual interpretation of the data.

Additional options

Rarefaction curves can also be generated for single or multiple metagenomes at a time. Rarefaction curves calculate species richness for a given number of sampled individuals. The curve is a plot of the number of species as a function of the number of individuals sampled. On the left, the steep slope indicates that a large fraction of the species diversity remains to be discovered. If the curve becomes flatter to the right, a reasonable number of individuals is sampled: more intensive sampling is likely to yield only few additional species.

exercise 3: comparing two metagenomes

Note: You may want to reload the analysis window before doing this to clear any defaults set from the previous exercise. You can do this by simply clicking in the upper-right of the window.

One of the more powerful applications of mg-RAST is the ability to compare multiple metagenomes. This is made possible due to the normalization procedure that you should familiarize yourself with by reading the description here. Following normalization, you should be able to compare data gathered from vastly different locations and environments, but please keep in mind that sample collection at different times using various DNA extraction protocols and sequenced on different machines may have underlying effects on your data that are not corrected for simply by normalization. Rigorous follow-up studies should be used to confirm findings uncovered by this type of comparative analysis.

This exercise will compare the Dirty Rock metagenome analyzed in the previous exercise with a metagenome collected on the same voyage but from the Caribbean Sea off the Key West of Florida. The Dirty Rock metagenome we have been working with can be identified as sample GS025, and the Caribbean Sea metagenome is GS015. You can search for these metagenomes in the Data Selection Window on the analysis page if you first change the window that says projects to public. This will enable you to search all publicly available metagenomes and add them to your analysis. Do this now and create a bar chart using the default settings.

Select all of the sequences from the Eukaryota and add them to your workbench. This should result in 29,296 on your workbench. Visualize them as a tree structure at the maximum level of family and colored by phylum.

The default stacked bar chart may not be the most informative to display differences between the samples, so switch to the ‘bar chart’ and examine the results. Overall you should see a high level of congruence between the data.

exercise 4: comparing multiple metagenomes

Note: You may want to reload the analysis window before doing this to clear any defaults set from the previous exercise. You can do this by simply clicking in the upper-right of the window.

mg-RAST can implement some statistical tests on multiple metagenomes that are not relevant when comparing only two metagenomes. These are the Principle Component Analysis (PCoA) and calculation of p-values for assessing the difference between user-defined groups. For the following example we will be comparing relatively distinct metagenomes.

Note: Google’s Chrome browser does not correctly display the PCoA results. Completing this exercise will require using Safari or Firefox.

Refresh your analysis page and perform a PCoA on the following projects to the Selected Metagenomes window:

[unordered_list style=”arrow”]

Cow Rumen — 710F6
Cow Rumen –640F6
Cow Rumen –80F6
LeanMouseCecumMic2005
ObeseMouseCecumMic2005
FishHealGutKentSTMic20060504
FishHealSlimKentSTMic20060504
Fish — Healthy gut bacteria

[/unordered_list]

When the PCoA is complete, you should see a plot with three distinct groups. You can define these groups by selecting each spot and color-coding it according to the group set in the dropdown box above the plot. For example, if the box displays Group 1, every sample you click will go into group 1. Change the box to Group 2 in order to select samples you would like to place into that group, and so on. There are a variety of ways this page allows you to select groups, so explore the menu options. When you are done you should see a plot that looks like the figure below. Depending on the order in which you selected the groups, the colors might be different; however, the groupings are what is important.

Click “Store Grouping” to save your work before moving to the next step.

mg-RAST is currently only able to perform a t-test comparison on each of these groups using the bar chart module. So go ahead and create a bar chart on these data. When the bar chart is complete, select the calculate p-values option which is towards the top of the window to calculate p-values for each grouping you specified in the PCoA. mg-RAST automatically selects the statistical test to be used on the data. This selection process is described in the table below.

For these data, each of the groupings specified in the PCoA are highly-supported at the higher-level taxonomy. However, it can become more interesting as you burrow down through the data. For example, what class of proteobacteria are NOT different amongst these groupings?

[toggle title_open=”Hide Answer” title_closed=”Show Answer” hide=”yes” border=”yes” style=”default” excerpt_length=”0″ read_more_text=”Read More” read_less_text=”Read Less” include_excerpt_html=”no”]gammaproteobacteria are the only class of proteobacteria not statistically different between the groups specified using the PCoA (p = 0.0981).[/toggle]

references

The Metagenomics RAST server – A public resource for the automatic phylogenetic and functional analysis of metagenomes. F. Meyer, D. Paarmann, M. D’Souza, R. Olson , E. M. Glass, M. Kubal, T. Paczian , A. Rodriguez , R. Stevens, A. Wilke, J. Wilkening, R. A. Edwards. BMC Bioinformatics 2008, 9:386