We will provide updates to the VM to enable you to analyze your own samples very easily using the pipeline. Please go to our web site to get instructions how to do it:
http://algbio.cs.uni-duesseldorf.de/vm-dist-support/pps-tutorial-ck/howto.html
PhyloPythiaS is a composition-based taxonomic assignment software for metagenomic samples which uses a collection of structural Support Vector machines. We have now combined this with an automated search for relevant clades and reference data to create a taxonomic model of the relevant clades for your metagenome sequence sample.
It requires a two-step procedure: First, you identify the clades to be included in the model and train this model using reference sequence data. To get good results, it is important to choose appropriate clades for modeling. To guide the selection process one can use marker gene analysis to analyze the taxonomic content of a sample. In the second step, you use this model to assign your sequence fragments to these clades with the model.
In our tutorial we will be assigning an assembled metagenome sequence data set from the gut of the Australian Tammar Wallaby (Pope et al. PNAS 2010). We will be working with a part of the data set, 5994 contigs, which are part of scaffolds with two or more contigs.
We first have to decide on what clades we want to model. It should be the clades from which the metagenome sample contigs originate. We might have some prior knowledge from additional 16S rRNA studies and this can help us to decide. We also need reference sequences, known to originate from these clades, to train our model. To learn more about relevant clades and find training sequence, we will first search in our metagenome sequence sample for a set of marker genes (5S, 16S, 23S and others), perform taxonomic assignment of them, and decide which clades to model based on these results.
Some of the steps require more time that we have in the tutorial or more memory than might be available on your system. In this case we can skip the step and continue with pre-computed data that was put on the VM.
The VM is in the general OVA format and supports VirtualBox and VMWare Player. When you have imported the VM into your desktop virtualization software, go to the machine settings dialogue and make sure that there is a shared folder with the name “tmp”, it can point to a folder with a name of your choice on your host system but make sure its name is “tmp” and a checkbox indicating auto-mount is unchecked. The default under Linux is the /tmp folder. Once defined, this folder can be reached via /host-tmp/ and be used to exchange data with your host system or place temporary data outside of the VM. Under the VM settings, also make sure that the assigned amount of memory (2 GB) is ok for your computer.
Updates
The VM, as delivered at the workshop, should be updated to include latest changes. Ensure that the programs in the VM have network access (e. g. open browser and go to a web page) and run the following commands.
Get updated tutorial on the desktop:
fsexec get-tutorial
Fix a bug in run.py control script:
fsexec fix-runpy
/apps/ppstutorial/tools/pPPS/run.py -h
alias runppm=/apps/ppstutorial/tools/pPPS/run.py
Step by step
Optional steps are colored in purple.
The pipeline has been installed to /apps/ppstutorial. Open the file manager or use the terminal and to to this folder.
cd /apps/ppstutorial/
In the terminal, list folders using
ls
Here is a description of the folders and their contents:
name |
description |
---|---|
data |
|
tools |
software used in the pipeline:
|
analysis |
subfolders contain the samples and settings for the analysis |
Next we go into the analysis sub-folder containing the Tammar Wallaby sample.
cd /apps/ppstutorial/analysis/TW/
Have a look at the files, again.
ls
name |
description |
---|---|
contigs_tw.fna |
sample contigs to analyze |
config.cfg |
pipeline configuration |
config_PPS.txt |
PhyloPythiaS configuration |
pps_tw2009.* |
predictions from publication (used to compare to) |
expert_tw |
some expert data for this sample that can be used to refine the model before training |
output |
the place where the result files are created |
working |
the place where all temporary files are created |
Have a look at the config files using a text editor. The graphical editor is called leafpad while on the command line you can also use “less”, e. g.
less config_PPS.txt
Exit with “q”.
To locate the marker genes on the sample fragments, we use Hidden Markov Models that were constructed from Multiple Sequence Alignments. The marker gene analysis is currently divided into two steps, one for 5S/16S/23S and another for and extended set of genes (AMPHORA). Once the genes are found, their taxonomic origin will be determined using the RDP classifier (as implemented in mothur).
Run the 5S/16S/23S step using
runppm -c config.cfg -n
If you have less than 2 GB of memory assigned to the VM, you should skip this step and use the pre-computed files.
Run the AMPHORA step using
runppm -c config.cfg -g
We can now prepare the results of the marker gene analysis to be used with PhyloPythiaS.
runppm -c config.cfg -o s16 mg
Here is the step where we can refine the clades for training, e. g. if we already have some knowledge about the taxonomic content of the metagenome sequence sample. The file “ncbids.txt” in the working directory contains the NCBI taxonomic IDs of the clades that will be modeled. If you want, you can look them up on their website. You can also add some expert sequence data into the folder “sampleSpecificDir”. To do this, we simply copy FASTA files an expert created from the folder /apps/ppstutorial/analysis/TW/expert_tw into the sampleSpecificDir.
cd /apps/ppstutorial/analysis/TW cp expert_tw/*.fas working/sampleSpecificDir/
The files are named according to the NCBI Taxonomy ID and ending .fas. They contain the following amounts of sequences and the taxonomy IDs represents the following species:
bpCount |
species |
ncbid |
---|---|---|
209kb |
Bacteria;Firmicutes;Clostridia;Clostridiales;Lachnospiraceae;uncultured Lachnospiraceae bacterium |
297314 |
129kb |
Bacteria;Firmicutes;Erysipelotrichi;Erysipelotrichales;Erysipelotrichaceae;uncultured Erysipelotrichaceae bacterium |
331630 |
390kb |
Bacteria;Proteobacteria;Gammaproteobacteria;Aeromonadales;Succinivibrionaceae;uncultured Succinivibrionaceae bacterium |
538960 |
When the modeled clades are satisfactory, we can continue to train a classification model for our sample using some of the sample-data and reference genomes. This step takes between 2 and 5 hours and can use 3 GB or more of memory and will not be possible on everybody's computer. For this reason you should skip this step and continue with the prediction step with pre-computed models. The NCBI genomes needed for training are not contained in the VM by default. Go to the /apps/ppstutorial/data/genomes and follow instructions in the README to populate this folder, if you want to do training for any sample. Genomes will take another 4 to 5 GB on you hard disk. If you want to do the training later, shut down the VM and adjust the main memory to 3 GB.
runppm -c config.cfg -t
The actual prediction uses the models to assign each sequence in the sample file to a taxon. This step is triggered using:
runppm -c config.cfg -p c
Notice: if you run this step multiple times, you should currently have to remove the temporary file “contigs_tw.fna.ids.nox.fna” in the working folder.
Finally, to populate the output folder with the results, you run:
runppm -c config.cfg -r -s
Go into the output folder and inspect the result files.
cd /apps/ppstutorial/analysis/TW/output ls
or use the file manager.
You can also look at pie charts displaying the relative proportions of taxonomic assignments (in terms of number of fragments, not basepairs!) in the folder /apps/ppstutorial/analysis/TW/working. These are all those files ending with .svg. You can open them with most web browsers: for instance, the file /apps/ppstutorial/analysis/TW/working/contigs_tw.fna.ids.pie_species.svg