PhyloPythiaS Pipeline Tutorial



Updates to the VM to analyze your own samples

We will provide updates to the VM to enable you to analyze your own samples very easily using the pipeline. Please go to our web site to get instructions how to do it:

http://algbio.cs.uni-duesseldorf.de/vm-dist-support/pps-tutorial-ck/howto.html

Introduction

PhyloPythiaS is a composition-based taxonomic assignment software for metagenomic samples which uses a collection of structural Support Vector machines. We have now combined this with an automated search for relevant clades and reference data to create a taxonomic model of the relevant clades for your metagenome sequence sample.

It requires a two-step procedure: First, you identify the clades to be included in the model and train this model using reference sequence data. To get good results, it is important to choose appropriate clades for modeling. To guide the selection process one can use marker gene analysis to analyze the taxonomic content of a sample. In the second step, you use this model to assign your sequence fragments to these clades with the model.

Scenario

In our tutorial we will be assigning an assembled metagenome sequence data set from the gut of the Australian Tammar Wallaby (Pope et al. PNAS 2010). We will be working with a part of the data set, 5994 contigs, which are part of scaffolds with two or more contigs.

We first have to decide on what clades we want to model. It should be the clades from which the metagenome sample contigs originate. We might have some prior knowledge from additional 16S rRNA studies and this can help us to decide. We also need reference sequences, known to originate from these clades, to train our model. To learn more about relevant clades and find training sequence, we will first search in our metagenome sequence sample for a set of marker genes (5S, 16S, 23S and others), perform taxonomic assignment of them, and decide which clades to model based on these results.

Virtual machine notice

Some of the steps require more time that we have in the tutorial or more memory than might be available on your system. In this case we can skip the step and continue with pre-computed data that was put on the VM.

The VM is in the general OVA format and supports VirtualBox and VMWare Player. When you have imported the VM into your desktop virtualization software, go to the machine settings dialogue and make sure that there is a shared folder with the name “tmp”, it can point to a folder with a name of your choice on your host system but make sure its name is “tmp” and a checkbox indicating auto-mount is unchecked. The default under Linux is the /tmp folder. Once defined, this folder can be reached via /host-tmp/ and be used to exchange data with your host system or place temporary data outside of the VM. Under the VM settings, also make sure that the assigned amount of memory (2 GB) is ok for your computer.

Updates

The VM, as delivered at the workshop, should be updated to include latest changes. Ensure that the programs in the VM have network access (e. g. open browser and go to a web page) and run the following commands.

Get updated tutorial on the desktop:

fsexec get-tutorial

Fix a bug in run.py control script:

fsexec fix-runpy

Pipeline control

There is a control script called “run.py” which will run each single step of the pipeline. You can check out its options using

/apps/ppstutorial/tools/pPPS/run.py -h	

Because it is nicer to use, we can define a temporary alias using

alias runppm=/apps/ppstutorial/tools/pPPS/run.py

Step by step

Optional steps are colored in purple.

The pipeline has been installed to /apps/ppstutorial. Open the file manager or use the terminal and to to this folder.

cd /apps/ppstutorial/

In the terminal, list folders using

ls

Here is a description of the folders and their contents:

name

description

data

  • marker gene information

  • genome sequences for training

  • NCBI taxonomy

tools

software used in the pipeline:

  • hmmer

  • AMPHORA/rna_hmm3

  • PhyloPythiaS

analysis

subfolders contain the samples and settings for the analysis



Next we go into the analysis sub-folder containing the Tammar Wallaby sample.

cd /apps/ppstutorial/analysis/TW/

Have a look at the files, again.

ls

name

description

contigs_tw.fna

sample contigs to analyze

config.cfg

pipeline configuration

config_PPS.txt

PhyloPythiaS configuration

pps_tw2009.*

predictions from publication (used to compare to)

expert_tw

some expert data for this sample that can be used to refine the model before training

output

the place where the result files are created

working

the place where all temporary files are created



Have a look at the config files using a text editor. The graphical editor is called leafpad while on the command line you can also use “less”, e. g.

less config_PPS.txt

Exit with “q”.

To locate the marker genes on the sample fragments, we use Hidden Markov Models that were constructed from Multiple Sequence Alignments. The marker gene analysis is currently divided into two steps, one for 5S/16S/23S and another for and extended set of genes (AMPHORA). Once the genes are found, their taxonomic origin will be determined using the RDP classifier (as implemented in mothur).

Run the 5S/16S/23S step using

runppm -c config.cfg -n

If you have less than 2 GB of memory assigned to the VM, you should skip this step and use the pre-computed files.

Run the AMPHORA step using

runppm -c config.cfg -g

We can now prepare the results of the marker gene analysis to be used with PhyloPythiaS.

runppm -c config.cfg -o s16 mg

Here is the step where we can refine the clades for training, e. g. if we already have some knowledge about the taxonomic content of the metagenome sequence sample. The file “ncbids.txt” in the working directory contains the NCBI taxonomic IDs of the clades that will be modeled. If you want, you can look them up on their website. You can also add some expert sequence data into the folder “sampleSpecificDir”. To do this, we simply copy FASTA files an expert created from the folder /apps/ppstutorial/analysis/TW/expert_tw into the sampleSpecificDir.

cd /apps/ppstutorial/analysis/TW
cp expert_tw/*.fas working/sampleSpecificDir/

The files are named according to the NCBI Taxonomy ID and ending .fas. They contain the following amounts of sequences and the taxonomy IDs represents the following species:



bpCount

species

ncbid

209kb

Bacteria;Firmicutes;Clostridia;Clostridiales;Lachnospiraceae;uncultured Lachnospiraceae bacterium

297314

129kb

Bacteria;Firmicutes;Erysipelotrichi;Erysipelotrichales;Erysipelotrichaceae;uncultured Erysipelotrichaceae bacterium

331630

390kb

Bacteria;Proteobacteria;Gammaproteobacteria;Aeromonadales;Succinivibrionaceae;uncultured Succinivibrionaceae bacterium

538960


When the modeled clades are satisfactory, we can continue to train a classification model for our sample using some of the sample-data and reference genomes. This step takes between 2 and 5 hours and can use 3 GB or more of memory and will not be possible on everybody's computer. For this reason you should skip this step and continue with the prediction step with pre-computed models. The NCBI genomes needed for training are not contained in the VM by default. Go to the /apps/ppstutorial/data/genomes and follow instructions in the README to populate this folder, if you want to do training for any sample. Genomes will take another 4 to 5 GB on you hard disk. If you want to do the training later, shut down the VM and adjust the main memory to 3 GB.


runppm -c config.cfg -t

The actual prediction uses the models to assign each sequence in the sample file to a taxon. This step is triggered using:

runppm -c config.cfg -p c

Notice: if you run this step multiple times, you should currently have to remove the temporary file “contigs_tw.fna.ids.nox.fna” in the working folder.

Finally, to populate the output folder with the results, you run:

runppm -c config.cfg -r -s

Go into the output folder and inspect the result files.

cd /apps/ppstutorial/analysis/TW/output
ls

or use the file manager.

You can also look at pie charts displaying the relative proportions of taxonomic assignments (in terms of number of fragments, not basepairs!) in the folder /apps/ppstutorial/analysis/TW/working. These are all those files ending with .svg. You can open them with most web browsers: for instance, the file /apps/ppstutorial/analysis/TW/working/contigs_tw.fna.ids.pie_species.svg