Table of contents
- Expected learning outcome
- Part 1: A naive attempt
- Part 2: Evaluate the assembly
- Part 3: Improve the initial assembly
Expected learning outcome
The objective of this activity is to help you understand the basics of genome assembly, using typical tools and technology available today (January 2016). You will learn how to run an assembler on real bacterial data. More importantly, you will learn how to evaluate an assembly with relevant quality metrics. This lab also covers several techniques to obtain assemblies of reasonable quality.
Part 1: A naive attempt
Some assembly tutorials will tell you exactly what steps to take, what software to run and what parameters to try. This is in stark contrast with what you might experience in the real world: your libraries just came out of the sequencer, and no-one tells you what is the best assembler for this data, let alone how to set the parameters for it. In this first part of the tutorial, we would like to let you run an assembly with minimal instructions. However, if this is the first time you are using Linux and command-line software, we realize that you might get stuck. Thus, detailed instructions on how to perform this part are given below. On the other hand, if you have some experience with Linux and the command line, we encourage you to try to perform this part without the detailed instructions.
You are given a bacterial dataset to assemble: 500,000 reads from V. cholerae (genome size: ~4 Mbp) obtained using the Illumina MiSeq sequencer. The insert size is 150 bp, so the pairs are overlapping by quite a bit (you could merge them, but let’s not do that at first). The reads are in the directory called assembly/subsampled_illumina Use the files called SRR531199_1.500k.fastq and SRR531199_2.500k.fastq. Note that these are paired-end reads.
The goal of Part 1 is to obtain a quick, initial assembly of the V. cholerae genome using these reads. You could use any of the assemblers which have been pre-installed on your Instance. “But, which one”, you ask? Well, let’s assume that we don’t know. The point of this part is to let you to take initiatives and pick one! So, for this first attempt, if the assembler asks for any parameter (e.g. the k-mer size), try to guess a reasonable value but do not over-think it. At this point, it does not matter if the assembly is of poor quality. In the next Part, you will measure the quality of your initial assembly and recognize that it may be improved. Once you have generated your first assembly, move to Part 2. If you are stuck, don’t panic, and follow the detailed steps below.
Spades is an assembler designed for prokaryotic and small eukaryotic genomes. It does an excellent job (probably the best) at assembling bacteria, either multi-cell or single-cell data, and small metagenomes. It is multi-k and takes generally longer time and memory than other assemblers. It also supports assembly of IonTorrent and PacBio data for small genomes.
Velvet is a very popular assembler for Illumina data. However, it can require quite a lot of memory on >100 Mbp genomes. The bacterial dataset we are using here is small, so Velvet is still viable option.
GATB-Pipeline & Minia
GATB-Pipeline is a de novo assembly pipeline. It takes as input an Illumina dataset and can assemble large genomes (e.g. human) quickly, using small resources. It cannot take as input PacBio data. Its command line is similar to SPAdes. It is multi-k, and based on Minia assembler. [Note: GATB-Pipeline is an unpublished assembly pipeline, partly written by Rayan. He believes it will give you nice assemblies, but the software hasn’t been peer reviewed. So, in actual projects, use at your own risks!]
Download is available here: http://gatb-pipeline.gforge.inria.fr/versions/bin/gatb-pipeline-1.171.tar.gz
NOTE: GATB has already been downloaded for the workshop
Or maybe you will want to use just Minia, which is an assembler that can create contigs very quickly, but only for a single k value (like Velvet).
Many other assemblers
NOTE: These aren’t installed on the AMI, but are for your future reference.
The list of available assemblers is huge and the selection here is by no means complete. However, all of these listed are commonly used and should do a great job assembling this dataset.
SOAPdenovo 2 http://sourceforge.net/projects/soapdenovo2/
We encourage you to try solving Part 1 without looking at these steps. But if you are stuck, you may reveal a set of detailed steps by clicking on the link below. If these steps remain unclear, do not hesitate to ask a lab instructor for clarification.[toggle title_open=”Hide detailed steps” title_closed=”Show detailed steps” hide=”yes” border=”yes” style=”default” excerpt_length=”0″ read_more_text=”Read More” read_less_text=”Read Less” include_excerpt_html=”yes”]
We’ll use GATB-Pipeline to obtain an assembly of the given reads.
Now make sure that your current directory is where you want to want your assembly to be. You may enter these commands to create a new temporary directory and make it your current directory: [box]mkdir ~/workshop_data/assembly/gatb-assembly
cd ~/workshop_data/assembly/gatb-assembly [/box]
Let’s put a symbolic link of the reads into this directory. It is to avoid typing the long path to the reads.[box] ln -s ~/workshop_data/assembly/subsampled_illumina/SRR531199_?.500k.fastq ./ [/box] (don’t forget the “./” at the end, it means that the link is to be put inside the current directory. Note: Everything should be typed on one line)
Now we are ready to assemble. Enter the following command to assemble: [box]
gatb -1 SRR531199_1.500k.fastq -2 SRR531199_2.500k.fastq [/box] It will take around 20 minutes. If all goes fine, your assembly with GATB-Pipeline will be in the file assembly.fasta. Meanwhile, have a look on how to run another assembler, I highly recommend SPAdes. The command to run it is similar to GATB-Pipeline.[/toggle]
Part 2: evaluate the assembly
(~ 20 mins + ~ 20 mins for optional steps)
Exercise: evaluate your assembly with QUAST
Run QUAST on your output file. The syntax is below. You will want to specify a descriptive name for the output directory where QUAST will deposit a large number of output files. Use the parameter “–scaffolds” if your assembler produced scaffolds (Spades will, Minia won’t). [box]quast.py -o output_directory assembly_file [–scaffolds] [/box] Note that the “–scaffold” argument is with two dashes (‘-‘ + ‘-‘) but is displayed as a single dash on this web page. Note that if specifiying the assembly_file also specify the directory. Move into your QUAST output directory and examine the report.txt file. You may also take a look at the HTML file. A description of the output can be found on the QUAST web site: http://quast.bioinf.spbau.ru/manual.html.
Exercise: report and compare assembly metrics
Please fill in this Google Docs spreadsheet to compare your assembly with the other participants. It will ask you for the name of the assembler you used (SPAdes, GATB-Pipeline, etc..), and some statistics that can all be found in the QUAST report.
Take a look at the assembly qualities from the other participants. At this point, you may be tempted to re-run your assembly with better parameters. This is in fact what we will do in Part 3.
- Just to get a rough idea, assemblies that have a contig N50 around 100 Kbp can be considered to be “okay”-“good”. Although, note that there is no formal consensus about that in the community. They are “barely acceptable” when they have a contig N50 in the order of 10 kbp. This is because 10 Kbp is roughly the average length of genes. How was your initial assembly?
- Based on the results reported by the other participants, is there a single metric which reflects the overall quality of an assembly?
Optional: annotate the assembly
You may use the tool Prokka to annotate the assembly. It is installed on your Instance. In case you wish to install Prokka on another machine later, there are instructions are the end of this document. To run Prokka, just type in the same folder as your assembly: [box]prokka assembly_file[/box]
Note: if your prokka command fails, try running this command first: [box]export LC_ALL=C[/box]
Note: Prokka will produce an error message if your header line is greater than 20 characters long
“Contig ID must <= 20 chars long: scaffold1scaffold1NODE_1_length_1420_cov_33.173943”
To overcome this, use this short bash script (!!do not copy and paste!!, write it out! because double quotes do not print well):
- How many annotated proteins were predicted by Prokka? (hint: it is the number of entries in the output .faa file)
- Across all participants, how do the different metrics compare to the number of annotated genes?
Optional: assess the accuracy of your assembly
You may use the refence genome of V. Cholerae provided with the image: vcholerae_h1.fasta
Give this reference genome as an input to QUAST using the parameter “-R reference_file.fna”.
- How many large / small misassemblies were made by the assembler?
- What is the actual genome coverage?
Part 3: improve the initial assembly
(1 hour, until end of session)
The goal of this part is for you to generate a much better assembly of V. cholerae given the reads that you have. Feel free to take a look at the parameters reported by the other participants in the Google Docs spreadsheet to achieve this goal. Below, we highlight several steps that you may additionally choose to take in order to improve your initial assembly. These steps are given in no specific order, and you may choose to follow only some of them. This part is open-ended.
Use PacBio data with SPAdes
Long reads are good, they can really help an assembler to resolve repetitions.
PacBio data is provided: SRR497965.fastq and SRR497966.fastq in the directory assembly/pacbio
These reads are not paired. They just happen to be in two different files. You could use just one or both.
Give either or both of those dataset as input to SPAdes (provide either or both files with the option “–pacbio”, with two dashes), in addition to the Illumina data, for greatly improved contiguity. SPAdes needs Illumina data along with PacBio, it cannot process PacBio alone. GATB-Pipeline cannot take PacBio at all.
By the way: don’t specify a single “-k” value for SPAdes, it is a multi-k assembler, so by default it will use a mix of k values.
You might run into memory or time issues if you give both PacBio input files, as it will become a larger dataset to assemble.
Run a different assembler
With genome assembly, and also transcriptome assembly, getting a second opinion (i.e., running a different assembler) is recommended for any serious project. Assemblers are designed very differently, and there isn’t a single best assembler for all datasets.
Use more data
You were given only 500,000 reads, because that was sufficient to get a decent assembly of this organism. But the directory called assembly/high_coverage_illumina contains the full dataset (SRR531199_?.fastq). You may want to assemble it, to see if adding more reads improves the assembly. How many reads need to be sequenced to get the best possible assembly is a recurring question, and this is often difficult or impossible to predict before doing the actual assembly.
Some assemblers have good “default” parameters, some do not. In both cases, informed tuning of parameters can lead to better assemblies. E.g. SPAdes runs well with default parameters. By default, it assembles using three k values that it auto-detects. A little-known fact is that it will do even better if you set to execute with even more k values. On the other hand, Velvet requires that you specify a k value. It also generally performs better if you execute velvetg with the parameters “-cov_cutoff auto -exp_cov auto” rather than without. Using a specific value of -cov_cutoff might also even improve upon auto. The value of k also has a great influence on the results of Velvet and Minia. You may look at the Google Docs to pick better parameters. There also exists tools that can help you find optimal parameters.
Here two different approaches to estimate parameters are presented.
VelvetOptimizer (not installed today) http://bioinformatics.net.au/software.velvetoptimiser.shtml
VelvetOptimizer executes Velvet for many values of k and picks the best result. It also optimizes the cov_cutoff parameter. It may take a long time to run, but it’s worth it.
KmerGenie (installed!) http://kmergenie.bx.psu.edu/ [Note: software written by the lab instructor]
Kmergenie examines the read files and returns what it thinks is the best k and the best coverage cutoff (cov_cutoff for Velvet). For our dataset, both Fastq files should be input to Kmergenie. To do this, create a text file with one read path per line. Then run Kmergenie with the name of the text file as input. You can take a look at the generated HTML report, it contains a plot of the predicted genome size vs k. The predicted best k is the maximum value in this plot. To ascertain that the predicted k of Kmergenie is accurate, make sure that the plot is smooth and concave.
The V. cholerae H1 strain read datasets were downloaded from the SRA.
Prokka is installed on the Amazon AMI for this workshop. There is no need to install it during this workshop if you are using the Amazon AMI. The instructions below are provided as a guide if you wish to install Prokka on another machine after the workshop.
The following instructions can be helpful when you want to later use Prokka outside of the workshop. Installing Prokka can take a long time (> 1 hour) because it is a large download size and the many dependencies. So, if you are reading this lab while attending the workshop, and you do not have Prokka installed, then just skip the annotation step (will take too long to install).
You may use Homebrew/Linuxbrew with Homebrew-science to easily install it using the ‘brew install prokka’ command. Otherwise, Prokka can be downloaded at this address: http://www.vicbioinformatics.com/software.prokka.shtml Do not download just the main script, you need the whole package (note: it is a 2 GB file). Also, the dependencies can be quite difficult to install. This note has helped me to install them manually: http://2013-caltech-workshop.readthedocs.org/en/latest/prokka-annotation.html
This year, we will offer a price for the best assembly. A single metric will be used to estimate the quality of an assembly: NGA50. NGA50 is the NG50 of the aligned assembly (see course slides). It is computed for you by QUAST when you run it with the assembly and add “-R” with the reference genome. Using NGA50 alone is just for show; using just a single metric is generally not a reasonable way to evaluate assemblies. Furthermore, NGA50 is nice but cannot be computed without an accurate reference genome, therefore it cannot be used in actual de novo projects.