Table of contents
- Expected learning outcome
- Part 1: A naive attempt
- Part 2: Evaluate the assembly
- Part 3: Improve the initial assembly
Expected learning outcome
The objective of this activity is to help you understand the basics of genome assembly, using typical tools and technology available today (January 2015). You will learn how to run an assembler on real bacterial data. More importantly, you will learn how to evaluate an assembly with relevant quality metrics. This lab also covers several techniques to obtain assemblies of reasonable quality.
Part 1: A naive attempt
Most assembly tutorials will tell you exactly what steps to take, what software to run and what parameters to try. This is in stark contrast with what you might experience in the real world: your libraries just came out of the sequencer, and no one knows what is the best assembler for your data, let alone how to set the parameters for it. In this first part of the tutorial, we would like to let you run an assembly with minimal instructions. However, if this is the first time you are using Linux and command-line software, we realize that you might get stuck. Thus, detailed instructions on how to perform this part are given below. On the other hand, if you have some experience with Linux and the command line, we encourage you to try to perform this part without the detailed instructions.
You are given a bacterial dataset to assemble: 500,000 reads from V. cholerae (genome size: ~4 Mbp) obtained using the Illumina MiSeq sequencer. The insert size is 150 bp, so the pairs are overlapping by quite a bit (you could merge them, but let’s not do that at first). The reads are in the directory called assembly. Use the files called SRR531199_1.500k.fastq and SRR531199_2.500k.fastq. Note that these are paired-end reads.
The goal of Part 1 is to obtain a quick, initial assembly of the V. cholerae genome using these reads. You may use any of the recommended assemblers. Try to guess reasonable values for the mandatory parameters (e.g. the k-mer size) but do not over-think it. At this point, it does not matter if the assembly is of poor quality. In the next Part, you will measure the quality of your initial assembly and recognize that it is, indeed, poor. Once you have generated your first assembly, move to Part 2. If you are stuck, please follow the detailed steps below.
Spades is an assembler designed for prokaryotic and small eukaryotic genomes. It does an excellent job (probably the best) at assembling bacteria, either multi-cell or single-cell data, and small metagenomes. It takes generally longer time and memory than other assemblers. It also supports assembly of IonTorrent and PacBio data for small genomes.
Velvet is probably the most popular assembler. It works well with any sort of genomic Illumina data. However, Velvet uses quite a lot of memory (will need several hundreds of GB of RAM for > 100 Mbp genomes). The bacterial dataset we are using here is small, so Velvet should not consume more than 1 GB of RAM.
Minia takes as input an Illumina dataset and can assemble large genomes (human-sized) on a desktop computer. Minia creates contigs, in other words it does not use paired-end information to produce scaffolds. For the dataset provided in this lab, the insert size is small, so an assembly formed only by contigs might be of sufficient quality. [Note: Minia was co-written by the lab instructor]
Many other assemblers
The list of available assemblers is huge and the selection here is by no means complete. However, all of these listed are commonly used and should do a great job assembling this dataset.
SOAPdenovo 2 http://sourceforge.net/projects/soapdenovo2/
We encourage you to try solving Part 1 without looking at these steps. But if you are stuck, you may reveal a set of detailed steps by clicking on the link below. If these steps remain unclear, do not hesitate to ask a lab instructor for clarification.[toggle title_open=”Hide detailed steps” title_closed=”Show detailed steps” hide=”yes” border=”yes” style=”default” excerpt_length=”0″ read_more_text=”Read More” read_less_text=”Read Less” include_excerpt_html=”yes”]
We’ll use Minia to obtain an assembly of the given reads. Normally, Minia is installed in your virtual machine/amazon AMI but it’s an old version. For the following instructions to work, you need to download a new version. You can copy-paste these instructions to install it: [box] curl http://gatb-tools.gforge.inria.fr/versions/bin/minia-1.0.4-Linux.tar.gz | tar xz
sudo cp minia-1.0.4-Linux/bin/minia /usr/local/bin/ [/box]
Now make sure that your current directory is where you want to want your assembly to be. You may enter these commands to create a new temporary directory and make it your current directory: [box]mkdir /tmp/minia-assembly
cd /tmp/minia-assembly [/box] To generate the list of reads to assemble, enter this command (this is a single line): [box] ls -1 ~/workshop_data/assembly/SRR531199_?.500k.fastq > list_reads.txt [/box] Note that this command is ‘ls -1’ not ‘ls -l’. Now we are ready to assemble. Let’s assume that you choose the following parameters for Minia: k=31 and a kmer coverage threshold of 2. We will set the output prefix as: vcholerae_initial. Enter the following command to assemble: [box] /usr/local/bin/minia -in list_reads.txt -kmer-size 31 -abundance 2 -out vcholerae_initial [/box] If all went fine, your assembly should be the file vcholerae_initial.contigs.fa.
What you’ve just ran is the latest cutting-edge Minia version, it doesn’t have a proper manual yet. The previous version of Minia is here, and it has a manual: http://minia.genouest.org/[/toggle]
Part 2: evaluate the assembly
(~ 20 mins + ~ 20 mins for optional steps)
Exercise: evaluate your assembly with QUAST
Run QUAST on your output file. The syntax is below. You will want to specify a descriptive name for the output directory where QUAST will deposit a large number of output files. Use the parameter “–scaffolds” if your assembler produced scaffolds (Velvet and Spades will, Minia won’t). [box]quast.py -o output_directory assembly_file [–scaffolds] [/box] Note that the “–scaffold” argument is with two dashes (‘-‘ + ‘-‘) but is displayed as a single dash on this web page. Move into your QUAST output directory and examine the report.txt file. You may also take a look at the HTML file. A description of the output can be found on the QUAST web site: http://quast.bioinf.spbau.ru/manual.html.
Exercise: report and compare assembly metrics
Please fill in this Google Docs spreadsheet to compare your assembly with the other participants. It will ask you for the name of the assembler you used (Velvet, Minia, etc..), and some statistics that can all be found in the QUAST report.
Take a look at the assembly qualities from the other participants. At this point, you may be tempted to re-run your assembly with better parameters. This is in fact what we will do in Part 3.
- Assemblies start to be “just acceptable” when they have a contig N50 of > 10 kbp. Was it the case with your initial assembly?
- Based on the results reported by the other participants, is there a single metric which reflects the overall quality of an assembly?
Optional: annotate the assembly
You may use the tool Prokka to annotate the assembly. It is installed in the VM. In case you wish to install Prokka on another machine later, there are instructions are the end of this document. To run Prokka, just type in the same folder as your assembly: [box]prokka assembly_file[/box]
Note: if your prokka command fails, try running this command first: [box]export LC_ALL=C[/box]
- How many annotated proteins were predicted by Prokka? (hint: it is the number of entries in the output .faa file)
- Across all participants, how do the different metrics compare to the number of annotated genes?
Optional: assess the accuracy of your assembly
You may use the refence genome of V. Cholerae provided with the image: vcholerae_h1.fasta
Give this reference genome as an input to QUAST using the parameter “-R reference_file.fna”.
- How many large / small misassemblies were made by the assembler?
- What is the actual genome coverage?
Part 3: improve the initial assembly
(1 hour, until end of session)
The goal of this part is for you to generate a much better assembly of V. cholerae given the reads that you have. Feel free to take a look at the parameters reported by the other participants in the Google Docs spreadsheet to achieve this goal. Below, we highlight several steps that you may additionally choose to take in order to improve your initial assembly. These steps are given in no specific order, and you may choose to follow only some of them. This part is open-ended.
Use PacBio data with SPAdes
PacBio data is provided: SRR497965.fastq and SRR497966.fastq
Give it as input to SPAdes (with –pacbio), in addition to the Illumina data, for greatly improved contiguity. By the way: don’t specify a -k value for SPAdes, it is a multi-k assembler, so by default it will use a very mix of k values.
Run a different assembler
With genome assembly, and also transcriptome assembly, getting a second opinion (i.e., running a different assembler) is recommended for any serious project. Assemblers are designed very differently, and there isn’t a single best assembler for all datasets.
Use better parameters
Most assemblers do not have “default” parameters, thus they cannot be executed with zero parameters. (A notable exception is Spades.) You will need to specify a k value and probably other values. For instance, Velvet generally works better if you execute velvetg with the parameters “-cov_cutoff auto -exp_cov auto” rather than without. Using a specific value of -cov_cutoff might even improve upon auto (often the case if you are assembling data which may contain two or more populations of DNA molecules). The value of k also has a great influence on the results of Velvet and Minia. You may look at the Google Docs to pick better parameters There also exists tools that can help you find optimal parameters.
Here two software to estimate parameters are presented. They are independent, you do not need to run one after the other.
VelvetOptimizer (included in velvet) http://bioinformatics.net.au/software.velvetoptimiser.shtml
VelvetOptimizer executes Velvet for many values of k and picks the best result. It also optimizes the cov_cutoff parameter. It may take a long time to run.
KmerGenie http://kmergenie.bx.psu.edu/ [Note: software written by the lab instructor]
Kmergenie examines the read files and returns what it thinks is the best k and the best coverage cutoff (cov_cutoff for Velvet, min_abundance for Minia). For our dataset, both Fastq files should be input to Kmergenie. To do this, create a text file with one read path per line. Then run Kmergenie with the name of the text file as input.
Here is a more advanced usage of Kmergenie: take a look at the generated HTML report, it contains a plot of the predicted genome size vs k. The predicted best k is the maximum value in this plot. To ascertain that the predicted k of Kmergenie is accurate, make sure that the plot is smooth and concave. Sometimes the curve is not smooth and has small variations. Then, it may be a good idea to try a higher k than the one predicted by Kmergenie. Typically, if the estimated number of genomic k-mers is relatively stable for a range of k values (and then drops), then the highest k value of that range is a good candidate.
Use more data
You were given only 500,000 reads, because that was sufficient to get a decent assembly of this organism. But the directory called assembly contains the full dataset (SRR531199_?.fastq). You may want to assemble it, to see if adding more reads improves the assembly. How many reads need to be sequenced to get the best possible assembly is a recurring question, and this is often difficult or impossible to predict before doing the actual assembly.
The V. cholerae H1 strain read datasets were downloaded from the SRA.
Prokka is installed on the Amazon AMI for this workshop. There is no need to install it during this workshop if you are using the Amazon AMI. The instructions below are provided as a guide if you wish to install Prokka on another machine after the workshop.
The following instructions can be helpful when you want to later use Prokka outside of the workshop. Installing Prokka can take a long time (> 1 hour) because it is a large download size and the many dependencies. So, if you are reading this lab while attending the workshop, and you do not have Prokka installed, then just skip the annotation step (will take too long to install).
You may use Homebrew/Linuxbrew with Homebrew-science to easily install it using the ‘brew install prokka’ command. Otherwise, Prokka can be downloaded at this address: http://www.vicbioinformatics.com/software.prokka.shtml Do not download just the main script, you need the whole package (note: it is a 2 GB file). Also, the dependencies can be quite difficult to install. This note has helped me to install them manually: http://2013-caltech-workshop.readthedocs.org/en/latest/prokka-annotation.html