Table of contents
- Expected learning outcome
- Part 1: A naive attempt
- Part 2: Evaluate the assembly
- Part 3: Improve the initial assembly
Expected learning outcome
The objective of this activity is to help you understand the basics of genome assembly, using typical tools and technology available today (January 2014). You will learn how to run an assembler on real bacterial data. More importantly, you will learn how to evaluate an assembly with relevant quality metrics. This lab also covers several techniques to obtain assemblies of reasonable quality.
Part 1: A naive attempt
Most assembly tutorials will tell you exactly what steps to take, what software to run and what parameters to try. This is in stark contrast with what you might experience in the real world: your libraries just came out of the sequencer, and no one knows what is the best assembler for your data, let alone how to set the parameters for it. In this first part of the tutorial, we would like to let you run an assembly with minimal instructions. However, if this is the first time you are using Linux and command-line software, we realize that you might get stuck. Thus, detailed instructions on how to perform this part are given below. On the other hand, if you have some experience with Linux and the command line, we encourage you to try to perform this part without the instructions.
You are given a bacterial dataset to assemble: 500,000 reads from V. cholerae (genome size: ~4 Mbp) obtained using the Illumina MiSeq sequencer. The insert size is 335 bp, so the pairs are overlapping (you could try merging them, but that is optional). The reads are in the directory called assembly. Use the files called vcholerae_reads_1.trimmed.subsampled.250k.fastq.gz and vcholerae_reads_2.trimmed.subsampled.250k.fastq.gz. Note that these are paired-end reads.
The goal of Part 1 is to obtain a quick, initial assembly of the V. cholerae genome using these reads. You may use any of the recommended assemblers. Try to guess reasonable values for the mandatory parameters (e.g. the k-mer size) but do not over-think it. At this point, it does not matter if the assembly is of poor quality. In the next Part, you will measure the quality of your initial assembly and recognize that it is, indeed, poor. Once you have generated your first assembly, move to Part 2. If you are stuck, please follow the detailed steps below.
Velvet is probably the most popular assembler. It works well with any sort of genomic Illumina data. However, Velvet uses quite a lot of memory (will need several hundreds of GB of RAM for > 100 Mbp genomes). The bacterial dataset we are using here is small, so Velvet should not consume more than 1 GB of RAM.
Minia takes as input an Illumina dataset and can assemble large genomes (human-sized) on a desktop computer. Minia creates contigs, in other words it does not use paired-end information to produce scaffolds. For the dataset provided in this lab, the insert size is small, so an assembly formed only by contigs might be of sufficient quality. [Note: Minia was co-written by the lab instructor]
Spades is an assembler designed for small genomes. It does an excellent job (probably the best to date) at assembling bacteria, either multi-cell or single-cell data, and small metagenomes. It takes generally longer time and memory than other assemblers. Version 3.0 just came out (with IonTorrent and PacBio support), but version 2.5 is the one included in the VM.
Many other assemblers
The list of available assemblers is huge and the selection here is by no means complete. However, all of these listed are commonly used and should do a great job assembling this dataset.
SOAPdenovo 2 http://sourceforge.net/projects/soapdenovo2/
We encourage you to try solving Part 1 without looking at these steps. But if you are stuck, you may reveal a set of detailed steps by clicking on the link below. If these steps remain unclear, do not hesitate to ask a lab instructor for clarification.
We’ll use Minia to obtain an assembly of the given reads. Normally, Minia is installed in your virtual machine/amazon AMI. To verify this, just type minia in the command line, and make sure that the output start with:
/minia fasta_file kmer_size min_abundance estimated_genome_size prefix
If it says “command not found”, then Minia is not installed. In that case, you can copy-paste these instructions to install it:
make -C minia-1.6067
sudo cp minia-1.6067/minia /usr/local/bin/
Part 2: evaluate the assembly
(~ 20 mins + ~ 20 mins for optional steps)
Exercise: evaluate your assembly with QUAST
Run QUAST on your output file. The syntax is below. You will want to specify a descriptive name for the output directory where QUAST will deposit a large number of output files. Use the parameter “–scaffolds” if your assembler produced scaffolds (Velvet and Spades will, Minia won’t).
Exercise: report and compare assembly metrics
Please fill in this Google Docs spreadsheet to compare your assembly with the other participants. It will ask you for the name of the assembler you used (Velvet, Minia, etc..), and some statistics that can all be found in the QUAST report.
Take a look at the assembly qualities from the other participants. At this point, you may be tempted to re-run your assembly with better parameters. This is in fact what we will do in Part 3.
- Assemblies start to be “just acceptable” when they have a contig N50 of > 10 kbp. Was it the case with your initial assembly?
- Based on the results reported by the other participants, is there a single metric which reflects the overall quality of an assembly?
Optional: annotate the assembly
You may use the tool Prokka to annotate the assembly. It is installed in the VM. In case you wish to install Prokka on another machine later, there are instructions are the end of this document. To run Prokka, just type in the same folder as your assembly:
Note: if your prokka command fails, try running this command first:
- How many annotated proteins were predicted by Prokka? (hint: it is the number of entries in the output .faa file)
- Across all participants, how do the different metrics compare to the number of annotated genes?
Optional: assess the accuracy of your assembly
You may download the refence genome of V. Cholerae here: ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Vibrio_cholerae_O1_biovar_El_Tor_N16961_uid57623
Download (Tip: you can use wget) and concatenate the two .fna files to create the reference genome. Give this reference genome as an input to QUAST using the parameter “-R reference_file.fna”.
- How many large / small misassemblies were made by the assembler?
- What is the actual genome coverage?
Part 3: improve the initial assembly
(1 hour, until end of session)
The goal of this part is for you to generate a much better assembly of V. cholerae given the reads that you have. Feel free to take a look at the parameters reported by the other participants in the Google Docs spreadsheet to achieve this goal. Below, we highlight several steps that you may additionally choose to take in order to improve your initial assembly. These steps are given in no specific order, and you may choose to follow only some of them. This part is open-ended.
Run a different assembler
With genome assembly, and also transcriptome assembly, getting a second opinion (i.e., running a different assembler) is recommended for any serious project. Assemblers are designed very differently, and there isn’t a single best assembler for all datasets.
Use better parameters
Most assemblers do not have “default” parameters, thus they cannot be executed with zero parameters. (A notable exception is Spades.) You will need to specify a k value and probably other values. For instance, Velvet generally works better if you execute velvetg with the parameters “-cov_cutoff auto -exp_cov auto” rather than without. Using a specific value of -cov_cutoff might even improve upon auto (often the case if you are assembling data which may contain two or more populations of DNA molecules). The value of k also has a great influence on the results of Velvet and Minia. You may look at the Google Docs to pick better parameters There also exists tools that can help you find optimal parameters.
Here two software to estimate parameters are presented. They are independent, you do not need to run one after the other.
VelvetOptimizer (included in velvet) http://bioinformatics.net.au/software.velvetoptimiser.shtml
VelvetOptimizer executes Velvet for many values of k and picks the best result. It also optimizes the cov_cutoff parameter. It may take a long time to run.
Run VelvetOptimizer and choose the range of kmers to be tested using the -s (smallest kmer) and -e (largest kmer) . In our case we have two fastq files with short reads so we will use -shortPaired -separate (see this page for examples of modern Velvet command lines). Note the single quotes for the -f option:
~/software/VelvetOptimiser-2.2.5/VelvetOptimiser.pl -s SmallestKmer -e LargestKmer -f ‘-fastq -shortPaired -separate ~/assembly/vcholerae_reads_1.trimmed.subsampled.250k.fastq ~/assembly/vcholerae_reads_2.trimmed.subsampled.250k.fastq’
KmerGenie http://kmergenie.bx.psu.edu/ [Note: software written by the lab instructor]
Kmergenie examines the read files and returns what it thinks is the best k and the best coverage cutoff (cov_cutoff for Velvet, min_abundance for Minia). For our dataset, both Fastq files should be input to Kmergenie. To do this, create a text file with one read path per line. Then run Kmergenie with the name of the text file as input.
Here is a more advanced usage of Kmergenie: take a look at the generated HTML report, it contains a plot of the predicted genome size vs k. The predicted best k is the maximum value in this plot. To ascertain that the predicted k of Kmergenie is accurate, make sure that the plot is smooth and concave. Sometimes the curve is not smooth and has small variations. Then, it may be a good idea to try a higher k than the one predicted by Kmergenie. Typically, if the estimated number of genomic k-mers is relatively stable for a range of k values (and then drops), then the highest k value of that range is a good candidate.
Here is an example on how to run Kmergenie. The -o determines the prefix of all the output files.:
echo ~/assembly/vcholerae_reads_1.trimmed.subsampled.250k.fastq > reads_list.txt
echo ~/assembly/vcholerae_reads_2.trimmed.subsampled.250k.fastq >> reads_list.txt
~/software/kmergenie-1.5924/kmergenie reads_list.txt -o kmergenie
Use more data
You were given only 500,000 reads, because that was sufficient to get a decent assembly of this organism. But the directory called assembly contains the full dataset (vcholerae_reads_?.trimmed.
The V. cholerae dataset was taken from the GAGE-B paper: http://www.ncbi.nlm.nih.gov/pubmed/23665771
This paper also describes and discusses assemblies for this dataset.
Prokka is installed on the Amazon AMI for this workshop. There is no need to install it during this workshop if you are using the Amazon AMI. The instructions below are provided as a guide if you wish to install Prokka on another machine after the workshop.
Installing Prokka can take a long time (> 1 hour) because it is a large download size and the many dependencies. The following instructions are for when you want to later use Prokka outside of the workshop. If you are reading this while attending the workshop assembly lab, and you do not have Prokka installed, then just skip the annotation step (will take too long to install). Prokka can be downloaded at this address: http://www.vicbioinformatics.com/software.prokka.shtml Do not download just the main script, you need the whole package (note: it is a 2 GB file). Also, the dependencies can be quite difficult to install. This note has helped me to install them: http://2013-caltech-workshop.readthedocs.org/en/latest/prokka-annotation.html