Table of contents
- Expected learning outcome
- Part 1: A naive attempt
- Part 2: QC the data while you wait
- Part 3: Evaluate the assembly
- Part 4: Improve the initial assembly
- Treasure hunt
Expected learning outcome
The objective of this activity is to help you understand the basics of genome assembly, using typical tools and technology available today (January 2018). You will learn how to run an assembler on real bacterial data (2nd and 3rd generation sequencing data). More importantly, you will learn how to evaluate an assembly with relevant quality metrics. This lab also covers several techniques to obtain assemblies of reasonable quality.
Part 1: A naive attempt
Some assembly tutorials will tell you exactly what steps to take, what software to run and what parameters to set. This is in stark contrast with what you might experience in the real world: your libraries just came out of the sequencer, and you do not know what is the best assembler for your data, let alone if your data is adequate for assembly at all. In this first part of the tutorial, we would like to let you run an assembly with minimal instructions. However, if this is the first time you are using Linux and command-line software, we realize that you might get stuck. Thus, detailed instructions on how to perform this part are given below. On the other hand, if you have some experience with Linux and the command line, we encourage you to try to perform this part without the detailed instructions.
Story: Imagine you’ve been sent to Haiti, it’s October 2010 and there is a cholera outbreak. “Luckily” you are at sequencing center nearby and they have given you bacterial isolates to sequence. So you’re generating some Illumina runs and also some runs from a new technology called PacBio. Now the task is to assemble the data in order to later check the phylogeny of that strain, what makes it different from other strains in terms of SNPs and SVs, etc.. And remember: don’t drink the water, it’s a known vector of contamination. I’ve heard that Czech beer is safe though.
You are given several sequencing datasets from the same organism: V. cholerae, which has a genome size of ~4 Mbp. The goal is to perform an assembly of the V. cholerae genome. It is known that the genome has 2 chromosomes and has around 3,800 annotated genes.
The following datasets are provided in the Workshop AMI but can also be downloaded, see
“low-coverage” Illumina dataset:
- 500,000 paired reads, each read is of length 99bp, were sequenced using the Illumina MiSeq.
- Insert size is 150 bp. So the pairs are overlapping by quite a bit (you could merge them, but let’s not do that at first).
- Reads are in the directory called ~/workshop_materials/assembly/subsampled_illumina. Use the files called SRR531199_1.500k.fastq and SRR531199_2.500k.fastq. Note that these are paired-end reads.
“high-coverage” Illumina dataset:
- Same as the previous one, but with more reads.
- SRR531199_1.fastq and SRR531199_2.fastq
- There are PacBio reads (10kb SMRTbell template prep), which are relatively old (2012) compared to the recent evolution of the technology.
- Each file is a different run. Note: you do not have to use all the runs.
- Recent PacBio sequencing of a different strain:
- Do not mix it with the other datasets, unless you really know what you’re doing.
The goal of Part 1 is to obtain a quick, initial assembly of the V. cholerae genome using any of these datasets. You could use any of the assemblers which have been pre-installed on your Instance. “But, which one”, you ask? “And also, which of these datasets should I use?” Well, the point of this part is to let you to take initiatives and pick one! So, for this first attempt, if the assembler asks for any parameter, try to guess a reasonable value but do not over-think it. At this point, it does not matter if the assembly is of poor quality. In the next part, you will measure the quality of your initial assembly and recognize that it could possibly be improved. Once you have generated your first assembly, move to Part 2. If you are stuck, don’t panic, either ask a TA or follow the detailed steps below.
Q: What is the approximate coverage of the low-coverage Illumina dataset?
The coverage of a read dataset can be calculated as the number of bases inside the reads, divided by the approximate length of the genome.
Spades is an Illumina assembler designed for prokaryotic and small eukaryotic genomes. It does an excellent job (probably the best) at assembling bacteria, either multi-cell or single-cell data, and small metagenomes. It is multi-k and takes generally longer time and memory than other assemblers. It also supports assembly IonTorrent, and PacBio data.
Canu is one of the leading assemblers for PacBio data. It gives very good results, however in the context of this workshop it might take a while to run (significantly more than half an hour).
Note: consider using the command-line argument “useGrid=false” and/or “stopOnReadQuality=false” if you run into troubles. Also Canu may reserve more memory than your system has: cnsThreads=2 and cnsMemory=1 should solve it.
Miniasm is a PacBio assembler, but a little bit of a “UFO”: it is super-fast, but generated assemblies are imperfect. In the sense that they are structurally correct, but at the nucleotide level, there are many many mismatches and indels. This is because Miniasm does not correct the reads, it just creates an assembly without the final consensus step. Assemblies then need to be polished. E.g. using the Racon software. But since such polishing software may take a long time to run, we won’t cover it here in this lab.
Many other assemblers
NOTE: Many of these aren’t installed on the AMI.
The list of available assemblers is huge and the selection here is by no means complete. However, all of these listed are commonly used and should do a great job assembling this dataset.
Minia pipeline http://github.com/GATB/gatb-pipeline
SOAPdenovo 2 http://sourceforge.net/projects/soapdenovo2/
Unicycler (hybrid) https://github.com/rrwick/Unicycler
We encourage you to try solving Part 1 without looking at these steps. But if you are stuck, you may reveal a set of detailed steps by clicking on the link below. If these steps remain unclear, do not hesitate to ask a lab instructor for clarification.
Now make sure that your current directory is where you want to want your assembly to be. Move to the assembly directory:
NOTE FOR NEXT YEAR : let’s not use symbolic links, they’re confusing. Let’s put a symbolic link of the reads into this directory. It is to avoid typing the long path to the reads.
Now we are ready to assemble. Enter the following command to assemble:
spades.py -1 SRR531199_1.500k.fastq -2 SRR531199_2.500k.fastq -o spades_assembly
By the way: don’t specify a single “-k” value for SPAdes, it is a multi-k assembler, so by default it will use a mix of k values.
If you cannot see the detailed steps and you are using Safari, use Chrome or Firefox.
Part 2: QC the data while you wait
While you are waiting for the assembly to finish, let us open another terminal and run some quality checks on the data.
Exercise: data quality
Run the FASTQC and KmerGenie programs on the Illumina data that you are assembling. KmerGenie generates an HTML report showing a k-mer histogram. To give both paired reads files to KmerGenie, type:
ls -1 [location of your read files]/SRR*.fastq > list_fastq
Observe that “ls -1” is the command “ls” followed by a dash and the number 1; it returns one file per line matching the SRR*.fastq expression.
Q: Based on the KmerGenie histogram, what is roughly the sequencing coverage of this data?
One can estimate the sequencing coverage of a dataset by finding the gaussian peak of k-mers. If the peak is at 20, this indicates that a non-erroneous k-mer is seen around 20 times in the genome, which correspond to a sequencing coverage a bit above 20.
Optional Exercise: who’s in there?
If your first assembly has completed, you may skip this exercise and come back to it later, for example when another assembly is running.
This dataset is supposedly a single bacteria strain. If it was a single cell, there would be only one genome in there. But this is a collection of cells. Could it be that there is some variations between the genomes in those cells? We do not have a reference genome yet, so how to find out..? Easy! k-mer methods! Let’s call some SNPs without a reference. For this, we will use the DiscoSNP software.
Note that DiscoSNP has an unusual way to specify arguments. You need to first create a text file containing the two paired-end read files, like so:
ls -1 file1.fastq file2.fastq > list_fastq.txt
Then it requires a text file containing just the name of that file
echo list_fastq.txt > input.txt
then run discosnp with the argument “-r input.txt”. The command is run_discosnp++.sh: to get it, just type “run_disco” and then the [TAB] key
Q: Are there SNPs? If so, what does that mean?
Notice that DiscoSNP outputs a VCF but there are no reference positions. This is quite normal, because there is no reference to align the sequences to. DiscoSNP also outputs a FASTA file containing the two paths of each bubble found.
Optionally, for the most advanced users.. To go further: after you are done assembling, map the reads back to the assembly (e.g. using BWA-MEM) and call SNPs using samtools. Is there some concordance between the SNPs found by DiscoSNPs and the ones found using samtools?
Part 3: evaluate the assembly
(~ 20 mins + ~ 20 mins for optional steps)
Exercise: evaluate your assembly with QUAST
Run QUAST on your output file. The syntax is below. You will want to specify a descriptive name for the output directory where QUAST will deposit a large number of output files. Use the parameter “–scaffolds” if your assembler produced scaffolds (Spades will, Minia won’t).
Look out for two potentially confusing parts:
- “# contigs” is actually the number of scaffolds >= 500 bp, because by default QUAST considers sequences longer than 500 bp, and when you give it a scaffolds file as input, it considers that they are “contigs”.
- The “scaffolds_broken” column corresponds to contigs extracted by QUAST from the scaffolds file.
Assess the accuracy of your assembly
You may use the reference genome of V. Cholerae provided: ~/workshop_materials/assembly/reference/vcholerae_h1.fasta
Give this reference genome as an input to QUAST using the parameter “-R reference_file.fna”.
Q: How many large / small misassemblies were made by the assembler?
These statistics can be found in the QUAST output, under the “# misassemblies” and “# local misassemblies” fields respectively.
Q: What is the actual genome coverage?
Not to be mixed up with the read coverage, the “genome coverage” of an assembly is percentage of the genome that is covered by the assembly. It corresponds to the “Genome Fraction” field in the QUAST output. Recall that, in contrast, the coverage of a sequencing dataset is the average number of times a genome base is covered by a read.
Exercise: report and compare assembly metrics
Please fill in this Google Docs spreadsheet to compare your assembly with the other participants. It will ask you for the name of the assembler you used (SPAdes, Canu, etc..), and some statistics that can all be found in the QUAST report.
Take a look at the assembly qualities from the other participants. At this point, you may be tempted to re-run your assembly with better parameters. This is in fact what we will do in Part 3.
Just to get a rough idea, assemblies that have a contig N50 around 100 Kbp can be considered to be “okay”-“good”. Although, note that there is no formal consensus about that in the community. They are “barely acceptable” when they have a contig N50 in the order of 10 kbp. This is because 10 Kbp is roughly the average length of genes.
Q: Would you say that your initial assembly was satisfactory, according to the informal metrics above?
Q: Based on the results reported by the other participants, is there a single metric which reflects the overall quality of an assembly?
Visualize your assembly with bandage
Use the Bandage software to visualize the assembly graph. It is generally the file that ends with “.gfa” (“.fastg” for SPAdes) in the same folder as the one that end with “.fasta”.
- Load the .gfa file in Bandage
- Observe the graph
Q: Is your assembly graph connected? i.e. are there many disjoint components, or a single large one.
This provides some indications of whether you had sufficient coverage to perform the assembly. It could also mean that the sequencer produced some sequences that did not overlap others.
Q: Does the graph has many branching nodes?
This is indicative of variants or repetitions that could not have been resolved by the assembler.
Optional: annotate the assembly
You may use the tool Prokka to annotate the assembly. It is installed on your Instance. In case you wish to install Prokka on another machine later, there are instructions are the end of this document. To run Prokka, just type in the same folder as your assembly:
Note: if your prokka command fails, try running this command first:
If Prokka says: “Contig ID must <= 20 chars long: scaffold1scaffold1NODE_1_length_1420_cov_33.173943” Please let a TA know!
- How many annotated proteins were predicted by Prokka? (hint: it is the number of entries in the output .faa file)
- Across all participants, how do the different metrics compare to the number of annotated genes?
Part 4: improve the initial assembly
(1 hour, until end of session)
The goal of this part is for you try a few different assembly strategies: Illumina-only, PacBio-only, hybrid. The goal is to see which strategy achieves a good compromise between assembly quality and cost of sequencing, given the V. cholerae runs that we have. Feel free to take a look at the parameters reported by the other participants in the Google Docs spreadsheet to achieve this goal. Below, we highlight several steps that you may additionally choose to take in order to improve your initial assembly. These steps are given in no specific order, and you may choose to follow only some of them. This part is open-ended.
Use PacBio data with SPAdes
Long reads are good, they can really help an assembler to resolve repetitions. These reads are not paired. Give any subset of the PacBio runs as input to SPAdes (with the option “–pacbio”, with two dashes), in addition to the Illumina data, for greatly improved contiguity. SPAdes needs Illumina data along with PacBio, it cannot process PacBio alone.
You might run into memory or time issues if you give too many PacBio input file.
Use more Illumina data
The low-coverage dataset has only 500,000 reads, because that was sufficient to get a decent assembly of this organism. But the directory called assembly/high_coverage_illumina contains the full dataset (SRR531199_?.fastq). You may want to assemble it, to see if adding more reads improves the assembly. How many reads need to be sequenced to get the best possible assembly is a recurring question, and this is often difficult or impossible to predict before doing the actual assembly.
Assemble pacbio reads directly
Try to assemble the PacBio data without any Illumina data. We recommend that you assemble the ERR1716491 dataset. Even if it’s a different strain than the H1 strain in the Illumina data, the genome isn’t too different. The reason why we recommend using that dataset is that the older 2012 PacBio data do not seem to assemble well with modern PacBio tools. Feel free to try and investigate if you like, though.
The ERR1716491 dataset has high coverage and may take too much time to assemble. Instead of taking the full dataset, you may want to assemble a subsampled dataset (same strategy as with the low-cov/high-cov Illumina datasets). PacBio assemblers have a minimal coverage requirement, of approximately 30x. Compute the number of bases in each run and then choose the right number of PacBio runs to give to the PacBio assembler (e.g. Canu or miniasm). E.g. to subsamble the first 50,000 reads (so, 200,000 lines) out of a FASTQ dataset, run:
head -n 200000 [dataset.fastq] > [dataset.50k.fastq]
Q: When quality-checking the PacBio-only assembly of that strain, why is the genome coverage lower than with the Illumina assembly?
Optional Q: try assembling the SRR* files. (They won’t assemble well.) Given the output of the assembler, what do you think is the issue?
Try again with a different assembler
With genome assembly, and also transcriptome assembly, getting a second opinion (i.e., running a different assembler) is recommended for any serious project. Assemblers are designed very differently, and there isn’t a single best assembler for all datasets.
The V. cholerae H1 strain read datasets were downloaded from the SRA. Here is a set of commands to download the datasets if you don’t have access to the Workshop AMI anymore. The script uses the fastq-dump program from the SRA toolkit. If you don’t have it, install it using Conda: conda install -c bioconda sra-tools.
bash fastq-dump --split-files SRR531199 head -n 2000000 SRR531199_1.fastq > SRR531199_1.500k.fastq head -n 2000000 SRR531199_2.fastq > SRR531199_2.500k.fastq for dataset in SRR497965 SRR497966 SRR497967 SRR497968 ERR1716491 do fastq-dump -M 1000 --table SEQUENCE $dataset done
Prokka is installed on the Amazon AMI for this workshop. There is no need to install it during this workshop if you are using the Amazon AMI. The instructions below are provided as a guide if you wish to install Prokka on another machine after the workshop.
The following instructions can be helpful when you want to later use Prokka outside of the workshop. Installing Prokka can take a long time (> 1 hour) because it is a large download size and the many dependencies. So, if you are reading this lab while attending the workshop, and you do not have Prokka installed, then just skip the annotation step (will take too long to install).
You may use Homebrew/Linuxbrew with Homebrew-science to easily install it using the ‘brew install prokka’ command. Otherwise, Prokka can be downloaded at this address: http://www.vicbioinformatics.com/software.prokka.shtml Do not download just the main script, you need the whole package (note: it is a 2 GB file). Also, the dependencies can be quite difficult to install. This note has helped me to install them manually: http://2013-caltech-workshop.readthedocs.org/en/latest/prokka-annotation.html
Assemble a small genome and decode a secret message
A little known part of Cesky Krumlov workshop history is that centuries ago, there existed pirates who had access to very primitive DNA sequencers. There is an old legend saying that they hid a small bounty somewhere in the city. They left clues, of course. I was able to recover one but was never to find the treasure. The clue is actually a set of Oxford Megapore reads from the 16th century. Back then the technology and file formats were a bit different, but I’m sure those reads contain valuable information. I’m ready to give them to you, perhaps you’ll be able to assemble them so that they reveal a secret message..
To get started, just type:
curl -sL http://rayan.chikhi.name/hunt.2018 | bash
Then.. you are on your own..