Table of contents

Expected learning outcome

The objective of this activity is to help you understand the basics of genome assembly, using typical tools and technology available today (January 2019). You will learn how to run an assembler on real bacterial data (3rd generation sequencing data). More importantly, you will learn how to evaluate an assembly with relevant quality metrics. This lab also covers several techniques to obtain assemblies of reasonable quality.

Part 1: A naive attempt

(30mins-1hour)

Some assembly tutorials will tell you exactly what steps to take, what software to run and what parameters to set. This is in stark contrast with what you might experience in the real world: your libraries just came out of the sequencer, and you do not know what is the best assembler for your data, let alone if your data is adequate for assembly at all. In this first part of the tutorial, we would like to let you run an assembly with minimal instructions. However, if this is the first time you are using Linux and command-line software, we realize that you might get stuck. Thus, detailed instructions on how to perform this part are given below. On the other hand, if you have some experience with Linux and the command line, we encourage you to try to perform this part without the detailed instructions.

Exercise

Story: Imagine you’ve been sent to Haiti, it’s October 2010 and there is a cholera outbreak. “Luckily” you are at sequencing center nearby and they have given you bacterial isolates to sequence. So you’re doing a PacBio run and also a Illumina run, just in case. (Disregard the fact that PacBio wasn’t really widely available in 2010.) Now the task is to assemble the data in order to do various analyses later: check the phylogeny of that strain, what makes it different from other strains in terms of gene content, SNPs, structural variants, etc.. And remember: don’t drink the water, it’s a known vector of contamination; Czech beer is perfectly safe though.

You are given several sequencing datasets from the same organism:  V. cholerae, which has a genome size of ~4 Mbp. The goal is to perform an assembly of the V. cholerae genome. It is known that the genome has 2 chromosomes and has around 3,800 annotated genes.

Afficher l'image d'origine

The following datasets are provided in the Workshop AMI (in ~/workshop_materials/assembly/) but can also be downloaded by using the ERRxxxxxx identifiers on the Sequence Read Archive. See how at the end of this tutotorial.

  • PacBio sequencing (these are raw, uncorrected “CLR” reads, not “CCS” corrected reads):
    • ERR1716491.subsampled.fasta
  • Illumina sequencing (paired-end interleaved, insert size 150 bp):
    • SRR531199_interleaved_500k.fastq

The goal of Part 1 is to obtain a quick, initial assembly of the V. cholerae genome using any of these datasets. You could use any of the assemblers which have been pre-installed on your Instance. “But, which one”, you ask? Well, the point of this part is to let you to take initiatives and pick one! So, for this first attempt, if the assembler asks for any parameter, try to guess a reasonable value but do not over-think it. At this point, it does not matter if the assembly is of poor quality. In the next part, you will measure the quality of your initial assembly and recognize that it could possibly be improved. Once you have generated your first assembly, move to Part 2. If you are stuck, don’t panic, either ask a TA or follow the detailed steps below.

Q: What is the approximate read coverage of the partial PacBio dataset?

The read coverage of a dataset can be calculated as the number of bases inside the reads, divided by the approximate length of the genome. To get the number of bases in a FASTA file: hint.

Pre-installed assemblers

Miniasm

Miniasm is a PacBio assembler, but a little bit of a “UFO”: it is super-fast, but generated assemblies are imperfect. In the sense that they are structurally correct, but at the nucleotide level, there are many many mismatches and indels. This is because Miniasm does not correct the reads, it just creates an assembly without the final consensus step. Assemblies then need to be polished. E.g. using the Racon software. But since such polishing software may take a long time to run, we won’t cover it here in this lab.

Github: https://github.com/lh3/miniasm/

Spades

Spades is an Illumina assembler designed for prokaryotic and small eukaryotic genomes. It does an excellent job (probably the best) at assembling bacteria with short read sequencing, either multi-cell or single-cell data, and also small metagenomes. It is multi-k and takes generally longer time and memory than other assemblers. It also supports hybrid assembly of Illumina+PacBio data.

Website: http://cab.spbu.ru/software/spades/

Manual: http://cab.spbu.ru/files/release3.11.1/manual.html

Flye 

Flye is a PacBio assembler that uses a somewhat different strategy than Canu. On this workshop’s dataset, it takes also quite a long time to run (somewhere between 30mins and 1 hour). But, unlike Canu which is not designed to run on small memory machines, Flye can in principle assemble a bacterial dataset in 2 GB memory.

Github: https://github.com/fenderglass/Flye

Manual: https://github.com/fenderglass/Flye/blob/flye/docs/USAGE.md#quickusage

Ra 

Ra is a PacBio assembler which is based on existing components: minimap, racon (polishing), and a layout stage called “rala”. As of 2019, it is not published and not very well-documented, but a recent benchmark shows that it performs well. It can also optionally take Illumina reads as supplementary input for polishing. But for our dataset, it looks like that step is not straightforward.

Github: https://github.com/rvaser/ra

Manual: there doesn’t seem to be any. Just make sure to redirect the output to a fasta file. E.g.: 

ra -x pb reads.fasta -t 2 > assembly.fasta 

Canu

Canu is one of the leading assemblers for PacBio data. It gives very good results, however in the context of this workshop it might take a while to run (significantly more than half an hour) or not run at all.

Github: https://github.com/marbl/canu

Quick start: http://canu.readthedocs.io/en/latest/quick-start.html

Note: consider using the command-line argument “useGrid=false” and/or “stopOnReadQuality=false” if you run into troubles. Also Canu asks for more memory than your system has: cnsThreads=2 and cnsMemory=1 may be able to solve it. The TA’s couldn’t get Canu to run this year, but they didn’t try hard at changing the parameters like ‘cnsMemory’ or ‘corMemory’. Maybe you will succeed.

Many other assemblers

NOTE: Many of these aren’t installed on the AMI.

The list of available assemblers is huge and the selection here is by no means complete. However, all of these listed are commonly used and should do a great job assembling this dataset.

Pacbio assemblers

wtdbg2 https://github.com/ruanjue/wtdbg2

FALCON https://github.com/PacificBiosciences/FALCON

SMARTdenovo https://github.com/ruanjue/smartdenovo

Unicycler (hybrid) https://github.com/rrwick/Unicycler

Illumina assemblers

Minia pipeline http://github.com/GATB/gatb-pipeline

Velvet http://www.ebi.ac.uk/~zerbino/velvet/

SGA https://github.com/jts/sga

ABySS http://www.bcgsc.ca/platform/bioinfo/software/abyss

SOAPdenovo 2 http://sourceforge.net/projects/soapdenovo2/

Megahit https://github.com/voutcn/megahit

 

Detailed steps

We encourage you to try solving Part 1 without looking at these steps. But if you are stuck, you may reveal a set of detailed steps by clicking on the link below. If these steps remain unclear, do not hesitate to ask a lab instructor for clarification.

Detailed Steps

We’ll use miniasm to obtain a quick assembly of the PacBio dataset.

Now make sure that your current directory is where you want to want your assembly to be. Move to the assembly directory:

cd ~/workshop_materials/assembly/

Let’s make a nice little folder that will contain the result of our miniasm assembly.

mkdir miniasm

cd miniasm

Now we are ready to assemble. Enter the following command to assemble:

minimap2 -x ava-pb -t8 ../ERR1716491.subsampled.fasta ../ERR1716491.subsampled.fasta | gzip -1 > reads.paf.gz
miniasm -f ../ERR1716491.subsampled.fasta reads.paf.gz > reads.gfa
# the following command is a very intimidating command that converts a GFA file to a FASTA file
# it was taken from here: https://www.biostars.org/p/169516/
awk '/^S/{print ">"$2"\n"$3}' reads.gfa | fold > assembly.fa

These commands follow exactly the README at: https://github.com/lh3/miniasm. Note: when running minimap2, you need to enter the reads twice, this is not a typo. minimap2 performs alignment of all reads against themselves.

If you cannot see the detailed steps and you are using Safari, use Chrome or Firefox.

 

Part 2: QC the data while you wait

(~ 15 mins + ~ 20 mins for optional steps)

While you are waiting for the assembly to finish,  let us open another terminal and run some quality checks on the data.

Exercise: Have a look at the PacBio data yourself

While the assembly completes, let’s open another terminal and see a little bit what the data looks like.

Q: Print the first 4 lines of the PacBio data file. (Unsure how?) What do you observe?

Let’s get some more statistics about the PacBio data.

Q: How many reads are there? (Unsure how?)

Q: What is the mean read length? (Here’s one way to compute it)

Optional Exercise: Look at data quality of the Illumina dataset

Run the FASTQC and KmerGenie programs on the Illumina data. KmerGenie generates an HTML report showing a k-mer histogram. To run kmergenie, type:

kmergenie SRR531199_interleaved_500k.fastq

Q: Based on the KmerGenie histogram, what is roughly the sequencing coverage of this data?

One can estimate the sequencing coverage of a dataset by finding the gaussian peak of k-mers. If the peak is at 20, this indicates that a non-erroneous k-mer is seen around 20 times in the genome, which correspond to a sequencing coverage a bit above 20.

Note: doing this analysis with PacBio data won’t work. This is because of the high sequencing error rate of PacBio data, which prevents straightforward k-mer analysis.

Part 3: evaluate the assembly

(~ 20 mins + ~ 20 mins for optional steps)

Exercise: evaluate your assembly with QUAST

Run QUAST on your output file. The syntax is below. You will want to specify a descriptive name for the output directory where QUAST will deposit a large number of output files.

quast.py -o output_directory assembly_file
  Move into your QUAST output directory and examine the report.txt file. You may also take a look at the HTML file. A description of the output can be found on the QUAST web site: http://quast.bioinf.spbau.ru/manual.html.

Look out for two potentially confusing parts:

  • “# contigs” is actually the number of scaffolds >= 500 bp, because by default QUAST considers sequences longer than 500 bp, and when you give it a scaffolds file as input, it considers that they are “contigs”.

Note also, if you are running QUAST on an Illumina assembly, e.g. SPAdes, you might want to use the parameter “–split-scaffolds” (note there are actually two dashes – – but the webpage displays only one) to let QUAST know that SPAdes produced scaffolds. SPAdes produces many files. One is “contigs.fasta”, which unsuprisingly contains contigs, and another is “scaffolds.fasta”, which contains scaffolds. If you ran QUAST on a PacBio assembly, since a PacBio assembler never creates scaffolds (only contigs), you are fine.

  • The “scaffolds_broken” column corresponds to contigs extracted by QUAST from the scaffolds file.

Assess the accuracy of your assembly

You may use the reference genome of V. Cholerae provided: ~/workshop_materials/assembly/vcholerae_h1.fasta

In real life you will likely not have a reference genome, but if you have a closely related genome you can use it as reference and get some nice stats from QUAST such as ‘genome fraction’ (=genome coverage).

Give this reference genome as an input to QUAST using the parameter “-r reference_file.fna”.

Q: How many large / small misassemblies were made by the assembler?

These statistics can be found in the QUAST output, under the “# misassemblies” and “# local misassemblies” fields respectively.

Q: What is the genome coverage?

Not to be mixed up with the read coverage, the genome coverage of an assembly is percentage of the genome that is covered by at least one base of the assembly. It corresponds to the “Genome Fraction” field in the QUAST output. Recall that, in contrast, the  coverage of a sequencing dataset is the average number of times a genome base is covered by a read.

Exercise: report and compare assembly metrics

Please fill in this Google Docs spreadsheet to compare your assembly with the other participants. It will ask you for the name of the assembler you used (SPAdes, Canu, etc..), and some statistics that can all be found in the QUAST report.

https://docs.google.com/spreadsheets/d/1WZsYoK8Bbs73JhFEi3B70tS8uYvUDnvJYLV67R6H-5k/edit

Take a look at the assembly qualities from the other participants. At this point, you may be tempted to re-run your assembly with better parameters. This is in fact what we will do in Part 4.

Just to get a rough idea, assemblies that have a contig N50 around 100 Kbp can be considered to be “okay”/”good”. Although, note that there is no formal consensus about that in the community. They are “barely acceptable” when they have a contig N50 in the order of 10 kbp. This is because 10 Kbp is roughly the average length of genes.

Q: Would you say that your initial assembly was satisfactory, according to the informal metrics above?

Q: Based on the results reported by the other participants, is there a single metric which reflects the overall quality of an assembly?

Visualize your assembly with bandage

Use the Bandage software to visualize the assembly graph. It is generally the file that ends with “.gfa” (“.fastg” for SPAdes) in the same folder as the one that end with “.fasta”.

  • Load the .gfa file in Bandage
  • Observe the graph

Q: Is your assembly graph connected? i.e. are there many disjoint components, or a single large one.

This provides some indications of whether you had sufficient coverage to perform the assembly. It could also mean that the sequencer produced some sequences that did not overlap others.

Q: Does the graph has many branching nodes?

This is indicative of variants or repetitions that could not have been resolved by the assembler.

 

Optional: annotate the assembly

You may use the tool Prokka to annotate the assembly. It is installed on your Instance. In case you wish to install Prokka on another machine later, there are instructions are the end of this document. To run Prokka, just type in the same folder as your assembly:

prokka assembly_file

Note: if your prokka command fails, try running this command first:

export LC_ALL=C

If Prokka says: “Contig ID must <= 20 chars long: scaffold1scaffold1NODE_1_length_1420_cov_33.173943” Please let a TA know!

 

  • How many annotated proteins were predicted by Prokka? (hint: it is the number of entries in the output .faa file)
  • Across all reported assemblies in the Google Docs, what is the variability of the number of annotated genes?

 

Part 4: improve the initial assembly, analyse the assembly

(1 hour, until end of session)

The goal of this part is for you try a few different assembly strategies: Illumina-only, PacBio-only, hybrid. The goal is to see which strategy achieves a good compromise between assembly quality and cost of sequencing, given the V. cholerae runs that we have. Feel free to take a look at the parameters reported by the other participants in the Google Docs spreadsheet to achieve this goal. Below, we highlight several steps that you may additionally choose to take in order to improve your initial assembly. These steps are given in no specific order, and you may choose to follow only some of them. This part is open-ended.

 

Try again with a different assembler

With genome assembly (and also transcriptome assembly), getting a second opinion i.e., running a different assembler is recommended for any serious project. Assemblers are designed very differently, and there isn’t a single best assembler for all datasets.

Can you create a better assembly than your first attempt, using any other assembler e.g. from the list above?

Optional: see if adding more PacBio coverage helps

More coverage for the PacBio data is available in the ERR1716491 folder;

To use it, concatenate all the FASTA:

cd ~/workshop_materials/assembly/

cat ERR1716491/*.fasta > all_reads.fasta

Q: What is the coverage of the full dataset?

Q: Does this enable to create an even better assembly?

 

optional: Find out differences between Illumina and pacbio datasets

Actually those the Illumina and the Pacbio datasets aren’t from the same exact strain (the PacBio data is from Bengladesh). To determine the differences between the strains, what you could do is: align reads from the Illumina dataset to the Pacbio-only assembly, and call SNPs.  To call SNPs, you could for example use samtools mpileup: http://samtools.sourceforge.net/mpileup.shtml

Q: How many SNPs were found between the two strains?

As a check, do the same procedure by aligning reads from the Illumina dataset to the Illumina-only assembly.

Q: Does SNP calling of a dataset on its own assembly report “false positive” SNPs?

 

Treasure hunt

Afficher l'image d'origine

Do some bioinformatics, decode a secret message, find a treasure

A little known part of Cesky Krumlov workshop history is that centuries ago, there was an ancient civilization that had primitive DNA sequencers. They were very advanced, had the same kind of pets as we do, such as cats, salamanders and birds. There is an old legend saying that they hid a small bounty somewhere in the city. They left clues, of course. I was able to recover one but was never to find the treasure. I’m ready to give that clue to you, perhaps you’ll be able to assemble them so that they reveal a secret message..

Here’s the clue: http://rayan.chikhi.name/hunt

Then.. you are on your own..

 

 

Appendix

Data source

The V. cholerae H1 strain read datasets were downloaded from the SRA. Here is a set of commands to download the datasets if you don’t have access to the Workshop AMI anymore. The script uses the fastq-dump program from the SRA toolkit. If you don’t have it, install it using Conda: conda install -c bioconda sra-tools. Using fastq-dump isn’t a good idea to download PacBio data, see here why.

fastq-dump --split-files SRR531199
head -n 2000000 SRR531199_1.fastq > SRR531199_1.500k.fastq
head -n 2000000 SRR531199_2.fastq > SRR531199_2.500k.fastq

# go to https://www.ebi.ac.uk/ena/data/view/ERR1716491 and download all the .bax.h5 files then run
conda install dextractor
find . -name '*.bax.h5' | xargs dextract
cat m*.fasta > ERR1716491.fasta (edited)