Resistance is futile…you will be Assembled!

I totally and shameless just stole my title for this blog from the winner of the Evomics t-shirt competition in 2012 for the Workshop on Genomics.

Dr. Rayan Chikhi
Pennsylvania State University

Topic: Assembly

Rayan start us out today and by saying there is only one ‘process’ for assembly but many different possible outcomes of it:

  • Create a reference genome/assemble a transcriptome
  • Just interested in the genes
  • Find novel insertions
  • Make sense of un-mapped reads
  • Discover SNPs on non-model organisms
  • Validate breakpoints
  • Recover a specific region of interest
  • Explore metagenomics

Even in 2014 assembly is still a difficult process and Rayan highly recommends the GAGE 2012 benchmark paper as a good read illustrating the difficulties in obtaining good assemblies and how at the moment there is no ‘best’ de novo assembler out there right now. Another paper I found evaluating de novo assemblers is from PLoS ONE and cites the GAGE paper: Vezzi F, Narzisi G, Mishra B (2012) Feature-by-Feature – Evaluating De NovoSequence Assembly. PLoS ONE 7(2): e31002.

There is no single program right now that is considered ‘the assembler’. Different assemblers have advantages and disadvantages as well as things they are generally useful and not useful for.

So the state of the research is as such, there are:

  1. Data-specific assemblers
  2. Low-memory assemblers
  3. Best practices (protocols) papers
  4. Assembly techniques for other specific purposes (e.g. variant calling)

Let’s run through the some terms:

  • Assembly: Set of sequences which best approximate the original sequenced material.
  • Read: Any sequence that comes out of the sequencer
  • Paired read: read1, gap  500 bp, read2
  • Mate-pair: read1, gap  1 kbp, read2
  • Single read: Unpaired read
  • k-mer: Any sequence of length k
  • Contig: gap-less assembled sequence
  • Scaffold: sequence which may contain gaps (N)

One of the important things that you need to know for assembly is what a k-mer is.

A k-mer is any sequences with length k.

AGC is a k-mer with k=3
AGCT is a k-mer with k=4
AGCTT is a k-mer with k=5

You hopefully get the idea.

Here’s the first exercise that Rayan had in his presentation (see the solution in his slides):

ex1

There are two essential methods that assemblers use to assemble: de Bruijn graphs and overlap/string graphs and Rayan had some excellent slides illustrating how this all works so I will be putting some of those up; but I highly encourage you to review his slides.
[hr] de Bruijn

Consider a sequence: ACTG (it’s a very short sequence). Consider a k-mer size of 3, totally arbitrary but this is for illustration. If you are constructing a de Bruijn graph for this teeny sequence with k=3 it will look like this:

ACT –> CTG

So first you start with the beginning nucleotide and go 1, 2, 3 then you put an arrow and start with the next nucleotide ‘C’ and go 1, 2, 3. You cannot go further because starting with T wouldn’t give you 3 nucleotides, it would just be TG so it is not a k-mer of 3 therefore isn’t in the graph.

Ex2: AGGTCCAT

AGG –> GGT –> GTC –> TCC –> CCA –> CAT

Get it? Ok lets go in reverse…how do you construct a sequence from a de Bruijn graph for this sequence above?

Here you start with the first three nucleotides making up the first k-mer of k=3, that’s AGG. Now in order to construct the sequence you add the last nucleotide of the next k-mer in order:

AGG –> GGT –> GTC –> TCC –> CCA –> CAT

Reconstruction gives you: AGGTCCAT

What if you have mutations??? Consider the following teeny alignment of sequences, using k=3…lets see what the graph would do:

debruijn1

Consider this sequence being reconstructed using de Bruijn with k=3:

assem2

Now CTG as you can see is a ‘node’ with more than one arrow leading/coming out of it, this means it’s repeated in the sequence we are trying to construct. So any contigs we ‘make’ have to stop at that node…remember how to reconstruct a sequence?

assem3

 

Remember, contigs contain simple paths only that don’t include branching nodes. There are three starting points:

Contig 1: ACT it has to end because it cannot include the node with the branch.
Contig 2: TGCT
Contig 3: CTGAA

Now all the nodes are used and no sequence contains a branched node.

So that’s de Bruijn.

String/Overlap Graphing

This is probably what you are familiar with if you use assemblers for 454 data. It essentially looks like an alignment.

ACTGCT
–CTGCT (overlap of 5)
—–GCTAA (overlap of 3)

string1

Comparing de Bruijn and String:

stringdebruCompare

stringdebruCompare2
for more explanation and examples see Rayan’s slides

Which is better?

Point of information: A string/overlap graph cannot account for a mutation/substitution unless you tell it to (ie. allow 1 SNP error). So as you see above a de Bruijn graph will simply branch off, a string/overlap graph will give you nothing–it can’t make the connections in the alignments because it cannot resolve or wasn’t told to ‘ignore’ the mutation.

  • String/overlap is usually better if you have long reads (k-mers) as long as you adjust to allow for errors.
  • de Bruijn is better for shorter reads (k-mers)
  • For raw PacBio data, there are too many indels to input them to a string graph. The solution is to input corrected data.

In theory assembly will find the path of minimal length that traverses each node (read) once. Contigs consist of ‘simple paths’ only…paths that do not contain a branching node. So multiple contigs can be constructed from a path.

Unfortunately reality is rarely that simple; rather in practice we get a return of a set of paths covering the graph, such that all possible assemblies contain these paths.

Another exercise, see Rayan’s slides for the solution:

ex2

Classical metrics for evaluating assemblies include:

  • Number of contigs/scaffolds
  • Total length of the assembly
  • Length of the largest contig/scaffold
  • Percentage of gaps in scaffolds (’N’)
  • N50/NG50 of contigs/scaffolds
  • Number of predicted genes
  • Number of core genes

More recent metrics include internal consistency and likelihood of an assembly given the reads.

QUAST is a good program to give you estimates on all these programs.

How to evaluate assembly:

  • Length, size, coverage
  • Would you rather have an assembly of short reads with good coverage or an assembly of long reads with mediocre coverage?
  • It’s difficult to rank assemblies but some of the criteria include: number of contigs or scaffolds, length of assembly, length of the largest contig, numbers of gaps, the N50 value, internal consistency and number of predicted genes.

A note about internal consistency: It’s really the only way to detect errors in de novo assemblies though it rarely appears in articles. It is the percentage of paired reads correctly aligned back to the assembly and are called ‘happy pairs’. Recent tools include REAPR and FRCurve

Recent software that was developed to determine Assembly Likelihoods include ALE and CGAL.

Rayan also highlighted a blog post from Lex Nederbragt that talks about assembly uncertainty and gives suggestions.

N50: The scaffold or contig length at which larger sequences cover 50% of the total assembly length.

How about a visual…

assem4

So you have an assembly with 3 contigs–one of length 3, one of length 4 and one of length 1. So the total length of your assembly is 8, while your genome is length 10.

  • Ok, so first you can say you didn’t cover your entire genome
  • Total length of your assembly is 8
  • N50 asks first what is 50% of the assembly length? 4
  • N50 then asks for the length (do you have a contig) of length 4 or greater? Yes we do.
  • So our N50 is 4

What if our genome size was 12 instead of 10?

  • Trick question!! N50 ONLY deals with the length of the assembly.

Ok now lets deal with NG50, also a metric of assembly same EXACT thing as N50 but now in reference to the genome. Now this one will change answers depending on your genome.

  • Refer back to the figure above.
  • Our genome is length 10
  • 50% of our genome = 5
  • Do we have a contig of length 5? Nope
  • In this instance when you don’t have a contig of exactly 5, you now have to find your two largest contigs–3 and 4. Do they add up to greater than or equal to 5? Yes, they add to 7.
  • So now you take the SMALLEST of those two contigs and that becomes your NG50.
  • So your NG50 = 3

This SMALLEST rule also applies to N50. So in the above example if the assembly had been length 10 and you didn’t have a contig of size 5, you would take the two largest contig that equal or are greater than 5. Once you’ve found them, take the SMALLEST of the two as your N50.

solution in Rayan's slides
solution in Rayan’s slides

In general for projects involving the human genome or on that ‘scale’ N50 should be 10-100 kb for contigs and 1-10 Mb for scaffolds in a good assembly. For bacterial ‘sized’ projects, N50 would be 100 kb for contigs and up to the length of the genome for scaffolds–assuming your bacterial genome is big enough to merit scaffolding. My viruses are super tiny…so contigs many times easily span my genomes.

Other parameters aside from N50 and NG50

  • Internal consistency (discussed above): if everything aligns perfectly you’d have an internal consistency of 100
  • Coverage in terms of horizontal coverage of bases and depth, how many reads support the individual bases
  • Assembly errors: they aren’t perfect, they make substitution errors, small indels and mis-joins

Really there is no ‘global’ metric for accuracy. Some assemblers look at the percentage of aligned blocks. Many of the assemblers in the Assemblathon paper were measure for ‘greatness’ based on the number of structural errors in the assemblies. As mentioned above, QUAST is a great tool for obtaining metrics of your assembly that you can compare to decide if it’s a good assembly.

There are tons of assemblers…see Rayan’s slides for all but in general:

The assemblathon papers are a great resource for seeing what different assemblers can do with different data sets and are a highly recommended read.

For RNA-seq Assembly you are looking at short contigs (average length 2 kb), uneven coverage most likely due to varying expression and alternative splicing and your contigs will be used more than once. It follows the same line as regular DNA assembly using de Bruijn graphs with the exception that it reuses contigs.

Rayan wraps up with an assembly pipeline and suggests at each step:pipeline

  • Adapters: remove low quality reads, remove barcodes/adapters/primers trimming and merging of overlapping paired end reads into single reads
  • Correction: Allpaths-LG is quite good, SOAPdenovo, QUAKE (high coverage mitigates the need to do the correction step.
  • Diginorm
  • KmerGenie/PreQC: Optimal kmer size selection. Kmer size varies depending on dataset. Keep in mind, (i) there is a lower limit (ii) also a high limit (iii) ideally you want to set k as high as possible to reduce erroneous kmers (iv) try at least 2 k values (v) Use KmerGenie or PreQC to help.
  • Assembler: already discussed at length.
  • Scaffolder: scaffolding maps paired reads to contigs to order them. Most assemblers include a scaffolder. Scaffolding is where errors are most likely to be made. You can use stand-alone scaffolders: SSPACE, BESST, Bambus2, Opera…or just skip scaffolding because many times contigs are good enough.
  • GapFiller: Fills gaps inside scaffolds. A few assemblers include gap filling (ie. SOAPdenovo2, Allpaths-LG); GapCloser (a part of SOAPdenovo). Or you can use GapFiller or FinIS but they have limitations.
  • Annotation: Prokka (prokaryotes), MAKER (eukaryotes, per Konrad’s suggestion)

Assembly Advice/Reminders:

  • k-mer size will vary: At a low limit k-mer size (when you get ‘too’ low) then depending on the size of your genome you’ll start getting many repeats which will make assembly difficult. At a high limit make sure you have good error correction. Typically use a value of k = readlength -1. Ideally k should be set at the highest level where you have greater than 2 error free k-mers. Additionally, it is good to assemble with more than 1 k value and if you have absolutely NO idea where to start. Start with k = 25 or k = 30 unless you are dealing with uber short reads. And move on from there evaluating k values.
  • Error correction: Make sure you correct your errors! Make sure low quality reads are removed and trimming has been done prior.
  • If you have a large genome, scaffolding is recommended.
  • Generally good to run more than one assembler then compare outputs

And if things get difficult with your organism…perhaps yelling “Resistance if Futile! You will be Assembled!” will help…if not well, now at least you’ve gotten some odd attention from you lab mates who will probably give you plenty of space now for a bit.

Cheers,
Dr. Mel