Our sequencing Guru takes the podium…


Dr. Mike Zody
Broad Institute

Topic: Genomics Study Design

Dr. Zody is a jack of all sequencing trades–he’s been at it for several years and his career spans the Human Genome Project, vertabrate evolution and postitive selection, to genetic links to viral disease. He’s probably seen and heard it all, his slides are excellent so as soon as they go up on the webiste, download them! His faculty highlight will be up soon so I will leave the introduction at that and jump right into a rundown of todays lecture…

He covered four main topics:

  1. Will sequencing address your goals
  2. Considerations with sequencing
  3. Steps toward obtaining your data
  4. Specific sequencing application considerations: (Resequencing, genome assembly, RNA-seq, ChIP-seq and metagenomics).

Unfortunately we didn’t get to metagenomics, but you’ll hear plenty about metagenomics from Rob, Nick and Daniel later in the workshop.

Genomic Study Design: Does sequencing address your goals?

Sequencing has moved away from sequencing just for the heck of it in an effort to ‘discovery something’ and has moved more mainstream in cost and time so now experiments are becoming more hypothesis driven again.

What’s your goal? Is your study hypothesis driven? What do you specifically want to look at?

  • Are you interested in variation within and/or between species?
  • DNA binding sites and the genes/sequences they are affiliated with?
  • Gene expression (transcriptomics)?
  • Assembly a genome because there is no reference genome for your organism or you want to do comparative genomics?
  • Are you interested in the teasing apart mixed populations
  • Resequencing is great for generating comparative data.
  • Assembly is great for getting reference genomes or comparing genomes
  • RNA-seq is useful (mRNA context) when you have samples you want to differentially express
  • ChIP-seq focuses on those sequences affiliated with DNA-binding sites where you can hone in on particular regions of expression activity.
  • Metagenomics will give you mixed populations and you are interested in the composition perhaps and/or how it changes over time.

How much data do you generate?

  • Is your sequencing result the final answer, or just a survey to generate preliminary data for follow up studies.
  • What are the costs of false positives and false negatives, relative to the cost of the sequence?
  • Four case studies highlight how projects involving SNP discovery might require different amounts of data  based on these factors

Case Study 1: Tumor/normal sequencing you will need high coverage and high variant calling stringency

Case Study 2: Microbial evolution, you will need high coverage but can do with low variant callling stringency

Case Study 3: Vertebrate evolution, you can do with low coverage but you will need high variant calling stringency

Case Study 4: Population SNP discovery: you can do with low coverage and low variant calling stringency

See Mike’s slide for more info!

Things that influence your data…

  • Samples: do you need biological replicates? technical replicates? controls? do you have a good reference?
  • Type of library constructed: fragment, paired-end, mate pair
  • Fragments are the least expensive and consist of one read. Paired-end libraries give you more data because they read the same sequence from both directions and they help with assembly. Mate pair libraries are the most complicated, you need a lot of DNA substrate, they yield longer fragments and some platforms will not be able to read the second strand you generate.
  • number of reads: depth, coverage, quality

Mike had a great analogy here…when thinking about numbers of reads you would need for a meaningful result. A common problem in sequencing is that there is not enough data. It’s similar to if you were reading a laboratory protocol and it called for you to use 1.0 micrograms of extracted DNA, but you decided not to. Instead you decided to use 0.1 micrograms of DNA. What would you expect? Would you get a result? Maybe. Would it be as a good a result as what you would’ve gotten had you just used the amount asked for in the protocol, probably not. Just because it’s an ‘advanced’ technology doesn’t make it fail proof or devoid of the need for statistical robustness.

  • length of reads: longer the better but longer + poor quality is worse than using short good quality reads, so don’t go ‘overboard’–don’t be a low quality base-holder!
  • overall complexity: The number of distinct and randomly spread fragments in your library.

Lets say you assemble your genome and you have lots of gaps and all the reads seem to tile in all the same areas. You think “OK, perhaps I need to sequence some more to get my coverage up”. So you create another library with that sample and sequence it the same way to add data thinking for sure you’ll have caught all those pesky gap areas. You assemble with your old data as well, DOH! Same problem now you have ridiculous coverage in all areas BUT your low coverage or gap areas! Low complexity may suggest you need to re-think your laboratory protocols…Oh noes!!! Why!!!

Low complexity could be:

  • Target primer failed amplification (assuming you are using targeted primers) leading to missing PCR products that could not be sequenced.
  • Do your PCR fragments actually represent what went into your library?
  • Are your fragments the correct size range? Too small and they aren’t flexible enough, too large and they’ll interfere with each other.

The physical machine you use:

Illumina: You can use all types of libraries (fragment, mate pair, paired end) and depending on specific machine you can get fragments 150 to 250 bp.

SOLiD: All types of libraries, fragments < 75bp

454 Roche: Fragment and Mate Pair only, lengths 450-750 bp

PacBio: Fragment only, very long lengths (in the thousands of bps).

Considerations for Library Generation and Sequencing:

PCR Bias: There is a lot of PCR that goes into sequencing. If you have a target, you PCR to enrich that target in addition to the amplification that occurs for sequencing. Additionally, if you are sequencing an organism that has many secondary structures or an extreme GC content this usually leads to poor/low quality and/or poor representation of those portions of the genome that contain such structures or GC fluctuations. Chimeras can also be generated with PCR as well as duplicate sequences

Great paper addressing all the ways PCR can bias sequencing libraries: Specifically Illumina:

Aird et al., 2011. Analyzing and minimizing PCR PCR amplification bias in Illumina sequencing libraries. Genome Biology. 12:R18.

Sequencing Applications: I’m only going to rehash these briefly I highly suggest you refer back to Mike’s slides!


  • Useful for: SNP detection/discovery, population sequencing, comparative genomics, structural variant discovery. You optimally will need a good reference genome though and that reference needs to be complete, accurate, and representative of the samples you are sequencing.
  • Sequence depth will depend on what you are sequencing:
  • Haploid/Bacterial/Viral >10x
  • Diploid >30x
  • Aneuploid or Somatic >50xGC
  • Population variant sequencing >200x

Mike had some great slides illustrating the fall off of accuracy when you have genome structural problems like extreme GC content which can affect over all quality and depth/coverage in your genome. Also what kind of read coverage you would need to do an effective SNP discovery.


Genome Assembly:

You do this when you want a reference genome. Alternative method of SNP discovery as well as structural variants. What is your preference for your reference? Do you want something truly representative of what’s being studied even though it’s hard to perhaps sequence or do you go for ‘inbred’ and sequence something that might be less specific yet still applicable to research for other investigators of that organism?

In terms of the coverage for assembly:

Illumina, Ion, SOLiD = 50-100x
454/PacBio = 20x

But the more coverage the better because de novo assembly won’t work without it. Also, long reads help, mate pair libraries are preferred and you’ll have to make considerations for repeat regions in your genome which will be hard to sequence as well as GC content.


You can do this to look at global expression of mRNA. It can help with annotation of genomes and looking at transcripts.

The pipeline of work:

  • Extract RNA
  • enrich for mRNA
  • convert to cDNA
  • Fragment your cDNA
  • Library construction
  • Sequencing

As with previous analyses you have similar considerations:

  • Sample numbers: are you looking for differential gene expression? novel gene discovery? how many replicates you will need will depend on the biological variability of the process you are interested in. How do you figure that out? Well hopefully someone has done some expression/microarray study and you can glean your starting values from that. In general, few reads with more replicates will give you a better idea of variation while more reads and less replicates will give you better statistical support for you inferences.
  • Identification and Quantification of transcripts: What do you expect given previous studies or your own studies and of the total transcripts how many are actually involved in the biological process you are interested in?
  • Read length: The longer the better but in general nothing less than 75 bp. Doing this with 454 libraries is problematic as you can’t get ‘too long’ because otherwise you’ll span two exons/coding regions and then it’s just a mess to tease apart.
  • Analysis: You can align first or assemble first. With aligning first you’ll get complete construction probably at low coverage but it’ll yield a decent reference. With assembly first, it’s because you don’t need a reference and will get high abundance (good quality) transcripts but you won’t resolve low abundance transcripts very well.
  • Strand Specific Libraries (Levin 2010 Nature Methods paper): Make for easier annotation with better resolution of the overlapping genes,  however there are extra steps and costs involved.



Example RNA-seq runs:

  • Human expression (per condiCon):  ¼ lane HiSeq, 76bp paired
  • Vertebrate annotaCon (per Cssue): ¼ lane HiSeq, 101 bp paired, strand‐specific
  • Bacterial and fungal annotaCon: 1/12 lane HiSeq, 101 bp paired, strand‐specific



Mike listed several caveats in his slides with respect to RNA-seq like: looking at comparisons across experiments, current programs are of variable quality and aren’t ‘perfect’, PCR duplicates and variance.

and finally…


ChIP seq characterizes chromatin domains, does identification of protein binding sites as well as quantitation of occupancy of binding sites. With ChIP-seq you have to take into account:

  • Enrichment and Frequency of target site (depends on # of reads)
  • Level of binding over background
  • Specificity of capture (ie. antibody)
  • Variability in the system/using replicates, choice of control (depends on # of samples)
  • Paired ends are really not needed
  • Don’t particularly need highly accurate reads
  • Reads only have to be long enough to uniquely place it
  • Repeat regions are often excluded.

Mike has a wealth of knowledge built from over a decade in this field, so I hope you will take advantage of his availability at the workshop…

Many thanks to our Guru for coming down from his place of meditation and relaying some incredibly insightful considerations and suggestions built from his years of experience over a span of organisms.

…Dr. Mel