The ever dynamic Chris Wheat took the stage this morning to give us a continuation of great keynotes and master classes from our ridiculously intelligent and talented faculty…I am in utter awe.
Holy crap I feel like I’ve gotten the honor of sitting through keynote after keynote at #evomics2014, truly awesome.
— Mel Melendrez (@UseqMiseq) January 24, 2014
I feel like I’m watching a TEDx talk on ecological #genomics from Chris Wheat…this is great! #evomics2014
— Mel Melendrez (@UseqMiseq) January 24, 2014
Topic: Ecological Genomics
So this morning Chris gives us a sobering and yet hopeful talk about the state of ecological genomics and how we should approach our study designs and overall understanding when it comes to the nitty gritty of our studies. By way of introduction Chris gives us some anecdotes:
“What’s published isn’t all the science that goes on…”
“I am going to present a non-typical view of ecological genomics and make you uncomfortable by sharing my nightmares…don’t get me wrong, I love my field, every bit of it but I want you to give you some food for though perhaps encourage you to critically assess your results.”
“I use rather strange color patterns…so if your neighbor starts going into seizures, let me know, we’ll stop the talk…”
So you would be surprised at the amount of replication failures within the literature.
“What if 50% of your favorite studies were wrong? How would that affect your expectations?”
- Of the 49 most cited clinical studies, 45 showed intervetion was effective (biomedical), most were randomized control studies, of the 34 that were later replicated…41% directly contradicted or had lower effect.
- During a mouse cocaine study replicated 3 times, all with different results, average movement was 600cm, 700cm and >5000cm.
Why then are published research findings false?
- Small sample size
- When effect sizes are small
- When there is a greater number and lesser preselection of tested relationships
- Where there is greater flexibility in designs, definitions, outcomes and analytical modes
- When there is greater financial interest or prejudice
- When more teams are involved in a scientific field, all chasing after statistical significance.
To paraphrase Mark Twain: “There are lies, damn lies…and genomics?”
- Are datasets too big to fail?
- What do you follow up studies reveal?
- How can we gain confidence in our work?
A really good way to go about learning a technique or the science behind a paper is to recreate what papers found.
- What is the genetic basis of phenoytpic traits?
- How do we find genes that matter?
Right now it feels like the pervading paradign in science is “Sequencing our way to answers”
Chris goes on to talk about how do we find the genes that matter and what are selection tests are really detecting? What ‘power’ does everything boil down to. What power do we have to detect balancing selection? Power, by the way, is the probability that the test will reject the null hypothesis when the alternate hypothesis is true. Statistical analysis has become more important than ever in genomics especially since recombination and gene conversion can destroy the footprint of selection. You also have two types of selection both hard sweeps and soft sweeps; Chris used the Threespine Stickleback as a classic example of a hard sweep at a gene/allele.
So hard sweeps are great, hard sweeps happen and you can see a clear signature of fixation with respect to the background variation of the rest of the genome, but what about soft selective sweeps? Chris bets everyone that soft sweeps which are a lot harder to detect are probably a lot more dominant or happen at a much higher frequency then we could imagine.
In terms of detection, when you have a novel mutation or large mutations selected to fixation the probability of detecting it is quite high. However, if you have an old mutation, polygenetic or ‘soft’ sweep that doesn’t go completely to fixation (though may rise in the population) you have a low probability of being able to detect that.
Chris highlighted the Drosophilia 12 genomes consortium project showing some really beautiful work on the tempo and mode of chromosome evolution as you get more and more divergent from D. melanogaster.
- >20 My, chromosomal order has completely reshuffled in Diptera
- There has been selection dynamics across functional categories of genes.
- 33.1% of single copy orthologues have experience positive selection on at least a subset of codons.
- They estimate one fixed gene gail/loss across the genome over 60,000 yrs
- 17 of the gene are estimated to be duplicated and fixed in a genome every million years
At this point Chris stops to wonder if we’ve built a comparative genomics “house of cards”
- Really the data scale is too large to thoroughly assess error
- It’s likely 50% of what you think you know is wrong
- All conclusions at some stage rest upon simple bioinformatics and assumptions that get incorporated into seemingly unbiased methods.
His next set of slides goes on to explore the errors in this study that affect the interpretation of the work including gene alignments for detecting positive selection and calibrations used for temporal analysis.
- To assess selection dN/dS was used and estimates of this metric vary based on aligner and there is little agreement between dN/dS estimates derived from different aligners.
- It is absolutely essential to compare across methods!
- Alignment can have a larger affect than biology, simply where you place an indel can affect overall interpretation, amino acid translation and both alignments could still be considered probable or good alignments. (see Chris’ slides)
- In terms of temporal inference he mentions that we put constraints on biology given our underlying assumptions about how we think the data should work. For instance there are no fossils of Drosophilia prior to 70 mya so they constrained their node to 70 mya; however if you release that constraint the node goes back as far as 115 mya.
“So what’s reality?”
- Bayesian priors can be dangerous if they are not properly informed or properly flexible in their assumptions and distributions. And they will then heavily influence the posteriors.
So what do you do? Realize that integrative science is challenging and discuss or collaborate with experts to evaluate your approach. Know that DNA and fossils rarely agree and to assess temporal signal in DNA in a robust manner you must reduce prior biases and use lots of DNA data while modeling the likely violations of analysis models.
“What we can measure is by definition uninteresting and what we are interested in is by definition unmeasurable” ~Lewontin 1974
Chris’ take: “What we can assemble in the genome may by definition be uninteresting and what we are interested in is by definition very difficult to sequence and assemble and annotate and estimate (indels, inversions, gene family dynamics, demography, selection, temporal estimates)…”
Determining whether your hypothesis is true isn’t just about getting a good P-value, it’s about cross validation with many techniques. Overcoming the issues and biases just mentioned is a continual challenge.
…and with that challenge in mind, time for coffee
*Play elevator intermission music in your head for a moment while I switch slide decks, might I suggest Frank Sinatra’s The Girl From Ipanema?*
And Were Back…
“I don’t like putting Bio and Informatics together, they are a very separate tool set. How the world works versus how to manipulate data and integrating both is challenging. Much of the data you will get back from the core facility or your bioinformatician you will have to redo some of it.”
So what can you do?
- Get involved, ask intelligent questions about how things are supposed to run.
- Put the Bio back in bioinformatics.
- Double check data from core facility…”Is it your species?”…not joking!
- Use a known dataset for your pipeline so you at least know your pipeline is working.
- Do independent methods converge?
- Reassess our common metrics: ie. bootstrapping, pvalues, outliers, demographics null models
Some things you should know about transcriptome analysis:
- If anyone tells you cannot do RNA-seq on a non-model organism that does have a genome, that’s untrue, we did it.
- Also untrue: The best assessment of an assembly is N50 and # of contigs. Using orthologs and an ortholog hit ratio are also very informative in telling you how good your assembly is and it’a biological metric which is more robust.
- Also untrue: We’ll have your data back to you in a month!
- Using programs like Trinity, edgeR and bioconductor are fine but be aware of their limitations. For example the effect of filtering in Trinity can give you whisker outliers which aren’t really there. SNP variation in an assembly can be problematic to flesh out.
- Most studies are annotation limited so gene enrichment is really key
- Validate and solidify all your transcriptional insights, RNA-seq isn’t where it ends, its where it begins.
- Single gene analysis can be restrictive, network analysis is where your statistical power will be
- Make sure you have data relevant to your phenotype and organisms
Some more challenges:
- Do differences in performance correlate with differences in fitness?
- DNA alone can’t tell us about selection dynamics in the wild.
- Demand demonstration of natural selection when claiming adaptation
“Genomics is full of adaptive stories but many lack true validation”
So lets come back to the Stickleback for a moment where they found this awesome gene Eda which was correlated with loss of body armor in fresh water fish as compared to their oceanic counterparts. We can ask, is low armor really adaptive in fresh water? When the selection event is replayed they found that initially there was a rise in the armor phenotype, but then over time the fresh water low armor phenotype came back so you usee the ups and downs but overall the gene frequency at the start was no different than at the end…0.5…What is selection? By definition, a change in gene frequency. So what’s going on? The phenotype could be linked to a different gene that’s in turn linked to Eda perhaps so it remains unclear if Eda is truly a target of selection.
So how do you find candidate genes then? Well you use classic study systems in the wild and the validation process is still ongoing for such studies, second generation approaches sometimes find the genes sometimes not (or not easily) because modern tool aren’t desing with such architectures in mind.
From here Chris transitioned into his work on butterflies (which is great work but he’s still quite critical of his own methods/work). They found a gene associated with a phenotype in new population butterflies of having greater endurance. How’d they do this? 1 butterfly, jar with lid connected to a hose to measure gas output, scare the hell out of the butterfly so it freaks out and flies around…measure the CO2. I discuss this paper in his faculty highlight as well. One of the intriguing things is they did parts of the study before next gen sequencing and with the gene Sdhd in particular, the phenotype wasn’t associated with a mutation but rather a deletion, something they might not have found using NGS unless they knew to look for it (which they did). Assembly of the region would’ve been biased to the most common allele (no deletion). He found this mapping bias in the Pgi gene as well (a gene he’s been studying most of his career) where only the most common allele was popped out.
In conclusion…he encourages us to be, well…Scientists and to question everything, even what we think we know; to use cross validation, use multiple methods, directly compared methods, be interdisciplinary and do not be afraid to replicate studies or reanalyze data sets if something looks ‘fishy’. He mentions his editorship at the Journal of Negative Results and says it’s now open for submissions from genomics.
I encourage you to take a walk through his slides as he has many more figures form the studies conducted that he talks about and take a look at his faculty highlight as well.
A hopeful reality check is what we got from this TEDx-esk talk and we thank Chris for his sobering and valuable insight.