Workshop on Genomics 2014 Faculty Highlight
We continue our foray into metagenomic mayhem with our next faculty highlight Dr. Nick Loman who has also made his indelible mark on how we analyze data and use metagenomics in a clinical setting specifically in terms of pathogen discovery in an outbreak situation. The ‘Sherlock Holmes’, if you will, of the metagenomics world, wielding his magnifying glass using next generation sequencing to reveal the microbial or viral culprit and bring them to justice in light of an outbreak.
Today we jump into one of his favorite papers: Loman et al., 2013. A culture-independent sequence-based approach to the investigation of an outbreak of shig-toxigenic Escherichia coli O104:H4. JAMA
Nick was gracious enough to entertain several questions I had on the study and elaborate so we get an inside view on the story behind the paper; everyone enjoys a good method to the madness right? So lets get to it!
In June 2011 there was an outbreak of Shiga-toxigenic Escherichia coli (STEC) in Germany. There were more than 3000 cases and more than 50 deaths. The CDC issued a warning and as with any outbreak the clock immediately starts ticking as clinicians and researchers struggle to understand and isolate the source so they can control, manage and ultimately cure it. Several papers came out characterizing the strain, looking at outcomes in children, and looking at the paradigm shift with respect to how we look at human pathogenicity of STEC strains. Nick and colleagues decided to use the data from this outbreak to look at the feasibility of using next generation sequencing and metagenomics to develop a framework and set of pipelines that would facilitate faster identification and therefore faster outbreak responses in the future.
Traditionally, clinical bacteriology uses culture techniques to isolate bacteria thought to be responsible for a given disease or outbreak. This is a time consuming process and does not guarantee success if you are unsure of the pathogen you are attempting to isolate. So, what if you have no clue what your outbreak strain is? Remembering back to my medical bacteriology class, we always had some idea of what we might be aiming for when culturing samples given symptoms of the patient or other metadata. What if you don’t have that? What if all your tests come back negative? What if an outbreak strain is just that different that is flies under the radar of the diagnostic tests you subject it to? The STEC responsible for the German outbreak had not previously been seen in the context of epidemic disease and therefore could not be detected easily.
We’re screwed? Perhaps not. Metagenomics has been used in clinical settings before in the identification of the viral culprits associated with outbreaks so Nick and colleagues sought to look at the feasibility of extending the platform for clinical diagnostics and identify bacterial culprits without the need for culturing.
First some jargon/terms, courtesy of Loman’s paper, of which you will hear a lot of in any genomics workshop so might as well start getting used to it now:
Metagenomics: this term is applied to the open ended sequencing of nucleic acids recovered directly from samples without target-specificity or enrichment.
Coverage: The number of times a portion of the genome is sequenced in a sequencing reaction; often expressed as ‘depth of coverage’ and numerically as 1x, 2x, 3x etc.
Environmental Gene Tags (EGTs): Short sequences of DNA that contain genes in whole or in part that can be used to identify and characterize the organisms from which they originate.
Read: A discrete segment of sequence information generated by a sequencing instrument. Read length refers to the number of nucleotides in that segment.
Alright…that should get us started
What they did…
- Stool samples were collected in Germany between May-July 2011
- High throughput sequencing was done in Oct 2012
- Bioinformatic analysis was conducted in Nov 2012 and Feb 2013
- Traditional culturing techniques were used to identify the bacteria in the sample (conventional pathogen detection). Methods include spreading on selective agar, PCR, ELISA, MALDI-TOF mass spectrometry fingerprinting, and serology to detect antigens.
- DNA extraction
- Library Prep for MiSeq using the Beckman SPRIworks frament library system I and HiSeq using Nextera XT sample prep kit
- 2 pools which together contained 39 samples in equimolar concentration plus DNA from 5 samples that yielded pathogens other than STEC in 10-fold excess were sequenced on the HiSeq 2500, 2 x 151 rapid paired end sequencing aiming for a density of 800K to 1M clusters per mm^2, yielding 180 Gb in 40 hrs.
- MiSeq run of 8 libraries, 1 run/library (300 cycles, 2 x 150bp; paired end protocol)
- Bioinformatics included: inital screen out of human sequences, de novo assembly, alignment, phylogenetics
- A collection of environmental gene tags (EGTs) was made during assembly: EGTs are sequences that contain genes that can be used to identify the organism.
- Use of the MetaHIT project to provide control samples from healthy individuals to assist in enriching outbreak specific reads by screening out the normal flora (microbial occupants) seen in healthy individuals.
- Constructed a draft genome
- Determine coverage of the outbreak strain
- Used MetaPhlAn tool from the Human Microbiome project to ID pathogens other than the outbreak strain.
A summary flow was presented in the paper that gives a nice overview of the process: Click on the picture to see the zoomed version.
I had some questions regarding their methods which Nick kindly expounded upon…
Your JAMA paper was very interesting, I was curious as to what the time comparisons were between traditional microbiological diagnosis of an outbreak pathogen and your metagenomics/sequencing approach? What is the time frame going from sample to identification assuming you have all resources needed?
“This is an interesting question because the answer depends on what you are comparing. We think, now we have built the pipelines, that you could go from a clinical sample and get it on an instrument same day, after doing DNA extraction and Nextera library prep. Then you are looking at a minimum of 24 hours on a MiSeq to get sequence data back. The bioinformatics analysis is where it gets interesting, it took us several months to figure out what we were doing initially, way too long, but in theory you could detect the pathogen and assemble a genome in under a day, so perhaps 3 days minimum. This is probably a little bit slower than conventional microbiology when looking at a fast-growing, commonly seen species or pathotype. But this outbreak was particularly interesting because although the E. coli was a Shiga toxin-producing strain, it was of an unusual serotype (O104:H4). And unlike the commonest serotype of E. coli that is Shiga toxin-positive, O157, this strain is sorbitol-fermenting. So these things reduce the speed in which the laboratory could be sure about what they are dealing with. The main advantage of a metagenome based approach is that in theory you can detect anything in the sample. And then if you can get the pathogen genome out, that’s a rich source of biological information, such as virulence genes and antibiotic resistance genes, as well as giving typing and SNP-level (SNP: Single Nucleotide Polymorphism) information useful for epidemiological studies. Collecting all that information via conventional microbiology can take weeks or months, or even longer with very slow growing organisms. It also seems likely that the sequencing could be done quicker in future with improvements to instruments!”
I noticed you did have the benefit of EGTs and reference genomes for the work. What might be an approach/advice for the complete ‘unknowns’? Several students in last year’s workshop were working with non-model organisms that had no database or ref genome to assist in assembly/annotation.
“Well, initially in that study we started with a reference-based approach. But when we wrote it up, the peer reviewers quite rightly pointed out that we need to be able to reconstruct the genome from scratch without any prior knowledge. So we built a system using de novo assembly which is entirely reference free. We then used a set of healthy controls, by reusing data from the MetaHIT metagenome sequencing project to try and figure out what in the dataset was never seen in healthy patients. This gave us lots of clues, including the O-antigen determining locus and the Shiga toxin prophage genes, as well as some parts of the plasmids. Then we used that information to cluster those contigs with the other contigs that seemed to be from the same genome, by using coverage information (how many reads mapped to the contigs) across multiple samples, and also paired-ends. We ended up being able to reconstruct the genome very well without any reference. This would in theory be a generalizable approach to pathogen detection, particularly if you had a bunch of samples from patients you thought had the same pathogen.”
Are the extraction protocols for HiSeq and MiSeq identical or is there chance for extraction bias?
“The extraction protocols are independent of the sequencer but yes DNA extraction kit biases are a potential issue, particularly with bugs that are hard to crack open. Another issue is that with very low DNA input samples, the DNA extraction kit can actually contribute DNA into the sequencing you do, so you need to be cautious about your findings if you don’t have a negative kit control to sequence against. The sequencer itself is less happy sequencing very high GC regions so there is a possibility of bias there too.”
What they found?
- 1.5 million environmental genome tags (EGTs) with more than half falling into Enterobacteriales
- By subtracting out EGTs that were not present in at least 20 German outbreak samples and EGTs that matches those of healthy individuals (obtained from the MetaHit project) they ended up with 450 outbreak specific EGTs.
- Outbreak specific EGTs became seeds to recruit other sequences from the dataset to build an accessory genome and do functional annotation…low and behold, Shiga-toxin genes, O-antigen genes, antibiotic resistance genes, and aggressive adherence fimbriae
- By aligning reads back to the outbreak reference genome they got >10 fold coverage in 10 samples and >1-fold in 26 samples.
- 67% of the samples had shiga-toxin genes.
- The flagellar H antigen serotype was confirmed
- The MLST sequence type was confirmed
- Other pathogens recovered: C. jejuni, C. difficile, Salmonella, and Campylobacter concisus
- Bacteriophage sequences discovered which was unexpected.
Take Home Points…
- It is possible to identify and even build a genome of an outbreak strain using a metagenomics approach. Additionally, other bacteria of interest are recovered as well.
- Caveat to note: Because they discovered multiple candidate pathogens, “there might be some doubt cast as to the reliability of inferring a causal link between the detection of a single potential pathogen and causation of disease…”
- The data presented aren’t a ‘formal evaluation of metagenomics as a diagnostic tool’, and unfortunately with a sensitivity of 67% the technology has not come far enough yet to replace traditional bacteriological techniques in clinical testing….
- However, it is promising and offers a lot of potential and even though it is quite costly, in an outbreak situation the cost could be justified…especially with more difficult to identify pathogens.
Still more Questions…
“How far changes on microbial community composition and synergistic interactions between potential pathogens play a role in the development of pathology?“
Challenges to still overcome?
- Reduce Cost
- Increase Speed
- Simplification of workflows
- Improve sensitivity
…all of which may be attainable as sequencing technological advances march forward.
What do you feel is the most important consideration when conducting a metagenomics study? Or set of considerations?
“As with much genomics science, it is very helpful to have a specific set of questions you wish to answer in advance. And then your experimental design should be appropriate so you have a good chance of answering the questions. This is very specific on the study. And you should develop an understanding of how helpful different approaches will be to answer different questions. For example, 16S phylogenetic profiling will only look at bacteria and your results will be limited to taxonomic information. Whole-genome metagenomics can reveal the functional capacity of a community, but you need trancriptomes to actually know which genes are being transcribed. Then you really need to know about the complexity of your community, as this dictates the amount of sequencing you will need to do. Finally, you should have an idea of the type of bioinformatics analysis that is possible and how to go about doing it!”
- Ahhhh…I see a theme forming, your specific questions and NGS study design…design…design…design
Where do you see your focus going next? Do you think your projects will continue to stay in the pathogen discovery/characterization of genome realm? Bioinformatic pipeline development/programming? or has a different particular topic captured your imagination for the future?
“I really want to push the envelope of metagenomics for pathogen discovery and characterization. Right now we’ve focused on a relatively easy challenge, as the E. coli is present in the samples sometimes at very high abundance. But what happens if you take samples where there are only a few cells, sometimes in a sea of human cells. How can we reduce the human DNA ‘contamination’ to the level where we can detect and characterize pathogens. This might mean going to the single cell level which brings lots of challenges.”
“More generally I am interested in seeing these technologies reaching the clinic and directly benefiting patients, and that means lots of development of bioinformatics algorithms and software to give robust results that are interpretable and useful in a reasonable time scale. We are a long way off this right now!”
Your publication record and internet sites suggest a background of pathogens and programming. It’s interesting when I ask different scientists who also program or conduct bioinformatic analysis what they consider themselves, I get a mix of answers. Some say they are computational biologists, bioinformaticists, geneticist… What do you consider yourself in the field?
“My initial love was for computing, then for medicine (I trained and practiced as a medical doctor), then for genomes. But I suppose I identify myself primarily as a bioinformatician and a genomicist.”
Did your formal education include programming/computer science or did you teach yourself? Many of our students are teaching themselves and last year had many questions about how to turn a ‘biologist’ with little computing background into a ‘programmer/data scientist’ necessary for their research. Do you also have any advice?
“My only formal training in computing was from my dad, who taught me the basics of C programming at around the age of 12! Since then I’ve taught myself, which is probably the way I enjoy learning best, and has traditionally been the main way to learn bioinformatics! The main role of courses like Evomics is to show scientists the ecosystem of bioinformatics and to get their feet wet, and to become inspired to go home and do a bunch more learning to answer their scientific questions.”
I’m going to take a time-out here to highlight Nick’s involvements in Bioinformatics training and outreach as he’s been highly vocal and has offered a great deal of insight to aspiring bioinformaticists. Nick co-authored a commentary with Dr. Mick Watson (another NGS power-house highly vocal as well on training and education in the field) in Nature Biotechnology entitled: “So you want to be a computational biologist?”
They lay out 10 goals that you should keep in mind on your path to becoming a computational biologist with helpful insights on each:
- Understand your goals and choose appropriate methods.
- Set ‘traps’ for your own scripts and other people’s
- You are a scientist, not a programmer
- Use a version control software
- Pipelinitist is a nasty disease
- An Obama frame of mind
- Be suspicious and trust nobody
- The right tool for the right job
- Be a detective
- Someone has already done this? Find them!
There are several helpful tables included in the publication as well such as learning tools, computing terms and tools for the biological software developer.
“The most important starting point is to be brave enough to try and to learn from these resources. Install Linux on your PC, and start working through some learning materials online. You will be astonished what you will be able to achieve very quickly, and ultimately you will have a very rewarding experience!” ~Loman and Watson 2013
Finally, there are a great deal of aspiring scientists that look up to you as they create and re-create themselves…who was the defining force/mentor or if you prefer, defining moment in your life that set you on the scientific path you are on now?
- Mel: And here’s my shameless plug of their book: The Double Helix
And of course because it’s fun…
Three words that you feel sum up your experience(s) with metagenomics?
“Bloody big data”
What is your favorite microbe/organism?
“Either the naked mole rat, or just plain old E. coli“
Nick has made several resources available for anyone who desires to read his blog or dig into his Science notebook here are the links…
- Pathogens, genes and genomes blog
- Github repositories
- Loman Lab Notebook on GitHub
- xBASE annotation
- Clinicogenomics website
- Next Generation Genomics: World Map of High-Throughput Sequencers
So for now we say goodbye to our Sherlock of the NGS world; we look forward to his presentation and tutorial session at the workshop this year and I have every confidence in his apt placement among the titans of NGS analysis, let’s ask our cultured culture if he agrees…