Ikea: Some assembly required? - Evolution and Genomics

Participant Discussions

So last year I was inspired after just reading all three books in the Lord of the Rings series and thematically pooled our student discussion under the inspiration of “One *ome to rule them all”…you can read it here, many of it’s point are still applicable to the field today. But this year the discussions took a slightly different turn…I thought perhaps I could play off the newest movie that just came out the “Desolation of Smaug” but something struck me stronger…

Last year we had a student who mentioned how convenient it would be to have an Ikea catalog for bioinformatics/Omics-science, where you could just go in pick all the parts pre-fashioned put them together and wah-la your dataset would be complete and you would know what you have with minimal assembly required.

Well for one, I will tell you right now some of those Ikea kits are damn hard to assemble! And they don’t always give you every part! Additionally, in our case, there is no bright picture on the front of the ‘box’ of what you are supposed to get at the end of the ‘assembly’ and unfortunately that is the state of the field right now. But frustrations aside…

This year I feel like this was a request of all the groups, several major themes came out; nearly all groups were concerned about:

What parameters to use?
How do you know you are using the right parameters?
What if you are missing a part (ie. reference genome)?
What is the best program (kit if you will)?
The desire for standards?
What is the ‘correct’ way to do things?
What do you do if what you want isn’t ‘created’ yet?

After mulling this over and talking to the faculty and instructors I think there are a few things that we can try as bioinformaticians and in trying we will both be challenged and freed…

We have chosen a field that is still evolving. Perhaps you didn’t choose bioinformatics, perhaps you were thrown into bioinformatics due to your love of rodents, invertebrates, the Ibex, Mesozoa, cyanobacteria or in my case Dengue fever. Yes, I fell in love with a virus that can cause violent internal bleeding in it’s worst form –we all have our quirks! In my defense there are microbiologists that spend the better part of their careers neck deep in sh*t and they love it. I didn’t start out ‘wanting’ to be a bioinformatician, I use computational biology/bioinformatics to accomplish my research goals. I got my Ph.D. during a time when bioinformatics was just emerging, there were no rules, no structure, no classes, no workshops, no guidance really of any sort — it was a free for all…and I think you’ll agree that’s pretty scary. But that’s also the point. It’s exciting, no one can ‘tell you what to do’ and order you around, you are your own rise and fall –you can try and do whatever you want! During my Ph.D. my advisor kept wanting to try new things, use the newest programs, employ the latest algorithm and I had to explain it all which was challenging when I had little computing background, no classes and the manuals were not fantastically informative. As a result my fellow Ph.D. students and I put up a sign in our lab: “Adapt or Die!” because we really felt that was where we were headed. Because our field is still evolving we will need to evolve with it and even be apart of the evolutionary process.

There is no ‘meaning of life’ answer in bioinformatics right now. It would be lovely if the answer to life and to the ultimate question was ’42‘ (if you don’t know this reference click the link). But really there is no ‘right’ way to do things right now in bioinformatics. At last count, (5 min ago), there were 651 bioinformatics applications, 615, references, and 875 URLs in SEQwiki. So there are a lot of options out there and basically as the researcher you have to define what is it you want to answer…what’s your question what is your clear path, your hypothesis then once that is clear you have two options. Comb and compare programs that look like they will do what you want them to do or build your own. Some programs will be fantastically documented, others will not, some programs will eat your computational power inside and out, some simply won’t work. At the workshop you’ve been or will be introduced to a lot of different programs. You’ll notice many slides have other links, references and programs that you can try depending on what it is you want to accomplish. Personally, I didn’t find a program that summarized the quality of our data (derived from 3 different platforms) exactly how I wanted it with specific metrics…so we built one. All the metrics could be drawn from the BAM file so we built a program that would specifically summarize what we were interested in from the BAM. There is no program that specifically corrects consensus sequence automatically inserting ambiguities. This was important to me because I work with a quasispecies so I sometimes cannot call an ‘A’ an ‘A’ because I have more than one ‘variant’ within the viral swarm (population) I sequenced so I need ambiguity in the consensus to reflect that. So we are building one leveraging the strength of BAM and VCF files. So I am not re-inventing the wheel as much as adapting it for different terrain.

There are lots of buttons to push but no push button solution.

“To do everything” while admirable, is not an appropriate goal. What is it you are trying to accomplish with your research? What is your goal? Can you write your specific hypothesis? This will help you immensely on narrowing down methodology, the programs that might be useful, and how to know if your dataset can even answer this question. As Mike stated much of next generation research today has moved away from discovery and is becoming hypothesis driven. Don’t expect the data to tell you your hypothesis, first form your hypothesis based on your literature reading and preliminary methods analysis then design your experiment and see if your data answers the hypothesis.

You must crawl before you walk, walk before you run, run before you sprint. I think incremental understanding is under appreciated sometimes, especially when you are jumping fields. Computer science, programming, software engineering are completely separate fields from biology for good reason –they are completely different, in mindset and practice; an essential re-wiring of the brain in some ways. Don’t have an unrealistic expectation of what you should be able to do in a field you were not originally trained in. Don’t be afraid to back up, slow down and really understand what one tool is telling you or what the code is saying before moving to another and/or completely discarding it. I realize the field moves incredibly fast…but not so fast that you don’t have time to understand the building blocks that make the program work and what the program is saying.

Finally…

This is just the beginning… Workshops like this are meant to wet your appetite to show you whats possible to introduce you to tools and techniques that will hopefully inspire you to create, test, apply and be successful in your research. Despite there being over 600+ programs available, you may not find an ‘exact’ program that does what you want, perhaps the field needs to evolve more to larger datasets, perhaps you are the one to do it –to build or contribute those new models, to adapt current programming or build new programming…to visualize data in new ways to answers your questions…to go where no man has gone before (*star trek music in the background*)

Well that’s all fine and inspirational…but where does this leave you at this moment in time in terms of what to do and where to go next?

Daniel had a great suggestion…find a publication you think it really cool or uses methods you would like to use. Pull the data (which should be publically available) and try and recreate what they did.
Look on forums or google groups. Most big software packages have forums or google groups associated with them. BioStars and SeqAnswers are good places to start.
Try your dataset out on different programs or using different algorithms and see how they differ. It will help you decided which one makes more sense to use for your research.
Be a super sleuth. You are basically a detective, and all the clues to the murder are in your data. The murderer is not going to make it easy and hand themselves in with a full confession. Work the clues. Work the data. Figure it out. Be a detective. (credit: Dr. Mick Watson)
Be Patient. Running beautifully crafted machine learning algorithms to find that perfect, but hidden, signal that reflects the true biology = 1% of your time. Getting data into the correct format, dealing with the fact that no two databases use the same identifiers, or the same format, troubleshooting, and removing errors and systematic bias from your data. = 99% of your time. This is the true art of bioinformatics. Try and get this done quickly and efficiently, so you can spend more time on the biology.(credit: Dr. Mick Watson)
Be suspicious. If it looks too good to be true, it probably is.
A large majority of your “Eureka!” moments will just be errors and systematic bias.

Whenever you find an answer, treat it with huge suspicion until you are absolutely sure it’s not an error.(credit: Dr. Mick Watson)

And I think that wraps it up…no Ikea catalog? No problem, we are up to the challenge.

Share this: