Eukaryotic Ebullience

Sunday, July 19, 2015

Reflections on #BOSC2015 - keynote and containers

Last weekend marked my first time attending the Bioinformatics Open Source Conference (#BOSC2015), a two-day satellite conference taking place before the huge ISMB/ECCB meeting in Dublin, Ireland. I gave a keynote talk, and my slides are now posted here on Figshare.

I was excited to receive this invitation. BOSC isn't a conference that I would normally carve out time to attend - I'm an open source advocate and end-user of bioinformatics software, but I'm firmly trying to forge a career in the world of academic biological research. I enjoy (and benefit from) being involved in software development projects, but I'm also keenly aware that this type of activity is unfortunately still not considered to be "primary research" and thus could impact my job/promotion prospects.

Initially, I wasn't sure who would be my BOSC audience (researchers vs. bioinformaticians vs. developers), so I decided to generally talk about my personal experiences transitioning from a traditional biological discipline (marine biology/taxonomy) to more computational and interdisciplinary pursuits. And I finally admitted in public (!) that I saw a command line for the first time in my life in the year 2010.

The ideas I wanted to convey were inspired by this Donald Rumsfeld quote. I argue that scientists like me (biologists doing computational work) are continually overwhelmed with the rapid pace of sequencing technology and software. Trying to keep up with too many disciplines at once leads to confusion and paranoia. Often I think I'm perhaps only a little less confused than other people (which makes me an expert in the eyes of my collaborators).

The knowledge of interdisciplinary biologists is thus broken down into three categories:

Known Knowns

The things I know I know - my assessment of my own skills, such as:

I know I can write perl/python/shell scripts
I know I can install and run software - on my laptop, in the cloud, on a cluster
I know how to sign up for training courses
I know how to Google error messages
I know who to ask if I get stuck

Known Unknowns

The things I know I don't know. Skills/knowledge I could have, but do not currently possess.

Mostly this relates to new software, languages, workflows, and file formats. There's always so much jargon that biologists have never been exposed to - phrases like "Hadoop", "Drupal", "Docker" or "Ruby on Rails". Terms that you hear and ask - is that an App? Is that a programming language? Should I learn this? How have I not heard about this before? When will I have time to read up on this?

As much as I try to keep up with the world of bioinformatics and software development, I just can't do it. Too much information. But there's always the worry that one of these jargony-terms is something that should really become a known known.

Unknown Unknowns

The things we (biologists) don't know we don't know. Speaking from personal experience, this is so frustrating for self-taught programmers (with no formal computer science classes or degree). There are huge gaps in my knowledge that never get filled in - because I never realized the gaps existed:

Core computer science concepts, such as a fundamental knowledge of hardware. When I first started talking to HPC system administrators, I realized I had no idea how computers worked. Or that I really needed to know this.

All the possible different file formats (text-based vs. binary) - especially more complex binary formatted files like HDF5 which have been long used in the earth/physical sciences but are only now gaining a foothold in microbial ecology and computational biology

Understanding how software interacts with hardware - in particular, the concept of parallel computing (running bioinformatics tools using shared memory or distributed computing, a distinction which is obfuscated for many beginners)

For me, most of the above things are now "known knowns", but these concepts were ones that eluded me for years even as my computational skills increased.

So I decided that my goal at BOSC was to interact with my participants and figure out, as a biologist:

Am I being reproducible enough?
Am I using the right tools?
Am I learning the right skills?
Am I being efficient in my quest to break my bad biologist habits (regarding data management, sharing, and reproducibility)?

After two days of talks and discussions, I'm pretty satisfied that I'm ahead of the curve and doing the best I can regarding the above questions.

I may have set off a slight twitter storm by stating that I don't use (and don't really like) Galaxy - which developed into an interesting discussion on containers and workflow systems. Kai Blin has a great blog post up about this topic, "Thoughts on Overengineering Bioinformatics Analyses", and the title sums up my thoughts perfectly. From a biologists' perspective, workflow and container systems are just too impractical to implement when I'm trying to analyze data and push out a manuscript. Each manuscript I write is very different from the next - sure, some things are generalizable (like demultiplexing raw Illumina data), but the data processing and exact analyses need to be customized based on the hypotheses/questions, the type of data (metagenome, rRNA survey, draft genome), and the target organism (eukaryote, prokaryote, viruses, etc.). For example, I work a lot with eukaryotic rRNA and metagenome data, where you have sparse reference databases for genes/genomes, and you have to really dig in with custom analyses and data parsing in order to explore it properly.

The main reasons I don't use containers are:

I work on different university clusters (depending on the project), so installing containers/workflows across all these HPC systems would be a nightmare and massive time suck.

I'm not confident that workflow systems would enable me to customize my analyses in the way I'm used to - so to get a custom system up and running I'd have to specifically employ a developer...which I don't really have the motivation or funding for, especially when I'm fine running everything on the command line myself.

Philosophically - I don't think I should be teaching students/collaborators to use GUI workflow systems. It doesn't help them understand how the data analysis woks, and I feel like it promotes a certain "black box" mentality towards bioinformatics. I'd much rather reduce fear of the command line, and get them comfortable exploring the input/output files as a way to troubleshoot and interpret their analyses.

We've gotten used to documenting our analyses using a combination of ipython notebooks and R markdown scripts which get published alongside the manuscript and data files (both raw and processed) - these go on ENA/NCBI, Github and/or Figshare. This seems to be a good system for collaborators with diverse skill sets (event taxonomists can understand plain text files or things they can open in a web browser), and also serves our open science aims.

The most interesting thing about attending BOSC is realizing how diverse the open source bioinformatics world has become. Everyone is excited about something different, and many frameworks are still very much in flux. Workflows and containers may be how biologists carry out reproducible analyses in the future, but right now its a rabbit hole that I'd prefer to watch from afar.

Saturday, April 5, 2014

If you only read ONE paper this year...

...make it this paper:

Osborne JM, Bernabeu MO, Bruna M, Calderhead B, Cooper J, et al. (2014) Ten Simple Rules for Effective Computational Research. PLoS Comput Biol, 10(3): e1003506. doi:10.1371/journal.pcbi.1003506

INCLUDING the supporting information text, which is very detailed and provides a fantastic trove of resources explaining how to get started.

As someone who has moved from lab-based biology towards computational research, I can tell you that I've learned many of these lessons the hard way. I wish someone had handed me this paper four years ago when started transitioning fields during my first postdoc. For biologists, this paper provides great advice for how to a) make computational tasks less painful, b) work with computer scientists and bioinformaticians, and c) how to do better science. If you don't think you need to learn GitHub, that you can get by without it...sure, you can, but you'll probably end up there anyway. And then you'll discover how much easier and efficient your research will become:

Thursday, March 27, 2014

Lateral Gene Transfer detected in Eukaryotic rRNA genes

This paper is an example of super cool science that also makes me worry. Eukaryote are known to have lower levels of Lateral Gene Transfer (LGT), and before this paper I assumed that LGT would not impact eukaryotic rRNA genes. However, this not so according to Yabuki et al. (2014):

Here, we report the first case of lateral transfer of eukaryotic rRNA genes. Two distinct sequences of the 18S rRNA gene were detected from a clonal culture of the stramenopile, Ciliophrys infusionum. One was clearly derived from Ciliophrys, but the other gene originated from a perkinsid alveolate. Genomewalking analyses revealed that this alveolate-type rRNA gene is immediately adjacent to two proteincoding genes (ubc12 and usp39), and the origin of both genes was shown to be a stramenopile (that is, Ciliophrys) in our phylogenetic analyses. These findings indicate that the alveolate-type rRNA gene is encoded on the Ciliophrys genome and that eukaryotic rRNA genes can be transferred laterally.

Why is this paper worrisome? Well, if LGT of rRNA genes is a widespread phenomenon in microbial eukaryotes, it will conflate biodiversity estimates obtained from environmental sequencing studies. If you had a environmental rRNA Illumina dataset, your bioinformatic analysis would show taxonomic assignments for an alveolate and stremenopile (detecting 2 taxa from one genome, one true assignment, one false). The authors cite this concern in their conclusion:

These large-scale [environmental] surveys may detect transferred rRNA genes and such transferred rRNA genes may confuse our understanding of the true diversity and distribution of microbial eukaryotes, even if the frequency of lateral transfers of the rRNA gene is rare and the copy numbers of the transferred rRNA gene in environments are low. We agree that environmental rRNA gene surveys with PCR are still useful and effective to estimate the diversity/ distribution of microbial eukaryotes. However, the fact that recovered rRNA gene sequences do not always reflect the actual existence of microbial eukaryotes corresponding to these sequences should be kept in mind based on our findings.

In other words, more research is needed to determine exactly how widespread this rRNA LGT phenomenon is in eukaryotes...it may be something else we need to take into account when designing software workflows for environmental sequence data.

Reference:

Yabuki, A., Toyofuku, T., & Takishita, K. (2014). Lateral transfer of eukaryotic ribosomal RNA genes: an emerging concern for molecular ecology of microbial eukaryotes, 1–4. doi:10.1038/ismej.2013.252

Wednesday, March 26, 2014

Exploring Statistics for Metagenomic Datasets

Recent lab discussions have made me think a lot about statistical tests we can use to detect and verify differences between metagenomic datasets. Since I don't have a strong background in statistics, my knowledge of this topic is still evolving - the scale and distribution of genomic datasets can be a tricky issue to deal with in a lot of statistical tests, it seems.

Some of of the most useful resources I've found so far are as follows (feel free to comment and recommend more resources):

Parks, D. H., & Beiko, R. G. (2010). Identifying biologically relevant differences between metagenomic communities. Bioinformatics, 26(6), 715–721. doi:10.1093/bioinformatics/btq041 (good rundown of the different statistical techniques applied to genomic data, including their implementation in the STAMP pipeline)

Primmer, C. R., Papakostas, S., Leder, E. H., Davis, M. J., & Ragan, M. A. (2013). Annotated genes and nonannotated genomes: cross-species use of Gene Ontology in ecology and evolution research. Molecular Ecology, 22(12), 3216–3241. doi:10.1111/mec.12309 (especially Box 3 - Gene Ontology enrichment tests)

Metagenome Ordination in IMG - provides a good comparison of PCA vs. PCoA vs. NDMS, particularly in regard to how each of these statistics are calculated (and differ from one another).

Dinsdale, E. A., Edwards, R. A., Bailey, B. A., Tuba, I., Akhter, S., McNair, K., et al. (2013). Multivariate analysis of functional metagenomes. Frontiers in Genetics, 4, 41. doi:10.3389/fgene.2013.00041 (added to list 5/10/14 - comprehensive and thought-provoking overview of metagenomic data analysis)

Monday, February 10, 2014

Meeting Announcement: Evolutionary Biology of Caenorhabditis & Other Nematodes

Since I'm on the organizing committee for this upcoming meeting, it's time to start advertising! Abstract submissions are now open (click here for meeting website):

Evolutionary Biology of Caenorhabditis and other Nematodes

June 14-17, 2014, Hinxton, UK

This conference will bring together scientists studying evolutionary processes in diverse nematode groups. In addition to attracting many researchers studying evolution in Caenorhabditis elegans as model organism (and its closer relatives such as C. briggsae and C. remanei), the meeting will also welcome scientists investigating other free-living groups and the numerous animal- and plant-parasitic nematode species that threaten human health and the global economy. There will be a strong emphasis on genomic approaches and perspectives. The topics highlighted will include experimental evolution, fundamental evolutionary forces, genotype-phenotype relationships, metagenomic analyses, and processes of parasitism. The programme plays a critical role in promoting interaction and collaboration between evolutionary scientists training in the C. elegans tradition and those focused on other nematode groups.

A limited number of registration bursaries are available for PhD students and junior post-docs to attend this conference (up to 50% of the registration fee).

Abstract and bursary deadline: May 2, 2014
Registration deadline: May 16, 2014

Monday, January 6, 2014

NRC survey: Research Priorities for Marine Science

I received an e-mail from the INDEEP mailing list, asking me to participate in a Virtual Town Hall on marine science research priorities, currently being run by the NRC. Here's the rundown from their website:

The National Research Council, at the request of the National Science Foundation, is seeking guidance from the ocean sciences community on the prioritization of research and facilities for the coming decade. The Decadal Survey of Ocean Sciences (DSOS) committee has been assembled for this task. To fulfill its charge, the DSOS committee is asking for community input via this Virtual Town Hall. To submit your input, please fill out the following identifying information, since anonymous comments will not be collected or posted. The deadline to submit your comments is March 15, 2014.

I figured I'd post my survey answers here (it would be great to generate some discussion about how we can promote greater emphasis on genomic tools and high-throughput sequencing in marine ecosystems - in particular the deep sea):

Across all ocean science disciplines, please list 3 important scientific questions that you believe will drive ocean research over the decade.

1) What is the role of microbial processes in ecosystem function?

2) How do microbes respond to (and impact) climate change?

3) How do we integrate knowledge from different fields (e.g. physical oceanography, biogeochemistry, taxonomy, marine biology) to gain a more comprehensive view of the marine environment?

Within your own discipline, please list 3 important scientific questions that you believe will drive ocean research over the next decade.

1) Characterizing phylogeographic patterns in microbial eukaryotes using genomic data. What is the proportion of comopolitan vs. regionally restricted species in different marine habitats?

2) Linking genomic data (DNA, RNA, genome sequences) to the existing body of morphological, ecological and taxonomic data. Particularly important for microbial species where each of these data types exists in discipline-specific silos. How can such linked data further our understanding of marine ecosystems?

3) How do we build accurate models (e.g. using robust algorithms and existing data as training sets) to predict species distributions and the potential impacts of climate change?

Please list 3 ideas for programs, technology, infrastructure, or facilities that you believe will play a major role in addressing the above questions over the next decade. Please consider both existing and new technology/facilities/infrastructure/programs that could be deployed in this timeframe. What mechanisms might be identified to best leverage these investments (interagency collaborations, international partnerships, etc.)?

1) In order to address ecosystem-scale questions, and use cutting-edge methods to do so, the marine science community (particularly ecologists and taxonomists) need to forge links with researchers in genomics and computational biology. DNA sequencing is largely under utilized in marine environments (notably lacking in the deep-sea), yet it offers a deep, cost-effective view of species, populations, and communities. Yet, computational expertise is needed to effectively apply genomic tools to marine systems, and that expertise must come from researchers who are knowledgeable about current software and algorithms (workflows optimized for "big data").

2) Funding initiatives or programs emphasizing microbial eukaryotes are needed to complement the (currently much greater) emphasis on bacteria/archaea, macro fauna and megafauna. Meiofauna and protists underpin many key ecosystem processes (e.g. nutrient cycling), but their role in marine habitats is perpetually understudied. We lack even a basic understanding of global biodiversity and species distributions for the majority of microbial metazoan phyla.

3) Marine sampling protocols MUST adopt forward-looking approaches. Ship time is expensive, and samples from habitats such as the deep-sea are precious and difficult to obtain (particularly for researchers in the genomics or computational biology communities, who may not have the professional connections needed to obtain biological samples). Many sample preservation methods do not consider the potential long-term use of a sample; for example, using formalin to preserve sediment immediately destroys the possibility of using that sample for DNA sequencing. There are many alternate sample preservation methods that preserve both DNA and morphological features (e.g. DESS is effective for sampling microbial metazoa). Giving deeper thought to sample collection, and prioritizing DNA preservation from diverse marine environments, is CRITICAL for furthering our understanding of marine biodiversity, biogeography, and ecology.

To give your own input, fill out the survey at this link: http://nas-sites.org/dsos2015/

Sunday, December 29, 2013

PFTF Discussion: Job talks, chalk talks, and teaching demonstrations

Last month we finished off winter quarter with another talk about the academic job search, where UC Davis faculty Siobhan Brady (Plant Sciences) and Sarah Perrault (University Writing Program) gave us the rundown on job talks, chalk talks, and teaching demonstrations.

Firstly, we discussed how these three presentations differ:

Job talk (research talk, about 50 minutes long) - what you did in the past
Chalk talk (research plans, can be 20-90 minutes long) - what you're going to do in the future
Teaching demonstration (can be 25-60 minutes) - mock classroom or course instruction

Then we broached the finer details.:

Chalk talks usually put forth the aims listed in your research statement (e.g. they can be the exact points you outlined in your original job application packet). Few institutions will allow any type of presentation aids for chalk talks (and if powerpoint is allowed, you'll usually be limited to just a few slides). The focus of this talk should be on your research goals (both short-term and long-term) as well as your long-term research questions. Some tips for chalk talks:

Speak quickly on your feet and show mastery of your field
Introduce your long-term research questions during your job talk
Being conservative here can help - people will see you as practical and thus able to get funding (and show preliminary data, if possible)
Gear your chalk talk towards faculty; talk about methods but don't be too technical: talk about what equipment and personnel you will need
People will interrupt you nonstop (in this sense, a chalk talk is similar to a PhD qualifying exam). Your responses will indicate how much (and how deeply) you have thought about the future.
People often fall apart because of a) nervousness and/or b) falling prey to the potential pitfalls of your subject matter
Most importantly: practice! Practice your chalk talk a lot before your interview, using diverse audiences (faculty, postdocs, etc.) Be critical as you strive to perfect this talk.

Job Talks are perhaps the most familiar, but we hit on some critical points that will ensure a successful presentation:

Keep your slides simple, use black text on a white background. Use 40pt font to make your slides readable in a large room
Use sentences for slide headings - these are more memorable than topical phrases (studies have shown this is true)
Use clean, simple graphics. Watch some TED talks to get an idea of how to use good visuals.
Beware of humor in a job talk: it can backfire
During questions, if you need to buy time to think you can ask "Can you repeat/rephrase the question?"
Be sure to remain poised and composed if you're battered with questions that seem to come out of left field (composure is what people are looking for). Sometimes these difficult questions are a result of faculty performing out of ego, or for the sake of their peers.

Teaching seminars come in different formats, and we discussed two different scenarios that our speakers had experienced.

The first scenario is more common in teaching universities: a candidate was asked to teach an intro course that was completely different from their own disciplinary subject area. The candidate was given good guidelines and plenty of preparation time (in this case, they were given a course textbook and told to teach the first chapter). At the interview, the audience for this teaching demo was comprised of the search committee and undergraduates.
The second scenario was much less structured, with only one instruction: no powerpoint allowed. The candidate was asked to teach a class as if it were an upper-level course, using only the blackboard. The demo only lasted 25 minutes, and the audience was the search committee. In this scenario, a good interview strategy would be to show expertise in a class not currently offered by the university (e.g. to show your potential fit in the Department). For the sake of the audience, it's also prudent to state your learning goals as well as a textbook, chapter, and figures that complement your teaching demo.

We also had an interesting discussion about "illegal questions" - the inevitable queries about your personal life that interviewers aren't supposed to ask (Are you married? Do you have kids? Do you want kids?). Someone suggested that you should be honest - after all, you don't want to insult your possible future colleagues. Another person suggested ways to deflect the question (and address the underlying concern that prompted the question), without answering or insulting: for example, saying something like "If this is about my productivity, I can assure you that I'm first and foremost passionate about my research..." It was interesting to hear different thoughts on this tricky subject.

Overall, our discussion provided a very eye-opening look into the interview process, and I learned a lot. We ended with some final tips and general guidance:

Keep your presentations backed up on a thumb drive during all interviews (just in case!)
At some institutions (UC Davis is one), admin staff will give their input during the faculty hiring process. So when you interview, keep in mind that every single person you encounter is your "audience" during a campus visit.
Keep a supply of water and energy snacks in your bag - interviews are exhausting, and you will need them.
Make a cheat sheet of people based on your interview schedule - note their recent publications, research interests, and other professional activities.
Never bluff answers. It's better to just say "I don't know". Or better yet, "I'd be happy to get back to you" - and then take their name and follow-up later with the answer.
For phone or Skype interviews, it's a good idea to dress in interview clothes and book a conference room to make sure you feel professional.
Make a mental note of people throwing their weight around - if other faculty don't shut them down, it might be an indicator of departmental culture (or indicate people that might have power over your career).