Eukaryotic Ebullience : Reflections on #BOSC2015

Last weekend marked my first time attending the Bioinformatics Open Source Conference (#BOSC2015), a two-day satellite conference taking place before the huge ISMB/ECCB meeting in Dublin, Ireland. I gave a keynote talk, and my slides are now posted here on Figshare.

I was excited to receive this invitation. BOSC isn't a conference that I would normally carve out time to attend - I'm an open source advocate and end-user of bioinformatics software, but I'm firmly trying to forge a career in the world of academic biological research. I enjoy (and benefit from) being involved in software development projects, but I'm also keenly aware that this type of activity is unfortunately still not considered to be "primary research" and thus could impact my job/promotion prospects.

Initially, I wasn't sure who would be my BOSC audience (researchers vs. bioinformaticians vs. developers), so I decided to generally talk about my personal experiences transitioning from a traditional biological discipline (marine biology/taxonomy) to more computational and interdisciplinary pursuits. And I finally admitted in public (!) that I saw a command line for the first time in my life in the year 2010.

The ideas I wanted to convey were inspired by this Donald Rumsfeld quote. I argue that scientists like me (biologists doing computational work) are continually overwhelmed with the rapid pace of sequencing technology and software. Trying to keep up with too many disciplines at once leads to confusion and paranoia. Often I think I'm perhaps only a little less confused than other people (which makes me an expert in the eyes of my collaborators).

The knowledge of interdisciplinary biologists is thus broken down into three categories:

Known Knowns

The things I know I know - my assessment of my own skills, such as:

I know I can write perl/python/shell scripts
I know I can install and run software - on my laptop, in the cloud, on a cluster
I know how to sign up for training courses
I know how to Google error messages
I know who to ask if I get stuck

Known Unknowns

The things I know I don't know. Skills/knowledge I could have, but do not currently possess.

Mostly this relates to new software, languages, workflows, and file formats. There's always so much jargon that biologists have never been exposed to - phrases like "Hadoop", "Drupal", "Docker" or "Ruby on Rails". Terms that you hear and ask - is that an App? Is that a programming language? Should I learn this? How have I not heard about this before? When will I have time to read up on this?

As much as I try to keep up with the world of bioinformatics and software development, I just can't do it. Too much information. But there's always the worry that one of these jargony-terms is something that should really become a known known.

Unknown Unknowns

The things we (biologists) don't know we don't know. Speaking from personal experience, this is so frustrating for self-taught programmers (with no formal computer science classes or degree). There are huge gaps in my knowledge that never get filled in - because I never realized the gaps existed:

Core computer science concepts, such as a fundamental knowledge of hardware. When I first started talking to HPC system administrators, I realized I had no idea how computers worked. Or that I really needed to know this.

All the possible different file formats (text-based vs. binary) - especially more complex binary formatted files like HDF5 which have been long used in the earth/physical sciences but are only now gaining a foothold in microbial ecology and computational biology

Understanding how software interacts with hardware - in particular, the concept of parallel computing (running bioinformatics tools using shared memory or distributed computing, a distinction which is obfuscated for many beginners)

For me, most of the above things are now "known knowns", but these concepts were ones that eluded me for years even as my computational skills increased.

So I decided that my goal at BOSC was to interact with my participants and figure out, as a biologist:

Am I being reproducible enough?
Am I using the right tools?
Am I learning the right skills?
Am I being efficient in my quest to break my bad biologist habits (regarding data management, sharing, and reproducibility)?

After two days of talks and discussions, I'm pretty satisfied that I'm ahead of the curve and doing the best I can regarding the above questions.

I may have set off a slight twitter storm by stating that I don't use (and don't really like) Galaxy - which developed into an interesting discussion on containers and workflow systems. Kai Blin has a great blog post up about this topic, "Thoughts on Overengineering Bioinformatics Analyses", and the title sums up my thoughts perfectly. From a biologists' perspective, workflow and container systems are just too impractical to implement when I'm trying to analyze data and push out a manuscript. Each manuscript I write is very different from the next - sure, some things are generalizable (like demultiplexing raw Illumina data), but the data processing and exact analyses need to be customized based on the hypotheses/questions, the type of data (metagenome, rRNA survey, draft genome), and the target organism (eukaryote, prokaryote, viruses, etc.). For example, I work a lot with eukaryotic rRNA and metagenome data, where you have sparse reference databases for genes/genomes, and you have to really dig in with custom analyses and data parsing in order to explore it properly.

The main reasons I don't use containers are:

I work on different university clusters (depending on the project), so installing containers/workflows across all these HPC systems would be a nightmare and massive time suck.

I'm not confident that workflow systems would enable me to customize my analyses in the way I'm used to - so to get a custom system up and running I'd have to specifically employ a developer...which I don't really have the motivation or funding for, especially when I'm fine running everything on the command line myself.

Philosophically - I don't think I should be teaching students/collaborators to use GUI workflow systems. It doesn't help them understand how the data analysis woks, and I feel like it promotes a certain "black box" mentality towards bioinformatics. I'd much rather reduce fear of the command line, and get them comfortable exploring the input/output files as a way to troubleshoot and interpret their analyses.

We've gotten used to documenting our analyses using a combination of ipython notebooks and R markdown scripts which get published alongside the manuscript and data files (both raw and processed) - these go on ENA/NCBI, Github and/or Figshare. This seems to be a good system for collaborators with diverse skill sets (event taxonomists can understand plain text files or things they can open in a web browser), and also serves our open science aims.

The most interesting thing about attending BOSC is realizing how diverse the open source bioinformatics world has become. Everyone is excited about something different, and many frameworks are still very much in flux. Workflows and containers may be how biologists carry out reproducible analyses in the future, but right now its a rabbit hole that I'd prefer to watch from afar.

Eukaryotic Ebullience

Sunday, July 19, 2015

Reflections on #BOSC2015 - keynote and containers

Known Knowns

Known Unknowns

Unknown Unknowns

4 comments: