Eukaryotic Ebullience : reproducibility

Showing posts with label reproducibility. Show all posts

Sunday, July 19, 2015

Reflections on #BOSC2015 - keynote and containers

Last weekend marked my first time attending the Bioinformatics Open Source Conference (#BOSC2015), a two-day satellite conference taking place before the huge ISMB/ECCB meeting in Dublin, Ireland. I gave a keynote talk, and my slides are now posted here on Figshare.

I was excited to receive this invitation. BOSC isn't a conference that I would normally carve out time to attend - I'm an open source advocate and end-user of bioinformatics software, but I'm firmly trying to forge a career in the world of academic biological research. I enjoy (and benefit from) being involved in software development projects, but I'm also keenly aware that this type of activity is unfortunately still not considered to be "primary research" and thus could impact my job/promotion prospects.

Initially, I wasn't sure who would be my BOSC audience (researchers vs. bioinformaticians vs. developers), so I decided to generally talk about my personal experiences transitioning from a traditional biological discipline (marine biology/taxonomy) to more computational and interdisciplinary pursuits. And I finally admitted in public (!) that I saw a command line for the first time in my life in the year 2010.

The ideas I wanted to convey were inspired by this Donald Rumsfeld quote. I argue that scientists like me (biologists doing computational work) are continually overwhelmed with the rapid pace of sequencing technology and software. Trying to keep up with too many disciplines at once leads to confusion and paranoia. Often I think I'm perhaps only a little less confused than other people (which makes me an expert in the eyes of my collaborators).

The knowledge of interdisciplinary biologists is thus broken down into three categories:

Known Knowns

The things I know I know - my assessment of my own skills, such as:

I know I can write perl/python/shell scripts
I know I can install and run software - on my laptop, in the cloud, on a cluster
I know how to sign up for training courses
I know how to Google error messages
I know who to ask if I get stuck

Known Unknowns

The things I know I don't know. Skills/knowledge I could have, but do not currently possess.

Mostly this relates to new software, languages, workflows, and file formats. There's always so much jargon that biologists have never been exposed to - phrases like "Hadoop", "Drupal", "Docker" or "Ruby on Rails". Terms that you hear and ask - is that an App? Is that a programming language? Should I learn this? How have I not heard about this before? When will I have time to read up on this?

As much as I try to keep up with the world of bioinformatics and software development, I just can't do it. Too much information. But there's always the worry that one of these jargony-terms is something that should really become a known known.

Unknown Unknowns

The things we (biologists) don't know we don't know. Speaking from personal experience, this is so frustrating for self-taught programmers (with no formal computer science classes or degree). There are huge gaps in my knowledge that never get filled in - because I never realized the gaps existed:

Core computer science concepts, such as a fundamental knowledge of hardware. When I first started talking to HPC system administrators, I realized I had no idea how computers worked. Or that I really needed to know this.

All the possible different file formats (text-based vs. binary) - especially more complex binary formatted files like HDF5 which have been long used in the earth/physical sciences but are only now gaining a foothold in microbial ecology and computational biology

Understanding how software interacts with hardware - in particular, the concept of parallel computing (running bioinformatics tools using shared memory or distributed computing, a distinction which is obfuscated for many beginners)

For me, most of the above things are now "known knowns", but these concepts were ones that eluded me for years even as my computational skills increased.

So I decided that my goal at BOSC was to interact with my participants and figure out, as a biologist:

Am I being reproducible enough?
Am I using the right tools?
Am I learning the right skills?
Am I being efficient in my quest to break my bad biologist habits (regarding data management, sharing, and reproducibility)?

After two days of talks and discussions, I'm pretty satisfied that I'm ahead of the curve and doing the best I can regarding the above questions.

I may have set off a slight twitter storm by stating that I don't use (and don't really like) Galaxy - which developed into an interesting discussion on containers and workflow systems. Kai Blin has a great blog post up about this topic, "Thoughts on Overengineering Bioinformatics Analyses", and the title sums up my thoughts perfectly. From a biologists' perspective, workflow and container systems are just too impractical to implement when I'm trying to analyze data and push out a manuscript. Each manuscript I write is very different from the next - sure, some things are generalizable (like demultiplexing raw Illumina data), but the data processing and exact analyses need to be customized based on the hypotheses/questions, the type of data (metagenome, rRNA survey, draft genome), and the target organism (eukaryote, prokaryote, viruses, etc.). For example, I work a lot with eukaryotic rRNA and metagenome data, where you have sparse reference databases for genes/genomes, and you have to really dig in with custom analyses and data parsing in order to explore it properly.

The main reasons I don't use containers are:

I work on different university clusters (depending on the project), so installing containers/workflows across all these HPC systems would be a nightmare and massive time suck.

I'm not confident that workflow systems would enable me to customize my analyses in the way I'm used to - so to get a custom system up and running I'd have to specifically employ a developer...which I don't really have the motivation or funding for, especially when I'm fine running everything on the command line myself.

Philosophically - I don't think I should be teaching students/collaborators to use GUI workflow systems. It doesn't help them understand how the data analysis woks, and I feel like it promotes a certain "black box" mentality towards bioinformatics. I'd much rather reduce fear of the command line, and get them comfortable exploring the input/output files as a way to troubleshoot and interpret their analyses.

We've gotten used to documenting our analyses using a combination of ipython notebooks and R markdown scripts which get published alongside the manuscript and data files (both raw and processed) - these go on ENA/NCBI, Github and/or Figshare. This seems to be a good system for collaborators with diverse skill sets (event taxonomists can understand plain text files or things they can open in a web browser), and also serves our open science aims.

The most interesting thing about attending BOSC is realizing how diverse the open source bioinformatics world has become. Everyone is excited about something different, and many frameworks are still very much in flux. Workflows and containers may be how biologists carry out reproducible analyses in the future, but right now its a rabbit hole that I'd prefer to watch from afar.

Saturday, April 5, 2014

If you only read ONE paper this year...

...make it this paper:

Osborne JM, Bernabeu MO, Bruna M, Calderhead B, Cooper J, et al. (2014) Ten Simple Rules for Effective Computational Research. PLoS Comput Biol, 10(3): e1003506. doi:10.1371/journal.pcbi.1003506

INCLUDING the supporting information text, which is very detailed and provides a fantastic trove of resources explaining how to get started.

As someone who has moved from lab-based biology towards computational research, I can tell you that I've learned many of these lessons the hard way. I wish someone had handed me this paper four years ago when started transitioning fields during my first postdoc. For biologists, this paper provides great advice for how to a) make computational tasks less painful, b) work with computer scientists and bioinformaticians, and c) how to do better science. If you don't think you need to learn GitHub, that you can get by without it...sure, you can, but you'll probably end up there anyway. And then you'll discover how much easier and efficient your research will become:

Thursday, September 19, 2013

Microbial Phylogenies have the least accessible data in systematics

This month in PLoS Biology, "Lost Branches on the Tree of Life" gives us a pretty stark overview of data sharing and accessibility in the systematics community. The paper assessed just how many published phylogeny papers also deposited their corresponding sequence alignments, tree files, and program parameters. The grim news:

...only 16.7%, 1,262 from a total of 7,539 publications surveyed, provided accessible alignments/trees (Figures 1 and 2). Our attempts to obtain datasets directly from authors were only 16% successful (61/375; see Table S4), and we estimate that approximately 70% of existing alignments/trees are no longer accessible. Thus, we conclude that most of the underlying sequence alignments and phylogenetic trees produced by the systematic community during the past several decades are essentially lost, accessible only as static figures in a published journal article with no capacity for subsequent manipulation. Furthermore, when data are deposited, they are often incomplete (e.g., what characters were excluded, accepted taxon names; see Text S1 and Figure S1). Our survey of publications that implemented BEAST revealed that only 11 out of 100 (11%) examined studies provided access to the underlying xml input file, which is critical for reproducing BEAST results. Although funding agencies often require all data to be accessible from funded publications, our results reveal this is more the exception than the rule.

What made me cringe even more is that my discipline (microbial systematics - including microbial eukaryotes, bacteria and archaea) are the worst offenders when it comes to data sharing. The green line indicating full data deposition is pretty much flatlining in some years for microbes! (Drew et al. 2013)

I'll be the first to admit that my own data is part of the problem - when I was doing my PhD, no one ever had a conversation with me about data reproducibility and sharing. I made my best effort to publish the supplemental files I thought would be useful, but at that time I wasn't in the loop about scientific reproducibility and best practices for data archiving. For my nematode phylogeny paper in BMC Evoltionary Biology, I did upload the original ARB databases I used to construct and edit the rRNA structural alignments; but in hindsight, this file requires knowledge of the ARB software itself (not an easy package to use), and I didn't even think to publish a FASTA alignment file or a Nexus tree file. Partially this was because my Phylogeny papers involved a multitude of topology tests and I didn't think it was correct to pick just "one tree" to represent my spectrum of results.

I've been thinking about this issue a lot recently, and taking strides to correct my past mistakes. I'm now digging through old PhD files to find my alignments and tree files to contribute a nematode phylogeny for the Open Tree of Life project. I'll also post these data on Figshare so my data will no longer be another "Lost Branch" on the Tree of Life.

Reference:

Drew BT, Gazis R, Cabezas P, Swithers KS, Deng J, Rodriguez R, et al. (2013) Lost Branches on the Tree of Life. PLoS Biology, 11(9):e1001636.