Thursday, September 19, 2013

Microbial Phylogenies have the *least accessible* data in systematics

This month in PLoS Biology, "Lost Branches on the Tree of Life" gives us a pretty stark overview of data sharing and accessibility in the systematics community. The paper assessed just how many published phylogeny papers also deposited their corresponding sequence alignments, tree files, and program parameters. The grim news:
...only 16.7%, 1,262 from a total of 7,539 publications surveyed, provided accessible alignments/trees (Figures 1 and 2). Our attempts to obtain datasets directly from authors were only 16% successful (61/375; see Table S4), and we estimate that approximately 70% of existing alignments/trees are no longer accessible. Thus, we conclude that most of the underlying sequence alignments and phylogenetic trees produced by the systematic community during the past several decades are essentially lost, accessible only as static figures in a published journal article with no capacity for subsequent manipulation. Furthermore, when data are deposited, they are often incomplete (e.g., what characters were excluded, accepted taxon names; see Text S1 and Figure S1). Our survey of publications that implemented BEAST revealed that only 11 out of 100 (11%) examined studies provided access to the underlying xml input file, which is critical for reproducing BEAST results. Although funding agencies often require all data to be accessible from funded publications, our results reveal this is more the exception than the rule.
What made me cringe even more is that my discipline (microbial systematics - including microbial eukaryotes, bacteria and archaea) are the worst offenders when it comes to data sharing. The green line indicating full data deposition is pretty much flatlining in some years for microbes! (Drew et al. 2013)

I'll be the first to admit that my own data is part of the problem - when I was doing my PhD, no one ever had a conversation with me about data reproducibility and sharing. I made my best effort to publish the supplemental files I thought would be useful, but at that time I wasn't in the loop about scientific reproducibility and best practices for data archiving. For my nematode phylogeny paper in BMC Evoltionary Biology, I did upload the original ARB databases I used to construct and edit the rRNA structural alignments; but in hindsight, this file requires knowledge of the ARB software itself (not an easy package to use), and I didn't even think to publish a FASTA alignment file or a Nexus tree file. Partially this was because my Phylogeny papers involved a multitude of topology tests and I didn't think it was correct to pick just "one tree" to represent my spectrum of results.

I've been thinking about this issue a lot recently, and taking strides to correct my past mistakes. I'm now digging through old PhD files to find my alignments and tree files to contribute a nematode phylogeny for the Open Tree of Life project. I'll also post these data on Figshare so my data will no longer be another "Lost Branch" on the Tree of Life.


Drew BT, Gazis R, Cabezas P, Swithers KS, Deng J, Rodriguez R, et al. (2013) Lost Branches on the Tree of Life. PLoS Biology, 11(9):e1001636.

No comments:

Post a Comment