Eukaryotic Ebullience

Reflections on #BOSC2015 - keynote and containers

2015-07-19T05:44:00.001-07:00

Last weekend marked my first time attending the Bioinformatics Open Source Conference (#BOSC2015), a two-day satellite conference taking place before the huge ISMB/ECCB meeting in Dublin, Ireland. I gave a keynote talk, and my slides are now posted here on Figshare.

I was excited to receive this invitation. BOSC isn't a conference that I would normally carve out time to attend - I'm an open source advocate and end-user of bioinformatics software, but I'm firmly trying to forge a career in the world of academic biological research. I enjoy (and benefit from) being involved in software development projects, but I'm also keenly aware that this type of activity is unfortunately still not considered to be "primary research" and thus could impact my job/promotion prospects.

Initially, I wasn't sure who would be my BOSC audience (researchers vs. bioinformaticians vs. developers), so I decided to generally talk about my personal experiences transitioning from a traditional biological discipline (marine biology/taxonomy) to more computational and interdisciplinary pursuits. And I finally admitted in public (!) that I saw a command line for the first time in my life in the year 2010.

The ideas I wanted to convey were inspired by this Donald Rumsfeld quote. I argue that scientists like me (biologists doing computational work) are continually overwhelmed with the rapid pace of sequencing technology and software. Trying to keep up with too many disciplines at once leads to confusion and paranoia. Often I think I'm perhaps only a little less confused than other people (which makes me an expert in the eyes of my collaborators).

The knowledge of interdisciplinary biologists is thus broken down into three categories:

Known Knowns

The things I know I know - my assessment of my own skills, such as:

I know I can write perl/python/shell scripts
I know I can install and run software - on my laptop, in the cloud, on a cluster
I know how to sign up for training courses
I know how to Google error messages
I know who to ask if I get stuck

Known Unknowns

The things I know I don't know. Skills/knowledge I could have, but do not currently possess.

Mostly this relates to new software, languages, workflows, and file formats. There's always so much jargon that biologists have never been exposed to - phrases like "Hadoop", "Drupal", "Docker" or "Ruby on Rails". Terms that you hear and ask - is that an App? Is that a programming language? Should I learn this? How have I not heard about this before? When will I have time to read up on this?

As much as I try to keep up with the world of bioinformatics and software development, I just can't do it. Too much information. But there's always the worry that one of these jargony-terms is something that should really become a known known.

Unknown Unknowns

The things we (biologists) don't know we don't know. Speaking from personal experience, this is so frustrating for self-taught programmers (with no formal computer science classes or degree). There are huge gaps in my knowledge that never get filled in - because I never realized the gaps existed:

Core computer science concepts, such as a fundamental knowledge of hardware. When I first started talking to HPC system administrators, I realized I had no idea how computers worked. Or that I really needed to know this.

All the possible different file formats (text-based vs. binary) - especially more complex binary formatted files like HDF5 which have been long used in the earth/physical sciences but are only now gaining a foothold in microbial ecology and computational biology

Understanding how software interacts with hardware - in particular, the concept of parallel computing (running bioinformatics tools using shared memory or distributed computing, a distinction which is obfuscated for many beginners)

For me, most of the above things are now "known knowns", but these concepts were ones that eluded me for years even as my computational skills increased.

So I decided that my goal at BOSC was to interact with my participants and figure out, as a biologist:

Am I being reproducible enough?
Am I using the right tools?
Am I learning the right skills?
Am I being efficient in my quest to break my bad biologist habits (regarding data management, sharing, and reproducibility)?

After two days of talks and discussions, I'm pretty satisfied that I'm ahead of the curve and doing the best I can regarding the above questions.

I may have set off a slight twitter storm by stating that I don't use (and don't really like) Galaxy - which developed into an interesting discussion on containers and workflow systems. Kai Blin has a great blog post up about this topic, "Thoughts on Overengineering Bioinformatics Analyses", and the title sums up my thoughts perfectly. From a biologists' perspective, workflow and container systems are just too impractical to implement when I'm trying to analyze data and push out a manuscript. Each manuscript I write is very different from the next - sure, some things are generalizable (like demultiplexing raw Illumina data), but the data processing and exact analyses need to be customized based on the hypotheses/questions, the type of data (metagenome, rRNA survey, draft genome), and the target organism (eukaryote, prokaryote, viruses, etc.). For example, I work a lot with eukaryotic rRNA and metagenome data, where you have sparse reference databases for genes/genomes, and you have to really dig in with custom analyses and data parsing in order to explore it properly.

The main reasons I don't use containers are:

I work on different university clusters (depending on the project), so installing containers/workflows across all these HPC systems would be a nightmare and massive time suck.

I'm not confident that workflow systems would enable me to customize my analyses in the way I'm used to - so to get a custom system up and running I'd have to specifically employ a developer...which I don't really have the motivation or funding for, especially when I'm fine running everything on the command line myself.

Philosophically - I don't think I should be teaching students/collaborators to use GUI workflow systems. It doesn't help them understand how the data analysis woks, and I feel like it promotes a certain "black box" mentality towards bioinformatics. I'd much rather reduce fear of the command line, and get them comfortable exploring the input/output files as a way to troubleshoot and interpret their analyses.

We've gotten used to documenting our analyses using a combination of ipython notebooks and R markdown scripts which get published alongside the manuscript and data files (both raw and processed) - these go on ENA/NCBI, Github and/or Figshare. This seems to be a good system for collaborators with diverse skill sets (event taxonomists can understand plain text files or things they can open in a web browser), and also serves our open science aims.

The most interesting thing about attending BOSC is realizing how diverse the open source bioinformatics world has become. Everyone is excited about something different, and many frameworks are still very much in flux. Workflows and containers may be how biologists carry out reproducible analyses in the future, but right now its a rabbit hole that I'd prefer to watch from afar.

If you only read ONE paper this year...

2014-04-05T11:15:00.000-07:00

...make it this paper:

Osborne JM, Bernabeu MO, Bruna M, Calderhead B, Cooper J, et al. (2014) Ten Simple Rules for Effective Computational Research. PLoS Comput Biol, 10(3): e1003506. doi:10.1371/journal.pcbi.1003506

INCLUDING the supporting information text, which is very detailed and provides a fantastic trove of resources explaining how to get started.

As someone who has moved from lab-based biology towards computational research, I can tell you that I've learned many of these lessons the hard way. I wish someone had handed me this paper four years ago when started transitioning fields during my first postdoc. For biologists, this paper provides great advice for how to a) make computational tasks less painful, b) work with computer scientists and bioinformaticians, and c) how to do better science. If you don't think you need to learn GitHub, that you can get by without it...sure, you can, but you'll probably end up there anyway. And then you'll discover how much easier and efficient your research will become:

Lateral Gene Transfer detected in Eukaryotic rRNA genes

2014-03-27T08:00:00.000-07:00

This paper is an example of super cool science that also makes me worry. Eukaryote are known to have lower levels of Lateral Gene Transfer (LGT), and before this paper I assumed that LGT would not impact eukaryotic rRNA genes. However, this not so according to Yabuki et al. (2014):

Here, we report the first case of lateral transfer of eukaryotic rRNA genes. Two distinct sequences of the 18S rRNA gene were detected from a clonal culture of the stramenopile, Ciliophrys infusionum. One was clearly derived from Ciliophrys, but the other gene originated from a perkinsid alveolate. Genomewalking analyses revealed that this alveolate-type rRNA gene is immediately adjacent to two proteincoding genes (ubc12 and usp39), and the origin of both genes was shown to be a stramenopile (that is, Ciliophrys) in our phylogenetic analyses. These findings indicate that the alveolate-type rRNA gene is encoded on the Ciliophrys genome and that eukaryotic rRNA genes can be transferred laterally.

Why is this paper worrisome? Well, if LGT of rRNA genes is a widespread phenomenon in microbial eukaryotes, it will conflate biodiversity estimates obtained from environmental sequencing studies. If you had a environmental rRNA Illumina dataset, your bioinformatic analysis would show taxonomic assignments for an alveolate and stremenopile (detecting 2 taxa from one genome, one true assignment, one false). The authors cite this concern in their conclusion:

These large-scale [environmental] surveys may detect transferred rRNA genes and such transferred rRNA genes may confuse our understanding of the true diversity and distribution of microbial eukaryotes, even if the frequency of lateral transfers of the rRNA gene is rare and the copy numbers of the transferred rRNA gene in environments are low. We agree that environmental rRNA gene surveys with PCR are still useful and effective to estimate the diversity/ distribution of microbial eukaryotes. However, the fact that recovered rRNA gene sequences do not always reflect the actual existence of microbial eukaryotes corresponding to these sequences should be kept in mind based on our findings.

In other words, more research is needed to determine exactly how widespread this rRNA LGT phenomenon is in eukaryotes...it may be something else we need to take into account when designing software workflows for environmental sequence data.

Reference:

Yabuki, A., Toyofuku, T., & Takishita, K. (2014). Lateral transfer of eukaryotic ribosomal RNA genes: an emerging concern for molecular ecology of microbial eukaryotes, 1–4. doi:10.1038/ismej.2013.252

Exploring Statistics for Metagenomic Datasets

2014-03-26T08:00:00.000-07:00

Recent lab discussions have made me think a lot about statistical tests we can use to detect and verify differences between metagenomic datasets. Since I don't have a strong background in statistics, my knowledge of this topic is still evolving - the scale and distribution of genomic datasets can be a tricky issue to deal with in a lot of statistical tests, it seems.

Some of of the most useful resources I've found so far are as follows (feel free to comment and recommend more resources):

Parks, D. H., & Beiko, R. G. (2010). Identifying biologically relevant differences between metagenomic communities. Bioinformatics, 26(6), 715–721. doi:10.1093/bioinformatics/btq041 (good rundown of the different statistical techniques applied to genomic data, including their implementation in the STAMP pipeline)

Primmer, C. R., Papakostas, S., Leder, E. H., Davis, M. J., & Ragan, M. A. (2013). Annotated genes and nonannotated genomes: cross-species use of Gene Ontology in ecology and evolution research. Molecular Ecology, 22(12), 3216–3241. doi:10.1111/mec.12309 (especially Box 3 - Gene Ontology enrichment tests)

Metagenome Ordination in IMG - provides a good comparison of PCA vs. PCoA vs. NDMS, particularly in regard to how each of these statistics are calculated (and differ from one another).

Dinsdale, E. A., Edwards, R. A., Bailey, B. A., Tuba, I., Akhter, S., McNair, K., et al. (2013). Multivariate analysis of functional metagenomes. Frontiers in Genetics, 4, 41. doi:10.3389/fgene.2013.00041 (added to list 5/10/14 - comprehensive and thought-provoking overview of metagenomic data analysis)

Meeting Announcement: Evolutionary Biology of Caenorhabditis & Other Nematodes

2014-02-10T08:00:00.000-08:00

Since I'm on the organizing committee for this upcoming meeting, it's time to start advertising! Abstract submissions are now open (click here for meeting website):

Evolutionary Biology of Caenorhabditis and other Nematodes

June 14-17, 2014, Hinxton, UK

This conference will bring together scientists studying evolutionary processes in diverse nematode groups. In addition to attracting many researchers studying evolution in Caenorhabditis elegans as model organism (and its closer relatives such as C. briggsae and C. remanei), the meeting will also welcome scientists investigating other free-living groups and the numerous animal- and plant-parasitic nematode species that threaten human health and the global economy. There will be a strong emphasis on genomic approaches and perspectives. The topics highlighted will include experimental evolution, fundamental evolutionary forces, genotype-phenotype relationships, metagenomic analyses, and processes of parasitism. The programme plays a critical role in promoting interaction and collaboration between evolutionary scientists training in the C. elegans tradition and those focused on other nematode groups.

A limited number of registration bursaries are available for PhD students and junior post-docs to attend this conference (up to 50% of the registration fee).

Abstract and bursary deadline: May 2, 2014
Registration deadline: May 16, 2014

NRC survey: Research Priorities for Marine Science

2014-01-06T07:00:00.000-08:00

I received an e-mail from the INDEEP mailing list, asking me to participate in a Virtual Town Hall on marine science research priorities, currently being run by the NRC. Here's the rundown from their website:

The National Research Council, at the request of the National Science Foundation, is seeking guidance from the ocean sciences community on the prioritization of research and facilities for the coming decade. The Decadal Survey of Ocean Sciences (DSOS) committee has been assembled for this task. To fulfill its charge, the DSOS committee is asking for community input via this Virtual Town Hall. To submit your input, please fill out the following identifying information, since anonymous comments will not be collected or posted. The deadline to submit your comments is March 15, 2014.

I figured I'd post my survey answers here (it would be great to generate some discussion about how we can promote greater emphasis on genomic tools and high-throughput sequencing in marine ecosystems - in particular the deep sea):

Across all ocean science disciplines, please list 3 important scientific questions that you believe will drive ocean research over the decade.

1) What is the role of microbial processes in ecosystem function?

2) How do microbes respond to (and impact) climate change?

3) How do we integrate knowledge from different fields (e.g. physical oceanography, biogeochemistry, taxonomy, marine biology) to gain a more comprehensive view of the marine environment?

Within your own discipline, please list 3 important scientific questions that you believe will drive ocean research over the next decade.

1) Characterizing phylogeographic patterns in microbial eukaryotes using genomic data. What is the proportion of comopolitan vs. regionally restricted species in different marine habitats?

2) Linking genomic data (DNA, RNA, genome sequences) to the existing body of morphological, ecological and taxonomic data. Particularly important for microbial species where each of these data types exists in discipline-specific silos. How can such linked data further our understanding of marine ecosystems?

3) How do we build accurate models (e.g. using robust algorithms and existing data as training sets) to predict species distributions and the potential impacts of climate change?

Please list 3 ideas for programs, technology, infrastructure, or facilities that you believe will play a major role in addressing the above questions over the next decade. Please consider both existing and new technology/facilities/infrastructure/programs that could be deployed in this timeframe. What mechanisms might be identified to best leverage these investments (interagency collaborations, international partnerships, etc.)?

1) In order to address ecosystem-scale questions, and use cutting-edge methods to do so, the marine science community (particularly ecologists and taxonomists) need to forge links with researchers in genomics and computational biology. DNA sequencing is largely under utilized in marine environments (notably lacking in the deep-sea), yet it offers a deep, cost-effective view of species, populations, and communities. Yet, computational expertise is needed to effectively apply genomic tools to marine systems, and that expertise must come from researchers who are knowledgeable about current software and algorithms (workflows optimized for "big data").

2) Funding initiatives or programs emphasizing microbial eukaryotes are needed to complement the (currently much greater) emphasis on bacteria/archaea, macro fauna and megafauna. Meiofauna and protists underpin many key ecosystem processes (e.g. nutrient cycling), but their role in marine habitats is perpetually understudied. We lack even a basic understanding of global biodiversity and species distributions for the majority of microbial metazoan phyla.

3) Marine sampling protocols MUST adopt forward-looking approaches. Ship time is expensive, and samples from habitats such as the deep-sea are precious and difficult to obtain (particularly for researchers in the genomics or computational biology communities, who may not have the professional connections needed to obtain biological samples). Many sample preservation methods do not consider the potential long-term use of a sample; for example, using formalin to preserve sediment immediately destroys the possibility of using that sample for DNA sequencing. There are many alternate sample preservation methods that preserve both DNA and morphological features (e.g. DESS is effective for sampling microbial metazoa). Giving deeper thought to sample collection, and prioritizing DNA preservation from diverse marine environments, is CRITICAL for furthering our understanding of marine biodiversity, biogeography, and ecology.

To give your own input, fill out the survey at this link: http://nas-sites.org/dsos2015/

PFTF Discussion: Job talks, chalk talks, and teaching demonstrations

2013-12-29T07:58:00.000-08:00

Last month we finished off winter quarter with another talk about the academic job search, where UC Davis faculty Siobhan Brady (Plant Sciences) and Sarah Perrault (University Writing Program) gave us the rundown on job talks, chalk talks, and teaching demonstrations.

Firstly, we discussed how these three presentations differ:

Job talk (research talk, about 50 minutes long) - what you did in the past
Chalk talk (research plans, can be 20-90 minutes long) - what you're going to do in the future
Teaching demonstration (can be 25-60 minutes) - mock classroom or course instruction

Then we broached the finer details.:

Chalk talks usually put forth the aims listed in your research statement (e.g. they can be the exact points you outlined in your original job application packet). Few institutions will allow any type of presentation aids for chalk talks (and if powerpoint is allowed, you'll usually be limited to just a few slides). The focus of this talk should be on your research goals (both short-term and long-term) as well as your long-term research questions. Some tips for chalk talks:

Speak quickly on your feet and show mastery of your field
Introduce your long-term research questions during your job talk
Being conservative here can help - people will see you as practical and thus able to get funding (and show preliminary data, if possible)
Gear your chalk talk towards faculty; talk about methods but don't be too technical: talk about what equipment and personnel you will need
People will interrupt you nonstop (in this sense, a chalk talk is similar to a PhD qualifying exam). Your responses will indicate how much (and how deeply) you have thought about the future.
People often fall apart because of a) nervousness and/or b) falling prey to the potential pitfalls of your subject matter
Most importantly: practice! Practice your chalk talk a lot before your interview, using diverse audiences (faculty, postdocs, etc.) Be critical as you strive to perfect this talk.

Job Talks are perhaps the most familiar, but we hit on some critical points that will ensure a successful presentation:

Keep your slides simple, use black text on a white background. Use 40pt font to make your slides readable in a large room
Use sentences for slide headings - these are more memorable than topical phrases (studies have shown this is true)
Use clean, simple graphics. Watch some TED talks to get an idea of how to use good visuals.
Beware of humor in a job talk: it can backfire
During questions, if you need to buy time to think you can ask "Can you repeat/rephrase the question?"
Be sure to remain poised and composed if you're battered with questions that seem to come out of left field (composure is what people are looking for). Sometimes these difficult questions are a result of faculty performing out of ego, or for the sake of their peers.

Teaching seminars come in different formats, and we discussed two different scenarios that our speakers had experienced.

The first scenario is more common in teaching universities: a candidate was asked to teach an intro course that was completely different from their own disciplinary subject area. The candidate was given good guidelines and plenty of preparation time (in this case, they were given a course textbook and told to teach the first chapter). At the interview, the audience for this teaching demo was comprised of the search committee and undergraduates.
The second scenario was much less structured, with only one instruction: no powerpoint allowed. The candidate was asked to teach a class as if it were an upper-level course, using only the blackboard. The demo only lasted 25 minutes, and the audience was the search committee. In this scenario, a good interview strategy would be to show expertise in a class not currently offered by the university (e.g. to show your potential fit in the Department). For the sake of the audience, it's also prudent to state your learning goals as well as a textbook, chapter, and figures that complement your teaching demo.

We also had an interesting discussion about "illegal questions" - the inevitable queries about your personal life that interviewers aren't supposed to ask (Are you married? Do you have kids? Do you want kids?). Someone suggested that you should be honest - after all, you don't want to insult your possible future colleagues. Another person suggested ways to deflect the question (and address the underlying concern that prompted the question), without answering or insulting: for example, saying something like "If this is about my productivity, I can assure you that I'm first and foremost passionate about my research..." It was interesting to hear different thoughts on this tricky subject.

Overall, our discussion provided a very eye-opening look into the interview process, and I learned a lot. We ended with some final tips and general guidance:

Keep your presentations backed up on a thumb drive during all interviews (just in case!)
At some institutions (UC Davis is one), admin staff will give their input during the faculty hiring process. So when you interview, keep in mind that every single person you encounter is your "audience" during a campus visit.
Keep a supply of water and energy snacks in your bag - interviews are exhausting, and you will need them.
Make a cheat sheet of people based on your interview schedule - note their recent publications, research interests, and other professional activities.
Never bluff answers. It's better to just say "I don't know". Or better yet, "I'd be happy to get back to you" - and then take their name and follow-up later with the answer.
For phone or Skype interviews, it's a good idea to dress in interview clothes and book a conference room to make sure you feel professional.
Make a mental note of people throwing their weight around - if other faculty don't shut them down, it might be an indicator of departmental culture (or indicate people that might have power over your career).

PFTF discussion: The Academic Interview Process

2013-10-25T07:00:00.000-07:00

The topic at last week's Professors for the Future meeting was the academic interview process! UCD faculty members Julia Simon and Warren Pickett led a great discussion, answering all of our eager questions (and there were indeed many questions).Here are my notes from the meeting, representing a mix of the speakers' slides and other discussion points I jotted down:

Presenting Yourself

Your apparel:

Don't stand out because of what you wear - the search committee is not looking for a mannequin
You should be comfortable in what you wear, and other people should not be made to feel uncomfortable because of what you wear.
The best scenario: your host should not be able to remember.
Its about what you know and do, not about what you wear.
Practices might be different in law school, business, vet school, etc. - Know your discipline and its general practices.

Note: I disagree with some of the above advice. Dressing yourself for an interview is both a skill and an art (and should be highly personalized). A memorable piece of clothing (shoes, scarf, printed skirt) can be a conversation point, so don't feel you need that you have to dress boring to be taken seriously. This topic deserves its own post, but I reccommend you check you this article at Inside HigherEd.

Don't be someone else - be yourself, but be on your "best behavior"

Be positive, and express enthusiasm about the future and the institution
Be prepared to ask questions - to learn about the new environment
Know your viewpoints on the questions of the day (in your discipline)
Keep a lid on your politics - it will not help, and it might hurt. Don't play stupid, but just put on a tolerant front.

Structure of the Interview Schedule

Typically, interviews are a two day campus visit where you meet with the department chair, the dean, and many other people (e.g. half-hour meeting with prospective colleagues).
Do some homework (IMPORTANT!):

Learn about your schedule and presentations (even though you might only get it 2 days beforehand)
Know something about the department and institution
Know something about several faculty members
Ask people about their interests, as well as talking about yours.

Generally be an interesting person to converse with
Often the search committee is invisible - but they will be paying much more attention than the typical faculty member.
The most important people you'll meet during your interview are:

The faculty in your area of research
The rest of the faculty
The chair (whose job it is to work with you)
The dean (who improves the faculty, while balancing the budget). Although the faculty vote on who to hire, the dean has to sign off on this decision: so be sure not to rub them the wrong way!

Be prepared! Understand the needs of the group (e.g. the department) and the reason for the search. Is the search specific, to fill a particular need? Or is it more general: to shore up the group and its teaching needs, to find the "best person at this time?"

This aspect becomes more important at higher levels of hiring - what is your "vision" (for the group, department, division, college)? What are the existing strengths and imminent challenges?

Things that you can (should) ask the dean/chair:

What is the tenure review process? What is the typical teaching load (and is there a policy for new faculty to have a one-semester reprieve while they adjust to the new job)?
What is the sabbatical policy? What is the family/childcare policy? - but perhaps consider the kind of institution you're interviewing at before asking these two questions

Research Presentations

You will be expected to give one or two talks (and be sure to understand the purpose and audience):

one for a "general" audience, colloquium style for all in the department/group
one or more related to your research specialty - still, do not be specific for these talks. Introduce the field, discuss various viewpoints, finally get to the point(s) you want to convey to the audience. But avoid too many details; let them read your papers.

Your presentation gives the exceedingly important impression of how well you can get your ideas across - essential for both education and for research, and more generally in interactions with colleagues
Caveat: being a good speaker is important; but you don't have to be the best
Most important: know your audience and keep them interested

Your Self-Presentation

Important: know what your plans and hopes are, and be able to articulate them well and field questions

Have a well thought-out research plan
What is the intellectual "hook" that makes you so excited?
Where will the funding support come from? Do you have experience?
What broader requirements do you have? Collaborators, travel, seasonal restrictions?
To what degree are you already connected in the field?

Education: readiness and aspirations

What style of teaching is effective in your field and why?
What new approaches do you find intriguing, exciting?
You can suggest what courses you'd like to teach, but be careful in case a course is someone's "pet".
Good to ask "what are your undergraduates like?" - this is something you'll want to know, but also show that you did your homework.

Examples of what NOT do do or say:

"I'm enthusiastic about research now. I may go into administration when I get tired of research" (said to a dean)
Don't paint yourself into a corner by demanding/requesting a larger start-up package than is reasonable. Find out what is customary, perhaps get some input on limitations.
Don't give the impression you need/deserve special considerations (the process needs to be fair to all)
Don't be derogatory about planned social events - complaining about restaurants, food, etc.

Other things to consider

What usually seals the deal:

Research Area Fit (you can't control this - this is a search committee decision)

Whether or not you connect with everyone and can have intellectual discussions during your interview (you can practice and prepare for this)

If you have a phone interview, try to use Skype if at all possible (and remember to look at the camera, and be sure the background is appropriate). Its generally hard to be engaging and sell yourself over a speakerphone.

Be concise in answering questions - aim for a 3 minute answer, then watch for body language and engagement with the person. Your eye contact and body language are also very important - don't look too rigid or too relaxed (no lazing on a sofa).

You'll have to be adaptable for questions about "service" - you can't be sure what they'll ask, and you will have to think on your feet.

Be prepared to defend yourself - search committees may want to challenge candidates. Politely, calmly stand your ground and give good arguments.

The Bottom Line

The search committee and faculty are a group of individuals, with different opinions and means of evaluation. What works for some may not work for others. Search committee members negotiate with one another and present a recommendation to the faculty, who then vote.

The above guidelines may be helpful. But the bottom line is, there is no bottom line.

Intra-Genomic Variation in the Ribosomal Repeats of Nematodes

2013-10-21T06:00:00.000-07:00

Happy to announce our new paper, published last week in PLoS ONE:

Bik HM, Fournier D, Sung W, Bergeron RD, Thomas WK (2013) Intra-Genomic Variation in the Ribosomal Repeats of Nematodes. PLoS ONE 8(10): e78230. doi:10.1371/journal.pone.0078230

This manuscript was in the works for a while, and was based on undergraduate research carried out by co-author Dave Fournier while he was an undergraduate at UNH. The rationale? To assess the level of variation in rRNA loci within a single nematode genome, as well as between genomes of different nematode species. rRNA is typically present as a repeated, muti-copy locus in eukaryote genomes, which makes it hard (impossible) to correlate gene abundance to organismal abundance in environmental sequencing studies. Unlike bacteria, there is no known correction that we can apply to "normalize" DNA for species with multiple rRNA copies - every species has multiple copies (sometimes into the thousands!) and we know little about the typical ranges of rRNA copy number across different eukaryote groups.

In this manuscript were were asking questions about both rRNA copy number (how many rRNA repeats are present in a genome?) and intragenomic variation (how many of these copies are unique rRNA gene sequences within a genome, and across rRNA variants are there "hotspots" for base polymorphisms?). We wanted to determine if we could spot patterns that govern rRNA copy number and level of intragenomic variation amongst gene copies - taking into account things like genome size and phylogenetic distance.

The result? There doesn't seem to be any pattern determining copy number or intragenomic rRNA variants across species, which kind of makes biodiversity estimates from environmental rRNA studies feel like a shot in the dark. But we DID find some interesting evidence of selection acting on rRNA loci:

By applying the same approach to four C. elegans mutation accumulation lines propagated by repeated bottlenecking for an average of ~400 generations, we find on average a 2-fold increase in repeat copy number (rate of increase in rRNA estimated at 0.0285-0.3414 copies per generation), suggesting that rRNA repeat copy number is subject to selection. Within each Caenorhabditis species, the majority of intragenomic variation found across the rRNA repeat was observed within gene regions (18S, 28S, 5.8S), suggesting that such intragenomic variation is not a product of selection for rRNA coding function.

Divergence and polymorpishm are illustrated in the figure below:

Figure 1. Variation observed in nematode ribosomal arrays. (A) Divergence in rRNA repeats observed between the genomes of C. elegans, C. briggsae, C. japonica, and C. remanei; here, base substitutions are denoted as transitions or transversions, while complex polymorphisms represent any type of insertion, deletion, or inversion event. (B) Polymorphic positions in rRNA repeats observed within the genomes of each Caenorhabditis species. Results suggest that the pattern of intragenomic polymorphisms is unique across repeats within a species, whereas patterns of interspecific divergence reflect a strong signature of natural selection for rRNA function.

The data on genomic patterns in eukaryotic rRNA is still very preliminary, and this paper is just a starting point. Hopefully this type of work will inspire similar analyses in other groups - we desperately need more knowledge, particularly for non-model organisms.

The Fisher Files - podcast series on academic productivity

2013-10-17T17:01:00.000-07:00

Last weekend I took a road trip to LA, and I needed something inspiring to make the 7-hour drive on the CA I-5 less tedious.

I came across The Fisher Files, a fabulous podcast series originally recorded by MIT physicist Peter Fisher (since collated and rescued from internet oblivion by the Foonyor Barzane blog). In addition to the typical career advice (PhD/postdoc/junior faculty), there are some great musings on how to promote personal productivity and efficiency in academic life.

I'm über-organized, so a lot of the advice was reassuring - after listening, I think I seem to be doing things right (organizing my calendar, reviewing/assigning work tasks on a weekly basis, etc.). But it was still great to hear how a senior academic manages his career and work life - I always like to try out new things and pick up tips I might not have thought about. For example:

Fisher argues that meeting should ideally adhere to three rules: 1) Always stick to a one hour timeslot, 2) Always prepare and distribute and agenda, and 3) An effective way for meeting chairs to move on from a discussion is to first summarize peoples' thoughts and ask if there are any final, additional points. If people start to repeat ideas say "we already discussed this" and move on. This can be necessary to cover all agenda items within the allotted time.

For my to-do lists, I've now added a couple subheadings that Fisher suggested - an "agenda" list where I keep track of things I need to discuss with different people (so when they drop by your office, you remember what you need to ask them), and a "waiting on" list where I keep track of outstanding items that require action by others.

So far I've only made it halfway through the podcast series (each topic is about 30 minutes long), but I would definitely recommend it to other researchers!

Fun with Myers-Briggs Assessments!

2013-10-04T06:00:00.000-07:00

This week at PFTF we went through results from our Myers-Briggs Type Indicator Assessments, and used our profiles as a discussion point for careers outside academia.

I guess I'm a bit unusual because I've wanted to do biological research since high school, and even way back then I knew I wanted a PhD and a career in academia. I've never deviated (or considered deviating) from this path. But still, I always find these personality assessment tools fun to look at - particularly from a management perspective, so I know what my weak points are when it comes to dealing with employees and colleagues. However, my profiles on these things are never a surprise.

I am an ENTJ according to Myers-Briggs:

Frank, decisive, assume leadership readily. Quickly see illogical and inefficient procedures and policies, develop and implement comprehensive system to solve organizational problems. Enjoy long-term planning and goal setting. Usually well informed, well read, enjoy expanding their knowledge and passing it onto others. Forceful in presenting their ideas.

Yep, the MBTI has me down to a T. Inefficiency drives me insane. I set daily goals every morning, and I'm already filling in iCal events for 2015. The MBTI rated science and law as two of my top recommended career choices, and I was weighing exactly those two options in high school.

Some people in our group expressed skepticism (or resistance) to these type of personality assessments, but I think everyone should complete one at some point to gain some insight on their natural personality tendencies. The results can be eye-opening, particularly if you're looking to make a change or searching for the ideal career path.

We finished up with some recommended external readings:

The Search Is Over (Chronicle of Higher Ed) - transitioning out of academia
How to do an Informational Interview (Chronicle of Higher Ed)
Do What You Are: Discover the Perfect Career for You Through the Secrets of Personality Type (Amazon.com - book)

Conflicts of interest and the privatization of the public university

2013-10-03T20:35:00.004-07:00

Today marked the start of our Ethics and Professional Integrity discussion seminar that I'm taking as part of the Professors for the Future (PFTF) program (well technically it's the second class, but I was in Alaska last week and missed the start of fall quarter). The topic was "Conflicts of interest and the privatization of the public university," and we had two readings:

Kezar, A.J. (2005). Challenges for higher education in serving the public good. In A.J. Kezar, T.C. Chambers, & J.C. Burkhardt (Eds), Higher education for the public good (pp. 23-42). San Francisco: Jossey Bass.

This was a perspective on the changing nature of Higher Education, where universities have moved from their historical roles as social institutions serving the public good, and towards commercialized ventures with strong links to industry. Our discussion group noted that although some of the evidence were a bit one sided, many of the arguments were spot on: the move towards cost effective lecture-based courses, increasing numbers of part-time and contract faculty, corporate administrative structures, and privatized and commercialized research.
The one thing this article missed was the impact of technology (not surprising, since today's technological landscape was still very much emerging when the article was published in 2005). Although we also noted that technology can exacerbate some of the problems in Higher Education (e.g. online courses that bring in substantial tuition money, with little student interaction beyond "ticking boxes").

Shamoo, A.E. & Resnik, D.B. (2003). Conflicts of interest and scientific objectivity. Responsible conduct of research (pp. 139-162). Oxford: Oxford University Press.

We tied this article into the previous reading, basically by arguing that the new landscape of Higher Education is essentially in conflict with itself (the lofty mandate of pure intellectual pursuits vs. the new reality of students as paying customers).
Our discussion emphasized that conflicts of interest are everywhere, and not necessarily bad. However, it is prudent to be aware of these conflicts and disclose them up front whenever possible. The case studies in this reading focused on COIs in the life sciences, but our group noted that the humanities are just as susceptible (e.g. authoring a textbook and requiring students to buy your textbook for a course).

We discussed how the incentive structure and cutthroat competition in academia can promote certain conflicts of interest. There were several case studies where corporate financial interests (accepting funding from pharmaceutical companies) led to unethical research practices. I also argued that paywalled articles and non-open access data put scientists in conflict with what's fundamentally best for the public good--especially if the research was taxpayer-funded in the first place.

I'm really enjoying this discussion seminar - we have a small, lively group representing both the sciences and humanities, so its been great to hear viewpoints from a diversity of disciplines.

A Quickstart Guide to Navigating University Administration

2013-10-03T15:32:00.003-07:00

I'm currently participating in the “Professors for the Future” (PFTF) program at UC Davis for the 2013-14 academic year (more about that here). PFTF is a year-long competitive fellowship program designed to recognize and develop the leadership skills of grad students and postdocs - I was selected after a nomination and application process. The program is pretty intense, and involves all these things: 1) biweekly meetings focused on careers and professional development, 2) A discussion seminar course on "Ethics and Professional Integrity", 3) a "Seminar on College Teaching" course, 4) spring and fall program retreats, and 5) individual projects (read about my project here).

At our fall retreat, Jeff Gibeling (our Dean of Graduate Studies) gave us a great rundown on University Administration. This was extremely useful, and helped clarify all those various titles you always hear being thrown around (vice-chancellor, provost, dean, chair, etc.). In short, the structure of any given university can be summed up by this neat little diagram:

The Board (or Regents in the UC system) is at the top of the administrative hierarchy. Board members are not usually academics, but rather entrepreneurs, businessmen, or people who have political connections (here are the UC system board members). They are selected by the governor for 12-year terms, and these appointments are approved by the academic senate.

The President or Chancellor is one step below, representing the administrative head of a university. The name of this position can be confusing - in the UC system, the President is the system-wide administrative head of all the UCs, while the Chancellor is the administrative head of one UC campus. Other universities may have both positions (President and Chancellor) that serve different functions.
The Provost is the Chancellor's second-in-command, and the chief academic officer of the university.

The right-hand side of the above diagram can be considered the "Executive Branch" of a university, encompassing all the Vice Chancellor and Vice/Associate Provost positions. The name and number of these positions varies across universities, and they may or may not be filled by faculty members (it varies according to the job duties of the specific position).

Below the Provost we come to the Colleges, Schools, and Departments. Departments are the most fundamental structure of a university, groups of Departments together form a School or College. [Note that in at UC Davis, a "School" offers only graduate and professional training, while a "College" offers both undergraduate and graduate training.] Each School/College is headed by a Dean, and each Department in the School/College is headed by a Chair (who reports to their respective school/college Dean). Department Chairs are senior (usually tenured) faculty members: they may be promoted to this position from the pool of faculty members in a department, or occasionally brought in as an external hire. Chairs are in charge of organizing committees, managing departmental budgets, managing the tenure review process, and overseeing the hiring new faculty members.

The left-hand side of the above diagram, the Academic (Faculty) Senate, can be considered the "Legislative Branch" of a university. The Senate is the pool of all junior and senior faculty members from different departments, who are organized into lots of different committees (each with its own chair) and act as a governing body. This is where all your academic "service" obligations come in. Apparently UC Davis has about 30 different committees focused on various issues: the Graduate Council, the Undergraduate Council, Research Committee, Tenure and Promotion Committee, Courses Committee, Academic Freedom Committee, and yes, even a "Committee on Committees" (which appoints members and chairs to other committees).

Finally worth pointing out are Graduate Groups - these are specific, interdisciplinary graduate programs that draw faculty members from different departments. The UC Davis Graduate Group in Ecology, as an example, represents 24 different departments on campus!

This is just a quick (and hopefully useful) overview based on the organization of the UC System, and UC Davis in particular. There can be a lot of variability in administrative structure. However, some things are pretty consistent: for example, you'll always find a Vice Chancellor for Research and a Vice Chancellor for Student Affairs.

Microbial Phylogenies have the least accessible data in systematics

2013-09-19T12:36:00.000-07:00

This month in PLoS Biology, "Lost Branches on the Tree of Life" gives us a pretty stark overview of data sharing and accessibility in the systematics community. The paper assessed just how many published phylogeny papers also deposited their corresponding sequence alignments, tree files, and program parameters. The grim news:

...only 16.7%, 1,262 from a total of 7,539 publications surveyed, provided accessible alignments/trees (Figures 1 and 2). Our attempts to obtain datasets directly from authors were only 16% successful (61/375; see Table S4), and we estimate that approximately 70% of existing alignments/trees are no longer accessible. Thus, we conclude that most of the underlying sequence alignments and phylogenetic trees produced by the systematic community during the past several decades are essentially lost, accessible only as static figures in a published journal article with no capacity for subsequent manipulation. Furthermore, when data are deposited, they are often incomplete (e.g., what characters were excluded, accepted taxon names; see Text S1 and Figure S1). Our survey of publications that implemented BEAST revealed that only 11 out of 100 (11%) examined studies provided access to the underlying xml input file, which is critical for reproducing BEAST results. Although funding agencies often require all data to be accessible from funded publications, our results reveal this is more the exception than the rule.

What made me cringe even more is that my discipline (microbial systematics - including microbial eukaryotes, bacteria and archaea) are the worst offenders when it comes to data sharing. The green line indicating full data deposition is pretty much flatlining in some years for microbes! (Drew et al. 2013)

I'll be the first to admit that my own data is part of the problem - when I was doing my PhD, no one ever had a conversation with me about data reproducibility and sharing. I made my best effort to publish the supplemental files I thought would be useful, but at that time I wasn't in the loop about scientific reproducibility and best practices for data archiving. For my nematode phylogeny paper in BMC Evoltionary Biology, I did upload the original ARB databases I used to construct and edit the rRNA structural alignments; but in hindsight, this file requires knowledge of the ARB software itself (not an easy package to use), and I didn't even think to publish a FASTA alignment file or a Nexus tree file. Partially this was because my Phylogeny papers involved a multitude of topology tests and I didn't think it was correct to pick just "one tree" to represent my spectrum of results.

I've been thinking about this issue a lot recently, and taking strides to correct my past mistakes. I'm now digging through old PhD files to find my alignments and tree files to contribute a nematode phylogeny for the Open Tree of Life project. I'll also post these data on Figshare so my data will no longer be another "Lost Branch" on the Tree of Life.

Reference:

Drew BT, Gazis R, Cabezas P, Swithers KS, Deng J, Rodriguez R, et al. (2013) Lost Branches on the Tree of Life. PLoS Biology, 11(9):e1001636.

Diversity and Dissemination in Scientific Conferences

2013-06-11T05:02:00.000-07:00

I've been swamped with service obligations these past few months, pretty much single-handedly organizing the SMBE Satellite Meeting on Eukaryotic -Omics at UC Davis (and a joint QIIME workshop) last month, as well as serving on the organizing committee for iEvoBio 2013. Both of these conferences aimed to emphasize interdisciplinary agendas (the SMBE meeting focused on high-throughput sequencing in eukaryotes, and iEvoBio on "big data" approaches in biology). But I'm consistently struck by the fact that interdisciplinary research never feels interdisciplinary enough--you're always wanting to reach a broader audience, connect with more diverse researchers, and spread the message of the conference as far and wide as possible.

Lately I've been reflecting on many issues I've encountered related to conference organization, diversity, and dissemination. Some things I've been asking myself lately as a conference organizer:

How do we balance out gender ratios and recruit female speakers?

Female scientists represent a much smaller pool compared to their male counterparts. Scientists in general are over committed, and in my experience its been much harder to secure female speakers because they're fewer in number. For iEvoBio 2013, we decided early on that since we have an all-female organizing committee, we wanted all female keynote speakers (a nod to all the publicity about gender issues in science lately). We approached many different people on the iEvoBio speaker shortlist before finding our keynote speakers. In the end, the iEvoBio committee volunteered me (!) to speak because time was ticking and we just could not secure a second woman speaker. Senior women appear particularly trickly to nail down - we started sending out speaker requests back in Autumn 2012, a full eight months (!) before the event was happening, but still no dice. For the SMBE meeting, I had also started with an initial gender-balanced and career-stage balanced list, but as time went on this list became increasingly male-biased. So even if meeting organizers are committed to promoting gender diversity, you're grappling with many external factors that inherently seem to work against you.

How do we increase diversity at meetings?

Increasing diversity encompasses a lot of things: ensuring a spectrum of career stages, balanced gender diversity, and participation from underrepresented groups. It has seemed much easier to ensure diversity of career stages (e.g. via travel awards for grad students and postdocs) than to ensure diversity in regard to gender and underrepresented groups. We had even advertised dedicated diversity awards for the SMBE meeting (travel awards targeting females and participants from underrepresented groups), but in the end we had a very small pool of applicants for these awards. I'm sure this is an advertising problem (I doubt I reached faculty at primarily undergrad institutes, or faculty at places like Historically Black Colleges), and a function of the gender/ethnicity ration amongst scientists, but overall I was left desperately searching for effective ways to increase diversity.

How do we advertise conferences?

This is particularly a concern for interdisciplinary conferences (how do you recruit participants from disparate disciplines, when particularly when the organizers themselves are outside those target disciplines?) and newly established events (how do you get people to attend when new or one-off meetings aren't already marked on anyone's calendar). I really struggled with advertising the SMBE meeting, which fit both of these criteria. I sent out countless e-mail notifications to colleages and listservs, Tweeted meeting announcements, and blogged about the event. I've begged other people in my professional network to do the same. There's so many different channels but I never know what the right channels are--and there is often no way to gather data on what advertising strategies worked and what didn't. For advertising events, I often feel like I'm flailing around and hoping that people bite. How do you ensure that the information reaches the right eyes?

How do we minimize the administrative burden, particularly when scientist organizers have little/no admin support?

After being buried by the administrating burden of the SMBE meeting, I've been mentally repeating that "I'm never organizing another conference ever again". I'm enthusiastic about meetings--I always learn a great deal from my peers and leave scientific events feeling inspired and motivated. I'm also perpetually optimistic that organizational and service activities won't be much of a time suck at all, as long as I gradually stay on top of things. However, I grossly underestimated the administrative duties for the SMBE meeting at UC Davis--travel reimbursements, room booking, alcohol permits, website updates, writing the program, sending e-mails, and collating abstracts--these all consumed my life in the weeks leading up to the meeting (and I was lucky to have help from our lab's admin support person). I worried about the meeting not running smoothly or the materials looking "thrown together" and having an unprofessional air, since I was doing all these organizational duties in my "spare" time outside of research. Perhaps running a meeting is easier for faculty members vs. postdocs (faculty may have admin staff of their own, but certainly grad students/postdocs who can share the organizing duties), but organizing even a small conference fundamentally requires a lot of admin duties regardless of the type of the event.

I don't have answers to any of the above questions - only observations and thoughts based on my own experiences. I'd love to hear comments and suggestions from the community - please discuss!

Wrap up of the #SMBEeuks meeting and QIIME workshop at UC Davis

2013-05-05T17:13:00.001-07:00

Thanks to everyone who attended the SMBE Satellite Meeting on Eukaryotic -Omics at UC Davis last week (April 29 - May 2, 2013). The event was a resounding success - it was wonderful to meet participants from such diverse backgrounds, working on different aspects of eukaryotic genomics and biodiversity studies. Many thanks to meeting sponsors SMBE, MOBIO and Illumina for their generous financial support. Fingers crossed for other similar meetings in the future!

For reference, all meeting documents are available here:

Meeting program PDF
Abstract book PDF
List of registered participants (Names, e-mails, affiliations, and Twitter usernames)

Twitter discussions that took place at the meeting each day have been compiled using Storify (a great online tool that collects tweets before Twitter locks them away in their archive):

#SMBEeuks - Day 1 Storify (compiled by Jonathan Eisen)
#SMBEeuks - Day 2 Storify (compiled by Jonathan Eisen)
#SMBEeuks - Days 3 and 4 Storify (complied by Holly Bik)

Some speakers have posted their slides online - hopefully I can expand this list as I convince more participants to share their talks and posters (updated 5/18):

Meeting welcome and overview of Eukaryotic -Omics at UC Davis - Holly Bik
The need for a phylogeny-driven genomic encyclopedia of Eukaryotes (slides only) - Jonathan Eisen (alternate YouTube link here - for slides and recorded audio)
The pro-shotgun-assembly talk - C. Titus Brown
Predicting loci of functional evolution within ancestral genes - Victor Hanson-Smith
Composition of the Maize Endophytic Microbiome is Correlated with Maize Genotype - Surya Saha
Next-generation sequencing for microbial ecology: alpha diversity, beta diversity, and biases in high-throughput sequencing - Rachel Adams
Host-associated eukaryotic communities - Laura Wegener Parfrey

On Thursday morning we held breakout group sessions to discuss the overall themes at the #SMBEeuks meeting, and put forward some recommendations for increasing the pace of scientific progress in Eukaryotic -Omics fields. A general discussion took place before we broke off into two smaller groups for more specific discussions. Notes are posted here:

Finally, my deepest thanks to Laura Wegener Parfrey, Tony Walters, and Adam Robbins-Pianka for running a fantastic QIIME workshop after the #SMBEeuks meeting. We had a packed room of eager biologists who were ready to pick up some command line expertise. QIIME workshop documents are posted below - additional thanks to microBEnet and the Alfred P. Sloan foundation who supported this workshop!

QIIME workshop agenda
List of participants at QIIME workshop (names, e-mails and affiliations)

Primer tests for Fungal ITS regions...plus, statistics!

2013-03-28T04:00:00.000-07:00

Reading a good paper is so inherently satisfying--and if you want to share my satisfaction, I recommend this recent piece of literature:

Bazzicalupo AL, Bálint M, Schmitt I. (2013) Comparison of ITS1 and ITS2 rDNA in 454 sequencing of hyperdiverse fungal communities. Fungal Ecology, 6(1):102–9.

I only wish this paper wasn't paywalled, because it contains quite a bit of useful information that is extremely relevant for the environmental sequencing community.

Firstly, the authors carried out a comparison of ITS primer sets and assessed their ability (and overlap) in recovering different fungal Orders, Families, Genera, and Species. I'm a big fan--these type of primer comparisons are important for figuring out what we might be missing in any given PCR-based approach.

Our results suggest that ITS2 may be more variable and recovers more of the molecular diversity. We confirm an earlier in silico study showing that ITS1 and ITS2 yielded somewhat different taxonomic community compositions when blasted against public databases. However, we demonstrate that both ITS1 and ITS2 reveal similar patterns in community structure when analyzed in a community ecology context. [Bazzicalupo et al. 2013]

Secondly, I feel like I learned some statistics by reading this paper! Or at least, I understood why authors chose the methods they did. I really liked that this paper includes detailed explanation of the statistical tests used to assess the ITS regions and make OTU comparisons. For example:

We compared OTU abundance distributions between the ITS1 and ITS2 datasets at all similarity levels with the KolmogoroveSmirnov (KS) test to see whether the ITS1 or ITS2 would project higher OTU rich- ness in the samples. KS tests are often used to test the distribution of datasets against other distributions, so one may use it to test if a dataset is e.g. normally distributed (Conover 1999). However, the KS test may also be used to compare the shapes of two empirical distributions. Species abundance distributions contain information about both the richness and evenness, thus the comparison of distributions is more meaningful than comparing the means of distributions with e.g. t-tests (Phillips et al. 2012). [Bazzicalupo et al. 2013]

I don't have a strong statistics background (but I'm very aware that I need to become more competent in this area), and this paper helped me understand what types of statistical tests I could apply to environmental sequence data in future analyses. In this regard, the Bazzicalupo et al. methods section was a great change of narrative, compared to the stats-name-dropping-without-explanation I see so often in other papers.

International Research Coordination Network for Biodiversity of Ciliates

2013-03-27T09:07:00.001-07:00

I was browsing through this NSF report on Dimensions of Biodiversity Projects 2010-2012, and I stumbled across this project which I had no idea even existed!

Upon further investigation, I discovered that this ciliates RCN has a portal website (including a document listing "Grand Challenges" in the study of ciliates). The inaugural meeting took place in September 2012 at NESCent, so it looks like the RCN is still in the early stages.

Frustratingly, the website doesn't seem to have been updated in quite a while (May 2012), so there isn't much new information about workshop outcomes or upcoming RCN activities. I'm excited to keep tabs on this new community - the discussions and outputs will be very relevant to high-throughput environmental sequencing approaches.

Content is King (Part 1): Social Media strategies according to Evan Bailyn

2013-02-06T10:00:00.000-08:00

**This is a cross post from microBEnet, a portal website for the Alfred P. Sloan Foundation's program focused on the Microbiology of the Built Environment. This is one Eisen lab project I'm heavily involved in, since microBEnet thinks a lot about social media.**

Part of what we're trying to do is to put the net in microBEnet. As in, building an online network for an emerging research discipline (Microbiology of the Built Environment) that connects building scientists and engineers with biologists, ecologists and computer scientists.

The internet is a big place. Publicizing a new cause or web portal can be overwhelming, even for those of us who think we know what we're doing.

Since the social media realm is largely considered "untested waters" (and us scientists are hardly Silicon Valley insiders) there's a lot of experimentation. Figuring out what works and what doesn't. Trying new strategies as the web evolves.

Until a few months ago, I was completely unfamiliar with the concept of Search Engine Optimization (SEO). But then I attended a talk by Evan Bailyn, author and web entrepreneur who has extensive practical experience with the worlds of web searches and social media.

Evan's #1 commandment for building an online presence (a brand, your professional reputation, or an online community such as microBEnet)? Create excellent and unique content, frequently - ideally every day. This will not only draw people into your site, but it will dramatically improve your search engine rankings.

To achieve internet domination, Evan outlines his 3-step "Nuclear Football" approach (a meaningless phrase, but you'll remember it):

1) Create content

Content reigns supreme. Your content will define your unique voice--its what will get people hooked on your site, and separate you from everyone else on the internet. In ecological terms, you must define your niche or face extinction.

Every site should have a blog - this is one of the easiest ways to publish new content. The next step is to define your audience (who will you target?), and figure out what content would be interesting for them. Once you have strategized and set up your site, try to update the blog with new posts every day. However! Don't just post for the sake of posting - to capture audiences there needs to be real passion emanating from your daily content. Don't think about SEO, web traffic or future accolades--it will only create stress and cripple your efforts to create really awesome content. Also never be afraid to cater to the average person's low level entertainment desire (why do you think we use such low brow humor on Deep Sea News, a marine science blog that I contribute to). Evan even suggested testing content on forums such as Reddit--what will garner the most interest from your audience?

Evan's general blogging guidelines: posts should be a minimum of 500 words, with a minimum of 3 images. Website images should be high-resolution (don't use those cheesy stock photos), and have captions that will help improve your search engine ranking. Blog posts should have appropriate and descriptive titles; this is vitally important for people finding your web content via search engines.

Wordpress software (what we use for microBEnet) is particularly friendly for SEO. Plugins such as Yoast allow you to manually edit meta page titles (the text that appears in the top bar of your browser application, next to the minimize/maximize buttons - NOT what's on the webpage itself). The meta page title is normally filled in automatically in Wordpress from the blog post title, but editing this to add in a few more keywords will help bring more viewers to your site. Consider blog post and meta page titles as the "prime real estate" for search engines; the order of words doesn't matter, but you should write a meta page title that is readable by both humans and computers.

Specific titles will increase your exposure to your target audience--people who are seriously researching a subject (or about to make a purchase) will use what are called long-tail search terms. For example, someone Googling "microbes" might just be browsing around or looking for a link to Wikipedia, but a person using a more specific phrase like "microbes that live in air conditioners" is desperate for specialized content. The latter person would represent the serious audience for microBEnet - so its in our best interests to appear high in the search rankings for that given phrase!

Surprisingly, URLs aren't as important as they used to be - these were devalued by Google about a year ago. Even more reason to pay close attention to page titles.

2) Reach out to get links and exposure

Once your website is established, content is being created, and all the SEO tools are in place, its time to get the word out.

The internet is a big place--even with the best content, your site will lurk in the darkness without external support. Google uses links AND page titles to determine search results. Evan underlined that Google's fundamental search strategy hasn't changed in 8 years, despite frequent tweaks to the algorithm. Algorithm updates are mainly meant to improve Google's ability to eliminate potentially spammy links; Google does not like it when websites pay $$ for links (this is a spammy strategy carried out for the sole purpose of improving search rankings). Spammy links impact a site's "Trust Rank", and will ultimately hurt a site's search engine rankings if (when) Google finds out.

The bottom line: getting people to link to your website is a much more valuable strategy. It shows that your website is a major hub and houses important content. In this respect, getting linked to from high-traffic sites with high search rankings is a definite victory.

So how do you "bait" people to link to you? One tip is to tweak content according to the what's hot right now. Customizing blog posts according to time of year (seasons, holidays) or newsworthiness is one way to create unique, compelling posts that can easily hook an audience. Evan commented that people tend to love Top 10 lists and superlatives. Another method is to leverage your online network and asking for links, although this seems to be a time consuming approach (sending out 20 emails, only getting 1 positive response), and perhaps not always appropriate for science (you don't want to come across as too self-promoting, and you might turn people off).

3) Convert people from viewers to buyers/followers

As your website grows, the final step is to convert casual visitors into captive followers--capitalizing on your web traffic and the people that come to you via search engine terms and external links.

To illustrate the power of Search Engine Optimization, Evan relayed a case study from his own life. His wife had always dreamed of being a theme park designer, but needed to find a way to break into the industry. She realized that theme parks are a small, niche industry where not many people are web-savvy; like science, the theme park industry also has what Evan calls "Microcelebrities" - people who are well known within their own specialized community. To get her name out there in the theme park world, Evan's wife set up a website called Entertainment Designer and started producing content that was unique but relevant for the theme park community. In particular, she sent "Interview Request" e-mails to micro celebrities, and started posting the resulting transcripts on the site. This unique content led to steady growth in site traffic over a relatively short timespan (1.5 years from web setup to interview posts). Entertainment Designer ultimately became a hub for the theme park industry, and nowadays Evan's wife acts as an intermediary who offers formal introductions to people who hope to work together in the industry. Through SEO and targeted content, the website enabled her to get her foot in the door, without having any training or previous experience.

The best advice for gaining followers? I think that every case is different, and requires some degree of trial-and-error. But general rules include knowing your target audience AND your chosen medium, and defining what you hope to accomplish in advance.

Conclusions

The internet continues to evolve, and there are some notable trends on the horizon. Google continues to personalize and customize search results--you'll notice your search results will be different depending if you're signed in or out of your Google account. When you're signed in, your search results will be affected by 1) your location and 2) your personal search history. This increasing reliance on personalized search will have significant implications for SEO strategies.

Perhaps the biggest secret of all is how to create content which is interesting to people--Evan firmly stated that frequent but boring, run-of-the-mill content isn't worth writing. If you take home one message from this post, take this: Content is King.

To find out more about SEO and social media strategies, I definitely recommend reading Evan Bailyn's books. I just finished Outsmarting Social Media, which we received for free at that talk, but another one I want to read is Outsmarting Google: SEO Secrets to Winning New Business.

SMBE Meeting on Eukaryotic -Omics: April 29-May 2 at #UCDavis

2013-01-24T15:22:00.003-08:00

The website is built, speakers have been lined up, and we're ready to announce it to the world:

Myself, along with my former PI Kelley Thomas at the University of New Hampshire, received funding from the Society for Molecular Biology and Evolution to host an SMBE Satellite Meeting focused on Eukaryotic -Omics at UC Davis this spring. The meeting dates have been set as April 29-May 2, 2013, and the meeting description is as follows:

The SMBE Satellite Meeting on Eukaryotic -Omics will bring together an interdisciplinary pool of researchers to discuss current efforts, challenges, and future directions for high-throughput sequencing approaches focused on microbial eukaryotes (environmental studies of non-model organisms). The meeting program will encompass investigations of eukaryote biodiversity, ecology, and evolution, using approaches such as rRNA marker genes, shotgun metagenomics, metatranscriptomics, and computational biology tools and software pipelines.

See the meeting website (http://www.smbe.org/eukaryotes/) for program announcements, registration details, and travel award information. We're currently in talks to tack on a QIIME workshop at the end of the meeting (tentative dates May 2-4), so keep an eye our for further details. The official conference hashtag will be #SMBEeuks on Twitter.

STEM diversity has been on my mind a lot lately, particularly given the Eisen lab's obsession with equality in gender representation. So I'm very excited to announce that our call for travel award applications includes a heavy focus on diversity--encouraging early-career applicants as well as those from underrepresented groups. Deadline for abstract submission and travel grant applications is Feburary 22, 2013 - mark it on your calendars!

Our #asm2013 Session: "Phylogenomics and Microbial Species Concepts"

2013-01-16T15:57:00.000-08:00

The preliminary program is out for the 2013 meeting of the American Society for Microbiology, to be held May 18-21 in Denver, Colorado.

I'm very to excited to announce two awesome sessions being led by the Eisen lab. First is the session I'm co-convening, entitled "Phylogenomics and Microbial Species Concepts" (session dates and description below). The second session is "Citizen Microbiology: Enhancing Microbiology Education and Research with the Help of the Public", led by Jonathan Eisen and David Coil.

The abstract submission deadline just been extended to Thursday, January 17th - consider submitting to these two awesome sessions if you're vying for a talk!

Navigating (and drowning in) the flood of PLoS ONE journal articles

2013-01-13T19:37:00.000-08:00

I love PLoS ONE--both the mission of the journal and much of the science that is published there--and for the large part I love the new website redesign. But one thing I'm definitely not feeling is the revamped e-mail alert system.

I will admit it up front: I still abide by some old skool methods for discovering relevant literature. Every week, I pour through the Table of Contents and early article alerts from my favorite journals, neatly delivered to my inbox. Twitter also helps me find a lot of literature, but I find it to be more of a stochastic and unpredictable method (particularly for weeks where my time for social media is limited due to a heavy workload or lots of travel). Plus, being on Pacific Time puts me off kilter with the rest of the world--relevant information is very easily buried in my Tweet stream, even on days when I am looking. So I stick to my e-mail alerts to make sure I don't miss any exciting new science.

Up until a month or so ago, the PLoS ONE e-mail alerts were a behemoth, but they were manageable. The HTML e-mail was nicely formatted with embedded links to a list of articles in fairly specific subject areas, such as "Marine and Aquatic Sciences" and "Evolution and Ecology". It would take a couple of minutes to scan through these sub-categories, but for the most part it was a pretty good way to filter out the research areas which were most certainly not relevant to you. Also, many articles were placed into multiple categories, so an environmental metagenome study using novel analysis methods would be listed under the subject headings for "Computational Biology" and "Evolution and Ecology".

So much to my dismay, I've been going through my holiday-induced backlog of journal alerts and was horrified by the new format for PLoS ONE Table of Contents:

The subject headings that were formerly useful for me have now been completely condensed into the very broad subject headings "Biology and Life Sciences" and "Environmental Sciences and Ecology". Worse, each of these subheading seems to contain a ridiculous number of articles (I didn't count how many, but I was scrolling for a looooong time before I reached the end of the subsection). And it also seems like I need to be looking through both of the above-mentioned subject headings: there were a few relevant articles peppered amongst lots of non-useful literature in each subheading.

I don't have the time (nor do I want to make the time) to scroll through lots of irrelevant scientific literature essentially looking for a needle in a haystack. So I took the advice of the yellow banner and went to create a custom alert on the PLoS ONE website. Frustratingly, there is no way just to look for new articles within in a defined subset of subject areas. You have to include a search term, which immediately narrows your search window. I tried just doing a simple search for "metagenomics", but I was getting a lot of biomedical/clinical articles amongst the interesting ones (and I didn't want to scroll through all 963 articles). Plus, I'm paranoid that my search term isn't catching all the articles that I would want to see. I tried filtering down the articles to a more manageable set, but my attempts did not go over very well:

My final gripe is the subject categories themselves. The checklist of subject terms initially presented under the Advanced Search function is different from the larger list of subject terms listed under "filter by subject area". Are the "filter by subject area" search terms defined based on the articles themselves? I have no idea. I also have no idea what half of the subject terms mean. There's a subject term called "Sequence Analysis" and also one called "Research and Analysis methods" - could/should an article overlap these two terms, or are they referring to two distinctly different things? In my mind these categories seem a bit too vague and redundant to be much use for users. Subject terms also have some glaring errors--there is no "Ecology" category at all!

In the end, I basically gave up. I'm going to begrudgingly go back to that monster of an e-mail Table of Contents.

I'm a firm believer in intuitive web interfaces with powerful user functionality--I don't think any researchers should have to work this hard to complete what is essentially a very simple (and very common) task. It also in the best interest of journals (and authors) to have their work easily accessible--the articles I'm downloading today may result in future citations, blog posts, social media sharing, etc.

So my pleas for PLoS ONE:

Bring back the old e-mail alert format! Or even better, a new revamped format with even more useful subject categories.
Consolidate and streamline the subject terms - make them consistent between search interfaces, and specific enough that the meaning of each term is obvious. Ideally, each article would have something like Mendeley tags that would function as searchable keywords. If I liked a particular article, there would be a way to view articles with similar keywords--kind of like "Customers who bought item X also looked at these items" on Amazon.com

I know these type of fixes won't necessarily be easy - I don't know how PLoS ONE organizes its article databases, and the things I'm suggesting might require a significant amount of coding and/or manual curation to implement. But I do think this type of organization is imperative for the long-term business model of the journals. Keep the scientists happy!

Dramatically reducing sequencing error via Duplex Tag sequencing

2013-01-07T03:00:00.000-08:00

Note: This is a cross-post from my recent blogging over at The Molecular Ecologist - check out the blog if you haven't heard of it; its a great resource for biologists grappling with high-throughput sequencing data.

An exciting new study was published in PNAS last month, an open access paper entitled "Detection of ultra-rare mutations by next-generation sequencing". This new method has the potential to open up a new frontier in Next-gen sequencing bioinformatics, since it allows tracking of virtually all PCR and sequencing-generated errors.

In this approach, authors Schmitt et al. used Duplex Tag sequencing - they tacked on a sequence of 12 randomized nucleotides onto Illumina adaptors prior to conducting PCR (where forward and reverse adaptors are labelled with different tags, denoted here by A and B tag sequences). After library prep and sequencing, these primer tags can be tracked in two ways. Firstly, sequences containing the same unique 12 bp in the same orientation (AB or BA) can be informatically grouped together and used to generate a Single-Strand Consensus Sequences (SSCS). Even this simple approach reduces error rates from standard quality processing.

However, the real power comes from combining information from reads containing the 12 bp tags in both orientations (AB and BA). The authors showed dramatic (e.g. near-elimination) or sequencing errors by using information from Duplex Consensus Sequence (DCS)--information from complementary pools of PCR products representing both sense and antisense DNA strands from the original reference DNA molecule. In DCS, any given mutation must be present across ALL reads (AB and BA oriented tags), or else it is likely to represent spurious sequencing error or first-round PCR error (e.g. in the case of a mutation being present in all sequences used to generate the SSCS but not present on the complementary reads used to generate the DCS)

Overview of Duplex Tag sequencing approach (Schmitt et al. 2012)

Schmitt et al. estimated the error frequency of DCS at 3.8 x 10^-10 (but likely even lower because this estimate assumes all mutations are equally probable, but the data shows there is actually quite a strong mutational bias observed in SSCS data). This error rate is unprecedented and astoundingly low, considering the method for standard data processing methods is typically 3.8 × 10^−3.

The study also contains a number of super cool empirical validations proving the accuracy of the Duplex Tag approach. Error rates from DCS were dramatically reduced to the point that DCS-derived estimates were shown to be even more accurate than in vivo genetic measurements of mutation rates (in this study, the authors used an example from the M13mp2 LacZ assay used to determine mutation rates for reference DNA). In addition, Schmitt et al.'s estimation of mutation rates via DCS was consistently equivalent to previous research in M13mp2 substrates (where mutation has been extensively characterized) and in human mitochondrial DNA. DCS can also be used pinpoint "hotspot mutational regions" and genomic mutation patterns, by removing artifacts that have previously precluded such analysis. For example, the authors used their Duplex Tag method to identify one such hotspot in the region of replication initiation (D loop) in human mtDNA.

As if that weren't enough, the study also tracked what kind of errors happen in your typical PCR reactions. Typically, there are bursts of DNA mutations that pop up during the first round of PCR, and these strand-specific errors are carried through all subsequent reactions. The authors used a mutagenic protocol to show that oxidative products produced by PCR polymerases cause an excess of C-->T and G-->T mutations, and a characteristic mutation profile in SSCS analyses. These PCR-specific errors can't be identified through SSCS alone (since first-round PCR errors are propagated across all daughter copies), but they are easily pinpointed and eliminated using DCS.

Tracking mutational profiles of first-round PCR errors (Schmitt et al. 2012)

So what do these results mean for future studies? Well first of all, Duplex tagging is an EASY method to apply--there are no extra steps in library prep, just modifications when you order your sequencing adaptors. However, the approach is limited by the need for overlapping fragments (e.g. fragment size limitations imposed by 2 x 250 PE runs on an Illumina MiSeq). The full workflow (deep error correction via DCS) is currently only applicable for shotgun metagenomic sequencing, where you are originally starting with double-stranded fragments of DNA. That being said, amplicon sequencing approaches (rRNA or other marker genes) could still harness the SSCP approach to at least gain some reduction in error rates by pooling reads with the same unique 12bp tag.

Even applied to metagenomic approaches, Duplex Tag sequencing has the potential to give us an unprecedented view into the rare biosphere - we've always been able to look at abundant taxa, but ~1% sequencing error rates have persistently clouded our view of low-copy taxa (or rare gene variants) in environmental or microbiome samples. By removing virtually all error, future studies can ensure that observed variants are biologically real.

Reference:

Schmitt MW, Kennedy SR, Salk JJ, Fox EJ, Hiatt JB, Loeb LA. (2012) Detection of ultra-rare mutations by next-generation sequencing. Proc Natl Acad Sci USA, 109(36):14508–13.

Defining a DNA barcording locus for protists

2012-12-13T08:32:00.001-08:00

Last month's PLoS Biology had a community article devoted to the Protist Working Group recently initiated though the Consortium for the Barcode of Life (CBOL). Now, people have mixed (often vehement) opinions about CBOL - I won't go into the history or debate about DNA barcoding here, but it's worth checking out posts by Dave Lunt at EvoPhylo (Rewriting the invention of DNA barcoding) and Jonathan Eisen at the Tree of Life ("Barcoding" researchers keep ignoring microbes and history).

Since its inception, CBOL has been overwhelmingly focused on the mitochondrial Cytochrome c Oxidase 1 loci that can be "universally" amplified (unless you're trying to barcode nematodes or any other taxon where these universal primer sets don't work). However, in recent years I've been pleased to see the formation of different taxon-specific working groups focused on ribosomal rRNA genes as "official" DNA barcodes (e.g. ITS for Fungi). Ribosomal loci have a much longer history--and thus more available reference data--in most eukaryote groups, and have pretty much been adopted as de facto barcodes for molecular studies (read: not yet CBOL-approved).

Even if all researchers working on a specific taxon are using the same gene to complement morphological/ecological information, I still strongly support the formation of these CBOL working groups. As more people adopt high-throughput sequencing approaches, we need coordination and interaction across different taxonomic communities. For a given taxon, discussions focused on the barcoding locus will simply get people talking, illuminating what different labs are actually doing and helping us to determine the most useful (although probably not perfect) community standards.

So CBOL working groups essentially gather all the experts within a particular taxon, and have them discuss the merits and drawbacks of different loci for molecular identification of species:

Identifying the standard barcode regions for protists and assembling a reference library are the main objectives of the Protist Working Group (ProWG), initiated by the Consortium for the Barcode of Life (CBOL, http://www.barcodeoflife.org/). The ProWG unites a panel of international experts in protist taxonomy and ecology, with the aim to assess and unify the efforts to identify the barcode regions across all protist lineages, create an integrated plan to finalize the selection, and launch projects that would populate the reference barcode library. (Pawlowski et al. 2012)

I was highly encouraged by the Protist Working Group's stance - instead of trying to force everyone to use a single locus (e.g. defining ONE barcode for all protists), they advocate a much more realistic approach:

Because of their long, independent, and complex evolutionary histories, protists are so genetically variable that it is virtually impossible to find a single universal DNA barcode suitable for all of them. The ProWG consortium therefore recommends a two-step barcoding approach, comprising a preliminary identification using a universal eukaryotic barcode, called the pre-barcode, followed by a species-level assignment using a group-specific barcode (Figure 3). In this nested strategy, the ~500 bp variable V4 region of 18S rDNA is proposed as the universal eukaryotic pre-barcode. Group-specific barcodes (Figure 2C) will then have to be defined separately for each major protistan group, based on comparative studies using the CBOL selection criteria, and much of this work is still to be done. (Pawlowski et al. 2012)

This proposed approach will easily translate to high-throughput studies - you might want to get a broad overview of eukaryote communities with a universal 18S primer set, and then dig deeper into species assemblages by also sequencing other loci (ITS, COX1) targeted at ecologically important groups.

Now all we need is a CBOL Working Group for Nematodes - seriously, why don't we have one yet?!

Reference:

Pawlowski J, Audic S, Adl S, Bass D, Belbahri L, Berney C, et al. (2012) CBOL Protist Working Group: Barcoding Eukaryotic Richness beyond the Animal, Plant, and Fungal Kingdoms. PLoS Biology, 10(11): e1001419.

What am I hiding?

2012-12-01T09:48:00.000-08:00

I was truly inspired by the events and discussions during Open Access Week 2012, held back in October. And I've made a decision: I'm embarking on a plan to go more Open.

After a couple of years keeping notes on a private blog, I've decided to make my online lab book public.

After half a decade of jotting notes in word documents and musing through scientific questions in my own head, I've decided to start put these on a dedicated blog (this one).

One of the striking themes across all the meetings I attend is the stunted flow of information, particularly across disciplines. For biologists moving towards increasingly computational projects, researchers often don't know what tools to use, what literature to read, or the most critical factors that should be considered when designing studies.

High-throughput biology requires knowledge of increasingly diverse fields: from taxonomy to computer science, metabolomotics to mathematics. At the moment everyone struggles to cope in private (including me for the last few years, retraining as a computational biologist during my postdoc).

"Science Communication" doesn't mean just outreach. It means communication amongst scientists. One of the hardest parts about being an interdisciplinary scientist is simply keeping track of information. It's hard enough to keep up with the literature in your own field, never mind the papers in other fields, or the fact that many things you need to know aren't even published yet. Jeff Fox of Dynamic Ecology lends a great perspective in his recent iee paper, arguing that:

Blogs can cover topics not suitable for peer-reviewed papers in ecology journals. Peer-reviewed journals in ecology traditionally, and rightly, focus on publishing new science, and reviews of existing science. But this by no means exhausts the range of topics that ecologists want and need to discuss.

So here we are, the debut of "Eukaryotic Ebullience". This is my venue to talk hardcore science, and share with you my experiences in high-throughput sequencing approaches focused on eukaryotes--from environmental metagenomics and biodiversity to software development and computational tools. I'm excited to become more involved in the online science community, and determined to carve out more time for blog posts. I think there are many exciting scientific conversations yet to be had, and I look forward to sharing them with you all.