Differentiation in Bioinformatics Careers

Big data is changing the way science is done. How will it affect your choices as a student or young professional?

By Thomas Shafee — Own work, CC BY 4.0, https://commons.wikimedia.org/w/index.php?curid=56961282

Every squint faces the following dichotomy: software engineering vs science. In this article, I’ll describe differences in skillsets between data science and data engineering, how those roles will affect your teamwork, and how each role is slightly different in the way they are reliant on or independent from others, in regards to the development of statistics, analytics, reports, and salesmanship.

But this article also highlights the need for mentorship. Unlike monoculture graduate programs, hybrid programs have a minimalist footprint, to leverage existing teachers in adjacent ‘parent’ departments. Such programs develop concentrations that steer students towards or away from deeper study in their comfort zone. Proper mentorship encourages students to take the concentration opposite of their scholastic strengths, but graduate level course work might not be enough of an introduction to the subject for the student.

Individuals graduating from interdisciplinary graduate programs proceed in markedly different directions compared to their peers in standard graduate programs. In my case, the choice was to differentiate between laboratory research, mathematics, and software engineering.

Before I begin drawing imaginary boundaries on what data scientists or engineers can and cannot do, I’d like to remind you that there is no limit to what you can accomplish, no matter how steep the learning curve is in your new field. You’re reading this article because you’d like to hear what my journey has been like. I’m another millennial navigating a interdisciplinary field coming from someplace I am strong (biology) and becoming stronger some place I was not(math and computer science). There are no boundaries, real or otherwise; any attempt I make at describing hypothetical roles or strength/weaknesses of the various categories is my attempt to organize differences in expectations and roles.

Regardless of where you are in terms of your computer literacy, there are two strengths that you can never have too much of in a new field: communication and research ethics. While the public may believe that research ethics is a consideration of phrasing, peeking at data before analysis, or failing to normalize appropriately… it’s most likely in practice the art of practice itself.

Data driven research can be very convincing. But our business is not to convince others of our opinion of what the data are saying. It is to use data to help make decisions, and sometimes we are to overwhelmed by the volume of data available to us.

This is not a weakness of statisticians per se, and instead the branch of mathematics offers some interesting new brands of techniques to this generation that the world is not as familiar with. Additionally, modern computer science and data structure research have produced new tools that are rapidly becoming part of the standard library. There are often new tools that would help the average researcher make sense of such information if there was enough time to make the right decisions. And the right decisions often require the most legwork to get there.

As such, bioinformaticians who are interested in mathematical biology make two important deliverables to the outside world that can be considered ‘data science.’ The first is a wealth of new databases and reports, reformatted and indexed for use later, detailing what experiments were conducted in silico. The second is a new generation of metrics being developed from algorithm designers that have new biological perspectives to bring into the standard data processing pipelines.

Interdisciplinary Science is a Balance of Science, Math, and CompSci

Bioinformaticians come in all different types: labrats™ that moonlight as computer nerds, theoreticians that use programming as that extra bit of magic in their specialized sub-fields, and infrastructure builders that create interesting and advanced architectures that enable whole companies.

Bioinformaticians have to choose how much of each discipline to master. For example, some bioinformaticians pursue knowledge of statistics and mathematical modeling if math is the student’s best language for learning and/or effecting their environment. Others may avoid the side of bioinformatics that stems from mathematics and computer science in favor of learning just enough math to make sense of the alignment problem, which has enough applications in theoretical biology to make an entire career. To some degree, alignment is one mathematical basis (no pun) for evolutionary theory to grow. For example, the use of multiple sequence alignment in early science education to teach the mathematical foundations of evolutionary biology represents one of bioinformatics’ most interesting problems under investigation.

The difficulty in the aforementioned choice of degree concentration may be made easier by the phrasing of key questions pertaining to the origin of the subject. Interdisciplinary papers and methods often make interesting and sometimes controversial statements or assumptions in the respective disciplines. Sometimes a difficult or longstanding problem may provide enough interest for the student to take the concentration they do not have the most strength in. In my case, the complexity of assembly algorithms was a source of frustration about mathematics that I needed to pursue more interesting subfields of mathematics.

A difficult decision was whether to think of computers as a tool of quantitative strength, flexibility and work-life balance, or as the fundamental center of my scientific career. I don’t like sitting in front of a computer all day. But I do enjoy the feeling that multi-tasking on the server gives me, when I know my simulations and theoretical pursuits are progressing even if I’m writing or doing something in the laboratory. The degree of publicly available information in software development is very large, while the fraction of material that is actual computer science wisdom is very low. It feels like 4 years isn’t enough time to develop a strong foundation for the decision to generalize or specialize.

System complexity certainly drives the growth of interdisciplinary studies. Most research is driven by data to the point where hypotheses aren’t always made explicit. Data is collected, the system is characterized, and only when something doesn’t make sense with the existing model is a hypothesis formalized. In some ways, we are reliant on the techniques, technologies, and assumptions made during the technique’s creation in a way that forbids us from questioning them regularly. And there aren’t many young comedians bringing light to those issues in a good way.

My point is that some sciences aren’t hypothesis driven any more; they’re data driven. The system complexity doesn’t always make for easy or obvious cherry-picked hypotheses. Only after significant study of the system’s normal function can you begin to anticipate what perturbations might be useful to study its response. And in this case the system is typically thousands to hundreds of thousands of ‘genes’ if you go by the high-school definition of what a gene is. So what? So the system complexity of any one gene is typically non-linear, there is no absolute unit of gene expression, the resolution required to focus on intervals of gene expression with linear characteristics is expensive, and genes are typically regulated in clusters. It turns out those considerations are necessary with multiplexed expression data to look at one simple perturbation experiment.

Data driven science is effective at building consensus and perhaps at convincing others, but it is ineffective at reasserting and refining existing theoretical foundations or challenging, deconstructing, and alternating new frameworks for the growth of theory. This is why interdisciplinary education can be so important, because assumptions from both fields are at play, in the methods that make the border between two disciplines, that may be convenient but not necessarily correct. In this way, I believe the discipline’s dogma is more plastic in interdisciplinary study.

So why would that belief be relevant to a discussion about careers? As you will see, the extent of computer literacy determines perception, but more importantly assists communication without words: with graphics, numbers, visualizations, or websites. As we will see, system complexity and computer literacy goes hand-in-hand. Technologies used to create in that was transcends the advertised role of data science as a blunt instrument of fact.

Scientists and Analysts

Bioinformaticians don’t hold the monopoly on data analysis. Most scientists can do simple and frequent regression and hypothesis testing in Excel and have a good intuition for likely vs unlikely. More important, however, is the way that scientists are in tune with testing often and keeping hypotheses simple. In contrast, bioinformaticians may feel overwhelmed by system complexity instead of making simple questions pay off.

Years of advanced statistical training matter much less when you’re side by side with wet-lab scientists that are experts in hardware and technological trends in the science, test hypotheses, and develop methods regularly.

The opportunity for bioinformaticians to market their abilities stems from a few key transactions that can be easy for them to monitor:

choice of software and parameters intuition about calculation speed and bottlenecks documentation on caveats calculation fundamentals and competing models

Perhaps the other more often neglected opportunity is to give scientists positive examples where the calculation was partially decisional or negative examples where the information was misinterpreted.

That said, some bioinformaticians might not even find themselves relating to the next two categories I outline in this article. They might find they relate closely with the scientists and prefer to do light, simple, and elegant modeling tasks with JMP, Excel, or similar. An issue facing those bioinformaticians is competition and job security. Some might read that as opportunity, instead of a warning about insufficient coding abilities. It’s true that some bioinformaticians might prefer job security over other things, but there is opportunity for researchers who can brave the waters of instability with a dash of reproducibility.

Other issues facing bioinformatic analysts include their reliance on others for access to databases, applications, datasets, and available models. Graphical user interfaces often go hand-in-hand with lower throughput and longer processing times than what is possible at the command line. All that is needed to offset this is attentiveness to the customer and the convenience of the calculation platform, an attention to the reproducibility challenge, and a good working relationship with professionals who can process or ‘mung’ data.

Data Scientists

If you believe you have better than average software abilities, you might find that there is so much insight to gain from existing datasets, and indexed, organized datasets can quickly yield insights, models, and application ideas were there merely the personnel to harness it. The data science specialty is often touted as the game-changing new specialty, yet it’s a rebranding of existing analytical and statistical specialties.

Data scientists’ mathematical and software let them retrieve information, develop models, tweak parameters, or derive novel metrics and computational solutions. Report generation may be an interesting subfield with the rise of literate programming frameworks like Jupyter and Rmd. These skills may contrast with the scientist or analyst in a very strong way, but can be especially helpful if web development and form generation is a skill in a data scientist’s wheelhouse.

Data scientists can process data with extraordinary throughput, but are in some ways more susceptible to becoming part of the machine that builds upon existing models or paradigms without actually challenging them. The temptation to appear as a data ‘authority figure’ may make antagonism between traditional bench scientists, analysts, and this ‘new’ category of technically inclined computer experts.

The great power of analytics in this way should mean that data scientists need to feel comfortable with older hardware wherever possible, being candid about needs for collaborations, datasets, or interactions so they might share what they have built. The results oriented nature and speed that they produce necessitates frequent presentation. Fortunately, they have one category of individuals even more removed from the truly integrated science ideal and stuck more into the IT millieu.

Data Engineers

The final category is newer in its formalization, and has largely grown from the increased size and complexity of data representations. Data engineers have rich software engineering and computer science abilities that lets them build critical infrastructure for others with a focus on stability, usability, and efficiency. Precomputing, algorithmic efficiency, and automation may be requirements of applications in their wheelhouse. Enterprise software is often expensive to develop and license; data engineers make a team responsive to the team’s computational needs in a way that scientists and analysts often cannot. They are invaluable teammates and often provide scripts, REST-APIs, databases, automated systems, web applications, and high-performance computing to the teams that they are directly or indirectly supporting.

Unfortunately, they closely resemble IT and may face budgetary restrictions that do not mesh with the throughput and results driven design of data engineering teams or teams that integrate their products. Moreover, salesmanship may be challenging for data engineers who may favor simple and elegant application designs and might dread working with designers. Informal and formal feedback, design, and financial feedback may be important skillsets for data engineers to build credibility inside and outside their teams.