The Pragmatic Data Scientist

5 Insights into commercial Machine Learning that all Data Scientist should be aware of.

This article is meant for data scientists, managers of data science teams, and stakeholders. It is my hope that these insights will give you a broad understanding of the challenges in the field of AI, and make you reflect on how to develop pragmatic data science solutions.

TL;DR (A summary of the 5 insights):

In the first insight “Artificial Intelligence is a moving target”, I take a look at how the AI industry is affected by the fact that there does not exist a clear-cut definition of what AI is. I argue that because of the loose definitions of what constitutes an AI system, marketers rather than engineers decide what is branded “AI”. In the second insight “Engineering is the bottleneck” I argue that the vast majority of companies need AI engineers, not AI scientists. This is because most companies are better of developing pragmatic solutions based off of existing research than conducting their own research. I also discuss solid engineering principles that can make or break industrial AI systems. In the third insight “Deduction is preferable to induction” I argue for the importance of creating simple solutions before bringing out the big guns. Creating simple solutions have a number of desirable properties such as constituting a good baseline and ease of maintenance. I argue why data scientists should favor a top-down approach to a bottom-up approach, and strive to develop a data independent solution when possible. In the fourth insight “Manipulate the problem-model-data-metric equation” I go into detail about what distinguishes pragmatic data science in the industry from academic and competitive data science. I argue that because industrial engineers can change more of the variables of the equation, they must master a broad range of skills, ranging from data collection techniques to stakeholder management. In the fifth insight “Engineering means making the right trade-offs” I argue that pragmatic engineers must strike a balance between different opposing desirable properties of any solution. I explain how the end goals of industrial data science teams differ from that of academic and competitive data science teams.

These insights were originally posted on https://www.botxo.co/blog/.

Discuss on Hacker News: https://news.ycombinator.com/item?id=20261274

1. insight: Artificial Intelligence is a moving target

In the industry, AI is whatever you want it to be. Internalizing this insight will help you navigate the space of AI-startups and projects.

What is AI?

Artificial Intelligence is defined as any computer system that demonstrates “intelligence”. Since we have no formal definition of what constitutes intelligence, the scope of AI is continuously shifting. Our idea of what constitutes an intelligent computer system is shaped by our perception of human intelligence. Thus, a computer system seems to be artificially intelligent if it:

Makes complex decisions that cannot easily be explained by simple rules.

Is uninterpretable, so that the system makes decisions without being able to explain the underlying reasoning.

Replaces humans, in the sense that the system does a job that used to require human intelligence, or that we could imagine humans do.

Is novel, meaning that the system does something that we are not used to seeing computers do.

Why is the definition so imprecise?

Systems can potentially be considered to be based on AI if they meet just one of the above criteria. This leaves room for a lot of creative interpretation.

For example: By defining AI as something that replaces human labor, we end up being able to reframe almost all problems as AI problems. Librarians are still indexing books manually, and thus, topic modeling is often considered to be a kind of artificial intelligence.

Because of different sometime conflicting interests, the definition of what constitutes an AI system is highly dependent on context.

In some cases, non-AI systems are miscategorized as intelligent because of a genuine lack of understanding of the field. Because machine learning is a relatively novel discipline and the terminology is in constant flux, misunderstandings are unavoidable.

In other cases, the miscategorization stems from deliberate marketing strategies. This is not only the case for external marketing. Projects in might be branded as AI internally inside organizations for political reasons.

Perhaps most bizarre is the tendency to shoehorn machine learning into solutions that would be better solved with traditional methods. This a consequence of the fact that the AI label can be an advantage both when selling the project (internally as well as externally), and when later using the case as an example for internal and external branding.

The consequences of the loose definition

This loose definition of AI leaves room for a lot of interpretation, which has had some interesting consequences in the industry.

Aspiring data scientists must develop a clear understanding of the discrepancy between academic and the practical, industrial terminology. According to Rachel Thomas from fast.ai, one of the most common complaints of data scientists is that they don’t get to build AI systems because the company they work for require basic digitalization, not advanced machine learning. When investigating job opportunities, it is therefore important to make sure that your talent is aligned with the actual needs of the organization rather the (external and/or internal) branding. Managers that are looking to expand their data science team must be well informed about the actual problems that new data science roles need to fill. As I argue in the second insight, “Engineering is the bottleneck”, the need for machine learning engineers often exceeds that of data scientists. Managers must also be wary of candidates that leverage the novelty of the field in order to pose as more senior than is actually the case. This tendency is described in detail here and here. As is commonly known in the industry: Investors looking to jump on the AI-train are too easily sold on solutions that are merely branded as intelligent.

The curious case of RPA

As an example of software that has been rebranded as Artificially Intelligent, consider Robotic Process Automation (RPA) software. RPA systems are desktop applications that allow non-programmers to automate business tasks by creating desktop-level macros. Desktop level automation has existed for many years (the “Automator.app” application has shipped with macOS since 2005).

By rebranding desktop-level macros as “Robots” (The “R” in RPA), the industry is capitalizing on the recent AI hype. Since most non-programmers operate at the desktop level, RPA is considered to be a tool to replace knowledge workers, leading to software license prices that compete with salaries.

RPA systems are undoubtedly tremendously valuable in some specific situations. However, they have little to do with what academics consider to be AI. This fact does not matter much when selling the systems. As with any product, the true value of “AI”-systems is solely determined by the value perceived by the end-users.

AI is whatever you want it to be

To sum up: While navigating the field of AI it’s important to keep in mind that AI is a moving target and is whatever you want it to be. Keeping this in mind can help data scientists, managers and investors avoid a lot of common pitfalls. Knowing when to use the loose definition to one’s advantage can become the difference between whether projects are undersold or oversold.

2. insight: Solid engineering is the bottleneck because complexity accumulates

Dealing with the complexity of the real world is more difficult, but also more valuable, than solving singular novel problems. I argue that the vast majority of companies are in need of AI engineers, not AI scientists. I also discuss solid engineering principles that can make or break industrial AI systems.

The mess that is engineering

In commercial software engineering and data science, most problems are best solved by applying solid engineering principles, rather than relying on novel research techniques.

Unfortunately, many laymen and managers confuse software engineering with scientific research. The day to day work of a software engineer is imagined as a conveyor belt with complex problems that needs to be solved. Once a problem is solved, the engineers move on to the next one. The best engineers are the ones that can solve the most difficult problems the fastest. I believe this view represents a misunderstanding of the field.

Only a very small percentage of companies deal with deep, novel research problems that can dealt with in isolation. Rather than dealing with the complexity of individual problems, software engineering usually revolves around tackling the accumulation of complexity from past solutions.

Software engineering is like maintaining a box of wires

Software engineering is more like maintaining a giant box of wires. The wires must be connected in a certain way for the box to work, and every day, the requirements changes slightly. Over time, more and more wires are put in, and it becomes increasingly more difficult to grasp the entire system. This is the reason why the progress of software projects stagnates: You have to take yesterday’s solution into account when solving today’s problem. Solving multiple small problems at the same time is by no means easy: Dealing with the complexity of the real world is often harder than tackling singular, novel problems.

The problems regarding the accumulation of complexity also hold true in data science. Complex procedures quickly add up, effectively slowing down every subsequent change. For data scientists, day to day work means dealing with practical issues such as legacy data formats, algorithms from different frameworks and languages and the mess that is GPU programming. It requires good engineering skills to set up stable ETL pipelines as well as dealing with out-of-core algorithms. Choosing the right tool for the job often requires broad experience with many tools.

The value of broad knowledge

Because complexity in software projects accumulates, good initial decisions are tremendously valuable. Going down the wrong path at the beginning of a project can cost weeks of valuable development time. You can even avoid building entire projects by knowing the right open source tools, so knowing the right tool for the job is invaluable.

This means that having a broad knowledge of most tools and frameworks often beats specialized knowledge of a particular algorithm. This is completely opposite to what is important for an academic career: Deep knowledge of a narrow topic is actively encouraged. In academia, you might spend hours solving an old problem in a new way. For data science in the industry, this is usually a waste of valuable time.

However, it should be mentioned that shallow knowledge with no depth is useless. It goes without saying that you should obtain the theoretical background, (eg. statistics, linear algebra and computer science) to dive deep into these topics when necessary.

To gain a broad knowledge of machine learning algorithms, I encourage you to read through the excellent documentation of the “sklearn” library. Because the library contains solutions to a lot of common industry problems, it is an excellent source of reference for a broad understanding of the field.

If you have only worked on academic classification problems, you might have never had to deal with open world assumptions. Therefore, you might not know anything about outlier or novelty detection, failing to realize that it is a vital part of the problem you are trying to solve.

Because of the ever-growing amount of freely available tools, and their increasing complexity, choosing the right tool is more art than science, and requires a lot of skill and experience.

The shortage of developers

Unfortunately, the world is experiencing a severe shortage of skilled software engineers. Software is useful whenever automation is possible. Typical businesses have thousands of problems lying around waiting to be tacked with automation. Until no human is doing work that a machine could do, there will be a need for software engineers. Unfortunately, good software engineering is as much an art as a science, so it is very difficult to teach software engineering in a traditional university setting.

The shortage of skilled software developers has had the unfortunate consequence that the bar for what constitutes as a software engineer has become quite low. Since the field is in constant flux, and since solutions are growing ever more complex, it is difficult for non-engineers to evaluate the candidates. Even the least skilled engineers can find work, meaning that many companies are suffering severely from solutions that look like the box of wires shown in the above picture.

I encourage you to fight the good fight: Untangle the mess — remember that complexity accumulates.

Conclusion

I believe that the vast majority of companies need AI engineers, not AI scientists. This is because most companies are better off developing pragmatic solutions based on existing research rather than conducting their own research. It is simply not cost-efficient for most companies to invest in novel technology when there is so much cutting-edge research freely available, ready to be implemented. The insight also makes the argument that it is more difficult to become a good engineer because engineering skills are more difficult to quantify than science skills and more difficult to teach in a classical university setting.

3. insight: Deductive reasoning is preferable to inductive reasoning

The simplest solution should be tried first, because it will set a baseline and it is often the optimal solution.

I believe data scientists should favour a top-down approach to a bottom-up approach and strive to develop data independent solutions when possible.

A primer on the philosophy of logic

There are two ways to reason about any problem.

Deductive reasoning, or the top-down approach, start with premises and generalizes over principles to reason about specific instances. As long as the premises hold, the conclusion is true.

Given the premise “Socrates is a man” and the principle “All men are mortal”, we can deduct the knowledge “Socrates is mortal”.

Inductive reasoning, or the bottom-up approach, start with observations of specific instances and uses the information to construct general principles. If the observations are representative of the general case, the conclusion is true.

If we have observed thousands of swans, and every single one of them was white, we can construct the hypothesis that “All swans are white”.

In the context of software

Most problems in software are tackled with a deductive approach. Our code represents general rules so that they can tackle specific cases.

For example, y = 2 * x is a general rule that can tackle all kinds of specific instances of x.

It’s only recently that machine learning has paved the way for a set of induction-based solutions. While statisticians have used the bottom-up approach for ages, machine learning practitioner’s pragmatic, non-mathematically rigorous, big-data approach has become a whole new paradigm.

Instead of writing y = 2 * x, we provide the computer with specific examples (e.g. “If x is 3 then y is 6”) and use machine learning algorithms to construct models that are generally applicable.

The software 2.0 stack

Andrej Karpathy, the Director of AI at Tesla, introduced the term Software Stack 2.0 to describe what I like to call “inductive programming”. I prefer the term “inductive programming”, because unlike “software 2.0”, it does not imply that this type of programming is superior to deductive programming (aka. “software 1.0”).

I believe the “software 1.0 and 2.0” terminology is the product of the recent hype surrounding AI. The choice between deductive and inductive programming should depend on the problem at hand and the available tools — one is not superior to the other.

The cons

The reason for the boom in induction-based programming is not that it is inherently better than the deductive approach, but rather that it yields great performance in domains where computers used to lack behind humans. Visual, auditory and textual recognition used to be human-only domains, but induction-based methods have made computers catch up quickly.

As a data scientist, it is easy to forget, but important to remember, that induction-based approaches have several inherent disadvantages. By neglecting to acknowledge these limitations, we risk using induction-based approaches when a deductive approach is the better choice.

The disadvantages include:

Collecting enough data is time-consuming and cumbersome.

Making sure that the data is representative is difficult.

Managing the collected data takes time and resources.

Reasoning based on the data is difficult.

Computing the solution is more computationally expensive.

The outputted results always have a degree of uncertainty.

Reconfiguring the problem based on new knowledge is cumbersome.

Since you are not designing the rules, you miss out on insights otherwise gained from iteratively working on the problem.

Because of these reasons, deductive solutions are almost always easier to implement than their inductive counterparts. Building inductive solutions might get easier with the introduction of new tools, but the inherent problems of induction cannot be mitigated.

An incremental approach

Inexperienced data scientists will often approach problems with the most advanced tools available, neglecting deductive methods entirely. I encourage data scientists to adopt an incremental approach, moving from deductive solutions towards more inductive approaches. I believe that one of the key responsibilities of data scientists is to determine which solution of use for a given problem.

The image below has proven useful when explaining this approach to laymen. The explanation goes something like this, moving from left to right:

We start with plain old code (Hard-coded). When this becomes too inflexible, we let the user specify the rules themselves (Rule based). When this shows poor performance, or becomes too complex to maintain, we gather data and create a statistical model (Statistics). If the problem is too complex for common statistics, we sacrifice the mathematical rigor and focus solely on the performance of our models (shallow machine learning). If the problem at hand is very complex, we stack multiple models on top of each other (Deep Learning).

Deductive solutions should always serve as baselines for more sophisticated approaches. If induction-based approaches cannot beat the baseline, there is no reason to move on. It is often the case that the best solution to industry problems lie at the beginning of this process.

By using an incremental approach, we save time and avoid introducing unnecessary complexity to our systems. Even if inductive approaches yield better performance, deduction might be preferable due to the ease of interpretability and reconfiguration.

Conclusion

I argue for the importance of creating simple solutions before bringing out the big guns. Creating simple solutions have a number of desirable properties such as constituting a good baseline and ease of maintenance. In data science, the top-down approach is usually simpler than the bottom-up approach, thus engineers should strive to develop a data independent solution when possible.

4. insight: Manipulate the problem-model-data-metric equation

In industrial data science the best solution is often reached by manipulating the problem, the metric or the data rather than the model.

Industrial, academic and competitive data scientists

Machine learning research is moving forward at a rapid pace. Industrial data scientists have to stay up to date with the newest academic articles if they want to use cutting-edge methods. Because of the close ties to academia, it is a common mistake to copy the methodology of researchers when working on problems in the industry. This leads to suboptimal solutions because the goals of academia are different from those of the industry. Likewise, data scientists who used to participate in competitions such as the ones hosted by Kaggle are biased towards a methodology that is suboptimal for data science in the industry.

The 4 components of a data science solution

In academia and competitions, almost all resources go into improving models. In the industry, however, we are often allowed to manipulate all 4 components of the final solution: The problem, the model, the data and the metric.

• The problem is the central issue that we are trying to solve. Problem descriptions should immediately reveal the type of problem, e.g. classification, regression or outlier detection.

An example of a problem description could be: “Given a sentence from a legal document, determine how likely it is that the sentence contains sensitive information.”

• The model is the algorithm or combination of algorithms that are developed to solve the problem, as well as their settings and hyperparameters. For most projects, the model that achieves the best performance is the right choice. However, properties such as interpretability, prediction time or theoretical bounds on the rates of convergence also come into play when choosing a model.

• The data is the datasets that are used for training and evaluation. Datasets should representative of the problem, contain very little or no noisy entries, and most importantly: They should be large. Usually, datasets will have to be pre-processed so that they are ready to be used by the model.

• The metric is the expression used to evaluate how well models are solving the problem. The metric must be automatically computable, even though this often means that we must sacrifice how well the metric fits the problem. Consider problems for which multiple solutions exist, such as text summarization; automatic evaluation of such problems is far from trivial.

How to manipulate each component of the equation

Because the problem, the data and the metric usually cannot be changed in academic or competitive contexts, industrial data scientists often forget to manipulate these components. In the following section, I will elaborate on how data scientists can manipulate each of the 4 components to their advantage.

Manipulating the problem

In competitions, the problem is always to reach the highest measurement on the test data. This cannot be changed. If the metric does not provide an accurate measurement of the intention of the organizer of the contest, the problem does not change. Competitors are allowed to explore such conflicts of interests.

In a competition where the goal is to determine the price of a taxi fare, competitors might exploit the fact the metric does not penalize some guesses sufficiently.

In academia, researchers can sometimes change the problems slightly to their advantage, but only problems that can be applied to a wide range of applications are of interest.

Academics will attempt to tackle problems such as assigning a Part of Speech category to words in input sentences. In an industry context, problems will usually be domain specific such as extracting significant named entities from certain types of legal documents.

In the industry, however, the initial problem should often be modified or completely changed. Many problems posed by laymen should be skipped altogether. Experienced data scientists avoid problems that are impossible to solve and manage the expectations of stakeholders.

Problems can be modified to improve commercial viability in 3 different ways:

1. Narrowing the problem

The initial problem description and tools available sometimes tackle a problem that is too general. By changing the problem to have a narrower use case, the variance of the input data will be reduced. It follows that the final solution will achieve equal or better performance.

Rather than building a general document classification system, you might want to build a classification system for documents in a particular domain, such as legal documents.

2. Switching the problem

The initial problem might not correlate with the goal of the end-user. By changing the problem to provide maximum value to the end-user, you might be able to achieve a better perception of performance.

You might realize that your classification problem should assume an open world instead of a closed world, or that it should be transformed into multiple binary classifications with confidence.

3. Solving an easier substitute problem

The initial problem might be so difficult that you cannot achieve sufficient performance. Thus, you must settle for solving a less complex problem.

Manipulating the metric

In competitions, the problem is always to reach the highest score on the test data. Thus, the metric has a total problem fit and should never be changed.

In academia, the metric is usually defined by previous research that you wish to improve upon. Metrics in academia are usually simple and standardized. The metric will often make solutions seem more capable than they actually are. This is because of an interesting feedback loop:

First, a group of academics writes an article that on the surface seems to achieve impressive results. Then, the article gains attention from the industry, and thus by other researchers. Finally, in order to compare results, other researchers must use the same metric. Thus, the metrics that on the surface makes solutions seem impressive are more likely to remain in use.

Consider the metric used to evaluate POS-taggers. The standard metric is word-level accuracy, and state of the art solutions reach an impressive ~95% accuracy for English. However, 95% word-level accuracy means that 1 in 20 words are classified incorrectly. The result is that every other sentence is a text contains a misclassified word. Considering how many words that have unambiguous POS-tags, the state-of-the-art performance does not seem so impressive.

In the industry, it is usually the case that the best metric is not one of the academic standards. Metrics should aim to measure the value provided to end-users. Inventing a scoring method that reflects user-value often requires weighting several domain specific heuristic measurements. In the industry we often need to sacrifice mathematical simplicity and comparability to previous results to create a metric that has as close a problem fit as possible.

Non-standard metrics can be created as a weighted combination of:

Allowing multi-category assignments for classification tasks.

Heuristically scoring of different categories, so that some categories penalized differently.

Evaluating the model based on training error as well as evaluation error.

For example, in life or death scenarios such as cancer treatment, you might want to penalize false negatives much heavily than false positives.

I advise that data scientists put a great deal of effort into developing and refining your metric to reflect the value provided to end-users. Maintaining a representative metric is similar to building good automatic tests for traditional software: It will allow teams to continually refactor solutions, providing confidence that the project constantly progressing, never moving in the wrong direction.

Manipulating the model

In academia and competitions, the model is usually the only component of the equation that is allowed to be manipulated. Thus, all resources are spent on this component. While academics attempt to invent new models that can be applied to multiple domains, competitive and industrial data scientists focus on improving the models from academia by exploiting the specific constraints of narrow use cases.

In academia, simple models are preferred because they are generally applicable.

In competitions, the simplicity of models does not matter. In fact, a simple model implies that the solution is easily achieved and thus uncompetitive. Winning solutions usually stack multiple models in complicated ensembles to squeeze out every last bit of performance. In other words: Nobody cares how the sausage is made, any hack that improves the performance of the model is viable.

In the industry, model simplicity improves maintainability and decreases turn-around time. This is especially important for applications for which the dataset changes over time because maintainers might want to continually adjust the model accordingly.

In the industry, the choice of model can also depend on additional requirements and trade-offs such as:

Interpretability of results: Interpretability is often very valuable in the development process and is even sometimes a strict requirement.

Interpretability is often very valuable in the development process and is even sometimes a strict requirement. Prediction time: In many cases, slow prediction times will result in a poor experience for end-users.

In many cases, slow prediction times will result in a poor experience for end-users. Training time: In some cases, end-users train models on their own data, making training time a trade-off.

In some cases, end-users train models on their own data, making training time a trade-off. Zero Training error: In a few cases, it is a requirement that the model performs with zero training error.

Because models are the primary subject of interest for academics and competitors alike, there is a wide range of great standard models available in the public domain. It is therefore often unnecessary for industrial data scientists to spend resources on manipulating this component. Most of the hard work is already provided by academia and open source contributors.

Novice data scientists in the industry tend to fall into the trap of spending excessive resources adjusting models. I speculate that this mistake is a result of the following:

There is a lot of learning material available for learning how to do this because models are the primary subject of interest for academics and competitors alike.

Mistakenly using the methodology taught in academia and used in competitions.

Models are considered the “science” of data science and usually requires little “engineering”. Data scientists with little practical experience will have a hard time dealing with the complexity of engineering.

Manipulating the data

Andrej Karpathy, the Director of AI at Tesla has revealed the following diagram showing estimates of how much time he spent working on the models versus the data when writing his PhD and working at Tesla. This illustrates one of the main differences between working as a data scientist in academia and in the industry.

This diagram illustrates a very important point. The bottleneck in commercial AI is data, not algorithms. The data component is where industrial data scientists should spend the majority of their time. Data Visualization and Exploratory Data Analysis (“EDA”) are probably the most important subfields to master to develop good data science solutions. Even in academic and competitive settings where directly manipulating the data is disallowed, Data Visualization and EDA provide tremendous value.

In academia, data scientists usually work with clean standardized datasets. The datasets are easily interpreted and have a simple structure. Datasets are usually huge, allowing for researchers to experiment with large, complex models.

Data visualization and EDA is often completely overlooked in academia, as fitting solutions to datasets makes them less generally applicable. However, I recommend researchers spend time on EDA regardless, as data insights can often inspire novel ways to solve general problems.

In competitions, features are often secret, undocumented or simply difficult to interpret. This makes data visualization and EDA even more important.

In the industry, the structure of the dataset is often quite complex. The complexity stems from relationships between columns, correlated or untrustworthy data sources, missing data, noise, and outliers.

The data used to evaluate the standard models provided by academia rarely reflects common business use cases. For some problems, training and test data is inherently different, making it difficult to avoid biased models.

For many domain-specific problems, little or no data is available until users start providing the service with data. This is usually referred to as the “chicken and egg problem” or the “cold start problem”.

When little or no data is available, industrial data scientists have to find ways to obtain more. Being able to effectively obtain more data is often what sets experienced industrial data scientists apart of novices. Common methods include:

Finding more data from open public sources.

Annotating data internally. Often, the economics of manually labelling data pays off.

Writing a good annotation description and use paid human labour services such as Mechanical Turk.

Building graphical user interfaces for data annotation and provide users with incentives to perform the labelling.

The benefits of knowing the difference between different types of data science

It is important for industrial data scientists to be aware of the different goals of academia, competitions and the industry. I believe that internalizing these differences will help data scientists:

Remember not only to modify the model, but also the problem, the data, and the metric.

Determine which components to spend the most resources on manipulating.

Avoid false impressions of the performance of academic results.

Yield value from exploiting the gap between academic models and domain-specific implementations.

In recent years, data scarcity has led the field into using a variety of techniques that leverages patterns in data sets other than the target data, such as transfer learning and multitask learning. Weak supervision is also a promising technique in fields where data scientist have enough intuition about particular problems to define reasonable accurate label functions.

Data scientists should interpret results after manipulating the whole pipeline. This helps choose what components to manipulate in subsequent iterations. More often than not, working on the data component yields the best development time/performance economics. I encourage you to use exploratory methods such as SHAP to interpret the results of the entire pipeline. Investigating feature importance can aid the development process and can sometimes even provide valuable information to end-users.

Conclusion

I have described what distinguishes pragmatic data science in the industry from academic and competitive data science. In industrial data science, engineers can choose not only the algorithms used to solve a particular problem, but also what data the algorithm uses for training. In many cases, it is advantageous to solve a different problem than the one originally formulated and to choose a different measurement of the feasibility of the solution. I argue that because industrial engineers can change more of the variables of the equation, they must master a broad range of skills, ranging from data collection techniques to stakeholder management.

5. insight: Engineering means making the right trade-offs

Pragmatic engineers make conscious decisions about the trade-offs between desirable properties of data science projects. The end goals of industrial data science teams differ from that of academic and competitive data science teams.

6 properties to consider

Engineers must deal with the complexity of the real world. This often requires making trade-offs between several desirable properties of the final solution. Pragmatic engineers know this and make conscious decisions about what properties to prioritize.

I propose to consider the following 6 properties for data science projects:

Performance: The performance of the model on the evaluation dataset, as calculated by the metric. Features: Desirable properties such as fast prediction time, fast training time, easy interpretability of results and confidence intervals are examples of features. Development time: The time required to develop the solution. Turn-around time: The time required to maintain the solution by making adjustments. Computational requirements: The storage, memory, CPU, GPU, and TPU requirements. Skill requirements: The level of education required by data scientists to contribute to the solution. Using uncommon technologies will require retraining time for new maintainers.

Put emphasis on the right properties

A common mistake of novice engineers is to put emphasis on the wrong properties. I suspect this mistake occurs because:

People with engineering-type personalities generally tend to be focused on things that can be directly measured, such as computational requirements.

Not considering the final value perceived by end-users.

Relying on the same prioritization as academic and competitive data science, for the reasons described at the beginning of the previous insight.

In the following section, I estimate how much engineers focus on each of the properties in academic, competitive and industrial contexts.

I have assigned two scores from 0 to 10 for each of the contexts: How many resources I believe data scientists usually spend on improving the property, and how many resources I believe should be spent on the property. A discrepancy between these two numbers highlights common mis-prioritizations. The scores are obviously very subjective, and I encourage you to write to me if you disagree with them.

1. Performance

(Needed/Actual), Academic: (9/9), Kaggle: (10/10), Industry: (7/9)

Performance is the primary focus of all three contexts.

In academia, researchers only stop optimizing performance when it would result in overfitting to the domain, thus losing general applicability.

In competitions, the focus on performance is pushed to the limit to the degree that competitors are willing to sacrifice all other properties for tiny performance improvements. Competitors will only stop optimizing the performance of their solution when they cannot possibly achieve better performance.

While performance is certainly of importance in the industry, other properties usually more important than data scientists realize, thus creating a discrepancy between the actual and the needed focus on performance. Pragmatic industrial data scientists should stop optimizing performance when the resources would be better spent elsewhere.

2. Features

(Needed/Actual), Academic: (6/5), Kaggle: (0/0), Industry: (6/4).

Features represent a kind of catch-all category that cover a number of desirable properties.

In competitions, features do not matter.

In academia, desirable features are used to argue for the feasibility of solutions with lower performance than competing methods. However, researchers tend to focus on directly measurable properties, thus performance often takes focus from features. Unlike in competitive and industrial context, in academia, the general applicability of solutions is considered a feature.

Classic remote-control vs Apple design choice. An example of focusing too much on features, and not on end-user experience.

In the industry, features such as fast prediction times and interpretability are often hard constraints. In settings where they are not, they are often neglected, even though they can have a big impact on end-user experience. Pragmatic industrial scientists realize this and make conscious decisions about the trade-offs between different solutions.

Features of data science projects include:

Fast training time.

Fast prediction time.

General applicability.

Zero training error.

Interpretability of results.

Confidence intervals.

Novelty of the solution (Very important for academia)

The ability to run on-device. The ability to run on encrypted data.

I encourage you to let me know if I have missed any features on this list.

3. Development time

(Needed/Actual), Academic: (7/7), Kaggle: (7/7), Industry: (5/8).

Academics and especially competitors experience hard constraints on development time.

In the industry, it is usually best to trade fast development time for a better turn-around time. This can only be achieved by doing everything conceivable to reducing the accumulation of complexity, as described in Insight 2. Similar to traditional software development — it is often beneficial to reduce development time by building a Minimal Viable Product. Novice managers and engineers tend to focus on the development time of solutions, neglecting turn-around time.

4. Turn-around time

(Needed/Actual), Academic: (3/2), Kaggle: (2/1), Industry: (10/5).

In the industry, turn-around time is usually the most important property to consider. Too few software teams realize this, resulting in solutions that slowly but steadily disintegrates into an intangible mess.

In academia, turn around does not matter as much. Once a paper is out, researchers move on to the next problem. However, since the longevity of software projects is usually grossly underestimated, I believe that most academics would benefit from prioritizing turn-around time by adopting proper engineering methods.

5. Computational requirements

(Needed/Actual), Academic: (2/2), Kaggle: (1/1), Industry: (2/5).

In most contexts, computational requirements do not matter much. Computing power is cheap relative to developer salaries. When huge amounts of data are available, or where data can be generated, computational requirements for training set hard limits on the performance of models.

It is my experience that novice engineers mistakenly focus on computational requirements even though this property tend to have a relatively small impact on feasibility of the final solution. It is usually better to trade-off computational requirements for other properties. Pragmatic engineers know the importance of avoiding such premature optimization.

6. Skill requirements

(Needed/Actual), Academic: (1/1), Kaggle: (2/2), Industry: (6/3).

In academia, projects with high skill requirements are expected, and actively encouraged. Proving that you can use the most difficult techniques is a badge of honour, not a drawback.

In competitions, leveraging advanced techniques can give you the edge to outperform your competitors.

In the industry, however, using advanced techniques is a double-edged sword. While it can lead to outperforming the competition, it can also make up-scaling the team difficult and expensive. If the same product and be built with lower skill requirements, it a strictly better solution. Pragmatic engineers realize this and use the most common tools and frameworks when possible. If a commonly used, publicly available tool provides can be used, it is usually preferable to both obscure publicly available tools, and custom-built solutions.

The results of making conscious decisions about what property to prioritize

I find that there is little to no discrepancy between the needed and the actual prioritization of properties in competitions and academia. In the industry, however, mis-prioritization is the rule rather than the exception. This results in suboptimal solutions, that does not provide maximum value to end-users.

In data science as well as engineering in general, it is often the case that one of the properties of a project is more important for end-users than all the others combined. This is similar to the field of computational performance optimization, where engineers always make sure to determine the worst bottleneck. Since it is highly probable that this bottleneck has a larger overhead than all other bottlenecks by a large factor, it is the only part of the solution that it is worth considering optimizing. Pragmatic engineers realize this and avoid drawing a false equivalency between the importance of different properties.

Conclusion

I believe that pragmatic engineers must strike a balance between different opposing desirable properties of the solution. I explain how the end goals of industrial data science teams differ from that of academic and competitive data science teams. I argue that in the industry, turn-around time is undervalued, while properties such as computational requirements and development time are subject to too much attention.

Thank you for reading through our lists of insights. I am always happy to receive comments and critique, so feel free to reach out to me.

These insights were initially published on https://www.botxo.co/blog/