In this post, we mention the most important skills a good data scientist should acquire, according to semanti.ca's data scientists and machine learning engineers.

The text has two parts. We start with technical skills that come from the statistics and computer science. The second part focuses on skills that are often called "soft": the ones a data scientist acquires by communicating with people and trying to optimize their work habits.

Regression

Linear Regression

Linear regression is a bread and butter of a data scientist. A large part of dependencies between variables (both dependent and independent) in real-world data have a linear form. No wonder, many data scientists start their exploration of the dataset by building pairwise comparison of features and drawing the regression lines.

Nonlinear Models

In cases where the dependencies between variables are non-linear, such techniques of non-linear regression as polynomial regression, segmented regression, k-nearest-neighbors, as well as neural networks could be used.

Classification

Logistic Regression

When the dependent variable is binary, Logistic Regression is very often an appropriate analysis to conduct. Logistic regression is often used to explain the relationship between one dependent binary variable and one or more independent variables. Types of questions that a logistic regression can answer:

How does the probability of getting a heart attack change for every additional mile ran during a week or for every bag of chips consumed during a day?

How do the number of cigarettes smoked during a day or the time spent under direct sunlight influence the probability of getting cancer?

Nonlinear Models

Similarly to linear and non-linear regression, there are non-linear classification techniques. The most widely used non-linear classification algorithms are Support Vector Machines and other kernel-based techniques. Kernels are mathematical functions that make linear models work in nonlinear settings by mapping data to higher dimensions where it exhibits linear patterns. Once the data was mapped to higher dimensions we can apply the linear model in the new input space.

Another popular non-linear method that works equally well for both classification and regression is k-Nearest Neighbors or k-NN. This is a non-parametric method. The prediction of the value for a dependent-variable is made by taking an average of the nearest neighbors of the input example in the space of all data examples (in case of regression) or the majoritary value of the independent variable among the nearest neighbors (in case of classification). The nearest neighbors for a data point are determined using some metrics selected by the data scientist. Popular choices for such a metics are Euclidean distance and cosine similarity.

Discriminant Analysis

During a study of some phenomenon, the data scientist often needs to answer the questions in the following form: "are the two groups of examples different" or "on what variables, are the two groups most different?", or either "can one predict which group a person belongs to using such variables?". Discriminant Analysis is helpful in answering such questions. So, the Discriminant Analysis is used when the dependent variable is categorical.

The objective of the Linear Discriminant Analysis is to develop functions that are linear-combinations of independent variables that will perfectly discriminate between the categories of the dependent variable. It enables the data scientist to examine whether significant differences exist among the groups, in terms of the predictor (independent) variables. It can also be used to evaluate the accuracy of the classification.

Resampling Methods

Resampling it's a nonparametric statistical method that consists of drawing repeated samples from the original collection data points (examples).

Resampling involves a randomized selection (sampling with replacement) of examples from the original data in such a way that each sample is similar to the original data. Due to replacement, the sampled collection consists of repetitive data points.

Resampling generates a unique sampling distribution based on the actual data. Resampling yields unbiased estimates as it is based on the unbiased samples of all possible examples of the data studied by the data scientist.

Types of resampling

There are four major types of resampling.

Randomization exact test (also called "permutation test") is a tool for constructing sampling distributions. Similarly to bootstrapping, a randomization exact test builds sampling distribution by resampling the observed data. For example, the data scientist can shuffle or permute the observed data. Unlike bootstrapping, this is done without replacement.

Cross-validation . In cross-validation, the data is randomly divided into two or more subsets and test results are validated by comparing across sub-samples. Three types of cross-validation are distinguished by data scientists: simple cross-validation , double cross-validation and multicross-validation .

Jackknife , also known as the Quenouille-Tukey Jackknife or leave-one-out , is a step beyond cross-validation. In Jackknife, the same test is repeated by leaving one subject out each time. This procedure is especially useful when the dispersion of the distribution is wide or extreme scores are present in the data set. In such cases, it is expected that leave-one-out would return a bias-reduced estimation.

Bootstrap. Compared with the Jackknife technique, the resampling strategy in bootstrap is more thorough in terms of the magnitude of replication. Unlike the previous three techniques, the bootstrap employs sampling with replacement. Furthermore, in cross-validation and Jackknife, the size of the subsample is smaller than that of the original dataset, but in bootstrap, every subsample has the same number of examples as the original dataset. As a consequence, the bootstrap method has the advantage of modeling the impacts of the actual sample size.

Subset Selection

Subset selection refers to the task of finding a small subset of the available independent variables that does a good job of predicting the dependent variable. The goal of subset selection is to reduce the dimensionality of the data by removing the irrelevant and redundant information. This allows, for example, the machine learning algorithm to operate more effectively; this also allows explaining different phenomena by using less independent variables (simpler explanations are usually better for humans).

Shrinkage

Shrinkage, also known as regularization is changing the problem of classification or regression in such a way that complex models have fewer chances to be computed. Usually, this is done by adding a penalizing term to the model. For example of linear regression, here's how a penalizing term is added to the residual sum of squares:

$$ \underset{w}{min\,} {{|| X w - y||_2}^2 + \alpha {||w||_2}^2} $$

Here, \(\alpha \geq 0\) is a complexity parameter that controls the amount of shrinkage: the larger the value of \(\alpha\), the greater the amount of shrinkage and thus the coefficients become more robust to collinearity.

Dimensionality Reduction

Dimensionality Reduction is an umbrella term that includes multiple techniques. One of such techniques is Subset Selection we saw above. The dimensionality reduction can be used to achieve multiple goals. One can want to reduce the number of independent variables in an example to:

Visualize datapoints in 2D or 3D;

Increase the speed (and accuracy) of a regression, classification or clustering algorithm;

Cluster examples based on the value of each independent variable resulted from the dimensionality reduction procedure.

The most practically useful dimensionality reduction techniques are Principal Component Analysis, Autoencoder and UMAP.

Tree-Based Methods

Decision trees and its ensemble versions, Random Forest and Gradient Boosting, are among the most widely used techniques in machine learning. They handle well both classification and regression problem, both with linear and non-linear dependencies in the data.

A decision tree is a tree where each node represents a feature (an independent variable), each branch represents a decision (rule) and each leaf represents an outcome (categorical or continues value of the dependent variable).

Unsupervised Learning

Clustering

Clustering is the problem of finding groups of similar examples in the dataset. The examples are similar according to some similarity metrics. This metrics is chosen by the data scientist.

k-Means clustering algorithm partitions data into \(k\) distinct clusters based on distance to the centroid of a cluster.

Hierarchical clustering builds a multilevel hierarchy of clusters by creating a cluster tree.

Topic Modeling

In its standard form, topic modeling is a problem of finding two probability distributions describing a collection of text documents: a distribution of topics over documents and a distribution of words over topics. The most widely used algorithms for topic modeling are Latent Semantic Indexing, or LSI and Latent Dirichlet Allocation, or LDA.

Topics can be discovered in arbitrary datasets, where each topic would be characterized by a distributed over a collection of independent variables (features).

Topic modeling can also be seen as a form of clustering with soft cluster assignment: every document in the collection can belong to multiple clusters (contain multiple topics) with the probability to be in a certain cluster \(C\) given by the probability that the document contains the topic \(C\).

Density Estimation

Density estimation is the construction of an estimate of the probability density function from the observed data. One approach to density estimation is parametric. Assume that the data are drawn from one of a known parametric family of distributions, for example, the normal distribution with mean \(\mu\) and variance \(\sigma\). The density \(f\) underlying the data could then be estimated by finding estimates of \(\mu\) and \(\sigma\) from the data and substituting these estimates into the formula for the normal density. Kernel Density Estimation is a non-parametric technique to estimate the unknown probability distribution of a random variable, based on a sample of points taken from that distribution. The estimation is done using kernel function, typically smooth functions with a single mode at \(x=0\); for example, a Gaussian bell curve can be used as a kernel function.

Other Statistical Methods

Confidence Interval

The purpose of taking a random sample from an unknown probability distribution and computing a statistic, such as the mean from the data, is to approximate the mean of the distribution. How well the sample statistic estimates the underlying probability distribution value is typically an issue. A confidence interval addresses this issue because it provides a range of values which is likely to contain the distribution's parameter of interest.

Hypothesis Testing

Hypothesis testing in data science is a way for the data scientist to test the results of a survey or experiment to see if they got meaningful results. Data scientists basically test whether their results are valid by figuring out the odds that the results have happened by chance. If the results may have happened by chance, the experiment will not be repeatable and, therefore, has little use.

A hypothesis is a guess about something. The guess should be testable, either by experiment or observation. For example:

A new medicine might work;

A new way of teaching might work better;

A new advertisement would attract more customers.

Time Series Analysis

Time series data often arise when monitoring industrial processes or tracking corporate business metrics. Time series analysis accounts for the fact that data points taken over time may have an internal structure (such as autocorrelation, trend or seasonal variation) that should be accounted for. Such techniques as moving average, exponential smoothing, and recurrent neural networks are often used to analyze time series data.

Monte-Carlo Simulation

Monte Carlo simulation, or probability simulation, is a technique used to understand the impact of risk and uncertainty in financial, project management, cost, and other forecasting models. Monte Carlo simulations are used to model the probability of different outcomes in a process that cannot easily be predicted due to the intervention of random variables. It is a technique used to understand the impact of risk and uncertainty in prediction and forecasting models. During a Monte Carlo simulation, values are sampled at random from the input probability distributions. Each set of samples is called an iteration, and the resulting outcome from that sample is recorded. Monte Carlo simulation does this hundreds or thousands of times, and the result is a probability distribution of possible outcomes. In this way, Monte Carlo simulation provides a much more comprehensive view of what may happen. It tells the data scientist not only what could happen, but how likely it is to happen.

Bayesian Statistics

Bayesian statistics is a mathematical procedure that applies probabilities to statistical problems. It provides data scientists the tools to update their beliefs in the evidence of new data. In other words, it is a system for describing epistemological uncertainty using the mathematical language of probability. In the Bayesian paradigm, degrees of belief in states of nature are specified; these are non-negative, and the total belief in all states of nature is fixed to be one. Bayesian statistical methods start with existing prior beliefs and update these using data to give posterior beliefs, which may be used as the basis for inferential decisions.

Ensemble Methods

Ensemble Methods are algorithms allowing building strong (accurate) regressors or classifiers by combining predictions obtained from weak (less accurate) regressors or classifiers. Such methods as voting, averaging, boosting, and bagging are ussually used to combine predictions of multiple simple predictors. Another frequently used way to combine predictions given by a collection of weak predictors is creating a predictor that considers the predictions of the collection of weak predictors as independent variables.

Neural Networks

Neural networks allow approximating almost any function given enough training data and network parameters. Recent advancements in neural networks made it possible to train very big and deep neural networks. If explainability of your model is not an issue, data scientist could prefer using neural networks for the tasks of classification and regression. Neural networks are especially effective with perceptive tasks, where the input is either sound, or text, or image.

Data Indexing and Search Engines

Understanding data indexing in database management systems is a crucial skill for a data scientist as more and more data is now accumulated in corporate databases and data warehouses. Indexing allows data scientists to get the needed data fast. Search engines, such as Solr or ElasticSearch have become a commodity and they are essential to access data stored in document-oriented databases like MongoDB, MarkLogic or CouchDB.

Recommender Systems

Recommender systems are algorithms that recommend a content to a user based on user's previous content consumption patterns. Recommender systems can be based on the principles of content-based filtering (they suggest the new content to the user based on the similarity between the content consumed in the past and the new piece of content), collaborative filtering (the new content is recommended to the user based on the similarity in tastes of this user with other users) or hybrid. Collaborative filtering algorithms can be memory-based, model-based, matrix-factorization-based, and others. Content-based filtering usually uses classification algorithms to predict whether a user will like the content or not.

Association Rules Mining

Association rule mining is a procedure which is meant to find frequent patterns, correlations, associations, or causal structures from datasets found in various kinds of databases such as relational databases, transactional databases, and other forms of data repositories.

For example, given a set of cash register transactions, association rule mining aims to find the rules which enable us to predict the occurrence of a specific item based on the occurrences of the other items in the transaction.

The most widely used algorithms of association rules mining are Apriori algorithm, Eclat algorithm, and the FP-growth algorithm.

Data Segmentation

Data segmentation is the process of taking your data and segmenting it so that you can apply different action or analysis to different segments.

In marketing, for example, user data segmentation will allow you to communicate a relevant and targeted message to each segment identified. By segmenting your data, you will be able to identify different levels of the customer database and allow messages to be tailored and sophisticated to suit your target market.

Data segmentation allows identifying within the customer database, which ones have something in common. Data scientists need to find the groups of people, understand them and make some commercial value from the different groups. In marketing, there are usually three types of segmentation: demographic (by sex, ethnicity or age group), attitudinal (happy and unhappy clients) and behavioral (buying patterns of customers like usage frequency, brand loyalty, benefits needed, etc).

Visualization and Graphs

Data scientists have to be able to not just analyze data, but also communicate it effectively through visualizations and graphs.

A good data scientist has to master data visualization principles to better communicate data-driven findings. They have to know to use plotting libraries such as ggplot2, matplotlib or tableau to create custom plots and understand strengths and weaknesses of widely used plots and know when to use and when to avoid using some of them.

Game Theory

Defined variously as the science of strategy, and as the study of conflict and cooperation between rational decision-makers, Game theory essentially embodies an analytic method for understanding and codifying both the structures of conflict and the dynamic interactions shaping behaviors.

Game theory deals with understanding strategic situations. How well an agent (or a player) performs, depends not just on their own actions but also on what others do and vice-versa. The basic principle of game theory is to find out an optimal solution for a given situation. It is not just the games like poker, football, or chess that fit into Game theory but there are many other important decisions like investing, customer engagement, deciding which job to take, etc. Game theory applications can be found in various strategic decision-making contexts like sports, economics, politics, ecology, etc.

Algorithmic game theory studies various algorithms capable of finding solutions such as Nash Equilibrium in various games.

Data Imputation

In data science, imputation is the process of replacing missing data with artificial values. When substituting for a data point (a complete example), it is known as unit imputation; when substituting for a component of a data point (an independent variable, a feature), it is known as item imputation.

There are three main problems that missing data causes. Missing data can:

introduce a substantial amount of bias;

make the handling and analysis of the data more difficult; and

create inefficiency.

Because missing data can create problems for analyzing data, imputation is seen as a way to avoid pitfalls involved with complete deletion of a subset of the data that have missing values. Moreover, when one or more values are missing in a data point, most statistical packages default to discarding any example that has one or ore missing values, which may introduce bias or affect the representativeness of the results. Imputation preserves all examples by replacing missing attributes with an estimated value based on other available information. Imputation theory is constantly developing and thus requires consistent attention to new information regarding the subject. A few of the well-known attempts to deal with missing data include:

mean imputation;

hot deck and cold deck imputation;

listwise and pairwise deletion;

last observation carried forward;

regression imputation;

stochastic imputation; and

multiple imputation.

Survival Analysis

Survival analysis is usually defined as a set of methods for analyzing data where the outcome variable is the time until the occurrence of an event of interest. The event can be death, the occurrence of a disease, marriage, divorce, etc. The time to event or survival time can be measured in days, weeks, months, etc.

For example, if the event of interest is a heart attack, then the survival time can be the time in years until a person develops a heart attack.

In survival analysis, subjects are usually followed over a specified time period and the focus is on the time at which the event of interest occurs. Linear regression cannot usually be used to model the survival time because survival times are typically positive numbers, while linear regression prediction can be negative. Another reason is that linear regression cannot effectively handle the censoring of observations.

Observations are called censored when the information about their survival time is incomplete. This happens when a patient does not experience the event of interest for the duration of the study or when a person drops out of the study before the end of the study observation time and did not experience the event. Censoring is an important issue in survival analysis, representing a particular type of missing data.

Unlike ordinary regression models, survival methods correctly incorporate information from both censored and uncensored observations in estimating important model parameters.

Experiment Design

At the heart of every data science project exists the planning, design, and execution of experiments. Such experiments aim at understanding the data, potentially cleaning it and performing the necessary data analysis for knowledge discovery and decision-making. Without knowing the experimental design processes that are used in practice, researchers may not be able to discover what is really hidden in their data.

Designing an experiment may be viewed as involving the design of three distinct components: the response (what is to be measured), the treatments (combinations of factors related to the research goals and hypotheses), and the experiment (the manner in which the treatments will be applied to the subjects or units being studied).

Experiment design is the cornerstone of A/B testing, which is one of the main tasks of many data scientists today.

Data Wrangling

Data wrangling (also called as data munging) is a process of mapping and transforming data from a single raw data form into the different format. Usually, the data analyzed by data scientists is challenging to work in its raw form. Some of the imperfections in data include inconsistent string formatting, missing values, bad encoding, or non-standard date format.

Data Architecture

Data architecture is a set of rules, policies, standards and models that govern and define the type of data collected and how it is used, stored, managed and integrated within an organization and its database systems. It provides a formal approach to creating and managing the flow of data and how it is processed across an organization's systems and applications.

Enterprise data architecture consists of three different layers or processes:

Conceptual/business model , which includes all data entities and provides a conceptual or semantic data model.

, which includes all data entities and provides a conceptual or semantic data model. Logical/system model that defines how data entities are linked and provides a logical data model; and

that defines how data entities are linked and provides a logical data model; and Physical/technology model. The latter provides the data mechanism for a specific process and functionality, or how the actual data architecture is implemented on underlying technology infrastructure.

Software Engineering

The modern data scientist is a skillful software developer. Most data science algorithms exist as part of packages for such programming languages as R and Python. To be able to write efficient code, data scientists have to master such software engineering concepts as Object-Oriented Programming, data structures and code profiling. Furthermore, to avoid writing loops which are usually the main sources of the inefficiency of the code, a data scientist has to be able to employ highly efficient vectorized operations.

In the remaining part of this post, we talk about the important soft skills of a successful data scientist.

Critical Thinking

Among the most important skills an effective data scientist should develop is critical thinking. This includes learning how to structure a problem so that it can be solved as a mathematical model. The job of a data scientist is to take real-world problems and transform them into mathematical models that, when automated, create repeatable business processes.

Very often, the product team comes with a "wouldn't it be amazing if" ideas about the potential benefit the business could achieve if it better utilized its informational assets. A data scientist must take these statements and break them down into a description of the desired result, determine what data is needed to get that result, and understand how that data can be converted into a model that can be repeated in a systematic fashion.

Business Acumen

A data scientist must have business acumen and the know-how of the elements that make up a successful business model. Otherwise, the technical skills the data scientist has cannot be employed effectively. The data scientist must be able to discern the problems and potential challenges that need to be solved for the business to sustain and grow.

Communication

Most data scientists, especially those who work for an enterprise, have to learn to communicate complex ideas to managers of all levels as well as to engineers.

Data scientists have to engage business stakeholders in a way that captures their attention both emotionally and logically.

When discussing data with business people, data scientists have to learn to utilize the language of the business as opposed to the language of a statistician or a computer scientist. Outcomes and value make more sense to business people than process and complexity. When communicating with business people and decision makers, it is often recommended to focus on the business priorities and keep it as simple and prescriptive as possible.

Data Intuition

This is perhaps one of the most significant soft skill that a data scientist needs. Great data intuition means perceiving patterns where none are observable on the surface and knowing the presence of where the value lies in the unexplored pile of data bits. This makes data scientists more efficient in their work. This is a skill which comes with experience. To develop it, data scientists participate in various boot camps, competitions, open-source, and not-for-profit projects.

Read our previous post "How to Get Your First Data Science Job" or subscribe to our RSS feed.

Found a mistyping or an inconsistency in the text? Let us know and we will improve it.