We, at semanti.ca, are constantly monitoring online publications on Artificial Intelligence and Machine Learning. What we see is that there are three major groups of online publications:

Too low-level ones, oriented on initiated readers;

Too high-level ones, which are kind of trying to explain AI to a non-technical reader, but rather confuse the reader even more;

Marketing bullshit.

None of the three groups of publications really helps executives to gain confidence in this complex topic. So, below is our guide to Machine Learning for top managers. We believe that this is the guide every top manager desperately looks for, and you are the lucky one to find it. Congratulations!

First of all, let's clarify the terminology: AI, Machine Intelligence, Machine Learning, Deep Learning, Data Science, Data Mining, Neural Networks, Predictive Analytics, Cognitive Computing... We understand your pain when you read white papers full of buzzwords and try to get an idea on what the hell is the difference between them? Did you just read the same text yesterday or it's something new? Is there a difference? If not, why people use so many different terms to say the same?

Terminology

Artificial Intelligence, AI, or sometimes also Machine Intelligence, is the science of making machines that act similarly to living species. It's very vague as a definition, so even a calculator can be called an AI because it mimics what only humans could do just two centuries ago.

Machine Learning (ML) is one of the ways of building an AI. Today, it's the most popular and effective way of making a machine that acts similarly to an animal brain on some very specific problems. When we say that it acts similarly to we mean how it looks like to an outside observer and NOT how it actually functions.

Deep Learning (DL) is a family of loosely related Machine Learning algorithms that employ the concept of an artificial neural network. This has nothing in common with animal brain neurons, just like Artificial Intelligence has nothing in common with the animal intelligence.

So, if you read an article on Neural Networks/Deep Learning and it has an illustration as below, know that the author is a layman. Do yourself a favor: close the browser tab. You're welcome.



Animal Neuron. Source: Wikipedia.

Don't worry, we will look at artificial neural networks more closely soon. Let's finish with the definitions.

Data Science is not a science. It's a set of methods and skills necessary to analyze data in the era of Big Data. Those methods and skills usually include a significant body of Machine Learning algorithms, hence the confusion between the two.

Big Data is the data that is hard to work with on a conventional computer. The notion of a conventional computer is quite vague, but we can approximate it with the best gamer's PC available on the market. It can be hard to work with the data because of one of the following reasons:

the data is too big to fit in memory or to be stored on a hard drive;

the data comes too fast to be processed by a CPU;

the data comes from too many heterogeneous sources to handle it in one user-friendly desktop application.

Those three reasons are often cited as the three "V": volume, velocity, and variety. If you read an article on Big Data or Data Science and it mentions four "V" (for "veracity") or five "V" (for "value"), know that you are reading marketing crap. Do yourself a favor: close the browser tab. You're welcome.

Predictive Analytics is a strict subset of Machine Learning and statistical techniques focusing on predicting anything based on the data from the past. From statistics, it takes regression analysis; from Machine Learning it takes classification.

Data Mining is mostly a retired buzzword. One of the best books on Machine Learning called Data Mining: Practical Machine Learning Tools and Techniques was originally to be named just "Practical Machine Learning", and the term "data mining" was only added for marketing reasons. As of 2018, the term "data mining" as a buzzword lost the lead to "data science".

Cognitive Computing is a buzzword promoted by IBM to stand out from the competition. It roughly means using Machine Learning to build AI systems.

Modern AI

The research on artificial intelligence has been carried out since the 1960s. Artificial neural networks in a form similar to the modern one were proposed and trained in the mid-1980s. So, what's so different now? The main difference is that during the first two decades of the 2000s, several important breakthroughs have been made that made it possible to train very big neural networks (with billions of parameters). The most important such breakthroughs were:

Rectified Linear Unit (ReLU);

Long short-term memory (LSTM);

Dropout.

ReLU and LSTM are types of neural network units that made it possible to train very deep neural networks. Dropout made it possible to train neural networks that generalize well. Generalization is the quality of a machine learning system that shows how well the system performs on previously unseen examples.

Other important factors that differ the modern AI from the AI of the XX century are:

the availability of very big training datasets (thanks to the Internet);

accessibility of very high computing power (thanks to GPUs and the cloud computing), and

the availability of high-quality high-level, accessible machine learning software (thanks to the Internet and open source movement).

What's So Special About Neural Networks?

Neural Networks are the only known machine learning algorithms that can significantly benefit from bigger datasets. Almost any machine learning algorithm becomes better with more data. See for example the article "The Unreasonable Effectiveness of Data" written by the lead research scientists from Google. However, neural networks, thanks to their multi-layer architecture can reach almost perfect performance on virtually any machine learning task, given enough data.

Machine Learning Tasks

Machine learning doesn't try to (and most likely could not) solve any problem the humanity and the business face. There're certain tasks well suited for solving using machine learning:

Classification,

Regression,

Clustering.

Many real-world engineering problems can be reduced to those three tasks.

Classification

Classification aims to find a mathematical function \(f\) such that given an example \(x\) as input it produces an output \(y\). The output \(y\) belongs to a finite set of labels. The example \(x\) usually is a vector of features; the quantity of possible values of \(x\) is usually infinite. A classical example of a classification task is spam detection. \(y\), in this problem, can take only two values (depending on the machine learning algorithm the values are zero and one or minus one and plus one).

The definition of \(x\) depends on the machine learning engineer. Because \(x\) has to represent an email message somehow, and it has to be a vector of features, one can decide that the vector will have \(10,000\) dimensions; every dimension will correspond to the presence of absence, in the email message, of some specific English word. If a word is present, then the value of the feature would be \(1\), otherwise, it would be \(0\). How to convert an email message into a vector totally depends on the machine learning engineer. The process is called feature engineering. There's no one correct way to do feature engineering, and it's rather an art than a science.

An engineer has to try different ways to extract features from email messages and see which one gives the best result on the problem of spam detection. Other types of features, not based on words, are also possible and can be combined with word-based features:

Whether the sender is in the contact book of the recipient;

Whether the email contains an attachment;

Whether the email or subject contain exclamation marks;

Whether the email contains a link;

Etc. The creativity of the engineer is very important here.

Examples of business problems that can be solved as a classification problem:

Churn Modeling: \(x\) represents what is known about the customer and \(y\) will be the prediction of whether the customer will stop engaging with your business in, let's say, the next 6 months.

Customer Segmentation: \(x\) represents what is known about the customer and \(y\) will be the segment of customers; The definition of segments, what they are, how many of them should be, etc, is the preliminary task.

Sentiment Analysis: \(x\) represents a customer's email or a tweet about a product and \(y\) represents whether the customer is satisfied or dissatisfied with the product.

Regression

Regression deals with the situations when we want to predict a certain quantity about an example. Here we aim to find a mathematical function \(f\) such that given an example \(x\) as input it produces an output \(y\), which is a real number. As in classification, \(x\) usually is a vector of features and the number of possible values of \(x\) is usually also infinite.

The feature engineering task in regression is also similar to that of classification.

Business problems that can be solved using regression:

Customer lifetime value modeling, that is predicting the future revenue that an individual customer will bring to your business in a given period. Here, \(x\) represents everything we know about the customer and \(y\) is a positive real number.

Demand analysis, that is predicting how many units of a product consumers will purchase. Here, \(x\) could represent the state of the market and \(y\) is a positive real number. For a company that produces soft drinks, \(x\) could contain such features as the average temperature for the past week and past months, the number of bottles sold during the past week and past months, the number of bottles sold during the same period last year, etc.

Goods price prediction: for example, given everything we know about the house (the number of rooms, the area, the year of construction, zip code, and so on) as \(x\), to predict what would be the price \(y\) of this house.

Stock price prediction: given everything we know about the company (the number of employees, the volume investments in R&D, the last-year revenue, the total number of clients, the percentage of clients that renew subscription, what is the state of the stock market compared to last month (up or down), the current price of the stock, and so on) as \(x\), to predict what would be the price \(y\) of the stock of this company in one month.

Clustering

The two previous tasks, classification and regression, assume that we can make our machine learning system predict either the correct label (spam/not spam) or the correct real number (the house price). To make this happen, we have to train our system by giving examples of what is a correct prediction is. For example, if we want to train our system to predict spam, we have to give examples of what is spam and what's not spam. So, to enable the learning, we have to provide our machine learning algorithms with a sufficiently large collection of correct pairs \(x, y\). Such kind of machine learning tasks are called supervised learning tasks.

Clustering, on the other hand, is an example of an unsupervised machine learning task. In the unsupervised settings, we still have a sufficiently large collection of example, however, every example is only \(x\); we don't have \(y\)s.

Clustering can be helpful, for example, in the Customer Segmentation. Remember that we said above that the definition of segments, what they are, how many of them should be, etc, is the preliminary task. This can be done using clustering.

The task of clustering is defined like this: given a collection of examples \(X\) and the number of clusters n, find a mathematical function \(f\) that assigns to every examples \(x\) from collection \(X\) a cluster id \(i \in 1..n\).

Ideally, we would like that \(f\) puts similar examples to the same cluster (or, assigns the same \(i)\) to similar values of \(x\). Therefore, the job of the machine learning engineer here is not just feature engineering but also metric engineering. Metric engineering is the task of crafting a mathematical function \(m\) which takes two vectors \(x\) and \(x'\) as input and returns a small value if \(x\) is similar to \(x'\) or a high value if \(x\) and \(x'\) are dissimilar.

Two popular choices of metric functions are:

Euclidian distance;

Invert cosine similarity.

Euclidian distance computes the straight line distance between two points in space (\(x\) and \(x'\)); the invert cosine similarity computes one minus the cosine of the angle between two vectors (\(x\) and \(x'\)). The idea is that the bigger the angle between the vectors, the more dissimilar are two examples. As with feature engineering, metric engineering is an art and each machine learning engineer could develop their own metrics for every practical business problem.

One of the primary use of clustering in business is segmentation: customer, product or store. Similar products can be clustered together into groups based on their attributes like size, brand, use, flavor, etc; similar stores can be grouped together based on characteristics as sales, size of the store and size of customer base, availability of in-store services, etc.

How Machine Learning Algorithms Learn

This is the most important question: how does the machine learn? How, given a collection of pairs \(x, y\) the machine figures out what the function \(y = f(x)\) should look like and how this function then can be applied to new values of \(x\), previously unseen by the machine such the predicted value of \(y\) is most of the time correct?

We will show you only three pictures you need to have in mind to understand how most machine learning algorithms learn.

Linear Regression

Linear regression is an algorithm that solves the regression problem by making one assumption about the function \(f\). It assumes that \(f\) has the following form: $$ y = f(x) = ax + b $$

So the problem of the machine learning algorithm is to find two values: \(a\) and \(b\), where \(a\) is a vector and \(b\) is scalar.

The machine uses the collection of examples \(x, y\) and finds such values for \(a\) and \(b\) that the following loss function is minimized: $$ loss(y, f(x)) = \sum_{(x, y)} (y - f(x))^2, $$ where \(\Sigma\) means the sum over all training examples. Basically, the loss function defines how good our prediction \(f(x)\) is compared to the correct value \(y\) that we know.

The choice of the loss function is one of the choices the machine learning engineer makes, along with the choice of features (in feature engineering) and the choice of similarity metric (in clustering). If we assume that \(x\) is one-dimensional, then the function \(f\) would look like the blue line on the below plot:



Linear Regression. Source: Towards Data Science.

On the above plot, grey lines indicate the loss (before the square is applied).

Nothing fundamentally changes if \(x\) has many dimensions. The only difference is that instead of the straight line, \(f\) will have the form of a plane (if \(x\) is two dimensional) or a hyperplane (if \(x\) has more than three dimensions).

Now you know how the simples form of regression works. Different algorithms can make different assumptions on the form of \(f\) (different from \(ax + b\)), but the principle remains the same:

First, make an assumption on the form of a function \(f\), then Define the loss function appropriate for your business problem, finally Make the machine find the parameters that minimize the loss on the training data.

The machine will find the parameters by using some learning algorithm. Many of them exist.

Support Vector Machine

Support vector machine is a classification machine learning algorithm. It illustrates the best how classification algorithms work. Let's assume that our x is two dimensional (\(x = (x_1, x_2)\)) and our \(y\) can take values either \(-1\) or \(+1\). Then the Support Vector Machine algorithm will find a straight line that separates the best the examples that have label \(-1\) from the examples that have label \(+1\):



Support Vector Machine. Source: OpenCV Documentation.

On the above plot, the examples with label \(-1\) are represented by the squares and those with label \(+1\) are represented by circles. The green line separates the two kinds of examples the best, where best means that the distance from the closest member of each group to the green line is maximized.

Again, as with the regression task, nothing fundamentally changes if \(x\) has three or more dimensions. Instead of the line, the machine will find a plane or a hyperplane that separates the best the examples belonging to different groups.

Different classification algorithms differ from one another on how this line (the one that separates groups of examples with different labels from one another) is drawn. It's not always possible to draw a straight line to separate examples, so more complex algorithms find more complex mathematical functions to draw this line.

K-means Clustering

To illustrate clustering, we will take the example of an algorithm called K-means. In this algorithm, the machine learning engineer has to make the following three choices:

How to convert examples into vectors (feature engineering), Define the similarity metric (metric engineering), and Guess how many clusters exist in the data.

The machine learning will do the rest.

K-means clustering algorithm works as follows:

Randomly (or according to a heuristics) put cluster centers in the proximity of the examples; Assign each example to the most similar cluster center (according to the similarity metrics); Compute the new position of each cluster center as the average of all examples assigned to this cluster center. Iteratively repeat steps 2 and 3 until the cluster center positions don't change significantly enough.

To illustrate the algorithm, let's assume that our \(x\) is two-dimensional and there are three clusters. The following animation represents iterations of the clustering algorithm:



K-means Clustering. Source: Source: Towards Data Science.

Generalization

When working on a supervised learning problem, one important concept to understand is that of generalization. We say that a machine learning algorithm generalizes well to previously unseen examples (the examples we didn't use to train the algorithm) if it doesn't make much more prediction errors on the unseen examples compared to the number of errors the algorithm makes on the data seen during training.

The learning algorithm can predict perfectly all the \(y\)s from the data used for training but predict very poorly the \(y\)s from new data. This problem is called overfitting or the problem of high bias.

Usually, every machine learning comes with several hyperparameters that the machine learning can tweak to reduce overfitting. It's generally impossible to completely remove overfitting, so the learning algorithm almost always performs better on the data seen during the training. However, such techniques as regularization that add a penalty to the loss function for overly complex shapes of the function \(f\) allows reducing overfitting.

Special Cases

Supervised learning has several important special taks:

In multi-label classification, \(y\) is also a vector. So multiple labels could be predicted at once for an \(x\). For example, in image segmentation, the goal is to predict the boundaries of an object on an image, so \(y\) is four-dimensional;

In sequence labeling, \(x\) is a sequence of vectors and \(y\) is a sequence of labels of the same length. Sequence labeling algorithms are frequently used in Natural Language Processing to assigns different labels to different words in a sentence;

In Sequence-to-Sequence learning, \(y\) is a sequence that can have different length than \(x\). This is a scenario in machine translation or spelling correction.

Artificial Neural Networks

Below, we quote a short part of our blog post on neural networks called How Neural Networks Work. The post is very accessible and high-level but gives just enough details so that the interested reader could get a clear idea about neural networks:

Artificial neural networks make another assumption about the function \(f\). They assume that \(f\) is a nested function. You have probably heard of neural network layers. So, for a 4-layer neural network, \(f\) will look like this: $$y = f(x) = f_4(f_3(f_2(f_1(x)))),$$ where \(f_1\), \(f_2\), \(f_3\), and \(f_4\) are simple functions like this: $$f_i = f_i(z) = nonlinear_i(a_i*z + b_i),$$ where \(i\) is called the layer index and can span from 1 to any number of layers. The function \(nonlinear_i\) is a fixed mathematical function chosen by the neural network designer (a human); it doesn't change once chosen. The coefficients \(a_i\) and \(b_i\), to the contrary, for every \(i\), are learned using an optimization algorithm called gradient descent. The gradient descent algorithm finds the best values for all \(a_i\) and \(b_i\) for all layers \(i\) at once. What is considered "best" is also defined by the neural network designer by choosing the loss function.

Other important details to know by a senior executive about neural networks are:

They require less feature engineering and can be successfully applied to raw data . Text can be "fed" to the neural network as a sequence of words; an image could be "seen" by the neural network as a matrix of pixels;

. Text can be "fed" to the neural network as a sequence of words; an image could be "seen" by the neural network as a matrix of pixels; Neural networks learn (during training) and compute (on prediction time) multiple representations of the input before producing an output; each layer learns different levels of representation: from low-level (straight lines and curves) to high-level (ears, eyes, and noses).

Since they require less feature engineering, neural networks require significantly more data to train those different levels of input representation;

Since they require significantly more data to train, GPUs (Graphical Processing Units, graphical cards) are used to accelerate training. GPUs are used for training neural networks because their specialization is fast operations on matrices (need in 3D-graphics). Coincidentally, neural network training also requires making a lot of mathematical operations on matrices.

Having GPUs is not critical at the classification time; usual CPUs can be used;

Neural networks have different architectures depending on the tasks: Convolutional Neural Networks are usually used to process images; Recurrent Neural Networks are applied to sequences (of words, sound frequencies or stock market prices); Feed Forward Neural Networks are used for classical classification and regression tasks. Dropout is used as a regularization technique that prevents overfitting.



That's it. Now you know everything an executive should know about machine learning and the modern AI.

Read our previous post "An Introduction to Approximate String Matching" or subscribe to our RSS feed.

Found a mistyping or an inconsistency in the text? Let us know and we will improve it.