Theorems and Algorithms

We are not going to spend a lot of time here. Google has loads of information on every algorithm beneath the sun!

There are classification algorithms, clustering algorithms, decision trees, neural networks, basic deduction, boolean, and so on. If you have specific questions, let us know!

Bayes Theorem

Alright, this is probably one of the most popular ones that most computer focused people should know about!

There have been several books in the last few years that have discussed it heavily.

What we personally like about Bayes theorem is how well it simplifies complex concepts.

It distills a lot about statistics in very few simple variables.

It fits in with “conditional probability”(e.g. If this has happened, it plays a role in some other action happening)

What we enjoy about it is the fact that it lets you predict the probability of a hypothesis when given certain data points.

Bayes could be used to look at the probability of someone having cancer based on their age or if an email is spam based on the words in the message.

The theorem is used to reduce uncertainty. It was used in World War 2 to help predict the location of U-boats, as well as predicting how the Enigma machine was configured to translate German codes.

As you can see it is quite heavily relied on. Even in modern data science, we use Bayes and it’s many variants for all sorts of problems and algorithms!

K-Nearest Neighbor Algorithm

K nearest neighbor is one of the easiest algorithms to understand and implement.

Wikipedia even references it as the “lazy algorithm”.

The concept is less based on statistics and more based on reasonable deduction.

In layman's terms. It looks for the groups closest to each other.

If we are using k-NN on a two-dimensional model. Then it relies on something called Euclidian distance (Euclid was a Greek mathematician from very long ago!).

This is only if you are specifically referring to 1-norm distance as it references square streets and the fact that cars can only move in one direction at a time.

The point is, the objects and models in this space rely on two dimensions. Like your classic x, y graph.

k-NN looks for local groups around a specified number of focal points. That specified number of focal points is k.

There are specific methodologies to figuring out how large k should be as this is an inputted variable that the user or automated data science system must decide.

This model, in particular, is great for basic market segmentation, feature clustering, and seeking out groups amongst specific data entries.

Most programming languages allow you to implement this in one to two lines of code.

Bagging/Bootstrap aggregating

Bagging involves creating multiple models of a single algorithm such as a decision tree. Each trained on a different bootstrap sample of the data. Because bootstrapping involves sampling with replacement, some of the data in the sample is left out of each tree.

Consequently, the decision trees created are made using different samples which will solve the problem of overfitting to the sample size. Ensembling decision trees in this way helps reduce the total error because variance continues to decrease with each new tree added without an increase in the bias of the ensemble.

A bag of decision trees that uses subspace sampling is referred to as a random forest. Only a selection of the features is considered at each node split which decorrelates the trees in the forest.

Another advantage of random forests is that they have an in-built validation mechanism. Because only a percentage of the data is used for each model, an out-of-bag error of the model’s performance can be calculated using 37% of the sample left out of each model.