Self-supervised learning may very well be the future of AI, according to some of the most prominent ML researchers.

Self-supervised learning is one of those recent ML techniques that have made waves in the data science community, but have so far been flying under the radar as far as Entrepreneurs and Fortunes of the world go; the general public is yet to learn about the concept but lots of AI folk deem it revolutionary.

The paradigm holds vast potential for enterprises too as it can help tackle deep learning’s most daunting issue: data/sample inefficiency and subsequent costly training.

We believe strongly that, as a business owner, you should acquaint yourself with this inherently complex subject and we’ll gladly help.

Hand-crafted Feature Learners and Deep Learning Systems: AI Past vs AI Future

These days, when someone mentions AI’s transformative potential, they’re likely speaking about machine learning. And when you’re reading another sensationalist post about the groundbreaking progress in ML, the authors are probably referring to deep learning (and supervised deep learning in particular).

Since human-generated features tend to be brittle, the AI community is increasingly embracing deep learning systems (for a number of tasks), which can learn data’s distinctive properties by themselves given an objective function.

These neural nets aren’t magic, though, they’re just cleverly applied statistics and algebra.

The basic build of a deep learning system includes a succession of linear and point-wise non-linear operators. Neural nets first encode every input, turning into a vector or a list of numbers, then do computations on the received values using various matrices in their neurons (which their layers consist of) and, afterward, pass the resulting vectors through a bank of nonlinear functions (such as ReLU) and a classifier at the very end of the architecture.

When the machine fails to output what is expected of it (a correct class/label in supervised learning), data scientists can tweak the network’s parameters (all the modules are trainable) until the results are satisfactory. This is done via gradient descent which, in turn, is computed by a backprop algorithm.

It turns out computers are exceptionally good at learning functions that map inputs to human-generated labels under one condition: an enormous amount of labeled data must be fed to them first. There’s also another drawback: these models, determined to classify the input into a category, don’t learn much about the inherent properties of input elements. The feedback the machine is given is scarce in supervised learning, so naturally the networks are very sample inefficient.

This creates a significant issue: high-quality data is often hard to come by for a lot of companies and obtaining an annotated dataset can prove too costly an undertaking even for large organizations.

Image Classification’s Hidden Value

By this point, you probably have an idea of how image classification challenges (Pascal, ImageNet) work. The machine is fed an image displaying an object and, after processing it, the algorithm outputs a class the object belongs to.

There are currently around 1000 object classes in ImageNet so it’s quite impressive when networks achieve high result accuracy. But the correct classification itself isn’t the most exciting part here – the network also learns feature representations that can be used in image retrieval, detection, segmentation, depth estimation, and other more complex tasks; this is far more important.

How rich and highly-performant the visual representations the model learns are is contingent upon the quality and quantity of the data fed to it during training. And scaling supervised methods doesn’t work due to the overwhelming expense of annotation required.

Intuitively, the unsupervised learning paradigm (using data without labels) should help with the issue but, so far, the ML community hasn’t been able to extract meaningful information from real-life images with unsupervised methods. Figuring out an objective function (which would encourage the creation of representations) isn’t possible when there are no labels; the machine doesn’t even know what it is trying to represent.

So, is there another way?

Yes, the answer to the problem might be self-supervised learning, at least according to Yann LeCun and other prominent computer scientists.

This fairly novel technique attempts to extract supervisory signals directly from the input data and thus requires no human involvement. It works well in the text domain where algorithms can rely on context for supervision while creating word representations.

This is how it happens: The model teaches itself to map input words to feature vectors using which the context around them can be predicted. Though human-generated labels aren’t at play here, the model must still learn a function from certain words to the words near them, so there’s supervision from the data, hence the name self-supervised learning. The model is encoding the entire input and thus learns much more about the properties and relationships between its parts compared to supervised networks that are given a narrow objective.

Context prediction is trivial though, so it’s often used just as a pretext task that tricks the model into learning word representations that are useful for real-world issues (semantic word similarity, etc.) which we call downstream tasks.

Can this Principle be Applied to Visual Problems?

While we can have a model create a probability distribution over an entire dictionary to make a context/word prediction, things are harder when it comes to the visual world in which the possibilities are limitless. And yet, the machine learning community is hardly discouraged by this: there’s already been substantial progress in the field.

Again, the principle behind SSL is to trick the machine learner into creating intermediate representations (with encoded semantic and structural meaning) of objects while doing some insignificant task and then carry them over to other problems – downstream tasks – which are much more valuable to us.

Examples of Self-supervision in Visual Tasks.

Rotation is when an algorithm is given rotated copies of an image and then tasked with predicting which rotation was applied (180 degrees, 90 degrees, etc.). The intuition here is that by performing this, the model can learn to recognize the canonical orientation of objects in images.

Distortion through data augmentation. Here, there are images that correspond to certain classes and multiple, slightly distorted examples of each picture are generated through data augmentation techniques (rotation, color shifts, scaling, etc.) The triplet loss function is used instead of the standard classification loss per class as it allows scaling the pretext task to an arbitrary number of surrogate classes. Explicit labels are absent, and the model is expected to determine which image examples are closer to each other (and to the original image) in the Euclidean space.

Positions of a patch relative to another patch . In this task, the model is attempting to figure out the positions of two image patches (relative to one another) given a random spatial configuration. To do so, it needs to learn some spatial context first – it must acquire an “understanding” of objects and scenes as there’s no determining spatial relationships without extracting objects and their parts from the visual input. According to Efros et al., while doing this, the algorithm can build visual representations good enough to be plugged into both object discovery (visual data mining) and object detection tasks later.

To prevent the model from taking shortcuts such as only picking up trivial signals (matching local patterns, continuing a line across boundaries, etc.) additional noise is added manually by scientists through introducing gaps, jitters, scaling down some images, randomly dropping color channels (to avoid the negative impact of chromatic aberration), and so on.

Randomized jigsaw. Instead of predicting where a certain patch should be placed relative to another patch, the model tries to determine every tile’s position at the same time; it attempts to reassemble a jigsaw puzzle after being given a set of randomly shuffled pieces. Here presented is a model that processes each patch independently by applying shared weights. Again, slight gaps are included (in the sampling) so that the machine learner doesn’t get hung up on edge alignment and chromatic aberration. Conversion to grayscale is also used as well as normalization to zero mean and standard uniform deviation. In the end, the representation is extracted through averaging across the nine uniform samples.

Colorization is yet another pretext task showing lots of promise. Since any color image can be used as training data (you just treat the photo’s L-channel as input and its ab color channels as signals for a machine learner’s supervision), the problem of expensive training datasets, pressing in deep learning, can be avoided entirely. Here is a model that acts similarly to autoencoder (a neural network meant to create compressed representations of images that can be used to rebuild them afterward, without losses) except that in this case the input and output are different channels. The network tries to map from the input (grayscale) to a distribution over several color value outputs.

It is tasked with achieving plausible enough colorization of grayscale images (for a human observer to find it convincing) but while the network is doing that, the authors say, it will also model dependencies between semantics and textures (ocean is blue, cherries are red, etc.) which will come handy for other more complicated tasks.

Summing up

Using supervised learning, data scientists can get machines to perform exceptionally well on certain complex tasks, such as image classification. But the success of these models is predicated on large-scale labeled datasets, which creates issues in the areas where high-quality data is scarce.

Labeling millions of data objects is costly, time-intensive, and unfeasible in many cases.

The self-supervised learning paradigm, which attempts to get the machines to derive supervision signals from the data itself (without human involvement) might be the answer to the issue. According to some of the leading AI researchers, it has the potential to improve networks robustness, uncertainty estimation ability, and reduce the costs of model training in machine learning.

Want to know more about self-supervised learning, neural networks, and how AI technology can help your business? Reach out to our expert right now.