By the end of this article, you’ll understand how Numerai is using advances in structure-preserving encryption to allow for open participation in the problem of stock market efficiency.

Some MNIST Handwritten Digits

Over the last few years, machine learning algorithms solved big problems in computer vision. One such problem was getting an algorithm to learn how to recognize handwritten digits in the MNIST dataset. Everyone writes digits differently, so the problem was difficult for computers to grasp.

When the dataset first became available in 1998, machine learning algorithms for computer vision were not very accurate. Furthermore, computer hardware was far behind where it is today. But slowly, researchers around the world made progress. By 2012, researchers published machine learning algorithms that demonstrated near-human performance on handwritten digit recognition.

Progress on the MNIST problem came down to:

New hardware (fast GPUs) New machine learning algorithms (convolutional neural networks) Open participation (a free dataset)

With MNIST, anyone could download the dataset and participate in progress — so people from around the world did. They had the data they needed to train their algorithms and experiment with new ideas. Without open participation, computers might still not be able to recognize handwritten digits.

Claim: To solve a machine learning problem, open participation is key.

Progress in Efficient Markets

Efficiency in the stock market is not abstract. Inefficiencies are bad for society. The whole world benefits when capital is allocated correctly. It is a very important, but difficult, problem. So if there were a stock market dataset, where would we be in the progress graph for solving it? Is it already efficient — already solved like MNIST?

Asset managers and hedge funds can employ people with machine learning skills, and increasingly they do. They can definitely afford to buy GPUs.

But what’s missing on the stock market is open participation.

It may appear as though individuals have access to more market data than ever before. There are many free data sources like Yahoo! Finance. But most stock market data, surprisingly, is not publicly available. Using just Yahoo! Finance data to build a model is like using only one pixel in an image to learn to recognize handwritten digits.

The high quality stock market data is guarded by data monopolies and hedge funds. Monopolies enjoy being just that, and hedge funds with an information edge aren’t about to part with it. The incentives are such that high quality datasets will become more secret and more expensive over time. So unlike the MNIST computer vision problem, solving the stock market won’t happen in plain sight. For the stock market, there is no free, high quality, public dataset for machine learning.

Without training data for their algorithms, data scientists who don’t work on Wall Street have no way to participate in the progress toward more efficient markets. This situation is especially unfortunate when you consider that the field of data science has become increasingly democratized through freely available tools like Theano and TensorFlow, cheap cloud computing resources, free books like The Elements of Statistical Learning, machine learning communities like Kaggle, and MOOCs like Andrew Ng’s Coursera course. Never before has the field been as accessible. Breakthroughs in artificial intelligence could come from anywhere, and yet, as far as the stock market is concerned, many will go nowhere. Unless we can find a way to share the data.

Claim: The stock market is probably somewhat inefficient with respect to new developments in machine learning because only a tiny fraction of the global machine learning talent has access to its data.

A Breakthrough For Sharing

What if there were a way to make expensive market data freely available? But securely, so as not to disclose it?

Encryption is a way to secure data. Ordinarily, if you encrypt data, it becomes useless to a data scientist. But new developments in cryptography are letting us share datasets securely without destroying their utility to data scientists. Structure-preserving encryption schemes allow machine learning algorithms to learn things even though blind to the raw data.

There are now practical homomorphic encryption schemes, such as the Fan and Vercauteren scheme, which allows one to perform addition and multiplication operations on high degree polynomial ciphertexts in an algebraic ring. Turns out, if multiplication and addition are preserved, then structure is too. Since machine learning algorithms only care about structure, this breakthrough means you can run machine learning algorithms on encrypted data.

Simpler schemes like order-preserving symmetric encryption also provide security in certain settings. New methods from neural cryptography can be used encrypt data that is easy to use with out of the box machine learning tools.

Claim: Breakthrough encryption techniques can be used to encrypt stock market data while still keeping it useful for machine learning experts.

Numerai

Over the last two and a half years, working with expensive financial data from many sources at a $15 billion asset management company, I came up with a way to turn a small segment of data into a tractable binary classification problem. And I was able to create and train a machine learning algorithm on this data.

Using my model, we invested about $50 million for more than a year, and outperformed the market significantly. Anyone can get lucky, but it’s not easy to be right in the right way — the statistical way; the way we measure machine learning algorithms; the way we know Yann LeCun didn’t get lucky when he helped solve MNIST.

Once you have a model in finance that works, you hide it. You hide the techniques you used to build it. You hide the methods you used to improve your data. And most importantly, you hide the data. The financial incentive for secrecy is strong.

But when I learned about homomorphic encryption, I was motivated to find a way to use cryptography to share my data set with other machine learning experts. I believed that if I could share the data, other people might some day build better models than mine.

So I started Numerai, the first hedge fund that gives its data away for free with structure-preserving encryption, and allows open participation by data scientists around the world.

We launched on December 1st 2015. We rose to the top of r/machinelearning until we were usurped by Elon Musk and Sam Altman’s billion dollar OpenAI project — fair enough. Within ten days, a graduate student from Bangalore with a background in neural networks beat my model.

A few days later, a user from Poland published a blog post about Numerai and shared free code for getting started on our platform. Since then, we have had professors and students at Stanford, Harvard, Carnegie Mellon, UC Berkeley, The Indian Institute of Technology and The University of Cape Town build models on our data. We have users who work as analysts on Wall Street. We have users who work at famous quantitative hedge funds. Users from Google and the Machine Intelligence Research Institute. Users using support vector machines, XGBoost and deep learning algorithms. We have users who are rated top 100, top 50, and top 10 Kaggle Masters. Users from 103 countries.

In our first month, Numerai users uploaded 10,292 prediction sets — a total of 200,098,002 equity price predictions.

I thought Numerai would take much longer to catch on, but our users are already making significant progress reducing the error rate on Numerai’s encrypted stock market dataset.

DEEPAI made the first significant stride, and then DATAGEEK achieved a lower error rate than any other user in December. Since then, the error rate has continued to fall as users experiment with new ideas and discover new techniques. One of the highest ranked users this month is using cutting edge research that he learned at this year’s Neural Information Processing Systems conference in Montreal.

Numerai is now trading user generated predictions in our hedge fund.