In 2013, a group of British intelligence agents noticed something odd. While most efforts to secure digital infrastructure were fixated on blocking bad guys from getting in, few focused on the reverse: stopping them from leaking information out. Based on that idea, the group founded a new cybersecurity company called Darktrace.

The firm partnered with mathematicians at the University of Cambridge to develop a tool that would use machine learning to catch internal breaches. Rather than train the algorithms on historical examples of attacks, however, they needed a way for the system to recognize new instances of anomalous behavior. They turned to unsupervised learning, a technique based on a rare type of machine-learning algorithm that doesn’t require humans to specify what to look for.

Darktrace has zeroed in on an infected device exhibiting anomalous behavior. Darktrace

"It’s very much like the human body’s own immune system," says the company’s co-CEO Nicole Eagan. "As complex as it is, it has this innate sense of what’s self and not self. And when it finds something that doesn’t belong—that’s not self—it has an extremely precise and rapid response."

The vast majority of machine-learning applications rely on supervised learning. This involves feeding a machine massive amounts of carefully labeled data to train it to recognize a narrowly defined pattern. Say you want your machine to recognize golden retrievers. You feed it hundreds or thousands of images of golden retrievers and of things that are not, all the while telling it explicitly which ones are which. Eventually, you end up with a pretty decent golden-retriever-spotting machine.

In cybersecurity, supervised learning works pretty well. You train a machine on the different kinds of threats your system has faced before, and it chases after them relentlessly.

But there are two main problems. For one, it only works with known threats; unknown threats still sneak in under the radar. For another, supervised-learning algorithms work best with balanced data sets—in other words, ones that have an equal number of examples of what it’s looking for and what it can ignore. Cybersecurity data is highly unbalanced: there are very few examples of threatening behavior buried in an overwhelming amount of normal behavior.

A visualization of all the connections within a particular subnetwork. Darktrace

Fortunately, where supervised learning falters, unsupervised learning excels. The latter can look at massive amounts of unlabeled data and find the pieces that don’t follow the typical pattern. As a result, it can surface threats that a system has never seen before and needs few anomalous data points to do so.

When Darktrace deploys its software, it sets up physical and digital sensors around the client’s network to map out its activity. That raw data is funneled to over 60 different unsupervised-learning algorithms that compete with one another to find anomalous behavior.