A reality check on the role of machine learning in cybersecurity

By Ben Dickson for TechTalks

Cybersecurity, a huge industry worth over $100 billion, is regularly subject to buzzwords. Cybersecurity companies often (pretend) to use new state-of-the-art technologies to attract customers and sell their solutions. Naturally, with artificial intelligence being in one of its craziest hype cycles, we’re seeing plenty of solutions that claim to use machine learning, deep learning and other AI-related technologies to automatically secure the networks and digital assets of their clients.

But contrary to what many companies profess, machine learning is not a silver bullet that will automatically protect individuals and organizations against security threats, says Ilia Kolochenko, CEO of ImmuniWeb, a company that uses AI to test the security of web and mobile applications.

While machine learning and other AI techniques will help improve the speed and quality of cybersecurity solutions, they will not be a replacement for many of the basic practices that companies often neglect.

Artificial intelligence won’t automate cybersecurity

“In cybersecurity today, we overestimate the capacities of machine learning,” Kolochenko says. “When talking about AI, many people have this illusion that they can just plug in software or hardware that is leveraging AI, and it will solve all their problems. It will not.”

According to Kolochenko, one of the main causes of data breaches and security incidents is lack of visibility on company data and assets. Organizations are growing larger and more fragmented, and they’re not doing a good job at keeping tabs on all their data and computing devices.

“Organizations are becoming so large, so clumsy that they have no idea where their data is stored, who has access to their data, how many devices, cloud storages, IoT devices, etc. they have, and all this leads to a very expansive, continuous and inevitable incidents and data breaches,” Kolochenko says.

This is an area where machine learning won’t help. Organizations need to have proper processes and practices in place to keep a continuous inventory of their digital assets. “If you do not have a process—even a paper-based process—of how you do things, who is responsible, who is accountable, who has the capacity to do continuous inventory, AI will not help,” Kolochenko says.

Machine learning will automate repetitive tasks, if it has the right data

This doesn’t mean, however, that machine learning is not without use in cybersecurity. It will still help network administrators to identify safe behavior and potential threats by accelerating the process of searching through data.

“AI can support you and accelerate you and take care of some routine time-consuming tasks and free up your team to spend their efforts on really complicated and more important tasks,” Kolochenko says.

Machine learning can specifically help in tasks that can’t be represented in classical rule-based algorithms. “We consider using artificial intelligence only when software solutions that don’t use big data and machine learning can’t provide you with meaningful outcomes, where we don’t know in advance all possible combinations, all possible use cases,” says Kolochenko.

Kolochenko also reminds that a prerequisite to using machine learning is to have the right training data. Not having data in proper amounts and quality will result in AI models that give the wrong signals or produce biased results.

“If you want to make sure the machine learning model will provide you with reasonable answers, you have to make sure that the data is comprehensive and it’s relevant. If you don’t have any data, you’d better reconsider reviewing the use of machine learning,” Kolochenko says, adding that many of the startups that talk about AI and cybersecurity don’t have the data required to solve the problems they advertise. “For every startup the biggest challenge is where to obtain reliable data,” he says.

Machine learning and anomaly detection

The most common description of using AI in cybersecurity is to use machine learning for anomaly detection. Basically, the idea behind anomaly detection is to feed a machine learning algorithm with a company’s data and let it determine the normal behavior, the baseline, and detect and block the deviations from the norm, the anomalies.

In theory, it sounds like a very promising idea and there are several companies that have implemented it with a degree of success. But in practice, cybersecurity and threat detection and prevention are much more complicated.

“We still have companies who try to advertise a particular approach to machine learning, such as unsupervised learning and full automation,” Kolochenko says.

Unsupervised learning is a type of machine learning training in which you provide the algorithm unlabeled data and let it arrange them in clusters and groups based on the common characteristics it finds. Supervised learning, the more common AI training method, requires humans to annotate training data, such as writing the descriptions of images or audio samples.

The benefit of unsupervised learning is that it doesn’t need humans to label the training data, a practice that is can become costly and slow. It is especially suitable for use cases where data is abundant but annotating it would is either impossible (because of the multitude of characteristics and parameters) or would require too much effort.

But there’s no guarantee that a machine learning algorithm trained through unsupervised learning will extract the right correlations, especially when you’re trying to profile a very complex space.

“Unsupervised machine learning is really good for simple tasks, but really depending on the complexity, you may need to shift to rainforest learning, or supervised learning and so on. the more complicated the task is, the more business logic that is not obvious that can’t be clustered, and the more untrivial and illogical the task is, the more human intervention you will need,” Kolochenko says.

Some companies have worked around this by using semi-supervised learning, where they allow their AI models to train through unsupervised learning while employing human analysts to guide and apply corrections where the algorithm makes mistakes. Over time, the AI algorithm learns both from the data and the human feedback and performs much better than it would had it gone through unsupervised learning.

“We certainly see good progress on the market, and we see companies that leverage machine learning to deliver value to their customers,” Kolochenko says. “It can be demonstrated either by reduction of false positives and increasing detection of threats that were previously undetectable.”

But these improvements are not proportional to the evolving cyberthreats, growing generation of data, and the widening skills gap in the cybersecurity industry. “We’re not keeping up with our own growth. We improved speed, we improved reliability, we reduced noise. But I can’t say that we’ve made a revolution,” Kolochenko says.

Machine learning and application security testing

ImmuniWeb’s AI platform is tailored for identifying vulnerabilities in web and mobile applications. But Kolochenko points out that machine learning is just one of many tools his company uses to root out security holes in the systems of its customers. The general strategy of ImmuniWeb is to use AI to augment the skills of human analysts, not automate the entire process.

“I always tell my customers that machine learning is just one way of performing some processes and tasks, it’s not a replacement,” Kolochenko says.

For simple tasks, such as detection of simple cross-site scripting (XSS) and SQL injection vulnerabilities, the company uses traditional, rule-based tools that have already proven their worth. There’s no need to use machine learning for something that already has a simpler and more practical solution.

For more complicated tasks that require to consolidate data from various sources and can’t be performed with classic tools, the company uses its own proprietary AI algorithms. “For example when we need to bypass a particular web application firewall (WAF), it’s not something that classic algorithms will perform well. Our machine learning algorithms jump in and we use aggregated knowledge from our pen tester, from public sources, to try to bypass the WAF in the fastest manner,” Kolochenko says.

But the machine learning algorithms often need help from human pen testers to complete their tasks. “If the AI fails, the issue will be escalated to our people. So, we still have people and we don’t claim that we have unsupervised machine learning,” Kolochenko says. “We have 10 percent of the most complicated tasks—such as CAPTCHAs that can’t be bypassed, or a functionality that has never been seen before—that will be shifted to our people.”

The use of AI in application security testing has enabled the company to scale its efforts. “Compared to traditional penetration testing, where we allocate one percent of our effort to take care of web application penetration testing during the week, we can afford to spend one hour per day and deliver a full report with all vulnerabilities detected, remediation guidelines, in just one business day,” Kolochenko says. “We make our people scalable and augment them using machine learning.” Read Full Article