Phase I : Detection of automated scans (Kinda Anomaly Detection)

Machine learning is eating the world. From communication and finance to transportation, manufacturing, and even agriculture, nearly every technology field has been transformed by machine learning and artificial intelligence, or will soon be.

With machine learning offering (potential) solutions to everything under the sun, it is only natural that it be applied to computer security, web-application security and cyber security, fields which intrinsically provides the robust data sets on which machine learning thrives.

Automated scanners are automated tools thats the scan web application, normally from the outside, to look for security vulnerabilities such as Cross-site scripting, SQL Injection, Command Injection, Path Traversal and insecure server configuration. More details about automated scanning tools can be found here.. https://www.owasp.org/index.php/Category:Vulnerability_Scanning_Tools

I tried to detect automated scans that are running in my web-application using ML and found some really promising results.

Takeaways : Before diving deeply into which algorithm to use we have to analyze the use case first. There will be two conclusions that i made

1) We have very few automated user agents/points which are anomaly and

2) Distance between anomaly user agent and normal user agent are far.

Challenges while having ML in Security:

More false positives and detecting anomaly user agent(automated scans) is very difficult

Dataset :

4 Dimensional data => Avg. request param count, Avg. response time, Avg. response status, Total number of request.

.. 1)After collecting my data i tried to apply clustering algorithm DBScan to get all the odd mans out by setting average distance between the points as epsilon and minimum points in that cluster as 10. So the groups which have lesser number of points(10 points) are detected. And those points are mainly the automated user agents.

Above approach will look like semi supervised learning as my data is labeled and also i am tuning my epsilon and minimum number of points with that data.

.. 2)Another approach i tried is by having one class SVM. This is a bit easy, i did a trick by giving training data as day-1 data for day0 and this looks unfair but i am impressed by the results ;). This approach is purely learning from past days and if there’s any anomaly user agent i will mark it and remove that from next day training. (Need to have automated one to verify the automated user agent)

.. 3)Also i clubbed k-means and db-scan/kNN. First i will run k-means with k as 3,5,7,9 and 11. Then take the points which are far away from the cluster centers(say top 50 points in each cluster, this will result in 250 points) and run db-scan/kNN on the resultant data set.