The methods we’ve discussed so far are signature-based. Based on our identified mechanisms of how lateral movement might happen, we define business rules and encode them as alerts. With these, we are able to incorporate as much domain expertise and context to our alerts. Done properly, these provide us with alerts with a good signal-to-noise ratio.

Although these type of rules are necessary to defend our network, they may not be sufficient to properly defend our networks because:

The complexity of the network scales much faster than the security team’s added manpower Enterprise networks are not static, and continuous monitoring and re-baselining is needed Business rules are not robust in detecting novel attacks

If your network is relatively small, then you can survive with rule-based alerts, because you can probably describe all normal connections that you expect to see and create an alert for those that you don’t consider normal. These hosts should only talk to the DNS, and Proxy server. This server should only connect to this database. Only these particular hosts should connect to the production databases…

However, as your network grows larger, complexity increases, it becomes almost impossible to keep track of all these. That is why we want to explore some techniques for network anomaly detection.

The hypothesis is that a malicious host would deviate from the typical behavior. If we are somehow able to capture the characteristics of “normal” in an automated way, then hopefully we are able to catch malicious activities by sifting through the “abnormal”.

Good ways to define and model “normal” is an area of on-going research. For this blog post, we will look mainly at two approaches:

Finding outliers in hosts’ behavioral signatures. Finding unexpected connections through new edge/link prediction

Outliers in Host Behavioral Signatures

The idea here is to summarize the behavior of hosts for a given time interval into a single vector. With these all of the host vectors, we can identify hosts who seem out of place with the rest of the network.

Feature Engineering

The first most important step is to build features for each host in the network.

What features should we get? Well, it depends on what is available to you and what the security domain experts of your organization advise.

As a starting point, we look at what features the paper [13] used. They used a total of 23 features:

RDP Features: SuccessfulLogonRDPPortCount, UnsuccessfulLogonRDPPortCount, RDPOutboundSuccessfulCount, RDPOutboundFailedCount, RDPInboundCount

SuccessfulLogonRDPPortCount, UnsuccessfulLogonRDPPortCount, RDPOutboundSuccessfulCount, RDPOutboundFailedCount, RDPInboundCount SQL Features: UnsuccessfulLogonSQLPortCount, SQLOutboundSuccessfulCount, SQLOutboundFailedCount, SQLInboundCount

UnsuccessfulLogonSQLPortCount, SQLOutboundSuccessfulCount, SQLOutboundFailedCount, SQLInboundCount Successful Logon Features: SuccessfulLogonTypeInteractiveCount, SuccessfulLogonTypeNetworkCount, SuccessfulLogonTypeUnlockCount, SuccessfulLogonTypeRemoteInteractiveCount, SuccessfulLogonTypeOtherCount

SuccessfulLogonTypeInteractiveCount, SuccessfulLogonTypeNetworkCount, SuccessfulLogonTypeUnlockCount, SuccessfulLogonTypeRemoteInteractiveCount, SuccessfulLogonTypeOtherCount Unsuccessful Logon Features: UnsuccessfulLogonTypeInteractiveCount, UnsuccessfulLogonTypeNetworkCount, UnsuccessfulLogonTypeUnlockCount, UnsuccessfulLogonTypeRemoteInteractiveCount, UnsuccessfulLogonTypeOtherCount

UnsuccessfulLogonTypeInteractiveCount, UnsuccessfulLogonTypeNetworkCount, UnsuccessfulLogonTypeUnlockCount, UnsuccessfulLogonTypeRemoteInteractiveCount, UnsuccessfulLogonTypeOtherCount Others: NtlmCount, DistinctSourceIPCount, DistinctDestinationIPCount

Other examples of features from [14] are:

Whether at least one address verification failed over the last 24 hours

The maximum outlier score given to an IP address from which the user has accessed the website

The minimum time from login to checkout

The number of different locations from which a user has accessed the website over the last 24 hours.

With these feature vectors, we can build a matrix and start using outlier detection techniques.

Principal component analysis

We start with a classical way of finding outliers using principal component analysis (PCA) [15] because it is the most visual, in my opinion.

In simple terms, you can think of PCA is a way to compress and decompress data, where the data lost during compression in minimized.

Since most of the data should be normal, then the low-rank approximation from PCA would focus on normal data. How does PCA perform with outliers?

Since the outliers do not conform to the correlation structure of the rest of the data, these would have high reconstruction errors. One way to visualize this is to think of the “normal data” as the background activity of the network.

Below we see a more visual example from the numerical linear algebra course of fast.ai [12]. They constructed a matrix from the different frames of the video. The image on the left is one frame of the video. The image in the middle is the low-rank approximation using Robust PCA. The image on the right is the difference, the reconstruction error.

In the example above, the “subjects” of the video typically appear in the reconstruction error. Similarly, we are hoping that anomalous hosts would stand out from the background and have high reconstruction errors.

Note: Classical PCA can be sensitive to outliers (unfortunately), so it might be better to use Robust PCA discussed in [12]

Autoencoders

I won’t go into much detail on Autoencoders, but you can think of autoencoders as a non-linear version of PCA.

The neural network is constructed such that there is an information bottleneck in the middle. By forcing the network to go through a small number of nodes in the middle, it forces the network to prioritize the most meaningful latent variables, which is like the principal components in PCA.

Similar to the PCA, if the autoencoder is trained on normal data, then it can have a hard to reconstructing the outlier data. The reconstruction error can be used as an “anomaly score”.

PCA and autoencoder are some of the components used by [14] for detecting malicious hosts in an unsupervised manner.

Isolation forests and other methods

Another popular way to find anomalous hosts is by using isolation forests, which has been shown to this performed better than other methods [13]. Just like many tree-based algorithms, this can handle both numerical and categorical data and there are few assumptions on the distribution and shape of the data.

In the end, we want to use our historical data to learn a function that can aid us in distinguishing what is normal and not normal.

Example decision boundaries learned from different methods using PyOD [16]

If you want to explore different methods for outlier detection, I recommend going through PyOD [16] and go through the implemented algorithms and the papers that they cite.

If you are interested in learning more about this I would recommend watching Anomaly Detection: Algorithms, Explanations, Applications [17].

New Edge/Link Prediction

This is an oversimplification of [11]

Unlike the previous methods, where we try to find malicious hosts. Here, we try to find anomalous edges, where an edge is a connection between a client and a server, or source and destination. The source/client can also be usernames and the edges represent authentication.

Now suppose this the first time we are seeing this edge. We might ask, what is the probability of observing this new edge?

For the function above, we need to have some sort of representation for both the client and the server. We can have categorical covariates for example:

The subnet of the host

The type of host (endpoint or server)

Job title of the user

Location of the user

With categorical covariates, represented by a binary indicator variable, we can use the interactions of the source and destination to get an idea of how likely the new edge is.

Getting embeddings for source and destination

One other way to represent a host is to generate some sort of embeddings.

Let us say from the historical data, we have previously observed the following edges. A host can be both a source in some connections and a destination in some other connection.

From this, we can construct an adjacency matrix, where each cell represents a possible edge, and edges that have been observed have a value of 1.

Adjacency matrix from

Using this matrix, we can perform some non-negative matrix factorization. This is something that we see in other applications such as recommender systems and collaborative filtering. Through the factorization, we are able to get embeddings for both source and destination.

Link prediction

The paper [18] shows how to be able to the factorization while incorporating the categorical covariates. This is done through Poisson Matrix Factorization (PMF).

After estimating the embeddings and some necessary coefficients , we can estimate how likely observing new edges are.

I hope you get a high-level idea of what we’re trying to do, but of course, estimating the values for α,β, and ɸ in a computationally efficient way is the crux of the problem. Details can be seen [18] and there are. Another paper that is also on edge prediction is [11].

If you are interested in more applications of statistics in cyber security I suggest watching Nick Heard’s talk Data Science in Cyber-Security and Related Statistical Challenges [19].

CTF Challenges