Recently, Lab41 teamed up with Cyber Reboot (a sister lab) to explore the intersection of deep learning (DL) and cyber security in a software defined network (SDN) environment. We called it Poseidon, based heavily on it being a cool word with the letters s, d, and n in order.

$ cat /usr/share/dict/american-english | tr ‘A-Z’ ‘a-z’ | grep ‘.*s.*d.*n?*’ | grep -v \’

The goal was to use predictions about network traffic to automatically update a network’s posture. This entailed three main objectives: performing deep learning on packet data, setting up an SDN environment, and scheduling a microservice to connect the two (for more information and code visit our Github page). Since I belong to the cult of deep learning, I was tasked with the first objective. But in order to create something meaningful I had to first immerse myself in the world of cyber security, and then break out of some typical analytical norms. Here is my story.

MY KINGDOM TO ANSWER THESE TWO QUESTIONS

I had the privilege of working with some of the best cyber security experts in the field today. They helped guide the analytical research focus by expressing interest in finding bad people already on the network. There are many cyber security companies that focus their efforts on preventing bad people from getting on your computer network, but fewer are focused on finding an intruder who has bypassed such preventions and is already pivoting on the network. In order to identify such unauthorized individuals, we needed to answer two questions: 1) what is on the network? 2) what is it doing? Though, we could philosophically banter how the final algorithm implicitly answers both questions, I will focus on our contributions to answering the second question.

WHAT IS IT DOING?

Through my literature review of network behavior analysis it became abundantly clear that anomaly detection was the most often used algorithm to identify anomalous events on a network. The reason is that in network data you have a billion examples of normal traffic and only a few examples of abnormal/bad/malicious traffic. It is tough to build a classifier on such an imbalance of class examples, because the classifier would simply label everything as normal and produce a classification accuracy of 99.99999%. Anomaly detection algorithms were created for grossly imbalanced datasets. They ignore the abnormal examples and model only the normal, all the while flagging anything that deviates too far from “normal”. The hope of this approach is to catch anything not normal, regardless of if it is a new or old type of attack. Unfortunately, there are many drawbacks to only modeling one class. The main drawback is the assumption that you know what normal is. A few years ago, the University of Berkeley and Lawrence Berkeley National Laboratory published a paper on using machine learning for intrusion detection (IDS). They stated, “…traffic often exhibits much more diversity than people intuitively expect, which leads to misconceptions about what anomaly detection technology can realistically achieve in operational environments.” The diversity found in networks makes modeling normality difficult, and it can lead to a high rate of false alarms. I decided to turn this anomaly detection problem into a classification problem. Here is how I did it.

ANOMALY CLASSIFICATION

/*BETA FOR DATA SCIENCE FOLKS WHO AREN’T INTO NETWORKING: When two machines are networked together they communicate by sending data packets back and forth. The collection of all of the packets in a conversation between two computers is called a session. This is very similar to how utterances (or data packets) are structured in a dialogue (or a session)*/

First, the right inputs needed to be selected. I didn’t want to deal with the complications of deep packet inspection (compute time, encryption, etc.) so I decided to focus only on packet headers. The raw hex dump of the headers offered a beautifully sequential structure with a very small lexicon (256 hex pairs, or words). Not only are the hex pairs in packet headers sequentially ordered (like words in an utterance), but also the packets themselves are sequentially organized in a session (like utterances in a dialogue). This is a perfect recipe for deep learning consumption. The only thing left was to create anomalous sessions for the classifier.

THE THREE ABBY NORMALS

The trick of switching from anomaly detection to classification is being able to programmatically create or generate anomalies. Recent advances in machine learning (see Generative Adversarial Networks) use competing neural networks to generate examples that are indistinguishable from a training data set. (NOTE: “Adversarial” in this context is not meant to refer to an adversary on the network but rather the competition between the two neural networks.)

In the spirit of GAN, I manually generated abnormal sessions to look almost indistinguishable from normal sessions using three basic techniques. The first two abnormal sessions that were synthetically created are similar. In the first technique, the order of the source and destination IP and MAC addresses in all of the packets in a session are switched. The second type is similar in that the order of the source and destination ports is switched. The purpose of this approach was to simulate a role reversal between two machines. As an example of machine role reversal, imagine if the server you normally SSH to decides to SSH to your workstation. It is similar to me starting a conversation with my wife by saying, “Honey, I just watched the most moving episode of Grey’s Anatomy;” this is a complete role reversal in a conversation. See the figure below.

Switching the order of the source and destination IP address in all of the packets in a session to simulate a role reversal.

The third abnormal type is accomplished by leaving the source IP in its proper place and swapping out the destination IP address with an IP address the source never talks to (the swap out creates unwanted correlations with within the header — these will be investigated in follow on work). This is to simulate a conversation that never happens, or should never happen, on the network. It’s similar to me in college telling my friends that I had an engaging conversation with a woman — a conversation that never happened. See the figure below.

Leaving the source IP and swapping out the destination IP with one the source never talks to.

The implementation is fairly simple. We assume that all of our data is benign. When each session is presented at training time it has a 50/50 chance of remaining as a normal session or being morphed into one of the three abnormal sessions.

P(normal) = 0.5

P(IP switch) = 0.5/3

P(port switch) = 0.5/3

P(IP swap out) = 0.5/3

This allows us to have as many examples of normal sessions as we do abnormal sessions. Next, we need to choose the right algorithm and make sure it is assessing the right parts of the packets.

HEIRARCHICAL RECURRENT NEURAL NETWORKS

Since both the hexadecimal pairs in a packet header and the packet headers in a session have a beautiful sequential order, a Recurrent Neural Network (RNN) is a natural choice for encoding packets and sessions. We will use two RNNs: one to summarize the hex pairs in a packet header, and one to encode all the packets in a session. We call these the Packet RNN and Session RNN respectively. The Packet RNN starts at the beginning of a header and encodes the first hex pair into a vector of numbers. It moves to the second hex pair and combines its representation with the information passed on from the first. Thus, at any pair in the header the Packet RNN is outputting a summary representation of that pair combined with the information from all the pairs before it. It does this sequentially until the last hex pair. We discard all the outputs from each pair in the sequence except for the last. This final output is a lovely summary of all the information in a packet header (see the red boxes in the figure below).

Now that we have a way of encoding and compressing a packet header into numbers, we need to collect these representations and use them to create a session representation. We use a second RNN, the Session RNN, which takes as input the ordered header representations we just created. It starts with the first header representation and combines its with the representation of the second header, and so on until the last packet in the session (see the blue boxes in the figure below).

Hierarchical RNN and dense layer architecture encode and classify sessions.

In the end we are left with a real-valued vector that is a compressed and latent representation of the entire session. This paper (including a lovely generative twist) and this one (adding attention mechanisms) are excellent examples of this architecture.

ADDING ATTENTION

An attention mechanism is a simple addition to the DL architecture, which allows the user to catch a glimpse into its decision process. It effectively turns the neural black-box, to more of a grey one. The output of the last time step of an RNN, as previously explained, is supposed to be a nice summary of the entire sequence it just digested. But, instead of using 100% of the last output, an attention mechanism creates a weighted sum of all the time step outputs (compare the figure below with the one above). These attention weights are part of the algorithm’s learning process and update as more examples pass through the network. This gives it the ability to ignore parts of the input and emphasize more important parts of the sequence. We use two attention mechanisms: Packet Attention to focus on the most important parts in the header, and Session Attention to focus on the most important headers in the session.

The same architecture with the addition of two attention mechanisms (in pink).

The figure below is a visualization of the two attention types of the first 8 packets of a session that suffers from destination IP swap out. Since we swapped out the destination IP address and left the source IP alone, we would hope that the Packet Attention mechanism would focus on the destination IP portion of the header. And it does! The darker the blue indicates what part of the header the Packet Attention deemed most important. Interestingly, it also focuses on the destination port, probably thinking they don’t match up very well. The Packet Attention didn’t look at the right parts of every packet in the session, but that is ok. The Session Attention pretty much ignored the packets that didn’t focus on the destination IP address and port areas. The darker red indicates which packet the Session Attention thought was most important.

Visualization of the two added attention mechanisms for a session that suffers from destination IP swap out. The darker the color the more attention was given to that part to aid in classification.

Follow this link and especially this one for more information on attention.

RESULTS

I tested the accuracy of the classifier on an openly available PCAP file called bigFlows.pcap. The order of the packets in the file is preserved. We use the first 80% of the sessions for model training and the remaining 20% for model testing. Remember, all of the data are presumed to be benign. In reality, some portion of any given network is likely to be compromised. This means the model won’t identify the existing hostility, but it will identify when the attacker tries to spread. The testing data is modified in the same manner as the training. The results are exciting.

F1 scores for each class and overall classification accuracy.

We expect it to do well in this adversarial scenario. Poor results here would have indicated a need for more model tuning. What should really get you fist pumping is the memory capacity of the RNNs. They can remember the relationship between two IP addresses! The next step was to test it out on a labeled IDS dataset.

ISCX IDS RESULTS

Finding a useful IDS dataset is difficult. A considerable amount of time was spent looking for an applicable dataset. The University of New Brunswick published an IDS dataset in 2012. It consists of seven days of network traffic PCAP files. The details of the data are in the figure below.

I was only interested in days 1 through 3, thus the other 4 days were discarded. We use the normal traffic from days 1 and 2 to train the model, using only two of the three abnormal types to define abnormal sessions: IP direction switch, and destination IP swap out (i.e. each selected with probability = 0.5/2). Once the model sufficiently learned from the normal and synthetic abnormal data, it was put to the classification test on data from day 3. Below is a confusion matrix of the results from the best model.

Predicted Predicted

normal attack

Actual normal [ 92455 46 ]

Actual attack [ 1608 8346 ]

What this matrix tells us is out of the total actual attack sessions (1608+8346=9954) the model catches 83.8% (8346/9954=0.838) of them, with only a 0.5% (46/(46+8346)=0.00548) false positive rate. Remember, the neural network defines abnormal based on only 2 simple attack types, and zero firewall rules. What if we had 5 attack types, or 10, or 20 to teach the neural network what abnormal is? There is room for improvement. By this time your arm should be sore from fist pumping so much. You can try this out now by downloading the Jupyter Notebooks on our Github repo.

FUTURE DIRECTION

I am aware that these three threat types are pretty basic when it comes to network security, but their effectiveness on the ISCX dataset was surprising. What would make this all the more awesome is the addition of a generative component as described in this paper. This would allow the classifier to go beyond the dataset and be more robust in catching variations of the same attack type.

The success of this normal/abnormal classifier and attention mechanisms gives hope that this architecture can teach us what is important in headers and sessions for more sophisticated attack types. But finding an interesting event on your network is only part of a complete cyber defense system. Next, you must effectively react to that event. This reaction be the focus of the next phase of our project.