TCPDump, and the care and feeding of an intelligent SDN

Without good data, we won’t have good data science.

In a previous blog post, we touched upon the fact that Poseidon’s combined use of machine learning and software defined networking (SDN) gives it an uncharacteristic advantage in pursuing its goal of detecting and responding to an attacker’s lateral movement once a network breach has occurred. In order to make this a reality, Poseidon must be able to answer two deceptively simple questions: 1) what is on the network? and 2) what is it doing?

Building a data set to train from is the first step to any machine learning endeavor, and the old adage “garbage in, garbage out” is still true as ever in the cyber domain. Our lab spent a good portion of 2017 attempting to collect network data to train our models. The lack of good network data was the single biggest obstacle to our project. What follows is our story, what drove our decision to enhance tcpdump, and an invitation to collaborate on this quest to build more intelligent networks.

A dearth of publicly available (benign) network data.

The Poseidon team decided to focus specifically on “enterprise” environments for our research. Our data scientists required examples of both “good” network traffic (i.e., benign) as well as “bad” traffic (i.e., lateral movement). Ultimately, we categorized these as “normal” and “abnormal” traffic. The context of normality obviously varies from network to network (and vlan to vlan), but these days, the normal activity of an enterprise environment could very well look like laptops and mobile devices communicating with enterprise web or file servers, use of cloud services for email or file storage, web services on printers and coffee machines — the full gamut — in addition to legacy services (e.g., mainframes) long forgotten but still quietly engaged with corporate networks.

To identify abnormal activity, a machine learning model must first accurately identify devices via their baseline normal activity. And therein lies the rub. A simple internet search for public packet captures would yield all kinds of data; e.g., Capture the Flag competitions, worm activity, command and control traffic, samples of specific exploits — the list goes on. But we were looking for traffic profiles of lateral movement and benign activity of common internal devices on an enterprise network. To date, we have been unable to find a large corpus of enterprise network traffic, let alone examples of lateral movement deep within those environments.

Another issue of existing data sets lies in the collection locality.

Another issue that exacerbates the problem is that of collection locality. To date, most security research has been done at the network perimeter, i.e., where communication travels to and from the outside world. This is not surprising, given the history of defensive technologies; firewalls, proxies, network intrusion detection devices and the like have been deployed like sentries at network boundaries. But lateral movement generally occurs much further within the network boundary, beyond the visibility of perimeter devices. The ability for SDN to make traffic visible at the level of internal physical switch ports (i.e., as close to the device’s network interface as one could reasonably get) means that Poseidon could definitively capture all of the traffic to and from a target host, including traffic between it and its peers within that boundary. The benefit of this groundbreaking visibility also means our training data needs to be sourced from unconventional collection points.

There are certainly ways of collecting such internal traffic in a traditional enterprise environment, but we have found that most entities are not keen to share. After all, network traffic carries the details of our conversations, documents, credentials, and other personal information, and even in the likelihood that such content were to be encrypted in transit, the metadata surrounding the traffic we generate can hint at lifestyle habits and associations we would rather outsiders not be privvy to.

In effect, we were unable to source data from examples of the very corporate networks we developed Poseidon to protect. After many conversations about privacy, we boiled the issues and concerns down to two primary points:

Sensitivity of data content within network packets. If we could come up with a way to perform payload sanitization — particularly of the TCP and UDP protocols — we could address the bulk of content-related sensitivity concerns.

If we could come up with a way to perform payload sanitization — particularly of the TCP and UDP protocols — we could address the bulk of content-related sensitivity concerns. Sensitivity of browsing/surfing habits, particularly what external IP addresses were being visited. In addition to scrubbing the content, masking the external IP addresses that a target host communicates with would also address this concern.

Given that the data sets we needed were nonexistent, our first step was to neutralize these issues before we could ask others to collaborate with us. So we started with the most obvious step: collecting from our own networks.

TCPdump: retooling a beloved classic.

Poseidon’s ability to detect lateral movement from an internal target host should not depend on the content of the traffic or the external hosts the target talks to. Put another way, the machine learning algorithms within Poseidon didn’t need the packet payloads but the packet headers. We still had interest in preserving as much of the rest of this traffic as we could — protocol headers, timestamps, volume and size of packets, flags, etc. Could we achieve this level of surgical payload sanitization? We were determined to answer this affirmatively.

Our initial assumption was that tools would already exist to meet this need, so we tried not to re-invent the wheel. Unfortunately, after a brief survey of available tools we found on the internet, we couldn’t find anything that met our needs. We ran into issues ranging from tools being too old for modern operating systems, to tools being limited to the Windows platform (which was impractical for us), to codebases that were unstable or unsupported.

We eventually concluded that the best path forward was to tackle this ourselves, and we did so by enhancing an existing tool that was already in widespread distribution: tcpdump. Our enhanced version of tcpdump includes two new “long” options:

--no-tcpudp-payload , which removes the payload of a TCP or UDP packet (for IPv4, currently).

, which removes the payload of a TCP or UDP packet (for IPv4, currently). --mask-external-address [replacement_IP] , which replaces all publicly routable IP addresses with a user-specified replacement_IP value.

For example, the following command would save a pcap file of semi-sanitized data (without the payload from TCP and UDP packets), at a fraction of the size of a full Layer 7 collection:

tcpdump --no-tcpudp-payload -w [DUMPFILE.pcap]

The advantages of this approach include a greatly reduced file size while saving network traffic, and a recognizable tool that could be more widely adopted by network administrators. Tcpdump already resides on millions of computers across the globe; our hope is that if these enhancements are accepted by the project maintainers (in tcpdump on GitHub, it is PR-615 if you would like to track it), the world will eventually receive a tool that encourages more widespread sharing of network captures. With good network data as a base, we have the potential to attract more machine learning colleagues and nurture more research and use cases.

In the meantime, we successfully deployed this on our internal network to develop the initial models now made available in PoseidonML.

An invitation to collaborate.

At this stage, our project has generated 4.45 GB of sanitized labeled data. Out of an initial wish list of 17 device types that we felt were representative of typical corporate networks, we extracted samples from 12 categories. Obviously our experiments are still nascent. There is still much work to do, and we still do not have the rich data set required to adequately train an intelligent SDN environment. But we invite you to partner with us in building data sets, generating new models, and perhaps one day releasing the models that would make intelligent automated security response a reality.

For more information or collaboration, please email the author.

Cyber Reboot has made Poseidon’s SDN interface and machine learning capabilities available on GitHub. Poseidon is probably easiest to deploy from our automation platform called Vent. Our tcpdump modifications can be downloaded here (as a zip).