We recently completed the MagicHour challenge, which used machine learning techniques to conduct dimensionality reduction on millions of raw computer log lines. Our premise was that unsupervised pattern discovery might be able to automatically create distinct signatures for a small set of human interpretable, higher level system events. Secure Shell (SSH) sessions and system restarts are examples of system events. Event signatures are predictable, closely related sequences of lower level system processes that are recorded line by line in the voluminous raw computer logs. A client request for an IP address or the authentication of user credentials are examples of lower level system processes.

Based on our assessment of the problem and a survey of current research, MagicHour initially included algorithms designed for frequent item set mining and topic modeling. We found that frequent item set mining expected the user to be able to estimate in advance the frequency of the events of interest and the output typically included an overwhelming number of sequences with weakly linked elements. Topic modeling was extremely effective at identifying closely related system processes in our log files, but it was computationally slow. The accuracy of topic modeling on our log files suggested a viable convergence between natural language processing (NLP) techniques and log file analysis. Recent Lab41 challenges Hermes and Sunny-Side-Up leveraged unique applications of vector representations for words, and we decided to explore how word vectors would perform when applied to computer logs.

Background

There is a plethora of products that examine computer log files to convey trends and detect anomalies. Commercial companies tend to depend on human experts to build data models and create custom parsers when they are confronted with various types of log files. This is a resource-intensive process that does not scale well outside the commercial marketplace. We set out to discover if machine learning could automatically identify meaningful groups of elements in a large log file.

MagicHour’s ultimate goal was to locate closely related collections of items in a large data sample. Frequent item set mining, also known as frequent pattern mining, techniques such as the Apriori algorithm and a memory-optimized variation, FP-Growth, were developed to address this scenario. Frequent item set mining’s best known application might be market basket analysis, which identifies products bought together by customers at a store. We were initially drawn to the analogy of different log lines (products) that occur together frequently in multiple windows of time (baskets).

Credit: paulshealthblog.com

Our initial research also led us to the Principle Atoms Recognition in Sets (PARIS) algorithm, which was built for log file analysis and discovering patterns of messages that represent processes in the system. Perfect! PARIS accomplishes this using a strategy similar to topic modeling, specifically Latent Dirichlet Allocation (LDA). Topic Modeling and LDA are popular tools for Natural Language Processing (NLP) and are based on the premise that a document consists of a small number of topics and that each word’s appearance is attributable to one of the document’s topics. In our case, the documents are log files, the topics are events, and the words are individual log lines.

Coming To Terms

Before we go on, let’s establish terminology. Raw computer log files typically contain millions of lines of messages. MagicHour’s first step was to conduct preprocessing on the log files to remove the entropy of predictable entities such as IP addresses, directory paths, and usernames. This allowed us to identify frequent words in the preprocessed log lines and then generate templates for unique sequences of those frequent words. We used LogCluster by Ristov Varaandi for this template discovery step.

Each log line that matched a template was assigned the unique identifier (ID) for that template. The expectation is that a log file with multiple millions of log lines has a lot of message type repetition such that it actually consists of only a few thousand unique templates. At this point we represented every log line in the file as a Unix timestamp and a template ID, and we retained a separate dictionary for the template definitions:

timestamp=1131523501,template=425

template425=USER FILEPATH data_thread()got no answer from any datasource

Our frequent item set mining and topic modeling algorithms were configured for transactional datasets. We divided log files into transactions which consisted of the template IDs that occurred in a single time window (e.g., 60 seconds). Our goal was to analyze the transactions in order to determine what template IDs occurred together, as those sequences of template IDs could signify a higher level system event. To help visualize this, here are two sample transactions from our synthetic data:

1953 1460 425 2308 1535

2126 1676 1954 1326 2327 1848 1497 1224

Good Vectors Make Good Neighbors

Vector representations of words are used as a feature extraction tool in natural language processing. Word2Vec and Global Vectors for Word Representation (GloVe) are two popular algorithms that leverage vector representations of words. At a high level, these methods map words or phrases to multi-dimensional vectors by evaluating the order and placement of a word against neighboring words. Words with similar vectors tend to be used in similar ways. In a practical sense, word vectors can assist with assessing semantic similarity and synonyms.

City and Zip Code Word Vector Differences, Credit: nlp.stanford.edu

I like to think of word vectors in terms of an airplane seating chart. If you examined your air travel over the past few years, most of the people who sat in your row were random strangers. However, it’s likely that a pattern will emerge over time — that there were people who sat next to you or a few seats away that provide context about who you are. These people could be your spouse, your children, your parents, your colleagues or your partner for travel adventures.

Airplane Seating Chart, Credit: Wikipedia

Word vectors work in a similar way, in that patterns in the proximity of words can provide context about their meaning and relationship. Our hope was that applying word vectors on the order and proximity of log lines and their associated templates could provide context about their role in larger system events. To continue the airplane analogy, we could assign each log line and its associated template a seat in the airplane seating chart based on the timestamp order it occurred. We treat the templates as words and calculate their word vectors.

Survey Says

We compared GloVe’s output against frequent item set mining algorithm FP-Growth and topic modeling algorithm Principle Atoms Recognition in Sets (PARIS). All of the algorithms were fed the same synthetic and generated datasets, which allowed us to compare the output against ground truth, evaluate performance and observe unique characteristics of each algorithm.

GloVe consistently identified approximately 50 percent or more of the seeded events in the synthetic data as either exact or as valid sub-sequence matches. GloVe tended to nominate a limited number of template sequences that weren’t related to seeded events and many of those were tied to high frequency templates. When we tested GloVe against a generated data set with multiple SSH sessions in an auditd file, GloVe correctly proposed a single event that included all of the auditd record types defined in the SSH user login lifecycle.

GloVe’s balance of correctly identified seeded events and limited number of extraneous event nominations was significantly better than FP-Growth but fell short of PARIS. As one point of comparison, FP-Growth identified all 50 seeded events on a sample run against synthetic data but it also nominated another 5,000 event sequences beyond the seeded events. GloVe and PARIS each suggested well less than 100 extra event sequences for the same set of transactions. Given that dimensionality reduction was a driving motivation in our challenge, we decided that FP-Growth’s deluge of weakly-linked results was not a good match for our log file scenario. PARIS was adept at identifying the event sequences in our synthetic data and our generated auditd data, but it performed slowly on large data sets.

What We Learned

We learned three things about applying vector representations for words to computer logs:

1. Putting The Pieces Together

One characteristic that emerged from our testing was that GloVe had a tendency to nominate sub-sequences of seeded events instead of the original full length sequences. While the sub-sequences can be viewed as valid events on their own, this behavior creates a scenario where additional post-processing work would need to be done to stitch the relevent pieces together into their original signature. An example from the synthetic data:

Seeded Event: 889 716 876 637 494

PARIS output: 889 716 876 637 494 (exact match)

FP-Growth output: 889 716 876 637 494 (exact match)

GloVe output: 637 494 889 and 876 716 (exact match but as two separate sequences)

GloVe’s proclivity to break down seeded events into smaller pieces increased as the number of epochs increased. We believe that the resolution achieved by additional epochs exposes closer relationships between certain sets of the templates inside an event. An analogy is that a family of four could logically be divided into a group of two parents and a group of two children because the children are more similar to each other than the whole family.

Reconstructing the original family would require stitching together the child and parent groups, but it might not be immediately obvious which parent group pairs with which child group. In our synthetic data example above, templates 716 and 876 were always the bookends for the seeded event. GloVe may have assessed that the closer proximity of interior templates 494 637 and 889 indicated a stronger, more consistent relationship.

2. Built For The Now Generation

GloVe performed strikingly fast, even when run locally with reasonably large datasets. It quickly became our favorite event discovery algorithm for MagicHour due to its combination of speed and the acceptable quality of the results. Neither PARIS or FP-Growth were able to keep pace with GloVe’s performance on larger data sets in a local execution environment. FP-Growth is available for a distributed environment in Spark’s Machine Learning library and GloVe’s predecessor Word2Vec is also available as a distributed implementation in the same Machine Learning library. We assessed that PARIS was not well-suited to a distributed environment due to frequent global state calculations inherent in the algorithm.

Credit: chartgo.com

3. Clustering Occasionally Gets In The Way

While running the algorithms repeatedly through shifting synthetic data sets, we noticed an instance where GloVe identified an unusually low number of the seeded events. Our investigation surfaced that some of the templates utilized in the seeded events appeared in more than one seeded event fingerprint. Our implementation of GloVe leveraged a clustering mechanism that would only allow a given item to be mapped to one cluster, which means that in our scenario a template could only exist in one proposed event. GloVe typically handled this situation by assigning the duplicate template in one event and reconstructing the impacted seeded event as best as possible. However, we also observed occasions when GloVe disregarded some or all of the other seeded events with the shared template. A clustering conflict example from the synthetic data:

Seeded Event #1: 839 201 39 237 177 659 984 890

Seeded Event #2: 160 98 103 39 66 450 374

GloVe output: 39 201 and 839 177 659 984 890 (partial match for event #1 as two separate sequences)

GloVe output: 98 66 450 374 (partial match for event #2 and missing template 39)

This limitation might not be an issue for some datasets. There might also be clustering alternatives to pair with GloVe that would allow a template to either exist in multiple clusters if the need arose.

Parting Thoughts

While some caveats exist, we found that vector representations for words can be a viable method for analyzing computer log files. GloVe’s combination of speed and meaningful results could provide a low-cost preview of the important elements in a log file. While topic modeling algorithms such as PARIS provide higher overall accuracy and completeness over longer processing times, we can envision scenarios where security analysts would be interested in getting a relatively quick first cut through a large batch of data with GloVe.

On a bigger scale, we believe that vector representations for words could be successful for other variations of semi-structured text beyond log files. We expect that ongoing research into word vectors intended to improve performance for traditional natural language processing will carry over similar benefits to analysis of semi-structured text.