Recently, I decided to look into Flink’s Complex Event Processing library (CEP) and see what it was capable of. I found it surprisingly easy to use and I can envision many possible use cases.

What is complex event processing? Before jumping into how to do CEP I thought it might first be useful to define it and why it is useful for data scientists and engineers to know. Complex event processing is useful for detecting patterns in streaming data and sending alerts or notifications based on these patterns. Anyone working with time critical data that needs to be monitored should know how to use CEP. So for instance, on the Flink website they detail how you could use complex event processing to monitor server rack temperatures and send a warning when the same rack exceeds a temperature threshold twice in a row. They then describe how when you have two ‘consecutive warnings’ you could create an alert. This example is pretty good, however it is a bit dated and I wanted to try testing it out myself with Twitter data.

Getting started:

In my previous article on Flink I described how to setup a stream of Tweets and perform a basic word count of the most common words in the Tweets. Now remembering back we had a data stream of words in (word,count) format which I called dataWindowKafka (as we were previously feeding it directly to a Kafka producer). This is what we will primarily be working with in this example.

Now lets say we are interested in seeing if a specific word is being mentioned over a set amount in one window and another greater amount in the second window. First let’s extend SimpleCondition in order to not write the code inline like in the documentation.

Now that we have a class that’s easy to use let’s write out our actual CEP code.

In production you would probably pass the output of this stream to Kafka or some DB, but for our purposes we will just write it to the console as a basic println.

System.out.println(manyMentions.writeAsText("alert.txt"));

// Output {increasing=[(trump,137)], first=[(trump,143)]}

{increasing=[(trump s,49)], first=[(trump s,35)]}

{increasing=[( ,42)], first=[( ,29)]}

{increasing=[(i m,11)], first=[(i m,21)]}

We obviously have some tokenization problems as trump and “trump’s” are the essentially the same word and blanks should not be included either, but the CEP itself seems to be doing its job. The first filter “first” and the second filter “increasing” are both being met. Though as shown the second isn’t always necessarily actually increasing. The reason is because the value is static so as long as the second is greater than 20 it will return true even if the prior was 42. For that we need Iterative Conditions.

Iterative Conditions: According to the Flink documentation they are:

the most general type of condition. This is how you can specify a condition that accepts subsequent events based on properties of the previously accepted events or a statistic over a subset of them.

So let’s say that we want to make sure that our frequency of these tweets is actually increasing. The main difference here is that instead of using a fixed value the boolean will be doing current_event>previous_event. We can do this using iterative conditions.

I will add the output of this final event soon.

More complex examples

Of course, this example barely scratches the surface of CEP. One can immediately imagine many more use cases. For instance, a company wanting to monitor it’s brand reputation might seek to detect a possible boycott (in order to respond to it quickly) by using CEP in conjunction with sentiment analysis. So the company could use a machine learning algorithm to detect and label every tweet that mentioned their brand with a positive or negative sentiment and then if the negative sentiments exceeded a certain threshold and/or the number of positives they would create an alert.

Another example might be an online movie or shopping site that wants to recommend movies or products to a user. In this instance, they might look at sequential log data and if a user visits a certain of sequence of films/items within a given span of time to recommend another related film/item.

Other Flink CEP Resources