Taking the SYSLOG_INVALID_USERS stream that we created above as our starting point, we can look at how many attempted logins we’re getting per minute:

ksql > SELECT HOST , COUNT ( * ) FROM SYSLOG_INVALID_USERS WINDOW TUMBLING ( SIZE 1 MINUTE ) GROUP BY HOST ; rpi - 03 | 1 rpi - 03 | 2 rpi - 03 | 4

To see which window this refers to we’ll persist the aggregate query as a table:

ksql > CREATE TABLE INVALID_USERS_LOGINS_PER_HOST AS \ SELECT HOST , COUNT ( * ) AS INVALID_LOGIN_COUNT \ FROM SYSLOG_INVALID_USERS \ WINDOW TUMBLING ( SIZE 1 MINUTE ) \ GROUP BY HOST ; Message --------------------------- Table created and running --------------------------- ksql >

Now we can query from it and show the aggregate window timestamp alongside the result:

ksql > SELECT ROWTIME , TIMESTAMPTOSTRING ( ROWTIME , 'yyyy-MM-dd HH:mm:ss' ), \ HOST , INVALID_LOGIN_COUNT \ FROM INVALID_USERS_LOGINS_PER_HOST ; 1521644100000 | 2018 - 03 - 21 14 : 55 : 00 | rpi - 03 | 1 1521646620000 | 2018 - 03 - 21 15 : 37 : 00 | rpi - 03 | 2 1521649080000 | 2018 - 03 - 21 16 : 18 : 00 | rpi - 03 | 1 1521649260000 | 2018 - 03 - 21 16 : 21 : 00 | rpi - 03 | 4 1521649320000 | 2018 - 03 - 21 16 : 22 : 00 | rpi - 03 | 2 1521649080000 | 2018 - 03 - 21 16 : 38 : 00 | rpi - 03 | 2

In the above query I’m displaying the aggregate window start time, ROWTIME (which is epoch), and converting it also to a display string, using TIMESTAMPTOSTRING . We can use this to easily query the stream for a given window of interest. For example, for the window beginning at 2018-03-21 16:21:00 we can see there were four invalid user login attempts. We can easily check the source data for this, using the ROWTIME in the above output for the window (16:21 – 16:22) as the bounds for the predicate:

ksql > SELECT ROWTIME , TIMESTAMPTOSTRING ( ROWTIME , 'yyyy-MM-dd HH:mm:ss' ), MESSAGE \ FROM SYSLOG_INVALID_USERS \ WHERE ROWTIME >= 1521649260000 AND ROWTIME < 1521649320000 ; 1521649272270 | 2018 - 03 - 21 16 : 21 : 12 | rpi - 03 sshd [ 3663 ]: Invalid user 0 from 5 . 188 . 10 . 156 1521649284181 | 2018 - 03 - 21 16 : 21 : 24 | rpi - 03 sshd [ 3667 ]: Invalid user 0101 from 5 . 188 . 10 . 156 1521649299896 | 2018 - 03 - 21 16 : 21 : 39 | rpi - 03 sshd [ 3671 ]: Invalid user 1234 from 5 . 188 . 10 . 156 1521649311508 | 2018 - 03 - 21 16 : 21 : 51 | rpi - 03 sshd [ 3676 ]: Invalid user admin from 5 . 188 . 10 . 156

Taking the aggregate query, we can use it to drive some exception alerting, based on a defined threshold. Let’s say we want to only know when there are four (or more) invalid user login attempts within a given window (in this case, one minute). It’s just standard SQL!

ksql > SELECT HOST , COUNT ( * ) AS INVALID_LOGIN_COUNT \ FROM SYSLOG_INVALID_USERS \ WINDOW TUMBLING ( SIZE 1 MINUTE ) \ GROUP BY HOST \ HAVING COUNT ( * ) >= 4 ; rpi - 03 | 12 rpi - 03 | 5 rpi - 03 | 6 rpi - 03 | 4

This is based on the original source stream, SYSLOG_INVALID_USERS . We could do it another route, using the table that we’d defined above and just filtering that:

ksql > SELECT ROWTIME , TIMESTAMPTOSTRING ( ROWTIME , 'yyyy-MM-dd HH:mm:ss' ), \ HOST , INVALID_LOGIN_COUNT \ FROM INVALID_USERS_LOGINS_PER_HOST \ WHERE INVALID_LOGIN_COUNT >= 4 ; 1520214360000 | 2018 - 03 - 05 01 : 46 : 00 | rpi - 03 | 12 1521632460000 | 2018 - 03 - 21 11 : 41 : 00 | rpi - 03 | 5 1521646620000 | 2018 - 03 - 21 15 : 37 : 00 | rpi - 03 | 6 1521649260000 | 2018 - 03 - 21 16 : 21 : 00 | rpi - 03 | 4

The results are the same—as with the discussion above around data fidelity and premature optimisation of our data pipelines, it depends whether you also have a use for the aggregate table itself too (for example, for streaming to an analytics view), or whether you purely want to drive an anomaly detection from the source stream alone.

Let’s now populate our KSQL_NOTIFY topic with the results of the anomaly detection above:

CREATE TABLE INVALID_LOGIN_NOTIFICATIONS WITH ( KAFKA_TOPIC = 'KSQL_NOTIFY' , VALUE_FORMAT = 'JSON' ) AS \ SELECT TIMESTAMPTOSTRING ( ROWTIME , 'yyyy-MM-dd HH:mm:ss' ) + \ ': :heavy_exclamation_mark: On ' + \ HOST + \ ' there were ' + \ CAST ( INVALID_LOGIN_COUNT AS VARCHAR ) + \ ' invalid user login attempts in the last minute (threshold is >=4)' AS TEXT , \ 'general' AS CHANNEL \ FROM INVALID_USERS_LOGINS_PER_HOST \ WHERE INVALID_LOGIN_COUNT >= 4 ;

Now we see the alerts come through, only when the threshold is breached.

A few points to note: