In a previous post, Getting started with AWS IoT and Tessel, you learned how to send temperature data from your Tessel microcontroller to an AWS IoT topic. You also learned how to set up an alarm that fires whenever the temperature goes above a certain point (in the example, 10 degree Celsius). But fixed thresholds don’t work for with dynamic datasets. If you want to receive an alert when the temperature becomes unusual, things get more complicated.

Defining the unusual

An unusual temperature needs to be defined first. Statistics can help us with the problem.

The arithmetic mean (in colloquial language, an average) is a measure of central tendency in data (avg).

The standard deviation is a measure that is used to quantify the amount of variation in data (sd).

Look at the following visualization of this idea.

The idea is that most of the values are very close to the average (0 in this case).

To be more precise, 68.2% of the values are within the [avg-sd, avg+sg] range, also called 1-sigma.

To be more precise, 95.4% of the values are within the [avg-2sd, avg+2sg] range, also called 2-sigma.

So 4.6% of the values are not within the 2-sigma range. We will define them as unusual values.

Excursion: My first job was in high frequency / algorithmic trading. I did a lot of real-time analytics on top of financial markets data like fx, stock, futures, and options prices. So let me explain the idea of averages in more detail.

The average applies to normally distributed data sets like you see in the above figure. Many phenomena in nature are bell shaped/normally distributed, e.g. size of humans.

If you apply the average to numbers like personal incomes of the people of a country, things get funny. Assume we have 100 people in our country. 99 people earn $100 while the leader makes $10,000. The average is $199. The standard deviation is $990. So an income of $10,000 is ~ a 10-sigma event. If you wait for a 7-sigma event once a day you will see one every 1.07 billion years (a quarter of Earth’s history). A 10-sigma event is impossible. Still, it’s in the data. The issue is that the data is not normally distributed.

The crux of all that is that we are usually interested in outliers while most statistics focus on the opposite. Some people even remove outliers from their data to make it normally distributed.

In our temperature example we assume a normal distribution but keep in mind that this assumption is most likely not true. You will not end up with 4.6% unusual temperatures!

Let’s do some math:

temperatures = [ 15 , 10 , 11 , 12 , 14 , 13 , 14 , 15 , 9 , 12 ]

avg = average(temperatures)

sd = standard_deviation(temperatures)



We define that a temperature is unusual, if it is not in the 2-sigma range of past values.

temperature <

or

temperature >



This works pretty well for static data sets. But in our case, new temperature data arrives every minute.

Sliding windows

What we like to do is compare the current temperature against the average and standard deviation of temperature over that past one hour. This is called a sliding window because old values fall out. You add new values on the front while old values fall out on the end after 1 hour.

Let’s do some math:

temperatures = [...]

temperatures_in_window = sliding_window( 1 h, temperatures)

avg = average(temperatures_in_window)

sd = standard_deviation(temperatures_in_window)



So we can define that a temperature is unusual if it is not in the 2-sigma range of values that are not older than 1 hour.

temperature <

or

temperature >



Implementation