Median in SQL.

By Leonid Toshchev, Dec. 4, 2014, 2:49 p.m.

A beautiful sunny morning. We are happy as never were. In the last days we saw how the average check was growing. Yesterday we had 10 sales for $5, 5 sales for $10 and 4 sales for $100. Mean was $26.3. It would seem that there were no signs of troubles, but in the evening the situation has changed dramatically. Mean was only $17.6. Nine dollars difference? Why? Because today we have only 2 sales for $100.

When the average check is falling by 60% CEO turns grey overnight. To preserve his health in safety, caring analysts used the median value in the calculations. Median eliminates the influence of the heavy-tailed distributions. So average check will be saved from such strong fluctuations.

How we can calculate the median:

1. Order all values in ascending. 5 5 5 5 5 5 5 5 5 5 10 10 10 10 10 100 100 100 100

2.1. Find the center of the series. 5 5 5 5 5 5 5 5 5 [5] 10 10 10 10 10 100 100 100 100

2.2 If the number of elements of the series is even we will calculate the mean for two central elements.

3. Founded value is median.

It is surprising, but to calculate the median is easier than the mean count. The child who doesn't know addition and subtraction able to properly carry out the calculation.

Many tools can do it for us. Unfortunately, SQL doesn't belong to them.

Fortunately, we still able to calculate the median through SQL. Why? Data aggregation as always: the calculation of the median values of product groups, the median as a condition of the sample data, etc.

Try to find the median price for an order item. The following query works on almost any SQL dialects and is the basis for calculating the median grade:

SELECT x.price FROM positions x, positions y GROUP BY x.price HAVING SUM(SIGN(1-SIGN(y.price-x.price))) = (COUNT(*)+1)/2;

The first thing that catches the eye is lack of sorting, which is a key step in finding the median. Sort replaced by a combination of the SUM and SIGN functions. With Cartesian product we match every price all other prices. Then we are looking for their difference. If the difference is positive or zero, the combination of SIGN functions returns 1. The sum of these results will show number of the prices which less than (or equal) to the current, i.e., it shows what position this price has in the ordered list. If this position is equal to the half of the entire table, then it is our median.

This query doesn't work in two cases. When the median is calculated as the average of the two numbers. And when several numbers in the list have the same value as the median (as it was with number 5 in the first example). Let's fix the query:

SELECT CASE count(tmp.median) WHEN 1 THEN tmp.median WHEN 2 THEN AVG(tmp.median) END as median FROM ( SELECT x.price as median FROM positions x, positions y GROUP BY x.price HAVING COUNT(*) / 2 BETWEEN SUM( CASE 1 - SIGN(y.price - x.price) WHEN 2 THEN 1 ELSE 0 END ) AND SUM(SIGN(1 - SIGN(y.price - x.price))) ) tmp;

With the help of a CASE WHEN clause in subquery we fix the second case. We just check if the half of the table size is between borders of the median values list. With the help of a CASE WHEN clause in outer query we eliminate troubles with calculation of the mean.

You can use this query just replaced the name of the tables and fields on yours. Unfortunately the trick with the calculation position of the price requires a Cartesian product of the tables that is a very heavy process. So the performance of queries on large datasets may leave a lot to be desired.

There is easiest and fastest ways to calculate median in SQL. But all of them depends on the specific RDBMS ability. You can consider the example above as a set of nice and useful tricks which will work everywhere. Just be careful using this method under the big load.

If you have any comments or suggestions - feel free to email me: mail@datamonkey.pro