The most useful feature of the relational database is that it allows us to easily process data in sets, which can be much faster than processing it serially.

When the relational database was first implemented, write-ahead-logging and other technologies did not exist. This made it difficult to implement the database in a way that matched with natural set theory. You see in, in set theory there is no possibility of duplicates in sets. This seems hard to fathom with the way we are used to dealing with data in a computer, but intuitive when you “think about how you think”.



When you say, “I have ten fingers and ten toes”, and I say, “how many is that combined?” you say 20, of course. Did you stop to add up the number one 10 times, and then then ten times more? Of course not. You did simple arithmetic to make the math faster, adding 10 + 10. This is a simple fact, that when you distribute computation you work faster.

How can we effectively reduce the data size of any set, possibly by orders of magnitude, with almost no work? Simple, we start thinking in sets like our brains do.

I am going to use a bigger example than “digits” because this is too small for you to notice an effective change.

Lets say I have a table of numbers, and I want to sum them all up:

mysql> show create table ex1; +-------+-----------------------------------------------------------------------------------------------------+ | Table | Create Table | +-------+-----------------------------------------------------------------------------------------------------+ | ex1 | CREATE TABLE `ex1` ( `val` bigint(20) NOT NULL DEFAULT '0' ) ENGINE=MyISAM DEFAULT CHARSET=latin1 | +-------+-----------------------------------------------------------------------------------------------------+ 1 row in set (0.00 sec) mysql> select count(*) from ex1; +----------+ | count(*) | +----------+ | 73027220 | +----------+ 1 row in set (0.00 sec) mysql> select sum(val) from ex1; +-----------+ | sum(val) | +-----------+ | 871537665 | +-----------+ 1 row in set (5.49 sec) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 mysql > show create table ex1 ; + -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + | Table | Create Table | + -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + | ex1 | CREATE TABLE ` ex1 ` ( ` val ` bigint ( 20 ) NOT NULL DEFAULT '0' ) ENGINE = MyISAM DEFAULT CHARSET = latin1 | + -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + 1 row in set ( 0.00 sec ) mysql > select count ( * ) from ex1 ; + -- -- -- -- -- + | count ( * ) | + -- -- -- -- -- + | 73027220 | + -- -- -- -- -- + 1 row in set ( 0.00 sec ) mysql > select sum ( val ) from ex1 ; + -- -- -- -- -- - + | sum ( val ) | + -- -- -- -- -- - + | 871537665 | + -- -- -- -- -- - + 1 row in set ( 5.49 sec )

Now, what if I structure my data differently? We can “compress” the table in the database by removing the duplicates:

mysql> create table ex2 as select val, count(*) from ex1 group by val; Query OK, 9 rows affected (19.51 sec) Records: 9 Duplicates: 0 Warnings: 0 mysql> select * from ex2; +-----+----------+ | val | count(*) | +-----+----------+ | -10 | 1 | | -3 | 1 | | -2 | 1 | | 0 | 250000 | | 4 | 8000000 | | 7 | 14680064 | | 14 | 14680064 | | 15 | 35417088 | | 16 | 1 | +-----+----------+ 9 rows in set (0.00 sec) mysql> select sum(`count(*)`) from ex2; +-----------------+ | sum(`count(*)`) | +-----------------+ | 73027220 | +-----------------+ 1 row in set (0.00 sec) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 mysql > create table ex2 as select val , count ( * ) from ex1 group by val ; Query OK , 9 rows affected ( 19.51 sec ) Records : 9 Duplicates : 0 Warnings : 0 mysql > select * from ex2 ; + -- -- - + -- -- -- -- -- + | val | count ( * ) | + -- -- - + -- -- -- -- -- + | - 10 | 1 | | - 3 | 1 | | - 2 | 1 | | 0 | 250000 | | 4 | 8000000 | | 7 | 14680064 | | 14 | 14680064 | | 15 | 35417088 | | 16 | 1 | + -- -- - + -- -- -- -- -- + 9 rows in set ( 0.00 sec ) mysql > select sum ( ` count ( * ) ` ) from ex2 ; + -- -- -- -- -- -- -- -- - + | sum ( ` count ( * ) ` ) | + -- -- -- -- -- -- -- -- - + | 73027220 | + -- -- -- -- -- -- -- -- - + 1 row in set ( 0.00 sec )

This is very useful compression. First, the data is stored inside of a histogram in the set. The cardinality and the distribution is always known. Second, computation on the compressed data is done without decompression, because a mathematical transformation does not have to be decompressed. Third, the COUNT acts as RLE compression for the other attributes. Don’t use FLOAT with this method (don’t use it at all).

We can do even more interesting things with this concept. Lets consider that I want to determine if I have certain numbers in my set?

First, lets create a table to hold the set of numbers (and their desired minimum cardinality) to a “search set” table:

CREATE TABLE `search_set` ( `val` bigint(20) DEFAULT NULL, `cnt` bigint(20) DEFAULT NULL ) ENGINE=MyISAM DEFAULT CHARSET=latin1 ; mysql> insert into search_set values (-2,1),(-3,1),(-10,1),(15,2),(16,1); Query OK, 5 rows affected (0.00 sec) Records: 5 Duplicates: 0 Warnings: 0 1 2 3 4 5 6 7 8 CREATE TABLE ` search_set ` ( ` val ` bigint ( 20 ) DEFAULT NULL , ` cnt ` bigint ( 20 ) DEFAULT NULL ) ENGINE = MyISAM DEFAULT CHARSET = latin1 ; mysql > insert into search_set values ( - 2 , 1 ) , ( - 3 , 1 ) , ( - 10 , 1 ) , ( 15 , 2 ) , ( 16 , 1 ) ; Query OK , 5 rows affected ( 0.00 sec ) Records : 5 Duplicates : 0 Warnings : 0

We are looking for at least one minus-two, at least two fifteens, etc

select * from compressed join search_set using (val) where compressed.cnt >= search_set.cnt; mysql> select * from compressed join search_set using (val) where compressed.cnt >= search_set.cnt; +-----+----------+------+ | val | cnt | cnt | +-----+----------+------+ | -10 | 1 | 1 | | -3 | 1 | 1 | | -2 | 1 | 1 | | 15 | 35417088 | 2 | | 16 | 1 | 1 | +-----+----------+------+ 5 rows in set (0.00 sec) mysql> delete from compressed where val=16; Query OK, 1 row affected (0.00 sec) mysql> select * from compressed join search_set using (val) where compressed.cnt >= search_set.cnt; +-----+----------+------+ | val | cnt | cnt | +-----+----------+------+ | -10 | 1 | 1 | | -3 | 1 | 1 | | -2 | 1 | 1 | | 15 | 35417088 | 2 | +-----+----------+------+ 4 rows in set (0.00 sec) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 select * from compressed join search_set using ( val ) where compressed .cnt >= search_set .cnt ; mysql > select * from compressed join search_set using ( val ) where compressed .cnt >= search_set .cnt ; + -- -- - + -- -- -- -- -- + -- -- -- + | val | cnt | cnt | + -- -- - + -- -- -- -- -- + -- -- -- + | - 10 | 1 | 1 | | - 3 | 1 | 1 | | - 2 | 1 | 1 | | 15 | 35417088 | 2 | | 16 | 1 | 1 | + -- -- - + -- -- -- -- -- + -- -- -- + 5 rows in set ( 0.00 sec ) mysql > delete from compressed where val = 16 ; Query OK , 1 row affected ( 0.00 sec ) mysql > select * from compressed join search_set using ( val ) where compressed .cnt >= search_set .cnt ; + -- -- - + -- -- -- -- -- + -- -- -- + | val | cnt | cnt | + -- -- - + -- -- -- -- -- + -- -- -- + | - 10 | 1 | 1 | | - 3 | 1 | 1 | | - 2 | 1 | 1 | | 15 | 35417088 | 2 | + -- -- - + -- -- -- -- -- + -- -- -- + 4 rows in set ( 0.00 sec )

Great, things are looking good. What if we were to do this on our original relation before compression? This is a way to do it without using the “search set” trick above.

Also, notice that it takes a long time to delete. There is no index on this table.

mysql> delete from data where val=16; Query OK, 1 row affected (3.14 sec) mysql> select val, count(distinct id) from data where val in (-2,-3,-10,15,15,16) group by val ; +-----+--------------------+ | val | count(distinct id) | +-----+--------------------+ | -10 | 1 | | -3 | 1 | | -2 | 1 | | 15 | 35417088 | +-----+--------------------+ 4 rows in set (39.82 sec) 1 2 3 4 5 6 7 8 9 10 11 12 13 mysql > delete from data where val = 16 ; Query OK , 1 row affected ( 3.14 sec ) mysql > select val , count ( distinct id ) from data where val in ( - 2 , - 3 , - 10 , 15 , 15 , 16 ) group by val ; + -- -- - + -- -- -- -- -- -- -- -- -- -- + | val | count ( distinct id ) | + -- -- - + -- -- -- -- -- -- -- -- -- -- + | - 10 | 1 | | - 3 | 1 | | - 2 | 1 | | 15 | 35417088 | + -- -- - + -- -- -- -- -- -- -- -- -- -- + 4 rows in set ( 39.82 sec )

We can also check to see if any two numbers sum to zero: