I want to take a few minutes for a deep dive into the effect your data model has on storage density when using PostgreSQL. When this topic came up with a customer, I explained my thoughts on the matter, but I realized at the time that I had never done a reasonably careful apples-to-apples test to see just exactly what the effect is, at least for a model sample size of one. So here it is.

Base Use Case

Let's start out with a presumed use-case. I have two tables: one is a parent table and the other a detail table. Think of them as something like orders and order_detail or order_lineitem. Or perhaps more closely matching with my sample code, they might represent sensor data collection with a parent table for a collection event and a detail table with one row for each sensor collected for each collection event. Something like this:

CREATE TABLE parent ( id bigserial PRIMARY KEY, pname text NOT NULL, pts timestamptz NOT NULL ); CREATE UNIQUE INDEX parent_pname ON parent(pname, pts); CREATE INDEX parent_pts ON parent(pts); CREATE TABLE detail ( pid int8 NOT NULL REFERENCES parent(id), elemname text NOT NULL, elemval float8 NOT NULL, PRIMARY KEY (pid, elemname) ); CREATE INDEX detail_elemname ON detail(elemname);

This is a pretty minimal representation but should suffice to illustrate my points. I created some indexes that might be needed depending on the data access patterns of a real life use-case.

INSERT INTO parent (pname, pts) SELECT 'pname ' || (g.i % 14197)::text, now() + (g.i::text || ' seconds')::interval FROM generate_series(1,100000) AS g(i); INSERT INTO detail SELECT p.id, 'elem_' || ( CASE WHEN g.i < 10 THEN '0' || g.i::text ELSE g.i::text END ), (g.i % 10) + random() FROM parent AS p, generate_series(0,99) AS g(i); VACUUM FREEZE ANALYZE;

As you can see I created 100,000 rows in the parent table, and 10 million in the detail table. With this data, each pname should be repeated roughly 7 times, but each time with a different "snapshot" timestamp, thus satisfying the UNIQUE index.

Now I am further going to assume that the data for each sensor snapshot will usually be accessed in an all-or-nothing fashion. For many use-cases this is perfectly true, and in some it might not be.

In any case, let's see how large these two tables and all accompanying storage (indexes, etc.) are:

SELECT pg_size_pretty(pg_total_relation_size('parent'::regclass) + pg_total_relation_size('detail'::regclass)); pg_size_pretty ---------------- 1178 MB (1 row)

So just over 1.1 GB of storage consumed.

We will do a quick test to see how long it takes to return one sensor snapshot of data. For fun, the timing method I am using is a custom PL/pgSQL function as shown below. I use a random integer as the key to select one parent row and all of the associated detail rows. I run it one thousand times to average out the effects of caching and other overhead that I am not interested in:

CREATE OR REPLACE FUNCTION timeit(insql text) RETURNS interval AS $$ DECLARE tgtpid bigint; startts timestamp; sumint interval = '0 seconds'; rec record; i int; numiters int := 1000; BEGIN FOR i IN 1..numiters LOOP tgtpid := round(100000 * random()); startts := clock_timestamp(); EXECUTE insql INTO rec using tgtpid; sumint := sumint + (clock_timestamp() - startts)::interval; END LOOP; RETURN (sumint / numiters); END; $$ LANGUAGE plpgsql; SELECT timeit( $$ SELECT count(1) FROM parent p JOIN detail d ON d.pid = p.id WHERE p.id = $1 $$); timeit ----------------- 00:00:00.000331

You can see that pulling one sensor snapshot out is pretty darn fast -- about 3 microseconds. For the record this is done on a commodity desktop computer, on PostgreSQL 11, with default configuration. Your mileage may vary (YMMV).

Modified Data Model

Now I want to try the first alternative data model. In this one I am combining the entire sensor snapshot's detail data into two arrays, names and values, and storing them inline with the parent tuple. This is consistent with the assumption of an all-or-nothing data access pattern. Below I create the table with more-or-less equivalent indexes, load the same data that is in the original parent and detail tables, run some post loading maintenance, check the data size, and finally time the access for a sensor snapshot:

DROP TABLE parentdetail; CREATE TABLE parentdetail ( id int8 PRIMARY KEY, pname text NOT NULL, pts timestamptz NOT NULL, elemnames text[], elemvals float8[] ); CREATE UNIQUE INDEX parentdetail_pname ON parentdetail(pname, pts); CREATE INDEX parentdetail_pts ON parentdetail(pts); CREATE INDEX parentdetail_elemnames ON parentdetail USING GIN(elemnames); INSERT INTO parentdetail SELECT id, pname, pts, array_agg(elemname), array_agg(elemval) FROM ( SELECT p.id, p.pname, p.pts, d.elemname, d.elemval FROM parent p JOIN detail d ON d.pid = p.id ORDER BY p.id, p.pts, d.elemname ) AS ss GROUP BY id, pname, pts; VACUUM FREEZE ANALYZE parentdetail; SELECT pg_size_pretty(pg_table_size('parentdetail'::regclass)); pg_size_pretty ---------------- 130 MB SELECT pg_size_pretty(pg_indexes_size('parentdetail'::regclass)); pg_size_pretty ---------------- 29 MB SELECT pg_size_pretty(pg_total_relation_size('parentdetail_pkey')); pg_size_pretty ---------------- 2208 kB SELECT pg_size_pretty(pg_total_relation_size('parentdetail_pname')); pg_size_pretty ---------------- 4448 kB SELECT pg_size_pretty(pg_total_relation_size('parentdetail_pts')); pg_size_pretty ---------------- 2208 kB SELECT pg_size_pretty(pg_total_relation_size('parentdetail_elemnames')); pg_size_pretty ---------------- 20 MB SELECT pg_size_pretty(pg_total_relation_size('parentdetail'::regclass)); pg_size_pretty ---------------- 159 MB SELECT timeit( $$ SELECT count(1) FROM parentdetail WHERE id = $1 $$); timeit ----------------- 00:00:00.000165 (1 row)

A few of observations regarding the above.

The storage size required for the same exact data is now reduced to about 13% of the original. This is due in large part to the fact that our detail table is fairly narrow and the inherent PostgreSQL per row overhead. There is also potentially the effect of toasting of the arrays.

VACUUM FREEZE ANALYZE is run after data loading. This has several beneficial effects. After bulk loading a table, the tuples have an initial transaction id at which they became visible, and they lack a "set" visibility hint bit. The former must be changed at some point in the future to prevent the row from becoming invisible due to "transaction id wraparound". The latter will get set the next time the row is touched when it is visible to all existing database sessions, even if that touch is nominally read-only (i.e. a SELECT operation). By vacuuming the table with the freeze option, we ensure that the rows are immediately rewritten as "frozen," meaning they don't need to be frozen later, and with the visibility hint bit set, meaning some later SELECT does not need to cause a write to storage. Additionally, the analyze option ensures that PostgreSQL will have all the needed statistics up to date when planning future queries against this table so that the queries can be properly optimized. All of this would eventually happen automatically via autovacuum/autoanalyze, but it is usually advantageous to get it out of the way up front when bulk loading.

The average access time for a sensor snapshot is now about 50% of the (already fast) time for the original data model.

By the way, toasting is an internal PostgreSQL mechanism which by default tries to compress data in rows which exceed a threshold of 2040 bytes. However, in our case, the per tuple size of the detail table is something like:

23 (tuple overhead) +

8 (id) +

16 (estimated average for pname) +

8 (pts) +

24 (array overhead) +

1100 (estimated text array data) +

24 (array overhead) +

800 (estimated float8 array data)

------ =

1987 bytes

I'm not 100% sure I got that perfectly correct, but it would suggest that these tuples are not getting compressed. To rectify that, do the following:

DROP TABLE parentdetail; CREATE TABLE parentdetail ( id int8 NOT NULL, pname text NOT NULL, pts timestamptz NOT NULL, elemnames text[], elemvals float8[] ) WITH (toast_tuple_target = 1024); INSERT INTO parentdetail SELECT id, pname, pts, array_agg(elemname), array_agg(elemval) FROM ( SELECT p.id, p.pname, p.pts, d.elemname, d.elemval FROM parent p JOIN detail d ON d.pid = p.id ORDER BY p.id, p.pts, d.elemname ) AS ss GROUP BY id, pname, pts; ALTER TABLE parentdetail ADD PRIMARY KEY(id); CREATE UNIQUE INDEX parentdetail_pname ON parentdetail(pname, pts); CREATE INDEX parentdetail_pts ON parentdetail(pts); CREATE INDEX parentdetail_elemnames ON parentdetail USING GIN(elemnames); VACUUM FREEZE ANALYZE parentdetail; SELECT pg_size_pretty(pg_table_size('parentdetail')); pg_size_pretty ---------------- 132 MB SELECT pg_size_pretty(pg_indexes_size('parentdetail')); pg_size_pretty ---------------- 20 MB SELECT pg_size_pretty(pg_total_relation_size('parentdetail_pkey')); pg_size_pretty ---------------- 2208 kB SELECT pg_size_pretty(pg_total_relation_size('parentdetail_pname')); pg_size_pretty ---------------- 3992 kB SELECT pg_size_pretty(pg_total_relation_size('parentdetail_pts')); pg_size_pretty ---------------- 2208 kB SELECT pg_size_pretty(pg_total_relation_size('parentdetail_elemnames')); pg_size_pretty ---------------- 12 MB SELECT pg_size_pretty(pg_total_relation_size('parentdetail'::regclass)); pg_size_pretty ---------------- 152 MB SELECT timeit( $$ SELECT count(1) FROM parentdetail WHERE id = $1 $$ ); timeit ----------------- 00:00:00.000167

Notice a few things here:

I am now loading the data before creating the primary key or the secondary indexes. If you are adding data to an already existing table that has data and is concurrently being accessed you cannot do this, but when bulk loading data into a newly created table, this is generally a big win in terms of loading speed and WAL file (transaction log) generation. The latter is important if you are archiving the WAL for Point In Time Recovery (PITR) or binary replication purposes. I have personally seen some pathological cases where the amount of WAL generated by a bulk load was reduced from several TB to tens of GB.

The total storage footprint has decreased from 159 to 152 MB. This is related to two effects, one beneficial and one detrimental. Recall that the floating point data used for this example was at least partially randomly generated. Random data usually does not compress well, and it may in fact get larger when compressed, which is supported by the fact that the table data increased by 2 MB. It might also be related to the fact that rows are chunked in toast tables, meaning overall more rows, which of course adds per row storage overhead. On the other hand, the index storage decreased by 9 MB. This is likely due to the fact that the indexes were built after the data was loaded which left the index density higher. In particular the GIN index on elemnames is quite a bit more compact.

The access time has remained about the same.

There is one more change I want to make to this table. The index created on elemnames seems unlikely to be very useful in many use-cases. Let's see what happens if we lose it:

DROP INDEX parentdetail_elemnames; SELECT pg_size_pretty(pg_total_relation_size('parentdetail'::regclass)); pg_size_pretty ---------------- 141 MB SELECT timeit($$ SELECT count(1) FROM parentdetail WHERE id = $1 $$); timeit ----------------- 00:00:00.000165

Not surprisingly this reduces the footprint further without any impact on access speed, since we were not using that index and don't really expect to ever use it. The storage size is down another 7% for a total reduction to about 12% of the original.

Refined Data Model

There is another way we might go further. Do we really need the elemnames array column at all? If our data model is stable, i.e. we always have the same 100 elements in our sensor snapshot, perhaps we do not. For this test we will create and load a new table:

CREATE TABLE parentdetail2 ( id int8 NOT NULL, pname text NOT NULL, pts timestamptz NOT NULL, elemvals float8[] ); INSERT INTO parentdetail2 SELECT id, pname, pts, elemvals FROM parentdetail; ALTER TABLE parentdetail2 ADD PRIMARY KEY(id); CREATE UNIQUE INDEX parentdetail2_pname ON parentdetail2(pname, pts); CREATE INDEX parentdetail2_pts ON parentdetail2(pts); VACUUM FREEZE ANALYZE parentdetail2; SELECT pg_size_pretty(pg_total_relation_size('parentdetail2'::regclass)); pg_size_pretty ---------------- 95 MB SELECT timeit( $$ SELECT count(1) FROM parentdetail2 WHERE id = $1 $$); timeit ----------------- 00:00:00.000164

No real difference in access time again, but look at storage size. We are now down to about 8% or a factor of 12 of the original data storage size.

Alternative Refined Data Model

Okay, given the assumption that we have a stable model with the same 100 elements every time, why not create a table with those 100 columns? Let's see how that pans out:

CREATE TABLE parentonly ( id int8 NOT NULL, pname text NOT NULL, pts timestamptz NOT NULL, elem_00 float8 NOT NULL, elem_01 float8 NOT NULL, elem_02 float8 NOT NULL, [...] elem_98 float8 NOT NULL, elem_99 float8 NOT NULL ); DO $_$ DECLARE rec record; sql text; BEGIN FOR rec IN SELECT * FROM parentdetail LOOP sql := rec.id || $$,'$$ || rec.pname || $$','$$ || rec.pts || $$',$$; sql = sql || array_to_string(rec.elemvals, $$,$$); EXECUTE 'INSERT INTO parentonly VALUES(' || sql || ')'; END LOOP; END; $_$; ALTER TABLE parentonly ADD PRIMARY KEY(id); CREATE UNIQUE INDEX parentonly_pname ON parentonly(pname, pts); CREATE INDEX parentonly_pts ON parentonly(pts); VACUUM FREEZE ANALYZE parentonly; SELECT pg_size_pretty(pg_total_relation_size('parentonly'::regclass)); pg_size_pretty ---------------- 95 MB SELECT timeit( $$ SELECT count(1) FROM parentonly WHERE id = $1 $$); timeit ----------------- 00:00:00.000178

Interestingly the storage size is exactly the same as parentdetail2. This suggests that we are definitely not getting any compression of the elemvals array, which should not be surprising since without the elemnames array the tuple size is well below the toast threshold. The access time is only slightly, but in my testing consistently, slower.

So why would you use one of these models versus the other (talking specifically about parentdetail2 compared to parentonly)? Well, if you frequently need only one or a small number of the elemvals columns when doing retrospective analysis of the data, then the one column per element model is probably a win. On the other hand, if you normally perform analysis on most of the elements, or if you just plan have thousands of elements, the array model is probably best. In the latter case you don't have much choice to use the column per element model -- PostgreSQL supports at most 1600 columns, and depending on your exact table structure and data, far fewer may be practical.

JSONB Data Model

Finally, I want to see how this would look if we used JSONB instead.

CREATE TABLE parentdetailjsonb ( id int8 NOT NULL, pdoc jsonb ); INSERT INTO parentdetailjsonb SELECT id, row_to_json(ss)::jsonb FROM ( SELECT id, pname, pts, jsonb_object(elemnames, elemvals::text[]) AS elements FROM parentdetail ORDER BY id, pts ) AS ss; ALTER TABLE parentdetailjsonb ADD PRIMARY KEY(id); CREATE UNIQUE INDEX parentdetailjsonb_pname ON parentdetailjsonb (((pdoc->'pname')::text), ((pdoc->'pts')::text)); CREATE INDEX parentdetailjsonb_pts ON parentdetailjsonb (((pdoc->'pts')::text)); CREATE INDEX parentdetailjsonb_pdoc ON parentdetailjsonb USING GIN(pdoc); VACUUM FREEZE ANALYZE; SELECT pg_size_pretty(pg_total_relation_size('parentdetailjsonb'::regclass)); pg_size_pretty ---------------- 900 MB SELECT timeit( $$ SELECT count(1) FROM parentdetailjsonb WHERE pdoc @> ('{"id": "$1"}')::jsonb; $$); timeit ----------------- 00:00:00.000433

Note the following:

The total size of table plus indexes is very large again. Not quite as large as the original two table scenario. In this case we probably are benefitting from toast compression, since JSONB is a compressible datatype.

The access time is quite fast in absolute terms, but relatively slow compared to all previous scenarios including the original two table model.

Now, this is a little unfair to JSONB because we have created a GIN index on the entire document, and we are using that index to select our sensor snapshots. The GIN index gives us great flexibility in getting indexed access to various parts of these documents in ways that are not possible given the other table definitions. However, as stated earlier, we are assuming that we do not typically need to access the data in that way. So now look at the following:

DROP INDEX parentdetailjsonb_pdoc; SELECT pg_size_pretty(pg_table_size('parentdetailjsonb'::regclass)); pg_size_pretty ---------------- 271 MB SELECT pg_size_pretty(pg_indexes_size('parentdetailjsonb'::regclass)); pg_size_pretty ---------------- 15 MB SELECT pg_size_pretty(pg_total_relation_size('parentdetailjsonb'::regclass)); pg_size_pretty ---------------- 286 MB SELECT timeit( $$ SELECT count(1) FROM parentdetailjsonb WHERE id = $1 $$); timeit ----------------- 00:00:00.000167

You can see that the total storage requirement is now reduced significantly, although not as much as parentdetail2 or parentonly -- in fact about 3 times the size, but still 24% of the original 2 table design.

UPDATE: Composite Types

Adam Brusselback reached out to me after running the similar test with composite types. I wanted to share his results below. In Adam's words:

"Hey Joe, I enjoyed the post. It really does points out how much data model can determine if a problem can be dealt with on a single machine or needs more complicated solutions. Like you said, changing data model can sometimes be a much better way to solve a problem."



"I've used these techniques before to deal with some real world problems in my application. One main difference though, is I usually utilize composite type arrays rather than arrays of each of the elements of the detail table."



"I thought i'd do a little of the same analysis you did in your post but using composite types."



"For example:"

CREATE TYPE detail_type AS ( elemname text, elemval float8 ); CREATE TEMPORARY TABLE parentdetail ( id int8 PRIMARY KEY, pname text NOT NULL, pts timestamptz NOT NULL, detail_elem detail_type[] ); INSERT INTO parentdetail SELECT id, pname, pts, array_agg(row(elemname, elemval)::detail_type) FROM ( SELECT p.id, p.pname, p.pts, d.elemname, d.elemval FROM parent p JOIN detail d ON d.pid = p.id ORDER BY p.id, p.pts, d.elemname ) AS ss GROUP BY id, pname, pts; CREATE UNIQUE INDEX parentdetail_pname ON parentdetail(pname, pts); CREATE INDEX parentdetail_pts ON parentdetail(pts); VACUUM FREEZE ANALYZE parentdetail;

"This gives all the benefits of having a normalized table as far as being able to query more naturally rather than individual arrays of each element, it also makes it easier to keep the related array attributes together. Say you wanted to run this query:"

SELECT pname, sum(elemval) FROM parent p JOIN detail d ON d.pid = p.id WHERE d.elemname = 'elem_01' GROUP BY 1;

"That gets harder when you have your detail record split into different arrays. Even if it's not the most efficient, you can still do that query with the parentdetail specified above like so:"

SELECT pname, sum(elemval) FROM ( SELECT id, pname, ptc, (unnest(detail_elem)).* FROM parentdetail ) x WHERE elemname = 'elem_01' GROUP BY 1;

"Then there are the original test queries you used:"

SELECT timeit( $$ SELECT count(1) FROM parent p JOIN detail d ON d.pid = p.id WHERE p.id = $1 $$); --00:00:00.000212 SELECT timeit( $$ SELECT count(1) FROM parentdetail WHERE id = $1 $$); --00:00:00.000087

"Then trying to do something more analytical:"

SELECT timeit($$ SELECT pts::date, avg(d.elemval), sum(d.elemval) FROM parent p JOIN detail d ON d.pid = p.id WHERE id = $1 GROUP BY 1 $$); --00:00:00.000796 SELECT timeit( $$ SELECT pts::date, avg(elemval), sum(elemval) FROM ( SELECT id, pts, (unnest(detail_elem)).elemval FROM parentdetail ) x WHERE id = $1 GROUP BY 1 $$); --00:00:00.000259

"Then we have the sizes (I am removed the elemname index because that is not possible using the parentdetail implementation):"

SELECT pg_size_pretty(pg_total_relation_size('parent'::regclass) + pg_total_relation_size('detail'::regclass)); --906 MB SELECT pg_size_pretty(pg_total_relation_size('parentdetail'::regclass)); --138 MB

Conclusion

Why should you care about all of this. Well imagine if your projections for your initial data model, ingest rate, and data retention requirements suggest that you may need to scale up to say, 100 TB of storage, you might just want to explore your options for reducing your requirements. We have shown here, that in certain, but I think reasonably common, circumstances you may be able to reduce that 100 TB to 8 TB! The lesson is that there is significant wiggle room in many designs, and it is worthwhile to look for options such as these prior to resorting to sharding or other distributed models.

That's it for now. I hope you find this information useful if not inspiring.