Zedstore - compressed in-core columnar storage

Heikki and I have been hacking recently for few weeks to implement

in-core columnar storage for PostgreSQL. Here's the design and initial

implementation of Zedstore, compressed in-core columnar storage (table

access method). Attaching the patch and link to github branch [1] to

follow along. The objective is to gather feedback on design and approach to the

same. The implementation has core basic pieces working but not close

to complete. Big thank you to Andres, Haribabu and team for the table access method

API's. Leveraged the API's for implementing zedstore, and proves API

to be in very good shape. Had to enhance the same minimally but

in-general didn't had to touch executor much. Motivations / Objectives * Performance improvement for queries selecting subset of columns

(reduced IO).

* Reduced on-disk footprint compared to heap table. Shorter tuple

headers and also leveraging compression of similar type data

* Be first-class citizen in the Postgres architecture (tables data can

just independently live in columnar storage)

* Fully MVCC compliant

* All Indexes supported

* Hybrid row-column store, where some columns are stored together, and

others separately. Provide flexibility of granularity on how to

divide the columns. Columns accessed together can be stored

together.

* Provide better control over bloat (similar to zheap)

* Eliminate need for separate toast tables

* Faster add / drop column or changing data type of column by avoiding

full rewrite of the table. High-level Design - B-trees for the win!

======================================== To start simple, let's ignore column store aspect for a moment and

consider it as compressed row store. The column store is natural

extension of this concept, explained in next section. The basic on-disk data structure leveraged is a B-tree, indexed by

TID. BTree being a great data structure, fast and versatile. Note this

is not referring to existing Btree indexes, but instead net new

separate BTree for table data storage. TID - logical row identifier:

TID is just a 48-bit row identifier. The traditional division into

block and offset numbers is meaningless. In order to find a tuple with

a given TID, one must always descend the B-tree. Having logical TID

provides flexibility to move the tuples around different pages on page

splits or page merges can be performed. The internal pages of the B-tree are super simple and boring. Each

internal page just stores an array of TID and downlink pairs. Let's

focus on the leaf level. Leaf blocks have short uncompressed header,

followed by btree items. Two kinds of items exist: - plain item, holds one tuple or one datum, uncompressed payload

- a "container item", holds multiple plain items, compressed payload +-----------------------------

| Fixed-size page header:

|

| LSN

| TID low and hi key (for Lehman & Yao B-tree operations)

| left and right page pointers

|

| Items:

|

| TID | size | flags | uncompressed size | lastTID | payload (container

item)

| TID | size | flags | uncompressed size | lastTID | payload (container

item)

| TID | size | flags | undo pointer | payload (plain item)

| TID | size | flags | undo pointer | payload (plain item)

| ...

|

+---------------------------- Row store

--------- The tuples are stored one after another, sorted by TID. For each

tuple, we store its 48-bit TID, a undo record pointer, and the actual

tuple data uncompressed. In uncompressed form, the page can be arbitrarily large. But after

compression, it must fit into a physical 8k block. If on insert or

update of a tuple, the page cannot be compressed below 8k anymore, the

page is split. Note that because TIDs are logical rather than physical

identifiers, we can freely move tuples from one physical page to

another during page split. A tuple's TID never changes. The buffer cache caches compressed blocks. Likewise, WAL-logging,

full-page images etc. work on compressed blocks. Uncompression is done

on-the-fly, as and when needed in backend-private memory, when

reading. For some compressions like rel encoding or delta encoding

tuples can be constructed directly from compressed data. Column store

------------ A column store uses the same structure but we have *multiple* B-trees,

one for each column, all indexed by TID. The B-trees for all columns

are stored in the same physical file. A metapage at block 0, has links to the roots of the B-trees. Leaf

pages look the same, but instead of storing the whole tuple, stores

just a single attribute. To reconstruct a row with given TID, scan

descends down the B-trees for all the columns using that TID, and

fetches all attributes. Likewise, a sequential scan walks all the

B-trees in lockstep. So, in summary can imagine Zedstore as forest of B-trees, one for each

column, all indexed by TIDs. This way of laying out the data also easily allows for hybrid

row-column store, where some columns are stored together, and others

have a dedicated B-tree. Need to have user facing syntax to allow

specifying how to group the columns. Main reasons for storing data this way

-------------------------------------- * Layout the data/tuples in mapped fashion instead of keeping the

logical to physical mapping separate from actual data. So, keep the

meta-data and data logically in single stream of file, avoiding the

need for separate forks/files to store meta-data and data. * Stick to fixed size physical blocks. Variable size blocks pose need

for increased logical to physical mapping maintenance, plus

restrictions on concurrency of writes and reads to files. Hence

adopt compression to fit fixed size blocks instead of other way

round. MVCC

----

MVCC works very similar to zheap for zedstore. Undo record pointers

are used to implement MVCC. Transaction information if not directly

stored with the data. In zheap, there's a small, fixed, number of

"transaction slots" on each page, but zedstore has undo pointer with

each item directly; in normal cases, the compression squeezes this

down to almost nothing. Implementation

============== Insert:

Inserting a new row, splits the row into datums. Then for first column

decide which block to insert the same to, and pick a TID for it, and

write undo record for the same. Rest of the columns are inserted using

that same TID and point to same undo position. Compression:

Items are added to Btree in uncompressed form. If page is full and new

item can't be added, compression kicks in. Existing uncompressed items

(plain items) of the page are passed to compressor for

compression. Already compressed items are added back as is. Page is

rewritten with compressed data with new item added to it. If even

after compression, can't add item to page, then page split happens. Toast:

When an overly large datum is stored, it is divided into chunks, and

each chunk is stored on a dedicated toast page within the same

physical file. The toast pages of a datum form list, each page has a

next/prev pointer. Select:

Property is added to Table AM to convey if column projection is

leveraged by AM for scans. While scanning tables with AM leveraging

this property, executor parses the plan. Leverages the target list and

quals to find the required columns for query. This list is passed down

to AM on beginscan. Zedstore uses this column projection list to only

pull data from selected columns. Virtual tuple table slot is used to

pass back the datums for subset of columns. Current table am API requires enhancement here to pass down column

projection to AM. The patch showcases two different ways for the same. * For sequential scans added new beginscan_with_column_projection()

API. Executor checks AM property and if it leverages column

projection uses this new API else normal beginscan() API. * For index scans instead of modifying the begin scan API, added new

API to specifically pass column projection list after calling begin

scan to populate the scan descriptor but before fetching the tuples. Index Support:

Building index also leverages columnar storage and only scans columns

required to build the index. Indexes work pretty similar to heap

tables. Data is inserted into tables and TID for the tuple same gets

stored in index. On index scans, required column Btrees are scanned

for given TID and datums passed back using virtual tuple. Page Format:

ZedStore table contains different kinds of pages, all in the same

file. Kinds of pages are meta-page, per-attribute btree internal and

leaf pages, UNDO log page, and toast pages. Each page type has its own

distinct data storage format. Block 0 is always a metapage. It contains the block numbers of the

other data structures stored within the file, like the per-attribute

B-trees, and the UNDO log. Enhancements to design:

======================= Instead of compressing all the tuples on a page in one batch, we could

store a small "dictionary", e.g. in page header or meta-page, and use

it to compress each tuple separately. That could make random reads and

updates of individual tuples faster. When adding column, just need to create new Btree for newly added

column and linked to meta-page. No existing content needs to be

rewritten. When the column is dropped, can scan the B-tree of that column, and

immediately mark all the pages as free in the FSM. But we don't

actually have to scan the leaf level: all leaf tuples have a downlink

in the parent, so we can scan just the internal pages. Unless the

column is very wide, that's only a small fraction of the data. That

makes the space immediately reusable for new insertions, but it won't

return the space to the Operating System. In order to do that, we'd

still need to defragment, moving pages from the end of the file closer

to the beginning, and truncate the file. In this design, we only cache compressed pages in the page cache. If

we want to cache uncompressed pages instead, or in addition to that,

we need to invent a whole new kind of a buffer cache that can deal

with the variable-size blocks. If you do a lot of updates, the file can get fragmented, with lots of

unused space on pages. Losing the correlation between TIDs and

physical order is also bad, because it will make SeqScans slow, as

they're not actually doing sequential I/O anymore. We can write a

defragmenter to fix things up. Half-empty pages can be merged, and

pages can be moved to restore TID/physical correlation. This format

doesn't have the same MVCC problems with moving tuples around that the

Postgres heap does, so it can be fairly easily be done on-line. Min-Max values can be stored for block to easily skip scanning if

column values doesn't fall in range. Notes about current patch

========================= Basic (core) functionality is implemented to showcase and play with. Two compression algorithms are supported Postgres pg_lzcompress and

lz4. Compiling server with --with-lz4 enables the LZ4 compression for

zedstore else pg_lzcompress is default. Definitely LZ4 is super fast

at compressing and uncompressing. Not all the table AM API's are implemented. For the functionality not

implmented yet will ERROR out with not supported. Zedstore Table can

be created using command: CREATE TABLE <name> (column listing) USING zedstore; Bulk load can be performed using COPY. INSERT, SELECT, UPDATE and

DELETES work. Btree indexes can be created. Btree and bitmap index

scans work. Test in src/test/regress/sql/zedstore.sql showcases all

the functionality working currently. Updates are currently implemented

as cold, means always creates new items and not performed in-place. TIDs currently can't leverage the full 48 bit range but instead need

to limit to values which are considered valid ItemPointers. Also,

MaxHeapTuplesPerPage pose restrictions on the values currently it can

have. Refer [7] for the same. Extremely basic UNDO logging has be implemented just for MVCC

perspective. MVCC is missing tuple lock right now. Plus, doesn't

actually perform any undo yet. No WAL logging exist currently hence

its not crash safe either. Helpful functions to find how many pages of each type is present in

zedstore table and also to find compression ratio is provided. Test mentioned in thread "Column lookup in a row performance" [6],

good example query for zedstore locally on laptop using lz4 shows postgres=# SELECT AVG(i199) FROM (select i199 from layout offset 0) x; --

heap

avg

---------------------

500000.500000000000

(1 row) Time: 4679.026 ms (00:04.679) postgres=# SELECT AVG(i199) FROM (select i199 from zlayout offset 0) x; --

zedstore

avg

---------------------

500000.500000000000

(1 row) Time: 379.710 ms Important note:

---------------

Planner has not been modified yet to leverage the columnar

storage. Hence, plans using "physical tlist" optimization or such good

for row store miss out to leverage the columnar nature

currently. Hence, can see the need for subquery with OFFSET 0 above to

disable the optimization and scan only required column. The current proposal and discussion is more focused on AM layer work

first. Hence, currently intentionally skips to discuss the planner or

executor "feature" enhancements like adding vectorized execution and

family of features. Previous discussions or implementations for column store Vertical

cluster index [2], Incore columnar storage [3] and [4], cstore_fdw [5]

were refered to distill down objectives and come up with design and

implementations to avoid any earlier concerns raised. Learnings from

Greenplum Database column store also leveraged while designing and

implementing the same. Credit: Design is moslty brain child of Heikki, or actually his

epiphany to be exact. I acted as idea bouncing board and contributed

enhancements to the same. We both are having lot of fun writing the

code for this. References

1] https://github.com/greenplum-db/postgres/tree/zedstore

2]

https://www.postgresql.org/message-id/flat/CAJrrPGfaC7WC9NK6PTTy6YN-NN%2BhCy8xOLAh2doYhVg5d6HsAA%40mail.gmail.com

3]

https://www.postgresql.org/message-id/flat/20150611230316.GM133018%40postgresql.org

4]

https://www.postgresql.org/message-id/flat/20150831225328.GM2912%40alvherre.pgsql

5] https://github.com/citusdata/cstore_fdw

6]

https://www.postgresql.org/message-id/flat/CAOykqKfko-n5YiBJtk-ocVdp%2Bj92Apu5MJBwbGGh4awRY5NCuQ%40mail.gmail.com

7]

https://www.postgresql.org/message-id/d0fc97bd-7ec8-2388-e4a6-0fda86d71a43%40iki.fi



Attachment Content-Type Size v1-0001-Zedstore-compressed-in-core-columnar-storage.patch text/x-patch 213.7 KB

Hi, On 2019-04-08 17:27:05 -0700, Ashwin Agrawal wrote:

> Heikki and I have been hacking recently for few weeks to implement

> in-core columnar storage for PostgreSQL. Here's the design and initial

> implementation of Zedstore, compressed in-core columnar storage (table

> access method). That's very cool. > Motivations / Objectives

>

> * Performance improvement for queries selecting subset of columns

> (reduced IO).

> * Reduced on-disk footprint compared to heap table. Shorter tuple

> headers and also leveraging compression of similar type data

> * Be first-class citizen in the Postgres architecture (tables data can

> just independently live in columnar storage)

> * Fully MVCC compliant

> * All Indexes supported

> * Hybrid row-column store, where some columns are stored together, and

> others separately. Provide flexibility of granularity on how to

> divide the columns. Columns accessed together can be stored

> together.

> * Provide better control over bloat (similar to zheap)

> * Eliminate need for separate toast tables

> * Faster add / drop column or changing data type of column by avoiding

> full rewrite of the table. Is storage going through the bufmgr.c or separately? > In uncompressed form, the page can be arbitrarily large. But after

> compression, it must fit into a physical 8k block. If on insert or

> update of a tuple, the page cannot be compressed below 8k anymore, the

> page is split. Note that because TIDs are logical rather than physical

> identifiers, we can freely move tuples from one physical page to

> another during page split. A tuple's TID never changes. When does compression happen? After each modifcation of the expanded

"page"? Are repeated expansions prevented somehow, e.g. if I

insert/delete rows into the same page one-by-one? > A metapage at block 0, has links to the roots of the B-trees. Leaf

> pages look the same, but instead of storing the whole tuple, stores

> just a single attribute. To reconstruct a row with given TID, scan

> descends down the B-trees for all the columns using that TID, and

> fetches all attributes. Likewise, a sequential scan walks all the

> B-trees in lockstep. Does the size of the metapage limit the number of column [groups]? Or is

there some overflow / tree of trees / whatnot happening? > Insert:

> Inserting a new row, splits the row into datums. Then for first column

> decide which block to insert the same to, and pick a TID for it, and

> write undo record for the same. Rest of the columns are inserted using

> that same TID and point to same undo position. Is there some buffering? Without that it seems like retail inserts are

going to be pretty slow? > Property is added to Table AM to convey if column projection is

> leveraged by AM for scans. While scanning tables with AM leveraging

> this property, executor parses the plan. Leverages the target list and

> quals to find the required columns for query. This list is passed down

> to AM on beginscan. Zedstore uses this column projection list to only

> pull data from selected columns. Virtual tuple table slot is used to

> pass back the datums for subset of columns.

>

> Current table am API requires enhancement here to pass down column

> projection to AM. The patch showcases two different ways for the same.

>

> * For sequential scans added new beginscan_with_column_projection()

> API. Executor checks AM property and if it leverages column

> projection uses this new API else normal beginscan() API.

>

> * For index scans instead of modifying the begin scan API, added new

> API to specifically pass column projection list after calling begin

> scan to populate the scan descriptor but before fetching the tuples. FWIW, I don't quite think this is the right approach. I've only a vague

sketch of this in my head, but I think we should want a general API to

pass that down to *any* scan. Even for heap, not deforming leading

columns that a uninteresting, but precede relevant columns, would be

quite a noticable performance win. I don't think the projection list is

the right approach for that. > Extremely basic UNDO logging has be implemented just for MVCC

> perspective. MVCC is missing tuple lock right now. Plus, doesn't

> actually perform any undo yet. No WAL logging exist currently hence

> its not crash safe either. Have you looked at the undo APIs developed for zheap, as discussed on

the list? Seems important that they're suitable for this too. > Test mentioned in thread "Column lookup in a row performance" [6],

> good example query for zedstore locally on laptop using lz4 shows

>

> postgres=# SELECT AVG(i199) FROM (select i199 from layout offset 0) x; --

> heap

> avg

> ---------------------

> 500000.500000000000

> (1 row)

>

> Time: 4679.026 ms (00:04.679)

>

> postgres=# SELECT AVG(i199) FROM (select i199 from zlayout offset 0) x; --

> zedstore

> avg

> ---------------------

> 500000.500000000000

> (1 row)

>

> Time: 379.710 ms Well, I'm not sure I'm actually impressed by that. What does the

performance look like if you select i0 instead? > Important note:

> ---------------

> Planner has not been modified yet to leverage the columnar

> storage. Hence, plans using "physical tlist" optimization or such good

> for row store miss out to leverage the columnar nature

> currently. Hence, can see the need for subquery with OFFSET 0 above to

> disable the optimization and scan only required column. I'm more and more thinking that we should just nix the physical tlist

stuff and start afresh. Congrats again, this is cool stuff. - Andres

On Mon, Apr 8, 2019 at 6:04 PM Andres Freund <andres(at)anarazel(dot)de> wrote: > Hi,

>

> On 2019-04-08 17:27:05 -0700, Ashwin Agrawal wrote:

> > Heikki and I have been hacking recently for few weeks to implement

> > in-core columnar storage for PostgreSQL. Here's the design and initial

> > implementation of Zedstore, compressed in-core columnar storage (table

> > access method).

>

> That's very cool.

>

>

> > Motivations / Objectives

> >

> > * Performance improvement for queries selecting subset of columns

> > (reduced IO).

> > * Reduced on-disk footprint compared to heap table. Shorter tuple

> > headers and also leveraging compression of similar type data

> > * Be first-class citizen in the Postgres architecture (tables data can

> > just independently live in columnar storage)

> > * Fully MVCC compliant

> > * All Indexes supported

> > * Hybrid row-column store, where some columns are stored together, and

> > others separately. Provide flexibility of granularity on how to

> > divide the columns. Columns accessed together can be stored

> > together.

> > * Provide better control over bloat (similar to zheap)

> > * Eliminate need for separate toast tables

> > * Faster add / drop column or changing data type of column by avoiding

> > full rewrite of the table.

>

> Is storage going through the bufmgr.c or separately?

> Yes, below access method its pretty much same as heap. All reads and writes

flow via buffer cache. The implementation sits nicely in between, just

modifying the access method code changing how just how data is stored in

pages, above AM and below AM is basically all behaves similar to heap code. > > In uncompressed form, the page can be arbitrarily large. But after

> > compression, it must fit into a physical 8k block. If on insert or

> > update of a tuple, the page cannot be compressed below 8k anymore, the

> > page is split. Note that because TIDs are logical rather than physical

> > identifiers, we can freely move tuples from one physical page to

> > another during page split. A tuple's TID never changes.

>

> When does compression happen? After each modifcation of the expanded

>

"page"? Are repeated expansions prevented somehow, e.g. if I

> insert/delete rows into the same page one-by-one?

> Compression is performed with new data is only if page becomes full, till

then uncompressed data is added to the page. If even after compression

cannot add data to the page then page split is performed. Already

compressed data is not compressed again on next insert. New compressed

block is created for newly added uncompressed items. The line of thought we have for delete is will not free the space as soon

as delete is performed, but instead delay and reuse the space deleted on

next insertion to the page. > A metapage at block 0, has links to the roots of the B-trees. Leaf

> > pages look the same, but instead of storing the whole tuple, stores

> > just a single attribute. To reconstruct a row with given TID, scan

> > descends down the B-trees for all the columns using that TID, and

> > fetches all attributes. Likewise, a sequential scan walks all the

> > B-trees in lockstep.

>

> Does the size of the metapage limit the number of column [groups]? Or is

> there some overflow / tree of trees / whatnot happening?

> In design it doesn't limit the number of columns, as can have chain of

meta-pages to store the required meta-data, page 0 still being start of the

chain. > > Insert:

> > Inserting a new row, splits the row into datums. Then for first column

> > decide which block to insert the same to, and pick a TID for it, and

> > write undo record for the same. Rest of the columns are inserted using

> > that same TID and point to same undo position.

>

> Is there some buffering? Without that it seems like retail inserts are

> going to be pretty slow?

> Yes, regular buffer cache. > > Property is added to Table AM to convey if column projection is

> > leveraged by AM for scans. While scanning tables with AM leveraging

> > this property, executor parses the plan. Leverages the target list and

> > quals to find the required columns for query. This list is passed down

> > to AM on beginscan. Zedstore uses this column projection list to only

> > pull data from selected columns. Virtual tuple table slot is used to

> > pass back the datums for subset of columns.

> >

> > Current table am API requires enhancement here to pass down column

> > projection to AM. The patch showcases two different ways for the same.

> >

> > * For sequential scans added new beginscan_with_column_projection()

> > API. Executor checks AM property and if it leverages column

> > projection uses this new API else normal beginscan() API.

> >

> > * For index scans instead of modifying the begin scan API, added new

> > API to specifically pass column projection list after calling begin

> > scan to populate the scan descriptor but before fetching the tuples.

>

> FWIW, I don't quite think this is the right approach. I've only a vague

> sketch of this in my head, but I think we should want a general API to

> pass that down to *any* scan. Even for heap, not deforming leading

> columns that a uninteresting, but precede relevant columns, would be

> quite a noticable performance win. I don't think the projection list is

> the right approach for that.

> Sure, would love to hear more on it and can enhance the same as makes more

usable for AMs. > > Extremely basic UNDO logging has be implemented just for MVCC

> > perspective. MVCC is missing tuple lock right now. Plus, doesn't

> > actually perform any undo yet. No WAL logging exist currently hence

> > its not crash safe either.

>

> Have you looked at the undo APIs developed for zheap, as discussed on

> the list? Seems important that they're suitable for this too.

> Not in details yet, but yes plan is to leverage the same common framework

and undo log API as zheap. Will look into the details. With the current

zedstore implementation the requirements from the undo are prertty clear. > > Test mentioned in thread "Column lookup in a row performance" [6],

> > good example query for zedstore locally on laptop using lz4 shows

> >

> > postgres=# SELECT AVG(i199) FROM (select i199 from layout offset 0) x; --

> > heap

> > avg

> > ---------------------

> > 500000.500000000000

> > (1 row)

> >

> > Time: 4679.026 ms (00:04.679)

> >

> > postgres=# SELECT AVG(i199) FROM (select i199 from zlayout offset 0) x;

> --

> > zedstore

> > avg

> > ---------------------

> > 500000.500000000000

> > (1 row)

> >

> > Time: 379.710 ms

>

> Well, I'm not sure I'm actually impressed by that. What does the

> performance look like if you select i0 instead?

> Just for quick test used 100 instead of 200 columns (with 200 the results

would be more diverged), this is what it reports postgres=# SELECT AVG(i0) FROM (select i0 from layout offset 0) x; -- heap

avg

------------------------

1.00000000000000000000

(1 row) Time: 183.865 ms

postgres=# SELECT AVG(i0) FROM (select i0 from zlayout offset 0) x; --

zedstore

avg

------------------------

1.00000000000000000000

(1 row) Time: 47.624 ms



Hi, On 09.04.2019 3:27, Ashwin Agrawal wrote:

> Heikki and I have been hacking recently for few weeks to implement

> in-core columnar storage for PostgreSQL. Here's the design and initial

> implementation of Zedstore, compressed in-core columnar storage (table

> access method). Attaching the patch and link to github branch [1] to

> follow along. Thank you for publishing this patch. IMHO Postgres is really missing

normal support of columnar store and table access method

API is the best way of integrating it. I wanted to compare memory footprint and performance of zedstore with

standard Postgres heap and my VOPS extension.

As test data I used TPC-H benchmark (actually only one lineitem table

generated with tpch-dbgen utility with scale factor 10 (~8Gb database).

I attached script which I have use to populate data (you have to to

download, build and run tpch-dbgen yourself, also you can comment code

related with VOPS).

Unfortunately I failed to load data in zedstore: postgres=# insert into zedstore_lineitem_projection (select

l_shipdate,l_quantity,l_extendedprice,l_discount,l_tax,l_returnflag::"char",l_linestatus::"char"

from lineitem);

psql: ERROR: compression failed. what now?

Time: 237804.775 ms (03:57.805) Then I try to check if there is something in

zedstore_lineitem_projection table: postgres=# select count(*) from zedstore_lineitem_projection;

psql: WARNING: terminating connection because of crash of another

server process

DETAIL: The postmaster has commanded this server process to roll back

the current transaction and exit, because another server process exited

abnormally and possibly corrupted shared memory.

HINT: In a moment you should be able to reconnect to the database and

repeat your command.

psql: server closed the connection unexpectedly

This probably means the server terminated abnormally

before or while processing the request.

The connection to the server was lost. Attempting reset: Failed.

Time: 145710.828 ms (02:25.711) Backend consumes 16GB of RAM and 16Gb of swap and was killed by OOM

killer (undo log?)

Subsequent attempt to run the same command is failed with the following

error: postgres=# select count(*) from zedstore_lineitem_projection;

psql: ERROR: unexpected level encountered when descending tree So the only thing I can do at this moment is report size of tables on

the disk: postgres=# select pg_relation_size('lineitem');

pg_relation_size

------------------

10455441408

(1 row) postgres=# select pg_relation_size('lineitem_projection');

pg_relation_size

------------------

3129974784

(1 row) postgres=# select pg_relation_size('vops_lineitem_projection');

pg_relation_size

------------------

1535647744

(1 row) postgres=# select pg_relation_size('zedstore_lineitem_projection');

pg_relation_size

------------------

2303688704

(1 row) But I do not know how much data was actually loaded in zedstore table...

Actually the main question is why this table is not empty if INSERT

statement was failed? Please let me know if I can somehow help you to reproduce and

investigate the problem. --

Konstantin Knizhnik

Postgres Professional: http://www.postgrespro.com

The Russian Postgres Company

Attachment Content-Type Size vstore_bench.sql application/sql 1.9 KB

On 09.04.2019 17:09, Konstantin Knizhnik wrote:

> Hi,

>

> On 09.04.2019 3:27, Ashwin Agrawal wrote:

>> Heikki and I have been hacking recently for few weeks to implement

>> in-core columnar storage for PostgreSQL. Here's the design and initial

>> implementation of Zedstore, compressed in-core columnar storage (table

>> access method). Attaching the patch and link to github branch [1] to

>> follow along.

>

> Thank you for publishing this patch. IMHO Postgres is really missing

> normal support of columnar store and table access method

> API is the best way of integrating it.

>

> I wanted to compare memory footprint and performance of zedstore with

> standard Postgres heap and my VOPS extension.

> As test data I used TPC-H benchmark (actually only one lineitem table

> generated with tpch-dbgen utility with scale factor 10 (~8Gb database).

> I attached script which I have use to populate data (you have to to

> download, build and run tpch-dbgen yourself, also you can comment code

> related with VOPS).

> Unfortunately I failed to load data in zedstore:

>

> postgres=# insert into zedstore_lineitem_projection (select

> l_shipdate,l_quantity,l_extendedprice,l_discount,l_tax,l_returnflag::"char",l_linestatus::"char"

> from lineitem);

> psql: ERROR: compression failed. what now?

> Time: 237804.775 ms (03:57.805)

>

>

> Then I try to check if there is something in

> zedstore_lineitem_projection table:

>

> postgres=# select count(*) from zedstore_lineitem_projection;

> psql: WARNING: terminating connection because of crash of another

> server process

> DETAIL: The postmaster has commanded this server process to roll back

> the current transaction and exit, because another server process

> exited abnormally and possibly corrupted shared memory.

> HINT: In a moment you should be able to reconnect to the database and

> repeat your command.

> psql: server closed the connection unexpectedly

> This probably means the server terminated abnormally

> before or while processing the request.

> The connection to the server was lost. Attempting reset: Failed.

> Time: 145710.828 ms (02:25.711)

>

>

> Backend consumes 16GB of RAM and 16Gb of swap and was killed by OOM

> killer (undo log?)

> Subsequent attempt to run the same command is failed with the

> following error:

>

> postgres=# select count(*) from zedstore_lineitem_projection;

> psql: ERROR: unexpected level encountered when descending tree

>

>

> So the only thing I can do at this moment is report size of tables on

> the disk:

>

> postgres=# select pg_relation_size('lineitem');

> pg_relation_size

> ------------------

> 10455441408

> (1 row)

>

>

> postgres=# select pg_relation_size('lineitem_projection');

> pg_relation_size

> ------------------

> 3129974784

> (1 row)

>

> postgres=# select pg_relation_size('vops_lineitem_projection');

> pg_relation_size

> ------------------

> 1535647744

> (1 row)

>

> postgres=# select pg_relation_size('zedstore_lineitem_projection');

> pg_relation_size

> ------------------

> 2303688704

> (1 row)

>

>

> But I do not know how much data was actually loaded in zedstore table...

> Actually the main question is why this table is not empty if INSERT

> statement was failed?

>

> Please let me know if I can somehow help you to reproduce and

> investigate the problem.

> Looks like the original problem was caused by internal postgres

compressor: I have not configured Postgres to use lz4.

When I configured Postgres --with-lz4, data was correctly inserted in

zedstore table, but looks it is not compressed at all: postgres=# select pg_relation_size('zedstore_lineitem_projection');

pg_relation_size

------------------

9363010640 No wonder that zedstore shows the worst results: lineitem 6240.261 ms

lineitem_projection 5390.446 ms

zedstore_lineitem_projection 23310.341 ms

vops_lineitem_projection 439.731 ms Updated version of vstore_bench.sql is attached (sorry, there was some

errors in previous version of this script). --

Konstantin Knizhnik

Postgres Professional: http://www.postgrespro.com

The Russian Postgres Company

Attachment Content-Type Size vstore_bench.sql application/sql 4.3 KB

On 09/04/2019 18:00, Konstantin Knizhnik wrote:

> On 09.04.2019 17:09, Konstantin Knizhnik wrote:

>> standard Postgres heap and my VOPS extension.

>> As test data I used TPC-H benchmark (actually only one lineitem table

>> generated with tpch-dbgen utility with scale factor 10 (~8Gb database).

>> I attached script which I have use to populate data (you have to to

>> download, build and run tpch-dbgen yourself, also you can comment code

>> related with VOPS). Cool, thanks! >> Unfortunately I failed to load data in zedstore:

>>

>> postgres=# insert into zedstore_lineitem_projection (select

>> l_shipdate,l_quantity,l_extendedprice,l_discount,l_tax,l_returnflag::"char",l_linestatus::"char"

>> from lineitem);

>> psql: ERROR: compression failed. what now?

>> Time: 237804.775 ms (03:57.805) Yeah, it's still early days, it will crash and burn in a lot of cases.

We wanted to publish this early, to gather ideas and comments on the

high level design, and to validate that the table AM API that's in v12

is usable. > Looks like the original problem was caused by internal postgres

> compressor: I have not configured Postgres to use lz4.

> When I configured Postgres --with-lz4, data was correctly inserted in

> zedstore table, but looks it is not compressed at all:

>

> postgres=# select pg_relation_size('zedstore_lineitem_projection');

> pg_relation_size

> ------------------

> 9363010640 The single-insert codepath isn't very optimized yet. If you populate the

table with large "INSERT ... SELECT ...", you end up with a huge undo

log. Try loading it with COPY. You can also see how many pages of each type there is with: select count(*), pg_zs_page_type('zedstore_lineitem_projection', g)

from generate_series(0, pg_table_size('zedstore_lineitem_projection')

/ 8192 - 1) g group by 2; - Heikki

On 09.04.2019 18:08, Heikki Linnakangas wrote:

> On 09/04/2019 18:00, Konstantin Knizhnik wrote:

>> On 09.04.2019 17:09, Konstantin Knizhnik wrote:

>>> standard Postgres heap and my VOPS extension.

>>> As test data I used TPC-H benchmark (actually only one lineitem table

>>> generated with tpch-dbgen utility with scale factor 10 (~8Gb database).

>>> I attached script which I have use to populate data (you have to to

>>> download, build and run tpch-dbgen yourself, also you can comment code

>>> related with VOPS).

>

> Cool, thanks!

>

>>> Unfortunately I failed to load data in zedstore:

>>>

>>> postgres=# insert into zedstore_lineitem_projection (select

>>> l_shipdate,l_quantity,l_extendedprice,l_discount,l_tax,l_returnflag::"char",l_linestatus::"char"

>>>

>>> from lineitem);

>>> psql: ERROR: compression failed. what now?

>>> Time: 237804.775 ms (03:57.805)

>

> Yeah, it's still early days, it will crash and burn in a lot of cases.

> We wanted to publish this early, to gather ideas and comments on the

> high level design, and to validate that the table AM API that's in v12

> is usable.

>

>> Looks like the original problem was caused by internal postgres

>> compressor: I have not configured Postgres to use lz4.

>> When I configured Postgres --with-lz4, data was correctly inserted in

>> zedstore table, but looks it is not compressed at all:

>>

>> postgres=# select pg_relation_size('zedstore_lineitem_projection');

>> pg_relation_size

>> ------------------

>> 9363010640

>

> The single-insert codepath isn't very optimized yet. If you populate

> the table with large "INSERT ... SELECT ...", you end up with a huge

> undo log. Try loading it with COPY.

>

> You can also see how many pages of each type there is with:

>

> select count(*), pg_zs_page_type('zedstore_lineitem_projection', g)

> from generate_series(0,

> pg_table_size('zedstore_lineitem_projection') / 8192 - 1) g group by 2;

>

> - Heikki postgres=# copy zedstore_lineitem from '/mnt/data/lineitem.tbl'

delimiter '|' csv;

COPY 59986052

Time: 232802.257 ms (03:52.802)

postgres=# select pg_relation_size('zedstore_lineitem');

pg_relation_size

------------------

10346504192

(1 row)

postgres=# select count(*), pg_zs_page_type('zedstore_lineitem', g)

from generate_series(0, pg_table_size('zedstore_lineitem') / 8192 -

1) g group by 2;

count | pg_zs_page_type

---------+-----------------

1 | META

1262308 | BTREE

692 | UNDO

(3 rows) And now performance is much worser:

Time: 99819.476 ms (01:39.819) It is strange, because the main advantage of columnar store is that it

has to fetch only accessed rows.

What I see is that in non-parallel mode (max_parallel_workers_per_gather

= 0)

backend consumes about 11GB of memory. It fits in my desktop RAM (16GB)

and speed is ~58 seconds.

But one I start 4 parallel workers, them cause huge swapping: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+

COMMAND

28195 knizhnik 20 0 11.823g 6.553g 5.072g D 7.6 42.2 0:17.19

postgres

28074 knizhnik 20 0 11.848g 6.726g 5.223g D 7.3 43.3 4:14.96

postgres

28192 knizhnik 20 0 11.854g 6.586g 5.075g D 7.3 42.4 0:17.18

postgres

28193 knizhnik 20 0 11.870g 6.594g 5.064g D 7.3 42.4 0:17.19

postgres

28194 knizhnik 20 0 11.854g 6.589g 5.078g D 7.3 42.4 0:17.09

postgres which is also strange because data should be present in shared buffers. --

Konstantin Knizhnik

Postgres Professional: http://www.postgrespro.com

The Russian Postgres Company

On 2019-Apr-09, Konstantin Knizhnik wrote: > On 09.04.2019 3:27, Ashwin Agrawal wrote:

> > Heikki and I have been hacking recently for few weeks to implement

> > in-core columnar storage for PostgreSQL. Here's the design and initial

> > implementation of Zedstore, compressed in-core columnar storage (table

> > access method). Attaching the patch and link to github branch [1] to

> > follow along.

>

> Thank you for publishing this patch. IMHO Postgres is really missing normal

> support of columnar store Yep. > and table access method API is the best way of integrating it. This is not surprising, considering that columnar store is precisely the

reason for starting the work on table AMs. We should certainly look into integrating some sort of columnar storage

in mainline. Not sure which of zedstore or VOPS is the best candidate,

or maybe we'll have some other proposal. My feeling is that having more

than one is not useful; if there are optimizations to one that can be

borrowed from the other, let's do that instead of duplicating effort. --

Álvaro Herrera https://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

On 09.04.2019 18:51, Alvaro Herrera wrote:

> On 2019-Apr-09, Konstantin Knizhnik wrote:

>

>> On 09.04.2019 3:27, Ashwin Agrawal wrote:

>>> Heikki and I have been hacking recently for few weeks to implement

>>> in-core columnar storage for PostgreSQL. Here's the design and initial

>>> implementation of Zedstore, compressed in-core columnar storage (table

>>> access method). Attaching the patch and link to github branch [1] to

>>> follow along.

>> Thank you for publishing this patch. IMHO Postgres is really missing normal

>> support of columnar store

> Yep.

>

>> and table access method API is the best way of integrating it.

> This is not surprising, considering that columnar store is precisely the

> reason for starting the work on table AMs.

>

> We should certainly look into integrating some sort of columnar storage

> in mainline. Not sure which of zedstore or VOPS is the best candidate,

> or maybe we'll have some other proposal. My feeling is that having more

> than one is not useful; if there are optimizations to one that can be

> borrowed from the other, let's do that instead of duplicating effort.

>

There are two different aspects:

1. Store format.

2. Vector execution. 1. VOPS is using mixed format, something similar with Apache parquet.

Tuples are stored vertically, but only inside one page.

It tries to minimize trade-offs between true horizontal and true

vertical storage:

first is most optimal for selecting all rows, while second - for

selecting small subset of rows.

To make this approach more efficient, it is better to use large page

size - default Postgres 8k pages is not enough. From my point of view such format is better than pure vertical storage

which will be very inefficient if query access larger number of columns.

This problem can be somehow addressed by creating projections: grouping

several columns together. But it requires more space for storing

multiple projections. 2. Doesn't matter which format we choose, to take all advantages of

vertical representation we need to use vector operations.

And Postgres executor doesn't support them now. This is why VOPS is

using some hacks, which is definitely not good and not working in all cases.

zedstore is not using such hacks and ... this is why it never can reach

VOPS performance. The right solution is to add vector operations support to Postgres

planner and executors.

But is much harder than develop columnar store itself. --

Konstantin Knizhnik

Postgres Professional: http://www.postgrespro.com

The Russian Postgres Company

On 09/04/2019 18:00, Konstantin Knizhnik wrote:

> Looks like the original problem was caused by internal postgres

> compressor: I have not configured Postgres to use lz4.

> When I configured Postgres --with-lz4, data was correctly inserted in

> zedstore table, but looks it is not compressed at all:

>

> postgres=# select pg_relation_size('zedstore_lineitem_projection');

> pg_relation_size

> ------------------

> 9363010640

>

> No wonder that zedstore shows the worst results:

>

> lineitem 6240.261 ms

> lineitem_projection 5390.446 ms

> zedstore_lineitem_projection 23310.341 ms

> vops_lineitem_projection 439.731 ms

>

> Updated version of vstore_bench.sql is attached (sorry, there was some

> errors in previous version of this script). I tried this quickly, too. With default work_mem and no parallelism, and

1 gb table size, it seems that the query chooses a different plan with

heap and zedstore, with a sort+group for zedstore and hash agg for heap.

There's no ANALYZE support in zedstore yet, and we haven't given much

thought to parallelism either. With work_mem='1GB' and no parallelism,

both queries use a hash agg, and the numbers are much closer than what

you saw, about 6 s for heap, and 9 s for zedstore. - Heikki

On 09.04.2019 19:19, Heikki Linnakangas wrote:

> On 09/04/2019 18:00, Konstantin Knizhnik wrote:

>> Looks like the original problem was caused by internal postgres

>> compressor: I have not configured Postgres to use lz4.

>> When I configured Postgres --with-lz4, data was correctly inserted in

>> zedstore table, but looks it is not compressed at all:

>>

>> postgres=# select pg_relation_size('zedstore_lineitem_projection');

>> pg_relation_size

>> ------------------

>> 9363010640

>>

>> No wonder that zedstore shows the worst results:

>>

>> lineitem 6240.261 ms

>> lineitem_projection 5390.446 ms

>> zedstore_lineitem_projection 23310.341 ms

>> vops_lineitem_projection 439.731 ms

>>

>> Updated version of vstore_bench.sql is attached (sorry, there was some

>> errors in previous version of this script).

>

> I tried this quickly, too. With default work_mem and no parallelism,

> and 1 gb table size, it seems that the query chooses a different plan

> with heap and zedstore, with a sort+group for zedstore and hash agg

> for heap. There's no ANALYZE support in zedstore yet, and we haven't

> given much thought to parallelism either. With work_mem='1GB' and no

> parallelism, both queries use a hash agg, and the numbers are much

> closer than what you saw, about 6 s for heap, and 9 s for zedstore.

>

> - Heikki

Yes, you was right. The plan for zedstore uses GroupAggregate instead

of HashAggregate.

Increasing work_mem force optimizer to use HashAggregate in all cases.

But it doesn't prevent memory overflow in my case.

And it is very strange to me, because there are just 4 groups in the

result, so it should not consume any memory. Yet another strange thing is that size of zedstore_table is 10Gb

according to pg_relation_size.

Q1 query access only some some subset of "lineitem" columns, not

touching the largest ones (with text).

I have configured 12Gb shared buffers. And all this 11Gb are used! Looks

like all columns are fetched from the disk.

And looks like except this 11Gb of shard data, backend (and each

parallel worker) is also consuming several gigabytes of heap memory.

As a result total size of used memory during parallel query execution

with 4 workers exceeds 20GB and cause terrible swapping at my system. --

Konstantin Knizhnik

Postgres Professional: http://www.postgrespro.com

The Russian Postgres Company

On Tue, Apr 9, 2019 at 11:51 AM Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> wrote:

> This is not surprising, considering that columnar store is precisely the

> reason for starting the work on table AMs.

>

> We should certainly look into integrating some sort of columnar storage

> in mainline. Not sure which of zedstore or VOPS is the best candidate,

> or maybe we'll have some other proposal. My feeling is that having more

> than one is not useful; if there are optimizations to one that can be

> borrowed from the other, let's do that instead of duplicating effort. I think that conclusion may be premature. There seem to be a bunch of

different ways of doing columnar storage, so I don't know how we can

be sure that one size will fit all, or that the first thing we accept

will be the best thing. Of course, we probably do not want to accept a ton of storage manager

implementations is core. I think if people propose implementations

that are poor quality, or missing important features, or don't have

significantly different use cases from the ones we've already got,

it's reasonable to reject those. But I wouldn't be prepared to say

that if we have two significantly different column store that are both

awesome code with a complete feature set and significantly disjoint

use cases, we should reject the second one just because it is also a

column store. I think that won't get out of control because few

people will be able to produce really high-quality implementations. This stuff is hard, which I think is also why we only have 6.5 index

AMs in core after many, many years. And our standards have gone up

over the years - not all of those would pass muster if they were

proposed today. BTW, can I express a small measure of disappointment that the name for

the thing under discussion on this thread chose to be called

"zedstore"? That seems to invite confusion with "zheap", especially

in parts of the world where the last letter of the alphabet is

pronounced "zed," where people are going to say zed-heap and

zed-store. Brr. --

Robert Haas

EnterpriseDB: http://www.enterprisedb.com

The Enterprise PostgreSQL Company

On Tue, Apr 9, 2019 at 11:29 AM Robert Haas <robertmhaas(at)gmail(dot)com> wrote: > On Tue, Apr 9, 2019 at 11:51 AM Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>

> wrote:

> > This is not surprising, considering that columnar store is precisely the

> > reason for starting the work on table AMs.

> >

> > We should certainly look into integrating some sort of columnar storage

> > in mainline. Not sure which of zedstore or VOPS is the best candidate,

> > or maybe we'll have some other proposal. My feeling is that having more

> > than one is not useful; if there are optimizations to one that can be

> > borrowed from the other, let's do that instead of duplicating effort.

>

> I think that conclusion may be premature. There seem to be a bunch of

> different ways of doing columnar storage, so I don't know how we can

> be sure that one size will fit all, or that the first thing we accept

> will be the best thing.

>

> Of course, we probably do not want to accept a ton of storage manager

> implementations is core. I think if people propose implementations

> that are poor quality, or missing important features, or don't have

> significantly different use cases from the ones we've already got,

> it's reasonable to reject those. But I wouldn't be prepared to say

> that if we have two significantly different column store that are both

> awesome code with a complete feature set and significantly disjoint

> use cases, we should reject the second one just because it is also a

> column store. I think that won't get out of control because few

> people will be able to produce really high-quality implementations.

>

> This stuff is hard, which I think is also why we only have 6.5 index

> AMs in core after many, many years. And our standards have gone up

> over the years - not all of those would pass muster if they were

> proposed today.

> +1 > BTW, can I express a small measure of disappointment that the name for

> the thing under discussion on this thread chose to be called

> "zedstore"? That seems to invite confusion with "zheap", especially

> in parts of the world where the last letter of the alphabet is

> pronounced "zed," where people are going to say zed-heap and

> zed-store. Brr.

> Surprised its felt this thread would initiate the invitation to confusion.

Based on past internal and meetup discussions for few quite sometime now,

the confusion already exists for zheap pronunciation because of the reason

mentioned, as last letter is not pronounced universally same. Hence we

explicitly called it zedstore to learn from and make the pronunciation

world wide universal for new thing atleast.



On Tue, Apr 9, 2019 at 9:13 AM Konstantin Knizhnik <

k(dot)knizhnik(at)postgrespro(dot)ru> wrote: >

>

> On 09.04.2019 18:51, Alvaro Herrera wrote:

> > On 2019-Apr-09, Konstantin Knizhnik wrote:

> >

> >> On 09.04.2019 3:27, Ashwin Agrawal wrote:

> >>> Heikki and I have been hacking recently for few weeks to implement

> >>> in-core columnar storage for PostgreSQL. Here's the design and initial

> >>> implementation of Zedstore, compressed in-core columnar storage (table

> >>> access method). Attaching the patch and link to github branch [1] to

> >>> follow along.

> >> Thank you for publishing this patch. IMHO Postgres is really missing

> normal

> >> support of columnar store

> > Yep.

> >

> >> and table access method API is the best way of integrating it.

> > This is not surprising, considering that columnar store is precisely the

> > reason for starting the work on table AMs.

> >

> > We should certainly look into integrating some sort of columnar storage

> > in mainline. Not sure which of zedstore or VOPS is the best candidate,

> > or maybe we'll have some other proposal. My feeling is that having more

> > than one is not useful; if there are optimizations to one that can be

> > borrowed from the other, let's do that instead of duplicating effort.

> >

> There are two different aspects:

> 1. Store format.

> 2. Vector execution.

>

> 1. VOPS is using mixed format, something similar with Apache parquet.

> Tuples are stored vertically, but only inside one page.

> It tries to minimize trade-offs between true horizontal and true

> vertical storage:

> first is most optimal for selecting all rows, while second - for

> selecting small subset of rows.

> To make this approach more efficient, it is better to use large page

> size - default Postgres 8k pages is not enough.

>

> From my point of view such format is better than pure vertical storage

> which will be very inefficient if query access larger number of columns.

> This problem can be somehow addressed by creating projections: grouping

> several columns together. But it requires more space for storing

> multiple projections.

> Right, storing all the columns in single page doens't give any savings on

IO. 2. Doesn't matter which format we choose, to take all advantages of

> vertical representation we need to use vector operations.

> And Postgres executor doesn't support them now. This is why VOPS is

> using some hacks, which is definitely not good and not working in all

> cases.

> zedstore is not using such hacks and ... this is why it never can reach

> VOPS performance.

> Vectorized execution is orthogonal to storage format. It can be even

applied to row store and performance gained. Similarly column store without

vectorized execution also gives performance gain better compression rations

and such benefits. Column store clubbed with vecotorized execution makes it

lot more performant agree. Zedstore currently is focused to have AM piece

in place, which fits the postgres ecosystem and supports all the features

heap does.



On 09/04/2019 23:24, Ashwin Agrawal wrote:

> BTW, can I express a small measure of disappointment that the name for

> the thing under discussion on this thread chose to be called

> "zedstore"? That seems to invite confusion with "zheap", especially

> in parts of the world where the last letter of the alphabet is

> pronounced "zed," where people are going to say zed-heap and

> zed-store. Brr.

>

> Surprised its felt this thread would initiate the invitation to

> confusion. Based on past internal and meetup discussions for few quite

> sometime now, the confusion already exists for zheap pronunciation

> because of the reason mentioned, as last letter is not pronounced

> universally same. Hence we explicitly called it zedstore to learn from

> and make the pronunciation world wide universal for new thing atleast. Yeah, you can blame me for the name. It's a pun on zheap. I'm hoping we

come up with a better name before this matures; I'm thinking it could be

just "column store" or something like that in the end, but it's good to

have a more unique name during development. - Heikki

C-Tree? Peter Geoghegan

(Sent from my phone)



On Tue, Apr 9, 2019 at 5:57 AM Ashwin Agrawal <aagrawal(at)pivotal(dot)io> wrote:

>

> Heikki and I have been hacking recently for few weeks to implement

> in-core columnar storage for PostgreSQL. Here's the design and initial

> implementation of Zedstore, compressed in-core columnar storage (table

> access method). Attaching the patch and link to github branch [1] to

> follow along.

>

> The objective is to gather feedback on design and approach to the

> same. The implementation has core basic pieces working but not close

> to complete.

>

> Big thank you to Andres, Haribabu and team for the table access method

> API's. Leveraged the API's for implementing zedstore, and proves API

> to be in very good shape. Had to enhance the same minimally but

> in-general didn't had to touch executor much.

>

> Motivations / Objectives

>

> * Performance improvement for queries selecting subset of columns

> (reduced IO).

> * Reduced on-disk footprint compared to heap table. Shorter tuple

> headers and also leveraging compression of similar type data

> * Be first-class citizen in the Postgres architecture (tables data can

> just independently live in columnar storage)

> * Fully MVCC compliant

> * All Indexes supported

> * Hybrid row-column store, where some columns are stored together, and

> others separately. Provide flexibility of granularity on how to

> divide the columns. Columns accessed together can be stored

> together.

> * Provide better control over bloat (similar to zheap)

> * Eliminate need for separate toast tables

> * Faster add / drop column or changing data type of column by avoiding

> full rewrite of the table.

>

> High-level Design - B-trees for the win!

> ========================================

>

> To start simple, let's ignore column store aspect for a moment and

> consider it as compressed row store. The column store is natural

> extension of this concept, explained in next section.

>

> The basic on-disk data structure leveraged is a B-tree, indexed by

> TID. BTree being a great data structure, fast and versatile. Note this

> is not referring to existing Btree indexes, but instead net new

> separate BTree for table data storage.

>

> TID - logical row identifier:

> TID is just a 48-bit row identifier. The traditional division into

> block and offset numbers is meaningless. In order to find a tuple with

> a given TID, one must always descend the B-tree. Having logical TID

> provides flexibility to move the tuples around different pages on page

> splits or page merges can be performed.

>

> The internal pages of the B-tree are super simple and boring. Each

> internal page just stores an array of TID and downlink pairs. Let's

> focus on the leaf level. Leaf blocks have short uncompressed header,

> followed by btree items. Two kinds of items exist:

>

> - plain item, holds one tuple or one datum, uncompressed payload

> - a "container item", holds multiple plain items, compressed payload

>

> +-----------------------------

> | Fixed-size page header:

> |

> | LSN

> | TID low and hi key (for Lehman & Yao B-tree operations)

> | left and right page pointers

> |

> | Items:

> |

> | TID | size | flags | uncompressed size | lastTID | payload (container item)

> | TID | size | flags | uncompressed size | lastTID | payload (container item)

> | TID | size | flags | undo pointer | payload (plain item)

> | TID | size | flags | undo pointer | payload (plain item)

> | ...

> |

> +----------------------------

>

> Row store

> ---------

>

> The tuples are stored one after another, sorted by TID. For each

> tuple, we store its 48-bit TID, a undo record pointer, and the actual

> tuple data uncompressed.

> Storing undo record pointer with each tuple can take quite a lot of

space in cases where you can't compress them. Have you thought how

will you implement the multi-locker scheme in this design? In zheap,

we have used undo for the same and it is easy to imagine when you have

separate transaction slots for each transaction. I am not sure how

will you implement the same here. --

With Regards,

Amit Kapila.

EnterpriseDB: http://www.enterprisedb.com

On 10/04/2019 09:29, Amit Kapila wrote:

> On Tue, Apr 9, 2019 at 5:57 AM Ashwin Agrawal <aagrawal(at)pivotal(dot)io> wrote:

>> Row store

>> ---------

>>

>> The tuples are stored one after another, sorted by TID. For each

>> tuple, we store its 48-bit TID, a undo record pointer, and the actual

>> tuple data uncompressed.

>>

>

> Storing undo record pointer with each tuple can take quite a lot of

> space in cases where you can't compress them. Yeah. This does depend on compression to eliminate the unused fields

quite heavily at the moment. But you could have a flag in the header to

indicate "no undo pointer needed", and just leave it out, when it's needed. > Have you thought how will you implement the multi-locker scheme in

> this design? In zheap, we have used undo for the same and it is easy

> to imagine when you have separate transaction slots for each

> transaction. I am not sure how will you implement the same here.

I've been thinking that the undo record would store all the XIDs

involved. So if there are multiple lockers, the UNDO record would store

a list of XIDs. Alternatively, I suppose you could store multiple UNDO

pointers for the same tuple. - Heikki

On 10.04.2019 10:25, Heikki Linnakangas wrote:

> On 10/04/2019 09:29, Amit Kapila wrote:

>> On Tue, Apr 9, 2019 at 5:57 AM Ashwin Agrawal <aagrawal(at)pivotal(dot)io>

>> wrote:

>>> Row store

>>> ---------

>>>

>>> The tuples are stored one after another, sorted by TID. For each

>>> tuple, we store its 48-bit TID, a undo record pointer, and the actual

>>> tuple data uncompressed.

>>>

>>

>> Storing undo record pointer with each tuple can take quite a lot of

>> space in cases where you can't compress them.

>

> Yeah. This does depend on compression to eliminate the unused fields

> quite heavily at the moment. But you could have a flag in the header

> to indicate "no undo pointer needed", and just leave it out, when it's

> needed.

>

>> Have you thought how will you implement the multi-locker scheme in

>> this design? In zheap, we have used undo for the same and it is easy

>> to imagine when you have separate transaction slots for each

>> transaction. I am not sure how will you implement the same here.

> I've been thinking that the undo record would store all the XIDs

> involved. So if there are multiple lockers, the UNDO record would

> store a list of XIDs. Alternatively, I suppose you could store

> multiple UNDO pointers for the same tuple.

>

> - Heikki

>

> I also a little bit confused about UNDO records and MVCC support in

Zedstore. Actually columnar store is mostly needed for analytic for

read-only or append-only data. One of the disadvantages of Postgres is

quite larger per-record space overhead caused by MVCC.

It may be critical if you want to store huge timeseries with relatively

small number of columns (like measurements of some sensor).

It will be nice to have storage format which reduce this overhead when

it is not needed (data is not updated). Right now, even without UNFO pages, size of zedstore is larger than size

of original Postgres table.

It seems to be very strange. --

Konstantin Knizhnik

Postgres Professional: http://www.postgrespro.com

The Russian Postgres Company

On 10/04/2019 10:38, Konstantin Knizhnik wrote:

> I also a little bit confused about UNDO records and MVCC support in

> Zedstore. Actually columnar store is mostly needed for analytic for

> read-only or append-only data. One of the disadvantages of Postgres is

> quite larger per-record space overhead caused by MVCC.

> It may be critical if you want to store huge timeseries with relatively

> small number of columns (like measurements of some sensor).

> It will be nice to have storage format which reduce this overhead when

> it is not needed (data is not updated). Sure. Definitely something we need to optimize. > Right now, even without UNFO pages, size of zedstore is larger than size

> of original Postgres table.

> It seems to be very strange. If you have a table with a lot of columns, but each column is small,

e.g. lots of boolean columns, the item headers that zedstore currently

stores for each datum take up a lot of space. We'll need to squeeze

those harder to make this competitive. Instead of storing a header for

each datum, if a group of consecutive tuples have the same visibility

information, we could store the header just once, with an array of the

datums, for example. - Heikki

On Wed, Apr 10, 2019 at 12:55 PM Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:

>

> On 10/04/2019 09:29, Amit Kapila wrote:

> > On Tue, Apr 9, 2019 at 5:57 AM Ashwin Agrawal <aagrawal(at)pivotal(dot)io> wrote:

> >> Row store

> >> ---------

> >>

> >> The tuples are stored one after another, sorted by TID. For each

> >> tuple, we store its 48-bit TID, a undo record pointer, and the actual

> >> tuple data uncompressed.

> >>

> >

> > Storing undo record pointer with each tuple can take quite a lot of

> > space in cases where you can't compress them.

>

> Yeah. This does depend on compression to eliminate the unused fields

> quite heavily at the moment. But you could have a flag in the header to

> indicate "no undo pointer needed", and just leave it out, when it's needed.

>

> > Have you thought how will you implement the multi-locker scheme in

> > this design? In zheap, we have used undo for the same and it is easy

> > to imagine when you have separate transaction slots for each

> > transaction. I am not sure how will you implement the same here.

> I've been thinking that the undo record would store all the XIDs

> involved. So if there are multiple lockers, the UNDO record would store

> a list of XIDs.

> This will be quite tricky. Whenever a new locker arrives, you first

need to fetch previous undo to see which all XIDs already have a lock

on it. Not only that, it will make discarding undo's way complicated.

We have considered this approach before implementing the current

approach in zheap. > Alternatively, I suppose you could store multiple UNDO

> pointers for the same tuple.

> This will not only make the length of the tuple unnecessarily long but

would make it much harder to reclaim that space once the corresponding

undo is discarded. --

With Regards,

Amit Kapila.

EnterpriseDB: http://www.enterprisedb.com

On 9/04/19 12:27 PM, Ashwin Agrawal wrote: > Heikki and I have been hacking recently for few weeks to implement

> in-core columnar storage for PostgreSQL. Here's the design and initial

> implementation of Zedstore, compressed in-core columnar storage (table

> access method). Attaching the patch and link to github branch [1] to

> follow along.

>

> Very nice. I realize that it is very early days, but applying this patch

I've managed to stumble over some compression bugs doing some COPY's: benchz=# COPY dim1 FROM '/data0/dump/dim1.dat'

USING DELIMITERS ',';

psql: ERROR: compression failed. what now?

CONTEXT: COPY dim1, line 458 The log has: 2019-04-11 15:48:43.976 NZST [2006] ERROR: XX000: compression failed.

what now?

2019-04-11 15:48:43.976 NZST [2006] CONTEXT: COPY dim1, line 458

2019-04-11 15:48:43.976 NZST [2006] LOCATION: zs_compress_finish,

zedstore_compression.c:287

2019-04-11 15:48:43.976 NZST [2006] STATEMENT: COPY dim1 FROM

'/data0/dump/dim1.dat'

USING DELIMITERS ','; The dataset is generated from and old DW benchmark I wrote

(https://sourceforge.net/projects/benchw/). The row concerned looks like: 457,457th interesting measure,1th measure

type,aqwycdevcmybxcnpwqgrdsmfelaxfpbhfxghamfezdiwfvneltvqlivstwralshsppcpchvdkdbraoxnkvexdbpyzgamajfp

458,458th interesting measure,2th measure

type,bjgdsciehjvkxvxjqbhtdwtcftpfewxfhfkzjsdrdabbvymlctghsblxucezydghjrgsjjjnmmqhncvpwbwodhnzmtakxhsg I'll see if changing to LZ4 makes any different. best wishes Mark



On 11/04/19 4:01 PM, Mark Kirkwood wrote:

> On 9/04/19 12:27 PM, Ashwin Agrawal wrote:

>

>> Heikki and I have been hacking recently for few weeks to implement

>> in-core columnar storage for PostgreSQL. Here's the design and initial

>> implementation of Zedstore, compressed in-core columnar storage (table

>> access method). Attaching the patch and link to github branch [1] to

>> follow along.

>>

>>

>

> Very nice. I realize that it is very early days, but applying this

> patch I've managed to stumble over some compression bugs doing some

> COPY's:

>

> benchz=# COPY dim1 FROM '/data0/dump/dim1.dat'

> USING DELIMITERS ',';

> psql: ERROR: compression failed. what now?

> CONTEXT: COPY dim1, line 458

>

> The log has:

>

> 2019-04-11 15:48:43.976 NZST [2006] ERROR: XX000: compression failed.

> what now?

> 2019-04-11 15:48:43.976 NZST [2006] CONTEXT: COPY dim1, line 458

> 2019-04-11 15:48:43.976 NZST [2006] LOCATION: zs_compress_finish,

> zedstore_compression.c:287

> 2019-04-11 15:48:43.976 NZST [2006] STATEMENT: COPY dim1 FROM

> '/data0/dump/dim1.dat'

> USING DELIMITERS ',';

>

> The dataset is generated from and old DW benchmark I wrote

> (https://sourceforge.net/projects/benchw/). The row concerned looks like:

>

> 457,457th interesting measure,1th measure

> type,aqwycdevcmybxcnpwqgrdsmfelaxfpbhfxghamfezdiwfvneltvqlivstwralshsppcpchvdkdbraoxnkvexdbpyzgamajfp

> 458,458th interesting measure,2th measure

> type,bjgdsciehjvkxvxjqbhtdwtcftpfewxfhfkzjsdrdabbvymlctghsblxucezydghjrgsjjjnmmqhncvpwbwodhnzmtakxhsg

>

>

> I'll see if changing to LZ4 makes any different.

>

> The COPY works with LZ4 configured.



> On Apr 10, 2019, at 9:08 PM, Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz> wrote:

>

>

>> On 11/04/19 4:01 PM, Mark Kirkwood wrote:

>>> On 9/04/19 12:27 PM, Ashwin Agrawal wrote:

>>>

>>> Heikki and I have been hacking recently for few weeks to implement

>>> in-core columnar storage for PostgreSQL. Here's the design and initial

>>> implementation of Zedstore, compressed in-core columnar storage (table

>>> access method). Attaching the patch and link to github branch [1] to

>>> follow along.

>>>

>>>

>>

>> Very nice. I realize that it is very early days, but applying this patch I've managed to stumble over some compression bugs doing some COPY's:

>>

>> benchz=# COPY dim1 FROM '/data0/dump/dim1.dat'

>> USING DELIMITERS ',';

>> psql: ERROR: compression failed. what now?

>> CONTEXT: COPY dim1, line 458

>>

>> The log has:

>>

>> 2019-04-11 15:48:43.976 NZST [2006] ERROR: XX000: compression failed. what now?

>> 2019-04-11 15:48:43.976 NZST [2006] CONTEXT: COPY dim1, line 458

>> 2019-04-11 15:48:43.976 NZST [2006] LOCATION: zs_compress_finish, zedstore_compression.c:287

>> 2019-04-11 15:48:43.976 NZST [2006] STATEMENT: COPY dim1 FROM '/data0/dump/dim1.dat'

>> USING DELIMITERS ',';

>>

>> The dataset is generated from and old DW benchmark I wrote (https://urldefense.proofpoint.com/v2/url?u=https-3A__sourceforge.net_projects_benchw_&d=DwIDaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=gxIaqms7ncm0pvqXLI_xjkgwSStxAET2rnZQpzba2KM&m=BgmTkDoY6SKOgODe8v6fpH4hs-wM0H91cLfrAfEL6C0&s=lLcXp_8h2bRb_OR4FT8kxD-FG9MaLBPU7M5aV9nQ7JY&e=). The row concerned looks like:

>>

>> 457,457th interesting measure,1th measure type,aqwycdevcmybxcnpwqgrdsmfelaxfpbhfxghamfezdiwfvneltvqlivstwralshsppcpchvdkdbraoxnkvexdbpyzgamajfp

>> 458,458th interesting measure,2th measure type,bjgdsciehjvkxvxjqbhtdwtcftpfewxfhfkzjsdrdabbvymlctghsblxucezydghjrgsjjjnmmqhncvpwbwodhnzmtakxhsg

>>

>>

>> I'll see if changing to LZ4 makes any different.

>>

>>

>

> The COPY works with LZ4 configured. Thank you for trying it out. Yes, noticed for certain patterns pg_lzcompress() actually requires much larger output buffers. Like for one 86 len source it required 2296 len output buffer. Current zedstore code doesn’t handle this case and errors out. LZ4 for same patterns works fine, would highly recommend using LZ4 only, as anyways speed is very fast as well with it.

On 11.04.2019 8:03, Ashwin Agrawal wrote:

>> On Apr 10, 2019, at 9:08 PM, Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz> wrote:

>>

>>

>>> On 11/04/19 4:01 PM, Mark Kirkwood wrote:

>>>> On 9/04/19 12:27 PM, Ashwin Agrawal wrote:

>>>>

>>>> Heikki and I have been hacking recently for few weeks to implement

>>>> in-core columnar storage for PostgreSQL. Here's the design and initial

>>>> implementation of Zedstore, compressed in-core columnar storage (table

>>>> access method). Attaching the patch and link to github branch [1] to

>>>> follow along.

>>>>

>>>>

>>> Very nice. I realize that it is very early days, but applying this patch I've managed to stumble over some compression bugs doing some COPY's:

>>>

>>> benchz=# COPY dim1 FROM '/data0/dump/dim1.dat'

>>> USING DELIMITERS ',';

>>> psql: ERROR: compression failed. what now?

>>> CONTEXT: COPY dim1, line 458

>>>

>>> The log has:

>>>

>>> 2019-04-11 15:48:43.976 NZST [2006] ERROR: XX000: compression failed. what now?

>>> 2019-04-11 15:48:43.976 NZST [2006] CONTEXT: COPY dim1, line 458

>>> 2019-04-11 15:48:43.976 NZST [2006] LOCATION: zs_compress_finish, zedstore_compression.c:287

>>> 2019-04-11 15:48:43.976 NZST [2006] STATEMENT: COPY dim1 FROM '/data0/dump/dim1.dat'

>>> USING DELIMITERS ',';

>>>

>>> The dataset is generated from and old DW benchmark I wrote (https://urldefense.proofpoint.com/v2/url?u=https-3A__sourceforge.net_projects_benchw_&d=DwIDaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=gxIaqms7ncm0pvqXLI_xjkgwSStxAET2rnZQpzba2KM&m=BgmTkDoY6SKOgODe8v6fpH4hs-wM0H91cLfrAfEL6C0&s=lLcXp_8h2bRb_OR4FT8kxD-FG9MaLBPU7M5aV9nQ7JY&e=). The row concerned looks like:

>>>

>>> 457,457th interesting measure,1th measure type,aqwycdevcmybxcnpwqgrdsmfelaxfpbhfxghamfezdiwfvneltvqlivstwralshsppcpchvdkdbraoxnkvexdbpyzgamajfp

>>> 458,458th interesting measure,2th measure type,bjgdsciehjvkxvxjqbhtdwtcftpfewxfhfkzjsdrdabbvymlctghsblxucezydghjrgsjjjnmmqhncvpwbwodhnzmtakxhsg

>>>

>>>

>>> I'll see if changing to LZ4 makes any different.

>>>

>>>

>> The COPY works with LZ4 configured.

> Thank you for trying it out. Yes, noticed for certain patterns pg_lzcompress() actually requires much larger output buffers. Like for one 86 len source it required 2296 len output buffer. Current zedstore code doesn’t handle this case and errors out. LZ4 for same patterns works fine, would highly recommend using LZ4 only, as anyways speed is very fast as well with it.

> Internal Postgres lz compressor is really very inefficient comparing

with other compression algorithms.

But in any case you should never assume that size of compressed data

will be smaller than size of plain data.

Moreover, if you are trying to compress already compressed data, then

result almost always will be larger.

If size of compressed data is larger (or even not significantly smaller)

than size of raw data, then you should store original data. lz4 is actually very fast. But it doesn't provide good compression ratio.

This my results of compressing pbench data using different compressors: Configuration Size (Gb) Time (sec)

no compression

15.31 92

zlib (default level) 2.37 284

zlib (best speed) 2.43 191

postgres internal lz 3.89 214

lz4 4.12

95

snappy 5.18 99

lzfse 2.80 1099

(apple) 2.80 1099

1.69 125 You see that zstd provides almost 2 times better compression ration and

almost at the same speed. --

Konstantin Knizhnik

Postgres Professional: http://www.postgrespro.com

The Russian Postgres Company

On Tue, 9 Apr 2019 at 02:27, Ashwin Agrawal <aagrawal(at)pivotal(dot)io> wrote: > Heikki and I have been hacking recently for few weeks to implement

> in-core columnar storage for PostgreSQL. Here's the design and initial

> implementation of Zedstore, compressed in-core columnar storage (table

> access method). Attaching the patch and link to github branch [1] to

> follow along.

>

> The objective is to gather feedback on design and approach to the

> same. The implementation has core basic pieces working but not close

> to complete.

>

> Big thank you to Andres, Haribabu and team for the table access method

> API's. Leveraged the API's for implementing zedstore, and proves API

> to be in very good shape. Had to enhance the same minimally but

> in-general didn't had to touch executor much.

>

> Motivations / Objectives

>

> * Performance improvement for queries selecting subset of columns

> (reduced IO).

> * Reduced on-disk footprint compared to heap table. Shorter tuple

> headers and also leveraging compression of similar type data

> * Be first-class citizen in the Postgres architecture (tables data can

> just independently live in columnar storage)

> * Fully MVCC compliant

> * All Indexes supported

> * Hybrid row-column store, where some columns are stored together, and

> others separately. Provide flexibility of granularity on how to

> divide the columns. Columns accessed together can be stored

> together.

> * Provide better control over bloat (similar to zheap)

> * Eliminate need for separate toast tables

> * Faster add / drop column or changing data type of column by avoiding

> full rewrite of the table.

>

> High-level Design - B-trees for the win!

> ========================================

>

> To start simple, let's ignore column store aspect for a moment and

> consider it as compressed row store. The column store is natural

> extension of this concept, explained in next section.

>

> The basic on-disk data structure leveraged is a B-tree, indexed by

> TID. BTree being a great data structure, fast and versatile. Note this

> is not referring to existing Btree indexes, but instead net new

> separate BTree for table data storage.

>

> TID - logical row identifier:

> TID is just a 48-bit row identifier. The traditional division into

> block and offset numbers is meaningless. In order to find a tuple with

> a given TID, one must always descend the B-tree. Having logical TID

> provides flexibility to move the tuples around different pages on page

> splits or page merges can be performed.

>

> The internal pages of the B-tree are super simple and boring. Each

> internal page just stores an array of TID and downlink pairs. Let's

> focus on the leaf level. Leaf blocks have short uncompressed header,

> followed by btree items. Two kinds of items exist:

>

> - plain item, holds one tuple or one datum, uncompressed payload

> - a "container item", holds multiple plain items, compressed payload

>

> +-----------------------------

> | Fixed-size page header:

> |

> | LSN

> | TID low and hi key (for Lehman & Yao B-tree operations)

> | left and right page pointers

> |

> | Items:

> |

> | TID | size | flags | uncompressed size | lastTID | payload (container

> item)

> | TID | size | flags | uncompressed size | lastTID | payload (container

> item)

> | TID | size | flags | undo pointer | payload (plain item)

> | TID | size | flags | undo pointer | payload (plain item)

> | ...

> |

> +----------------------------

>

> Row store

> ---------

>

> The tuples are stored one after another, sorted by TID. For each

> tuple, we store its 48-bit TID, a undo record pointer, and the actual

> tuple data uncompressed.

>

> In uncompressed form, the page can be arbitrarily large. But after

> compression, it must fit into a physical 8k block. If on insert or

> update of a tuple, the page cannot be compressed below 8k anymore, the

> page is split. Note that because TIDs are logical rather than physical

> identifiers, we can freely move tuples from one physical page to

> another during page split. A tuple's TID never changes.

>

> The buffer cache caches compressed blocks. Likewise, WAL-logging,

> full-page images etc. work on compressed blocks. Uncompression is done

> on-the-fly, as and when needed in backend-private memory, when

> reading. For some compressions like rel encoding or delta encoding

> tuples can be constructed directly from compressed data.

>

> Column store

> ------------

>

> A column store uses the same structure but we have *multiple* B-trees,

> one for each column, all indexed by TID. The B-trees for all columns

> are stored in the same physical file.

>

> A metapage at block 0, has links to the roots of the B-trees. Leaf

> pages look the same, but instead of storing the whole tuple, stores

> just a single attribute. To reconstruct a row with given TID, scan

> descends down the B-trees for all the columns using that TID, and

> fetches all attributes. Likewise, a sequential scan walks all the

> B-trees in lockstep.

>

> So, in summary can imagine Zedstore as forest of B-trees, one for each

> column, all indexed by TIDs.

>

> This way of laying out the data also easily allows for hybrid

> row-column store, where some columns are stored together, and others

> have a dedicated B-tree. Need to have user facing syntax to allow

> specifying how to group the columns.

>

>

> Main reasons for storing data this way

> --------------------------------------

>

> * Layout the data/tuples in mapped fashion instead of keeping the

> logical to physical mapping separate from actual data. So, keep the

> meta-data and data logically in single stream of file, avoiding the

> need for separate forks/files to store meta-data and data.

>

> * Stick to fixed size physical blocks. Variable size blocks pose need

> for increased logical to physical mapping maintenance, plus

> restrictions on concurrency of writes and reads to files. Hence

> adopt compression to fit fixed size blocks instead of other way

> round.

>

>

> MVCC

> ----

> MVCC works very similar to zheap for zedstore. Undo record pointers

> are used to implement MVCC. Transaction information if not directly

> stored with the data. In zheap, there's a small, fixed, number of

> "transaction slots" on each page, but zedstore has undo pointer with

> each item directly; in normal cases, the compression squeezes this

> down to almost nothing.

>

>

> Implementation

> ==============

>

> Insert:

> Inserting a new row, splits the row into datums. Then for first column

> decide which block to insert the same to, and pick a TID for it, and

> write undo record for the same. Rest of the columns are inserted using

> that same TID and point to same undo position.

>

> Compression:

> Items are added to Btree in uncompressed form. If page is full and new

> item can't be added, compression kicks in. Existing uncompressed items

> (plain items) of the page are passed to compressor for

> compression. Already compressed items are added back as is. Page is

> rewritten with compressed data with new item added to it. If even

> after compression, can't add item to page, then page split happens.

>

> Toast:

> When an overly large datum is stored, it is divided into chunks, and

> each chunk is stored on a dedicated toast page within the same

> physical file. The toast pages of a datum form list, each page has a

> next/prev pointer.

>

> Select:

> Property is added to Table AM to convey if column projection is

> leveraged by AM for scans. While scanning tables with AM leveraging

> this property, executor parses the plan. Leverages the target list and

> quals to find the required columns for query. This list is passed down

> to AM on beginscan. Zedstore uses this column projection list to only

> pull data from selected columns. Virtual tuple table slot is used to

> pass back the datums for subset of columns.

>

> Current table am API requires enhancement here to pass down column

> projection to AM. The patch showcases two different ways for the same.

>

> * For sequential scans added new beginscan_with_column_projection()

> API. Executor checks AM property and if it leverages column

> projection uses this new API else normal beginscan() API.

>

> * For index scans instead of modifying the begin scan API, added new

> API to specifically pass column projection list after calling begin

> scan to populate the scan descriptor but before fetching the tuples.

>

> Index Support:

> Building index also leverages columnar storage and only scans columns

> required to build the index. Indexes work pretty similar to heap

> tables. Data is inserted into tables and TID for the tuple same gets

> stored in index. On index scans, required column Btrees are scanned

> for given TID and datums passed back using virtual tuple.

>

> Page Format:

> ZedStore table contains different kinds of pages, all in the same

> file. Kinds of pages are meta-page, per-attribute btree internal and

> leaf pages, UNDO log page, and toast pages. Each page type has its own

> distinct data storage format.

>

> Block 0 is always a metapage. It contains the block numbers of the

> other data structures stored within the file, like the per-attribute

> B-trees, and the UNDO log.

>

>

> Enhancements to design:

> =======================

>

> Instead of compressing all the tuples on a page in one batch, we could

> store a small "dictionary", e.g. in page header or meta-page, and use

> it to compress each tuple separately. That could make random reads and

> updates of individual tuples faster.

>

> When adding column, just need to create new Btree for newly added

> column and linked to meta-page. No existing content needs to be

> rewritten.

>

> When the column is dropped, can scan the B-tree of that column, and

> immediately mark all the pages as free in the FSM. But we don't

> actually have to scan the leaf level: all leaf tuples have a downlink

> in the parent, so we can scan just the internal pages. Unless the

> column is very wide, that's only a small fraction of the data. That

> makes the space immediately reusable for new insertions, but it won't

> return the space to the Operating System. In order to do that, we'd

> still need to defragment, moving pages from the end of the file closer

> to the beginning, and truncate the file.

>

> In this design, we only cache compressed pages in the page cache. If

> we want to cache uncompressed pages instead, or in addition to that,

> we need to invent a whole new kind of a buffer cache that can deal

> with the variable-size blocks.

>

> If you do a lot of updates, the file can get fragmented, with lots of

> unused space on pages. Losing the correlation between TIDs and

> physical order is also bad, because it will make SeqScans slow, as

> they're not actually doing sequential I/O anymore. We can write a

> defragmenter to fix things up. Half-empty pages can be merged, and

> pages can be moved to restore TID/physical correlation. This format

> doesn't have the same MVCC problems with moving tuples around that the

> Postgres heap does, so it can be fairly easily be done on-line.

>

> Min-Max values can be stored for block to easily skip scanning if

> column values doesn't fall in range.

>

> Notes about current patch

> =========================

>

> Basic (core) functionality is implemented to showcase and play with.

>

> Two compression algorithms are supported Postgres pg_lzcompress and

> lz4. Compiling server with --with-lz4 enables the LZ4 compression for

> zedstore else pg_lzcompress is default. Definitely LZ4 is super fast

> at compressing and uncompressing.

>

> Not all the table AM API's are implemented. For the functionality not

> implmented yet will ERROR out with not supported. Zedstore Table can

> be created using command:

>

> CREATE TABLE <name> (column listing) USING zedstore;

>

> Bulk load can be performed using COPY. INSERT, SELECT, UPDATE and

> DELETES work. Btree indexes can be created. Btree and bitmap index

> scans work. Test in src/test/regress/sql/zedstore.sql showcases all

> the functionality working currently. Updates are currently implemented

> as cold, means always creates new items and not performed in-place.

>

> TIDs currently can't leverage the full 48 bit range but instead need

> to limit to values which are considered valid ItemPointers. Also,

> MaxHeapTuplesPerPage pose restrictions on the values currently it can

> have. Refer [7] for the same.

>

> Extremely basic UNDO logging has be implemented just for MVCC

> perspective. MVCC is missing tuple lock right now. Plus, doesn't

> actually perform any undo yet. No WAL logging exist currently hence

> its not crash safe either.

>

> Helpful functions to find how many pages of each type is present in

> zedstore table and also to find compression ratio is provided.

>

> Test mentioned in thread "Column lookup in a row performance" [6],

> good example query for zedstore locally on laptop using lz4 shows

>

> postgres=# SELECT AVG(i199) FROM (select i199 from layout offset 0) x; --

> heap

> avg

> ---------------------

> 500000.500000000000

> (1 row)

>

> Time: 4679.026 ms (00:04.679)

>

> postgres=# SELECT AVG(i199) FROM (select i199 from zlayout offset 0) x; --

> zedstore

> avg

> ---------------------

> 500000.500000000000

> (1 row)

>

> Time: 379.710 ms

>

> Important note:

> ---------------

> Planner has not been modified yet to leverage the columnar

> storage. Hence, plans using "physical tlist" optimization or such good

> for row store miss out to leverage the columnar nature

> currently. Hence, can see the need for subquery with OFFSET 0 above to

> disable the optimization and scan only required column.

>

>

>

> The current proposal and discussion is more focused on AM layer work

> first. Hence, currently intentionally skips to discuss the planner or

> executor "feature" enhancements like adding vectorized execution and

> family of features.

>

> Previous discussions or implementations for column store Vertical

> cluster index [2], Incore columnar storage [3] and [4], cstore_fdw [5]

> were refered to distill down objectives and come up with design and

> implementations to avoid any earlier concerns raised. Learnings from

> Greenplum Database column store also leveraged while designing and

> implementing the same.

>

> Credit: Design is moslty brain child of Heikki, or actually his

> epiphany to be exact. I acted as idea bouncing board and contributed

> enhancements to the same. We both are having lot of fun writing the

> code for this.

>

>

> References

> 1] https://github.com/greenplum-db/postgres/tree/zedstore

> 2]

> https://www.postgresql.org/message-id/flat/CAJrrPGfaC7WC9NK6PTTy6YN-NN%2BhCy8xOLAh2doYhVg5d6HsAA%40mail.gmail.com

> 3]

> https://www.postgresql.org/message-id/flat/20150611230316.GM133018%40postgresql.org

> 4]

> https://www.postgresql.org/message-id/flat/20150831225328.GM2912%40alvherre.pgsql

> 5] https://github.com/citusdata/cstore_fdw

> 6]

> https://www.postgresql.org/message-id/flat/CAOykqKfko-n5YiBJtk-ocVdp%2Bj92Apu5MJBwbGGh4awRY5NCuQ%40mail.gmail.com

> 7]

> https://www.postgresql.org/message-id/d0fc97bd-7ec8-2388-e4a6-0fda86d71a43%40iki.fi

>

>

Reading about it reminds me of this work -- TAG column storage(

http://www09.sigmod.org/sigmod/record/issues/0703/03.article-graefe.pdf ).

Isn't this storage system inspired from there, with TID as the TAG? It is not referenced here so made me wonder.

--

Regards,

Rafia Sabih



On Tue, 9 Apr 2019 at 20:29, Robert Haas <robertmhaas(at)gmail(dot)com> wrote: > On Tue, Apr 9, 2019 at 11:51 AM Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>

> wrote:

> > This is not surprising, considering that columnar store is precisely the

> > reason for starting the work on table AMs.

> >

> > We should certainly look into integrating some sort of columnar storage

> > in mainline. Not sure which of zedstore or VOPS is the best candidate,

> > or maybe we'll have some other proposal. My feeling is that having more

> > than one is not useful; if there are optimizations to one that can be

> > borrowed from the other, let's do that instead of duplicating effort.

>

> I think that conclusion may be premature. There seem to be a bunch of

> different ways of doing columnar storage, so I don't know how we can

> be sure that one size will fit all, or that the first thing we accept

> will be the best thing.

>

> Of course, we probably do not want to accept a ton of storage manager

> implementations is core. I think if people propose implementations

> that are poor quality, or missing important features, or don't have

> significantly different use cases from the ones we've already got,

> it's reasonable to reject those. But I wouldn't be prepared to say

> that if we have two significantly different column store that are both

> awesome code with a complete feature set and significantly disjoint

> use cases, we should reject the second one just because it is also a

> column store. I think that won't get out of control because few

> people will be able to produce really high-quality implementations.

>

> This stuff is hard, which I think is also why we only have 6.5 index

> AMs in core after many, many years. And our standards have gone up

> over the years - not all of those would pass muster if they were

> proposed today.

>

> BTW, can I express a small measure of disappointment that the name for

> the thing under discussion on this thread chose to be called

> "zedstore"? That seems to invite confusion with "zheap", especially

> in parts of the world where the last letter of the alphabet is

> pronounced "zed," where people are going to say zed-heap and

> zed-store. Brr.

> +1 on Brr. Looks like Thomas and your thought on having 'z' makes things

popular/stylish, etc. is after all true, I was skeptical back then. --

Regards,

Rafia Sabih



On 11/04/2019 16:12, Rafia Sabih wrote:

> On Tue, 9 Apr 2019 at 20:29, Robert Haas <robertmhaas(at)gmail(dot)com

> <mailto:robertmhaas(at)gmail(dot)com>> wrote:

>

> BTW, can I express a small measure of disappointment that the name for

> the thing under discussion on this thread chose to be called

> "zedstore"? That seems to invite confusion with "zheap", especially

> in parts of the world where the last letter of the alphabet is

> pronounced "zed," where people are going to say zed-heap and

> zed-store. Brr.

>

> +1 on Brr. Looks like Thomas and your thought on having 'z' makes

> things popular/stylish, etc. is after all true, I was skeptical back then. BrrStore works for me, too ;-). - Heikki

On 4/11/19 10:46 AM, Konstantin Knizhnik wrote: > This my results of compressing pbench data using different compressors:

>

> Configuration Size (Gb) Time (sec)

> no compression

> 15.31 92

> zlib (default level) 2.37 284

> zlib (best speed) 2.43 191

> postgres internal lz 3.89 214

> lz4 4.12

> 95

> snappy 5.18 99

> lzfse 2.80 1099

> (apple) 2.80 1099

> 1.69 125

>

>

>

> You see that zstd provides almost 2 times better compression ration

> and almost at the same speed. What is "(apple) 2.80 1099"? Was that intended to be zstd? Andreas

On 11.04.2019 16:18, Andreas Karlsson wrote:

>

> On 4/11/19 10:46 AM, Konstantin Knizhnik wrote:

>

>> This my results of compressing pbench data using different compressors:

>>

>> Configuration Size (Gb) Time (sec)

>> no compression

>> 15.31 92

>> zlib (default level) 2.37 284

>> zlib (best speed) 2.43 191

>> postgres internal lz 3.89 214

>> lz4 4.12

>> 95

>> snappy 5.18 99

>> lzfse 2.80 1099

>> (apple) 2.80 1099

>> 1.69 125

>>

>>

>>

>> You see that zstd provides almost 2 times better compression ration

>> and almost at the same speed.

>

>

> What is "(apple) 2.80 1099"? Was that intended to be zstd?

>

> Andreas

>

Ugh...

Cut and paste problems.

The whole document can be found here:

http://garret.ru/PageLevelCompression.pdf lzfse (apple) 2.80 1099

zstd (facebook) 1.69 125 ztsd is compression algorithm proposed by facebook:

https://github.com/facebook/zstd

Looks like it provides the best speed/compress ratio result. --

Konstantin Knizhnik

Postgres Professional: http://www.postgrespro.com

The Russian Postgres Company

On Thu, Apr 11, 2019 at 3:15 PM Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote: > On 11/04/2019 16:12, Rafia Sabih wrote:

> > On Tue, 9 Apr 2019 at 20:29, Robert Haas <robertmhaas(at)gmail(dot)com

> > <mailto:robertmhaas(at)gmail(dot)com>> wrote:

> >

> > BTW, can I express a small measure of disappointment that the name

> for

> > the thing under discussion on this thread chose to be called

> > "zedstore"? That seems to invite confusion with "zheap", especially

> > in parts of the world where the last letter of the alphabet is

> > pronounced "zed," where people are going to say zed-heap and

> > zed-store. Brr.

> >

> > +1 on Brr. Looks like Thomas and your thought on having 'z' makes

> > things popular/stylish, etc. is after all true, I was skeptical back

> then.

>

> BrrStore works for me, too ;-).

> Also works as a reference to the Finnish climate? (Sorry, couldn't help myself) --

Magnus Hagander

Me: https://www.hagander.net/ <http://www.hagander.net/>

Work: https://www.redpill-linpro.com/ <http://www.redpill-linpro.com/>



On Tue, 9 Apr 2019 at 02:27, Ashwin Agrawal <aagrawal(at)pivotal(dot)io> wrote: > Heikki and I have been hacking recently for few weeks to implement

> in-core columnar storage for PostgreSQL. Here's the design and initial

> implementation of Zedstore, compressed in-core columnar storage (table

> access method). Attaching the patch and link to github branch [1] to

> follow along.

>

> The objective is to gather feedback on design and approach to the

> same. The implementation has core basic pieces working but not close

> to complete.

>

> Big thank you to Andres, Haribabu and team for the table access method

> API's. Leveraged the API's for implementing zedstore, and proves API

> to be in very good shape. Had to enhance the same minimally but

> in-general didn't had to touch executor much.

>

> Motivations / Objectives

>

> * Performance improvement for queries selecting subset of columns

> (reduced IO).

> * Reduced on-disk footprint compared to heap table. Shorter tuple

> headers and also leveraging compression of similar type data

> * Be first-class citizen in the Postgres architecture (tables data can

> just independently live in columnar storage)

> * Fully MVCC compliant

> * All Indexes supported

> * Hybrid row-column store, where some columns are stored together, and

> others separately. Provide flexibility of granularity on how to

> divide the columns. Columns accessed together can be stored

> together.

> * Provide better control over bloat (similar to zheap)

> * Eliminate need for separate toast tables

> * Faster add / drop column or changing data type of column by avoiding

> full rewrite of the table.

>

> High-level Design - B-trees for the win!

> ========================================

>

> To start simple, let's ignore column store aspect for a moment and

> consider it as compressed row store. The column store is natural

> extension of this concept, explained in next section.

>

> The basic on-disk data structure leveraged is a B-tree, indexed by

> TID. BTree being a great data structure, fast and versatile. Note this

> is not referring to existing Btree indexes, but 