Get a Grip on clumpy Row Distribution – the Teradata PPI Table

For this trick to function properly, the partition column(s) must not be part of the Teradata Primary Index.

At some point in time, we all experienced this problem with table inserts: Although the table rows are distributed evenly across all AMPs, the merge step into the table seems to take forever.

One very likely reason is that we are confronted with the scenario of a SET table and many identical NUPI values. The AMPs are forced to do a duplicate row check for each inserted row. This process can take an enormous amount of time and waste a lot of CPU cycles as each new row is compared in full against all the existing rows having the same Primary Index value.

While this may be solved by switching over to a MULTISET design in the case of nonpartitioned tables, we have another great opportunity to address this problem with Teradata partitioned tables (PPI tables).

Assume we have the following table design like below:

CREATE TABLE MultiLevel_Example

(

the_code INTEGER NOT NULL,

the_date DATE NOT NULL,

the_value DECIMAL(18,0)

) UNIQUE PRIMARY INDEX (the_code)

PARTITION BY (

RANGE_N(the_date BETWEEN DATE ‘2015-01-01‘ AND DATE ‘2015-12-31’ EACH INTERVAL ‘1’ MONTH), RANGE_N(the_code BETWEEN 1 AND 5 EACH 1)

);

As you can see, the partition columns are not part of the Primary Index – this is the prerequisite for the trick to work!

Below is the schematic draw of the internal presentation of the table:

The column labeled “C” is the combined partition, “L1” the first level of partitioning and “L2” the second tier. The column labeled “H” represents the ROWHASH of “The_Code.”

“Whenever we create a PPI table which does not contain all partitioning columns in the Primary Index definition, equal NUPI values are spread across different partitions, reducing the clumps and therefore improving performance as the number of rows to be compared for each duplicate row check becomes significantly smaller.”

Nevertheless, be aware that having the partition columns not in the Primary Index comes as well with a penalty and you have to evaluate advantages and disadvantages:

In the case of a primary index access, all partitions have to be checked to find the searched Primary Index value. Furthermore, merge joins may be executed slower and aggregations are more costly as they have to take place across partitions.

Let us take a short look why this trick will not work for a Teradata PPI table in case that all partition columns are contained in the Primary Index definition:

CREATE TABLE MultiLevel_Example

(

the_code INTEGER NOT NULL,

the_date DATE NOT NULL,

the_value DECIMAL(18,0)

) UNIQUE PRIMARY INDEX (the_code,the_date)

PARTITION BY (

RANGE_N(the_date BETWEEN DATE ‘2015-01-01‘ AND DATE ‘2015-12-31’ EACH INTERVAL ‘1’ MONTH), RANGE_N(the_code BETWEEN 1 AND 5 EACH 1)

);

Take a look at below illustration. As The ROWHASH is calculated over all partition columns, it is by design impossible that the same ROWHASH ends up in different partitions (take your time to prove this):

Here is a link to the Partitioned Primary Index Guide: