Introduction to Teradata Referential Integrity

What is Teradata Referential Integrity? Referential integrity ensures that a value created in a foreign key (FK) column of a referencing (child) table also exists as a primary key (PK) in the referenced (parent) table. In Teradata, the parent table’s column does not have to be specified explicitly as PK, but an FK can reference each unique NOT NULL column. What are the Reasons for using Teradata Referential Integrity Features? 1. Ensure data integrity and consistency by preventing that DELETE, UPDATE, or INSERT statements can corrupt the database

2. No coding required to ensure referential integrity manually

3. Teradata referential integrity handling is optimized and performs better than programming it yourself using SQL or another coding method.

4. Referential Integrity helps the Teradata optimizer to create better execution plans by eliminating unnecessary joins. What are the requirements for the primary key column of the parent table? This column must be UNIQUE. This is given if this column is either defined as a primary key or has a UNIQUE constraint or a unique index (UPI, USI). What happens if I delete a Primary Key value from the parent table and the child table refers to this value? Teradata prevents this. A Primary Key value can only be deleted or changed if no Foreign Key refers to this value. What is the syntax for referential integrity? CREATE TABLE Sale

(

SalesId INTEGER NOT NULL

CustomerId INTEGER REFERENCES Customer (CustmerId),

SalesDate DATE,

Price DECIMAL(10,2))

PRIMARY INDEX (SalesId)

); Can the foreign key columns and primary key columns of the child and parent tables have different data types? No. Both data type and case sensitivity must be identical.

Teradata implements 3 Types of Referential Integrity

1. What is Standard Referential Integrity?

With the Standard Referential Integrity, every INSERT, DELETE, or UPDATE of a Row is checked immediately whether the referential integrity is given.

An additional reference index sub-table is required to ensure referential integrity.

If referential integrity is violated, the statement is not executed, and an error is reported.

Since the referential integrity check is row-by-row, a performance loss must be expected.

The load tools such as TPT, Fastload, and Multiload cannot be used for tables to which standard Referential Integrity is applied.

The Reference Index Sub-Table

For each referential constraint on a child table (that is the one with the FK), a reference index sub-table is required.

The reference index sub-table is used to check whether an operation (UPDATE, INSERT, or DELETE) would violate the referential integrity and prevent it.

The reference index sub-table is distributed to all AMPs using hashing, just like a base table.

The foreign key value determines the rowhash. This ensures that a foreign key’s index rows are always present at the same AMP as the primary key rows (defined by UPI, USI, or explicitly defined as PK).

Hashing the foreign key value ensures that the referential integrity check can be performed locally on each AMP without copying rows. This is an important performance aspect

A reference index row contains the following information:

The ROWID of the foreign key value

The foreign key value itself

The number of rows in the child table containing this foreign key value

The validity flag indicates if the foreign key value is valid (i.e., the value is available in the column of the parent table). The next section will discuss how referential integrity may not be given, even if it is defined.

Adding Standard Referential Integrity to a populated Table – the Creation of Referential Integrity Violations

It is possible to apply standard referential integrity to a populated table:

ALTER TABLE SALE ADD CONSTRAINT FK_SALES FOREIGN KEY (CUSTOMER) REFERENCES CUSTOMER(CustomerId);

The following happens if there is no primary key in the parent table for a certain foreign key value:

The validity flag of the reference index sub-table row is set to “false”

INSERT, UPDATE, or DELETE statements can still be executed, even if there are invalidated index sub-table rows, i.e., problems with the referential integrity in the child table exist.

No new rows with the same invalid value can be inserted!

An error table with the same structure as the child table is created automatically, which receives the invalid rows. The error table is called exactly like the child table, but with a numeric post-fix e.g., for SALE, there is the error table SALE_0 being created.

If there are several foreign key constraints on a child table, an error table is created for each foreign key constraint (these are numbered accordingly).

Error Table Handling for Standard RI

Purge Referential Integrity Violations

As we have seen in the previous section when you add standard referential integrity to a populated table, referential integrity maybe not be fulfilled for all rows.

Here are the methods of how you can restore data integrity after adding referential integrity checking:

ALTER TABLE SALE ADD CONSTRAINT FK_SALES FOREIGN KEY (CUSTOMER) REFERENCES CUSTOMER(CustomerId);

UPDATE SALE SET CustomerId = <ValidCustomerId> WHERE CustomerId = <InvalidCustomerId>;

UPDATE SALE SET CustomerId = NULL WHERE SalesId IN (SELECT SalesId FROM SALES_0);

DELETE FROM SALE WHERE SalesId IN (SELECT SalesId FROM SALES_0);

INSERT INTO Customer VALUES (<MissingCustomerId>,…);

The syntax for adding Standard Referential Integrity

ALTER TABLE SALE ADD CONSTRAINT SalesFK FOREIGN KEY (CustomerID) REFERENCES Customer (CustomerId);

2. What is Batch Referential Integrity?

The difference to standard referential integrity is that it does not check row-by-row, but checks the entire transaction. If only one row does not pass the test, the entire transaction is rolled back.

The advantage of this method that the performance loss can be considerably lower. Teradata also does not need an index sub-table for Batch Referential Integrity.

As with Standard Referential Integrity, the loading tools TPT, Fastload, and Multiload cannot be used on Batch Referential Integrity tables.

Since no reference index sub-table is used, the transaction’s referential integrity is checked differently: by joining the parent and child table.

If the referential integrity is violated by at least one row, the entire transaction is rolled back.

Adding Batch Referential Integrity to a populated Table – the Creation of Referential Integrity Violations

For populated tables, if the foreign key has violations, the ALTER TABLE

will fail with Error 3513. There is no error table written.

The syntax for adding Batch Referential Integrity – WITH CHECK OPTION

ALTER TABLE SALE ADD CONSTRAINT SalesFK FOREIGN KEY (CustomerID) REFERENCES WITH CHECK OPTION Customer (CustomerId);

3. What is Soft Referential Integrity?

Soft referential integrity does not ensure referential integrity. Although referential integrity between tables is defined, it is not enforced.

Why does soft referential integrity exist then? Soft referential integrity exists to help the optimizer create better execution plans without the overhead of batch or standard referential integrity.

Since no check is performed, all loading tools (Fastload, TPT, Multiload) can be used without restrictions.

With Soft Referential Integrity, you promise that data integrity is given. The optimizer trusts that data integrity is given. The wrong result may be returned for a query if this is not the case!

The Syntax for adding Batch Referential Integrity – WITH NO CHECK OPTION

ALTER TABLE SALE ADD CONSTRAINT SalesFK FOREIGN KEY (CustomerID) REFERENCES WITH NO CHECK OPTION Customer (CustomerId);

Referential Integrity in Comparison

Let’s now compare a query over two tables between which referential integrity is defined, and consider how exactly join optimization takes place, and what the result is, depending on the type of referential integrity used.

We select only the columns of the child table (Sale) so that the optimizer can do join elimination.

SELECT s.* FROM Customer c INNER JOIN Sale s ON s.CustomerId = c.CustomerId;

No Referential Integrity is defined

There is always a join taking place (although only columns of the sale table are selected).

If there is a CustomerId in the Sale Table that does not exist in the Customer Table, these rows are lost.

Standard Referential Integrity is defined

ALTER TABLE SALE ADD CONSTRAINT SalesFK FOREIGN KEY (CustomerID) REFERENCES Customer (CustomerId);

No join takes place.

Rows of the Sale Table are returned, for which the CustomerId is not in the Customer Table.



Remember:

This inconsistency could have been caused by adding the referential integrity to an existing and populated table.



Therefore, it is always important to check if there are rows in the error table and correct the errors accordingly, or the result might be unexpected.

Batch Referential Integrity is defined

ALTER TABLE SALE ADD CONSTRAINT SalesFK FOREIGN KEY (CustomerID) REFERENCES WITH CHECK OPTION Customer (CustomerId);

No join takes place.

Since batch referential integrity prevents referential integrity from being violated, there can be no unexpected results.

Soft Referential Integrity is defined

ALTER TABLE SALE ADD CONSTRAINT SalesFK FOREIGN KEY (CustomerID) REFERENCES WITH NO CHECK OPTION Customer (CustomerId);

No join takes place.

Rows of the Sale Table are returned, for which the CustomerId is not in the Customer Table.

When to use what kind of Referential Integrity?

Performance Considerations

Soft Referential Integrity does not cause performance deterioration . Use it if you are sure that your data’s referential integrity is ensured to help the optimizer create a better execution plan.

. Use it if you are sure that your data’s referential integrity is ensured to help the optimizer create a better execution plan. If a row-by-row INSERT, UPDATE, or DELETE occurs, the performance loss is similar for Standard and Batch Referential Integrity. Use it when mainly single rows are processed.

If many rows are processed at once with INSERT/DELETE/UPDATE, the performance loss is usually higher with Standard Referential Integrity than with Batch Referential Integrity if no referential integrity violations exist.

If many rows are processed at once with INSERT/DELETE/UPDATE, and referential integrity violation occurs frequently, the rollback caused by the Batch Referential Integrity might cause substantial performance issues.

Use batch referential integrity for batch inserts of rows only if there are usually no referential integrity instances, as the rollback in case of an error might become a performance issue.

Feature Comparison

The most important features compared

Teradata Referential Integrity – Why is it so rarely used?

It is typically not so critical in a data warehouse if the referential integrity is not 100% given.

One would rather accept a certain inaccuracy than the performance loss caused by standard or batch referential integrity.

Soft referential integrity seems to be a solution, but only if the data’s referential integrity is 100% fulfilled.

This may involve a lot of effort in the ETL process to ensure this. But only then it makes sense to use soft referential integrity because who can afford that queries return the wrong results?

Read also:

Teradata Documentation on this topic