Dynamic data transfer between Teradata and BigQuery

Written by Costin Croitoru, Dan Berbece & Marius Eftene

In this article, we’ll share a migration solution that takes data from Teradata Databases, transfers it to BigQuery and creates a Data Lake on Google Cloud Platform.

We were at the beginning of switching to Google Cloud Platform from on premise solutions and we needed to find a method for transferring data from Teradata to GCP BigQuery.

After some careful considerations, the solution we came up with is to export data from a Teradata databases to Google Cloud Storage using Teradata Parallel Transporter and afterwards load the data into BigQuery using Google Cloud Composer.

This is generally accomplished in two stages. The first stage is to create a list of source tables to pull and mark them as full or incremental loads. In case of incremental load it is necessary to set up what columns should be used as the incremental key.

Once we have a list of tables, the next step is to move data from Teradata to GCS. The final stage involves moving data from GCS to Google BigQuery.

Teradata TPT & Google gsutil are used for running the migration pipelines that move data from Teradata to GCS. This process exports data from tables according to a list of tables and stores the data in a bucket on GCS as csv. A second process group, GCS2BigQuery, moves the data from GCS to Google BigQuery.

The following sections describe the logic in our data migration pipeline from Teradata to Google BigQuery.

Part 1: Teradata to Google Cloud Storage

Define tables list that needs to be exported

This process involved the following — determine if the table will be transferred full or incremental.

In case the table will be transferred incremental it is necessary to define the incremental columns that will be used (max two columns for one table — insert_id and update_id) and a partition column at day level.

The incremental mode is done bases on insert_id/update_id but at day level.

The lists of tables that will be exported are stored in config table.

Master data tables export which is simpler:

· No backward and forward compatibility.

· Tables are exported from Teradata full each time

· In BigQuery data is loaded with write_truncate.

· Export from Teradata is done using TPT export.

Movement tables export with backward and forward compatibility.

This is done by using a DDL history table. If the table was not exported before an initial export will be done and a DDL image will be saved as starting point.

The initial load export is used also for master data tables or tables exported in full mode, but the structure for those tables is overwritten for each export.

See below script used to populate ddl_hist table:

SELECT DISTINCT

LOWER(TRIM(c.databasename) ) databasename,

LOWER(TRIM(c.TABLENAME) )TABLENAME,

LOWER(TRIM(c.COLUMNNAME) )COLUMNNAME,

CASE

WHEN TRIM(UPPER(c.COLUMNTYPE)) = 'BO' THEN 'Record'

.....

WHEN TRIM(UPPER(c.COLUMNTYPE)) = '.... ' THEN 'BQ dataty equivalent'

END DataType,

CASE WHEN TRIM(UPPER(c.COLUMNTYPE)) IN ('BV','CF','CV') and c.chartype=1 THEN '('||TRIM(c.COLUMNLENGTH)||') '

.....

calculate column length based on COLUMNTYPE, COLUMNLENGTH, DECIMALTOTALDIGITS and decimalfractionaldigits

ELSE ''

END DataSize , CASE

WHEN TRIM(UPPER(c.NULLABLE))='N' THEN 'NOT NULL'

ELSE ''

END AS NullInd

, c.defaultvalue ,

dense_rank() OVER (PARTITION BY 1 order by c.columnid asc ) columnid ,

c.chartype,

c.COLUMNLENGTH,

c.DECIMALTOTALDIGITS,

c.DECIMALFRACTIONALDIGITS ,

0 dropp,

1,

${last_export_timestamp_id} as last_export_timestamp_id

FROM DBC.COLUMNSv c

INNER JOIN DBC.TABLESv t

ON c.TABLENAME = t.TABLENAME

AND c.DATABASENAME = t.DATABASENAME

join (

select TABLENAME,DATABASENAME, max(columnid) max_column_id, min(columnid) min_column_id

from DBC.COLUMNSv

where LOWER(TRIM(TABLENAME) ) = 'table_name'

AND LOWER(databasename) ='database_name'

group by DATABASENAME, TABLENAME

) c2

ON c2.TABLENAME = t.TABLENAME

AND c2.DATABASENAME = t.DATABASENAME

where LOWER(TRIM(t.TABLENAME) ) = 'table_name'

AND LOWER(t.databasename) ='database_name'

AND TRIM(UPPER(t.TABLEKIND)) in ('T' ,'O')

and not exists (

select 1

from teradata_db.ddl_hist_table c3

where c3.TABLENAME = t.TABLENAME

AND c3.DATABASENAME = t.DATABASENAME

)

;

In case export table was already inserted in DDL history table check if there are ddl changes. In case there are columns that were removed from Teradata, dropped_column_ind will be set to 1. That column will be exported with default value that is set in “DefaultValue” column.

In case a dropped column is added again in Teradata table, it will be exported in the same position with the new values (not with default anymore).

dropped_column_ind will be set to 0

In case new columns are added the new column will always be added at the end of the exporting file. This will allow us to update the BQ table structure and import the new added columns without manual intervention.

I use the same script as in “initial load” with the following changes:

columnid is calculated as follows:

max_column_id+dense_rank() OVER (PARTITION BY 1 order by c.columnid asc ) columnid ,

tablename and columnname should be in:

and ( c.TABLENAME,c.COLUMNNAME) in (

select TABLENAME, COLUMNNAME from DBC.COLUMNSv t

where LOWER(TRIM(t.TABLENAME) ) = 'table_name'

AND LOWER(t.databasename) ='database_name'

minus

select TABLENAME, COLUMNNAME from teradata_db.ddl_hist_table t

where LOWER(TRIM(t.TABLENAME) ) = 'table_name'

AND LOWER(t.databasename) ='database_name'

)

initial load_ind will be set to 0

Delta algorithm

We are using an export_history_table that keeps necessary info to be able to export in incremental mode. It supports max two incremental columns per table.

This table contains all the information related to all historical exports that were executed from Teradata:

· table name

· partition_date

· insert_id

· update_id

· last_export_timestamp_id — export job unique ID

Export cross check info

This step will be executed only once per table per export. Crosscheck step is split in two:

· delta export — for delta export, the crosscheck file contains the no of rows for each partition

· Full data export — one row for each table containing the no of rows that are in Teradata.

Crosscheck info is used later on to ensure that the same no of rows exported from Teradata were loaded into BigQuery.

Export and execute TPT file

It is necessary to build a dynamic TPT script based on the information from DDL history table.

The script also makes a conversion to the corresponding BQ data type.

TPT script generated here will be executed automatically

select

case when columnid=min_column_id then 'select cast(oreplace(regexp_replace(regexp_replace(regexp_replace(otranslate('|| fast_exp_string ||'||''''#^#''''||'

when columnid=max_column_id then tpt_exp_string || ','''' ~&\\*/\"'''','''' ______''''),''''\\r\

'''','''' ''''),''''\

'''','''' ''''),''''\\r'''','''' '''') ,''''#^#'''',''''~'''') as varchar('||(SUM(size_col) OVER (PARTITION BY 1 order by columnid asc rows unbounded preceding ) )||')) '||'string'||' from '|| databasename||'.'||TABLENAME||' a '||';'

else tpt_exp_string ||'||''''#^#''''||'

end tpt_exp_string,

columnid

from (

select

/*convert to varchar and treat NULL values for all datatyoes */

case when dropped_column_ind =1 then ''''''''''

when DataType in ('float','integer' ) and coalesce(DECIMALFRACTIONALDIGITS,-1) =0 then 'case when a.'||COLUMNNAME||' is null then '''''''' else cast(oreplace(trim(a.'||COLUMNNAME||'),''''.'''','''''''') as varchar('|| trim( cast(cast( trim(DECIMALTOTALDIGITS) as Integer)+cast( trim(DECIMALFRACTIONALDIGITS) as Integer) +2 as varchar(50))) ||')) end '

when DataType in ('float','integer' ) and coalesce(DECIMALFRACTIONALDIGITS,-1) >0 then 'case when a.'||COLUMNNAME||' is null then '''''''' else cast(a.'||COLUMNNAME||' as varchar('|| trim( cast(cast( trim(DECIMALTOTALDIGITS) as Integer)+cast( trim(DECIMALFRACTIONALDIGITS) as Integer) +2 as varchar(50))) ||')) end '

when DataType in ('date') then 'case when a.'||COLUMNNAME||' is null then '''''''' else cast(cast(a.'||COLUMNNAME||' as date format''''YYYY-MM-DD'''') as varchar(10)) end '

when DataType in ('integer') then 'case when a.'||COLUMNNAME||' is null then '''''''' else cast(oreplace(trim(a.'||COLUMNNAME||') ,''''.'''','''''''') as varchar(50)) end '

when DataType in ('time') then 'case when a.'||COLUMNNAME||' is null then '''''''' else cast(cast( a.'||COLUMNNAME||' as time(0) format ''''HH:MI:SS'''') as varchar(50) ) end'

when DataType in ('timestamp') then 'case when a.'||COLUMNNAME||' is null then '''''''' else cast(cast(a.'||COLUMNNAME||' as format ''''YYYY-MM-DD HH:MI:SS'''') as varchar(50) ) end '

when DataType not in ('float','integer' ,'timestamp','date', 'time') then 'case when length(trim( a.'||COLUMNNAME||'))=0 then ''''na'''' when length(REGEXP_REPLACE(a.'||COLUMNNAME||', ''''[\s]'''', '''''''', 1, 0, ''''c''''))=0 then ''''na'''' else cast(trim(REGEXP_REPLACE(a.'||COLUMNNAME||', ''''[\s\t\r

\v\f\x]+'''', '''' '''' )) as varchar('||cast( TRIM(COLUMNLENGTH) as integer)||') ) END '

else 'case when a.'||COLUMNNAME||' is null or length(trim( a.'||COLUMNNAME||'))=0 or length(REGEXP_REPLACE(a.'||COLUMNNAME||', ''''[\s]'''', '''''''', 1, 0, ''''c''''))=0 then '''''''' else cast(trim(REGEXP_REPLACE(a.'||COLUMNNAME||', ''''[\s\t\r

\v\f\x]+'''', '''' '''' )) as varchar('||cast( TRIM(COLUMNLENGTH) as integer)||') ) END '

end fast_exp_string,

DataType,

NullInd,

DataSize,

CASE WHEN TRIM((DataType)) IN ('string') and chartype=1 THEN cast( TRIM(COLUMNLENGTH) as integer)

....similar for remaining datatypes

END size_col,

columnid,

max_column_id,

min_column_id,

databasename,

TABLENAME,

dropped_column_ind

from (

select distinct t.*, c2.max_column_id,c2.min_column_id from teradata_db.ddl_hist_table t

join (

select TABLENAME,DATABASENAME, max(columnid) max_column_id, min(columnid) min_column_id

from teradata_db.ddl_hist_table group by DATABASENAME, TABLENAME

)c2

ON c2.TABLENAME = t.TABLENAME

AND c2.DATABASENAME = t.DATABASENAME

where LOWER(TRIM(t.TABLENAME) ) = 'table_name'

AND LOWER(t.databasename) ='database_name'

)as a

) as b

Export BQ json schema

The JSON schema is generated dynamically based on DDL history table. It has backward and forward compatibility.