While the size of the data increases, the option to implement an incremental load becomes more plausible, in order to avoid a full scan of the original table. In this way, we’re going to explain an easy way to perform this task. The first two sections explain how to do it, while the third section covers how to update old data. Furthermore, we’re considering the following assumptions:

MySQL as the storage of the original data. Big Query as Data Warehouse.

Considering as valid the previous points, we’re proceeding in a statement-description way. For the complete procedure and more details (for example, the modules we’re using), I suggest you check out the GitHub code. Also, we’re going to focus on key steps, so to establish the connection to the database or to set the credentials it would be convenient just to make a quick search on google.

1. Retrieving fresh data from MySQL

#Obtain recent date of Dataset

sql_bq = '''SELECT MAX(datevariable) FROM Dataset'''

v = (bqclient.query(sql_bq).result().to_dataframe()) #Filter the necessary values from the date variable

v = str(v.iloc[interval]) #Retrieve data from MySQL

engine = db.create_engine('mysql+pymysql://User:Key@IP')

sql_sg = """SELECT desirableVariables

FROM Dataset

WHERE datevariable > '{}' """.format(v) #Reading with pandas

df_from_query = pd.read_sql_query(sql_sg, engine)

We’re considering the fact that you could use a TIME type variable to know which is you’re most recent observation. For example, in retail, it could be the date where a purchase was performed. In this case, we’re using the variable v to perform this operation assuming a date format “y-m-d h-m-s”, so that’s why using a .iloc is convenient to retrieve the part of the date that you’ll be using.

We see that performing an increment load is just doing a variable substitution through a str.format(). Next, we’re using pandas to read and write the data to GBQ due to the option of specifying a schema with the variables' types, that we see in the next snippet of code.

2. Append data to Dataset in GBQ

credentials = service_account.Credentials.from_service_account_file(

'path_to_Credential')

project_id = 'Id of project' #Criteria of Dataset

table_id = 'Dataset'

schema = [{'name': 'variable1', 'type': 'STRING'},

{'name': 'variable2', 'type': 'FLOAT'},

{'name': 'variable3', 'type': 'INTEGER'},

....

{'name': 'variableN', 'type': 'DESIRED_TYPE'}]

#Data Load

df_from_query.to_gbq(table_id, project_id=project_id,

if_exists='append',

table_schema=schema,

credentials=credentials)

In the above code, we notice two major things. First, we’re using the direction where we’re writing the fresh data with the variable table_id. Second, if_exists = ‘append’, to add this new data to the existing desirable table. It’s useful to specify a schema because you’re appending data to an already existing table. Now, we’re going to see the case when we need to refresh data before the maximum date variable v.

3. In case you need to refresh previous rows

# Time variable

v2 = datetime.datetime.strptime(v, '%Y-%m-%d %H:%M:%S')

v3 = v2 - datetime.timedelta(days=n,

hours=v2.hour,

minutes=v2.minute,

seconds=v2.second) # Data from the last n days

sql_sgt = """ SELECT desiredVariables

FROM Dataset

WHERE datevariable >= '{fecha1}'

AND datevariable <= '{fecha2}' """.format(fecha1=v3,fecha2=v)



df_n_days = pd.read_sql_query(sql_sgt, engine)

In the first two lines of code, when the variable v has values different from zero in the hour, minute or second you could subtract them so that you consider data from 00:00:00 of the desired day. Also, we’re taking an interval of time [v3, v], which includes the dates from which we would like to update the data. For example, a status type variable is necessary to update when the condition of the row has changed.

# Temporal table for merging

table_id_t = 'Dataset_n_days'

schema = [{'name': 'variable1', 'type': 'STRING'},

{'name': 'variable2', 'type': 'FLOAT'},

{'name': 'variable3', 'type': 'INTEGER'},

....

{'name': 'variableN', 'type': 'DESIREDTYPE'}]

df_n_days.to_gbq(table_id_t,

project_id=project_id,

if_exists='replace',

table_schema=schema,

credentials=credentials)



Using an auxiliary table to perform a merge is a convenient way to update the desired dataset, otherwise, you would need to do a full scan of the table to perform the update with pandas. Thus, Dataset_n_days contains the data from the previous n days that you’ll want to update.

# Merge Dataset and Temporal Table

sql_merge = '''MERGE Dataset D

USING Dataset_n_days DND

ON D.id = DND.id

WHEN MATCHED

THEN UPDATE

SET D.variable1 = DND.variable1,

D.variable2 = DND.variable2,

....,

D.variableN = DND.variableN

WHEN NOT MATCHED

THEN INSERT (id) VALUES (id)

''' query_job = bqclient.query(sql_merge,

location='location_of_GCP_services')



query_job.result()

Lastly, in case of not being a match for the row, we simply use the variable id as a ‘ghost’ operation.

4. Final Considerations

Depending on the flow of your data (minute, hourly or daily) or how you schedule this task it could happen that there is no new data to append. If that happens, it useful to use the next snippet of code:

if df_from_query.shape[0] != 0:

proceed with the above steps

else:

print('There's no new data to append')

Finally, there’s room for improvement for this guideline code, so in case you have an observation, comment or catch an error leave a comment.