Background

Python Pandas data analysis workflows often require outputting results to a database as intermediate or final steps. There are two major considerations when writing analysis results out to a database: I only want to insert new records into the database, and, I don't want to offload this processing job to the database server because it's cheaper to do on a worker node.

Problem

We only want to insert "new rows" into a database from a Python Pandas dataframe - ideally in-memory in order to insert new data as fast as possible.

Proposed Solution

Create a function which takes a dataframe, and a database connection/table, and returns a dataframe of unique values not in the database table.