Photo by Porapak Apichodilok from Pexels

Trade Me, like many companies around the globe, is leveraging the capabilities of the public cloud. The countless services and the cost-saving opportunities are only a few advantages the cloud brings to businesses.

This was a collaborative project between Business Intelligence and Data Engineering and this blog post - written by Maikel Penz (Data Engineer) and Lashwin Naidoo (BI Developer) - shares our experience building a data warehouse in the cloud with Snowflake.

Overview

Snowflake is a relational SQL data warehouse provided as a service. It runs on top of either AWS, Azure or Google Cloud. There is no infrastructure management involved, which means that BI developers spend their time delivering business value rather than installing/maintaining and patching software.

The scope of this project was to understand Snowflake and deliver business value using a targeted, but large data set. BI and Data Engineering played their roles to write the ETL and build the integrations around the technology to feed data and make it available to consumers.

Moving data into Snowflake

Trade Me runs a Data Lake stored on AWS which is used as a powerful data source for teams around the business to build data driven products. Our goal is to continuously push data from the Data Lake into Snowflake.

The Data Lake

Events are logged from Trade Me’s website and sent to S3. A Data Pipeline kicks off (let’s call it Lake Pipeline) to process data files and make them available for consumption.

The Data Lake follows a tiered storage architecture. We split the data into multiple S3 buckets, with their own retention periods and holding data generated from different stages of the Lake Pipeline. This model helps us save on storage costs and gives us access to both raw and processed data files.

Data Lake tiers

Snowpipe

Snowflake provides a continuous data ingestion service called Snowpipe. The service is managed by Snowflake and we only need to hit their service and pay for how much compute resource is needed to push data through their pipeline.

Our requirements:

The call to Snowpipe must be plugged into the Lake Pipeline and we want to send files from the Persisted tier, before arriving in the Lake.

Lake Pipeline isn’t affected when we pause/resume Snowpipe.

In case of failure on the Snowpipe Pipeline, we must be able to replay files without duplicating data on the Lake.

There are two approaches to set up Snowpipe to move files from S3 to Snowflake:

1 - Using a feature called auto ingest. This feature uses a Snowflake managed SQS queue to ingest files. Messages get pushed to their queue by adding it as an event to an S3 bucket or as a subscription to a SNS topic.

2 - Programmatically calling the Snowpipe REST API.

Approach 2 works better for us considering the requirements listed above. The following image introduces the Lake Pipeline and shows how we plugged in a Lambda Function called Snowflake Ingestion as part of it. You will notice that subscribing a Lambda Function to a SNS topic adds great flexibility as we can manually call the function to replay files, add logic to pause the pipeline and all of this without interfering with the Lake Pipeline.

Lake Pipeline with Snowpipe

User Access

Authentication

Authenticating users to Snowflake can be done in multiple ways, from the simple user/password to more sophisticated methods like Federated Authentication.

Since the beginning we knew that maintaining users on Snowflake wouldn’t scale well as more analysts started using the tool. With this in mind, the obvious preference was to authenticate users through Trade Me’s existing user management platform.

As mentioned above, Snowflake provides Federated Authentication. This means that Snowflake acts as the service provider (SP) and the company as the identity provider (IdP). Fortunately, Trade Me’s Single Sign-On is done through OKTA, and OKTA provides native support for Snowflake.

Integrating OKTA with Snowflake is straightforward, without diving into the details here is what needs to be done:

1 - On OKTA: Create a Snowflake application and point the sub domain to the Snowflake account.

2 - On Snowflake: Update account settings to use a certificate and SSO URL generated from the first step on OKTA and enable SSO login

3 - once the above is done, you will see this cool “Single Sign On” button on Snowflake’s login page.

Snowflake’s login page with Single Sign On

Gotcha: AD users are not sync’d with Snowflake automatically, which means that every individual needs to be created and set with the appropriate Idp email address on Snowflake.

Access Control

Now that people can log into Snowflake we need to make sure they are allowed to access the right stuff.

Snowflake provides a set of System-defined Roles to cover administration tasks like user management and billing. However, besides administration you might also want to configure granular access to objects users interact with, like Databases and Warehouses.

Following the same security principles of most Databases, on Snowflake you can configure read/write access to Schemas, Tables and Views. In our case, we created an Analysts role with read access to the data, and separated roles for ETL (data transformation/aggregation) and the Pipeline (to cover the move of files from the Data Lake to Snowflake).

Warehouses exist for compute. Think of them as machines that don’t hold the data permanently but provide CPU and Memory to run ETL and AD Hoc Querying. To help us better understand our cost, we created a warehouse for Analysts, one for the ETL and another for Backfilling. Permissions were set to make sure an Analyst can only use the shared “Analysts Warehouse”.

ETL Architecture

ETL was one of the big decisions to be made as part of this project. Snowflake doesn’t come with a built-in ETL tool, but instead provides an extensive list of 3rd-party ETL partners.

With plenty of options in front of us, we wanted to find a low-cost solution that would also check a few boxes:

Code-driven ETL, to facilitate deployment and help upskilling the BI team to a more cloud/coding approach.

ETL, to facilitate deployment and help upskilling the BI team to a more cloud/coding approach. facilitated Workflow definition ;

; Flexible to run ETL from any environment;

In the end we didn’t find a “one size fits all” solution and instead decided to combine a set of technologies. Let’s now go through the ETL architecture in more detail.

Coding

Snowflake comes with a Python connector, allowing us to use code to fire SQL queries to Snowflake to transform and aggregate data. Python is a widely used programming language and lets us extend the ETL by using libraries and doing error handling out of Snowflake.

Workflow Definition

Here we introduce Prefect. Prefect is a workflow management system built to combine Python functions in a robust workflow, where tasks can either run in sequence or take advantage of parallel execution. Besides its power, the other strong reason we chose Prefect was due to the the low effort to get up and running with it.

Environment

Finally, we had to figure out where to run the ETL from and also how to schedule it. Prefect provides its own scheduling system, but this means we needed to have a machine running 24/7. To get around it, here comes Docker, AWS ECR and AWS batch.

1 - We package our Python functions along with the Prefect workflow and generate a docker image.

2 - This image gets pushed to AWS ECR (AWS’s managed container registry).

3 - A job definition gets created on AWS Batch, using the previously uploaded image. AWS Batch dynamically allocates the resources to run the container, so there’s no need to run a machine all the time.

4 - We create a schedule through Cloudwatch to fire jobs to AWS Batch and run the ETL.

The following image summarizes our approach.

ETL with Snowflake

Data Manipulation

Snowflake uses its own in-house SQL. That said, we found querying in Snowflake intuitive and natural. The query language is based off the ANSI SQL standard. Snowflake has added to this to cater for analysis and working with semi-structured data.

One particularly nifty feature is the LATERAL JOIN. This concept is similar to to SQL Server’s CROSS APPLY and Oracle’s LATERAL joins but can be used to parse through semi-structured data.

In practice parsing through nested JSON and arrays can be tricky. It’s usually parsed via code outside of the database. This involves iterating through the nested sets via nested loops and transforming it into tabular form for loading into the data warehouse. The LATERAL JOIN in Snowflake works much like other joins letting you read through nested subsets by joining them and returning a standard result set.

Parquet files from the Data Lake are ingested into Snowflake . We land the data into a variant column storing in a JSON semi-structured format.

Snowflake profiles and optimizes reads of JSON formatted data. This provided us with two approaches for transforming the data into facts and dimensions through our ETL.

1 - The first option was the traditional approach of splitting out the data into its columnar form as transient tables. Then running transformations on this tabular data.

Traditional approach

2 - For the second approach we created views on top of the semi-structured data. This removed the step to copy out the data into intermediate tables before the transformation step.

Reading off views

After testing both approaches we settled for the first option. In practice running transformations off the semi-structured data was not as fast as tabular. The slow down was more apparent as we scaled up to larger data sets and ran both sets of ETL end to end.

Tooling

Snowflake provides JDBC and ODBC drivers which connect to a myriad third party clients and visualization tools. These drivers cater for authentication via SSO and Okta.

Querying

The Snowflake Worksheets query interface is more than adequate for running ad hoc queries. If you’re looking for additional query features though, you could easily connect to your favorite non proprietary query tool. We found that DBeaver worked quite well.

Power BI Integration

Trade Me runs Power BI dashboards and reports through a Power BI Gateway. Setting up Snowflake as data source was fairly straight forward. Power BI, much like other modern visualization tools, has two options for reading data. The first being import mode where all report data is cached into memory on a scheduled refresh . The second is direct query where queries are sent and data retrieved as users interact with the report. Further to this Power BI and other visualization tools usually run parallel queries off the data warehouse to get at the data quicker.

Visualization tools optimize under the hood for fast report response times. Coupled with the fact that Snowflake charges by query time, tuning reports to minimize costs becomes a significant matter. It’s important to minimize query times by choosing the connection method suited for the report data set size and expected usage. In our case we created a compute warehouse to monitor Power BI report usage and analysed the report query history to determine the most cost effective approach for each report.

Final Notes

Snowflake is an excellent tool and seems to be a good fit for Trade Me. We found it simple to interact with the UI and the flexibility to integrate Snowflake with external tools/pipelines put ourselves in a good place to expand and take on more projects like this in the future.

This project put us in front of new technologies and we believe that besides the continuous push to learn and explore new tools we also need to revisit our methods to be ready for the cloud. Alongside the wins we did face some surprises, specially around the cost incurred on large backfills and ETL. Trade Me has a strong on-premises background, and the move to the cloud will force us to improve the ways we currently operate.

To sum up, the overall outcome of the project was very positive as we now better understand the technology and the small frustrations made us learn where to be cautious with it in the future.