To fuel a business today, data is the new oil!

But if you’re not a data scientist, trying to make sense of all these metrics can be overwhelming. That’s where Qwiklabs can help. Just released -check out the new BigQuery for Data Warehousing quest. BigQuery is an essential tool for managing and analyzing your data, and you don’t need a Phd. to use it.

As you learn more about BigQuery and get better at the operation of performing data joins, you will want to learn how to get out of a pitfall when you face one. Don’t worry, we have this covered for you in the lab Troubleshooting and Solving Data Join Pitfalls. Here is what you will do in this lab:

Let us now dive deep into this lab and see what all it has to offer…

Setup and requirements

The setup and requirements of this lab are pretty easy. All you need to do is to open the Google Console by following the given instructions. You also need to open Big Query Console and keep it ready for use.

Create a new dataset to store your tables

The first task of the lab is pretty straightforward. You have to create a data set. You will easily be able to do it as you follow the instructions. Once you are done with this task, you will see the ecommerce dataset listed under your project. This is what you should see…

Pin the Lab Project in BigQuery

Next, pin the Lab Project in BigQuery. We will use this are going to require our dataset frequently in this lab. Pinning makes life easier, especially as you apply what you learn to your everyday tasks and work with multiple projects. Now you are ready to do some real work in BigQuery.

Examine the fields

As you will be working on the ecommerce dataset, have a look at the fields of all the tables in the dataset. This will help you in analyzing the queries as you move further in this lab. After you become familiar with the fields, try to identify some key fields in the tables which can be potentially useful for joins.

Examine the Records

This is the step where you would be finding out all the records that are being stored in the dataset. Keep an eye on the pagination at the bottom right corner. This is where you would be seeing the number of pages and number of rows displayed per page. You can use the keyword DISTINCT to show only unique records in the table. It is possible that many of the SKUs have multiple product names and many product names would have multiple SKU values

Find out the SKUs associated with one Product name and vice versa by firing the queries given in the lab instructions.

Pitfall: Non-Unique key

We are going to use the SKU id as the base for joining the tables, so if the SKUs are not unique, it is going to be a bit difficult to use that column to perform joins.

First, you can run a query to find out product names associated with a particular SKU id to check if the SKU id is associated with the same product. You would find out that the SKU is associated with different product names, but the product is the same.

As we will be moving forward to joining the tables, you can examine if the SKU is unique in the table “products”.

Join pitfall: Unintentional many-to-one SKU relationships:

We can now JOIN the inventory dataset with the Website Analytics dataset using the query.

It will show the inventory stock level thrice the actual count, i.e. One for each record in the dataset. Try modifying the query to check the sum of the records in inventory. You should see a count of 462. This is known as unintentional Many-to-one relationship. Check out this screenshot for more clarification

Join pitfall solution: use distinct SKUs before joining

To eliminate the unintentional many-to-one relationship pitfalls, we can use distinct SKU IDs while joining the dataset. You can try the same by firing the query given in the lab. The query will aggregate the count and it will show only one row for every distinct SKU ID. As the query will return three rows of Product description, you would want to keep only one row in the result table.

Join pitfall: Losing data records after a join

Next, you can join the Inventory dataset with the products dataset again, fire the query given in the lab instructions and you will see only 1090 records returned. You can investigate the reason of records being less by modifying the query to get more specific results.

As you can see after the query returns the result, all the SKUs are present in both datasets.

Join pitfall solution: Selecting the correct join type and filtering for NULL

The default type of Join is Inner Join. It will only return the SKUs that are available in both datasets. If you want to see all the SKU records regardless of their presence in Inventory dataset, you can try out different types of Join, viz, Left Join, Right Join, Full Join and Cross Join. Here is a brief summary of the different types of Joins:

The query which has been given in the next step in the lab performs Left Join. The query results will return all the 1909 records from both of the datasets.

Let us write a query to filter all the NULL values from the inventory dataset. This query will return 819 results, which means your dataset has 819 NULL values in SKU column.

To check if all the SKUs in products table are available in the website dataset and vise versa, you can fire the queries given in the lab.

The first query will return zero results because the SKU ID in the query does not exist in the website dataset.

The second query will return two results which are NULL in website dataset but not in the Product dataset.

Add some more fields in the next query from the products dataset to get more details about the NULL SKUs in the website dataset.

Why are two products missing from the website dataset? One product in the dataset is “In Store only” and the other one is a new product.

The next query that you would be firing is the one which will give you all the missing SKUs from both datasets. This query uses a concept known as Full Join (Left Join + Right Join).

Join pitfall: Unintentional Cross Join

If you do not know the exact relationship between the keys of two datasets, you might unintentionally cross join in the database. This can significantly reduce the performance.

To see the effect of unintentional cross join, let’s intentionally make this mistake! First, create a table “site wide promotion” which will have a field as “discount”.

Add one record as 0.05 in the discount table. Fire a query to check all the products which are on clearance in the dataset.

Next, add two more rows to the discount table of value 0.04 and 0.03. Now fire the query in the instructions to check the products on clearance again.

The query will return 246 results, i.e. thrice the original count of clearance products. Why? Because each record is joined with all the records in the discount table, which has 3 values. This is known as the unintentional cross join.

This unintentional cross join can happen even if you are performing a normal join on datasets.

To prevent the unintentional performance reduction of your database, you can study all the data relationships between datasets. It might not be safe to assume the keys in dataset are unique.

We hope you have enjoyed this lab and learned how to get out of a pitfall when you find yourself in one. Hop on to the next lab Working with JSON, Arrays, and Structs in BigQuery.

We hope you enjoy BigQuerying! :)