The ultimate objectives of data capture are, of course, insight and action. For many organizations, however, the gulf remains wide between data capture and well informed data-driven decisions.

As a discipline, data modeling is intended to foster a conscious, outcome-oriented approach to data capture and enable meaningful reporting, visualization, and analysis. More specifically, data modeling can help developers and data professionals to:

fully inventory and understand the available and needed data in their target domains

anticipate the types of analysis to be performed on the data

facilitate presentation and reporting

Data modeling can apply to a wide range of domains. In this blog post, we concentrate on modeling Google Analytics e-commerce data integrated with other back-end retail data.

We approach the retail data model in four phases:

Integrating online and offline data sources, we map out a normalized schema in BigQuery. We populate the normalized schema for staging in BigQuery We replicate the normalized data in BigQuery in a denormalized schema, also within BigQuery but optimized for fast reads as a presentation layer. We then outline several types of analysis that the data model will facilitate: Customer Sentiment Analysis

Customer Service Performance

Online Promotion Influence on Offline Sales (and Vice Versa)

Churn Pervention for Marketing

Determining Real Campaign Impact

Customer Segmentation

Customer Lifetime Value Prediction

Along the way, we take advantage of Google Cloud Platform’s data modeling services to generate additional dimensions against the data already included in the model.

Data Lake, Data Warehouse, Data Mart, Data Model

Before we dive in, let’s review some essential data management concepts:

Data Lake: fairly unstructured, comprehensive data repository. Your data lake might contain files in Cloud Storage or Google Drive or transactional data in Cloud Bigtable.

Data Warehouse: structured, transformed data repository, designed with defined uses cases in mind.

Data Mart: subset of the data warehouse typically oriented towards a specific team, business function, or KPI.

Data Model: model that organizes elements of data and their interrelationships.

A data model can help to clarify requirements for a Data Warehouse or a Data Mart.

Unified, Efficient Presentation Layer

The retail data model that we define in this post will enable a unified presentation layer in BigQuery. We can aggregate and transform the external data sources using a range of Google Cloud Platform components, including:

Cloud Functions to transport data to Google Cloud Storage for staging

Cloud Dataflow to move data from Google Cloud Storage to BigQuery

Cloud Composer to orchestrate data pipelines jobs

The specific architecture is guided by the pipelines that the data model will reference.

The resulting data model, and the physical data warehouse or data marts that it can help to shape, will offer the following benefits:

unification: the data model will consist of online, in-store, and back-end data.

efficiency: since the data model is itself built around different reporting requirements, consumers of the corresponding data marts will be able to efficiently query and present only the data that is needed for reporting and analysis.

Normalized-to-Denormalized Structure

We’ll start the data model in a normalized form but then move to denormalization:

Normalized form represents a systematic approach of decomposing tables to eliminate data redundancy (repetition) and anomalies due to record insertion, update, and deletion. To minimize redundancy, normalized form relies on strong relationships between tables.

represents a systematic approach of decomposing tables to eliminate data redundancy (repetition) and anomalies due to record insertion, update, and deletion. To minimize redundancy, normalized form relies on strong relationships between tables. Denormalized form represents a strategy applied to previously normalized database tables in order to increase performance. With some redundancy and decreased write performance, read performance is improved.

So normalized form favors storage efficiency and relations between entities, while denormalized form represents read efficiency. The first layer of our data model will be normalized; the second layer will be denormalized but will also take advantage of BigQuery’s nested and repeated field structure to preserve some of the benefits of normalization.

Normalized Model Structure

We design the first layer of the model in a Normalized form in a staging area of BigQuery to organize and cleanse the data. It will encompass the following main areas of data:

Customer

Products

Order Transactions

Customer Service

Loyalty Program

Click Stream

Promotions

Staff

The physical tables for this layer of our data model will reside in BigQuery and will be accessed only by the developers rather than used directly for presentation. While BigQuery is not considered an RDBMS, we’ll be able to take advantage of some relational concepts such as logical joins and primary key enforcement to map the relations between tables and ensure a great degree of data integrity and data quality in this first layer of the retail data model.

(view full diagram – click the link and then click within the main panel to display the diagram)

Note that this model represents a subset of a comprehensive retail data model, which could also encompass inventory, finance, and other aspects of data capture for a retail organization.

As shown in the above diagram, which is the initial design for Retail industry relational data model in normalized form.

BigQuery is designed to support denormalized data more than normalized data. The increase in query speed that denormalized data model offers outweighs the storage savings of normalized data, which is generally not as high as a priority in modern systems.

Denormalization for the Presentation Layer:

Based on the best-practice considerations outlined above, we’ll design the presentation layer to be denormalized. Below is a denormalized BigQuery schema corresponding to the normalized schema above.

Can I ever report from the normalized layer?

There are two main considerations for maintaining and reporting in normalized form without replicating in denormalized form as you design your schema for BigQuery: denormalize if large: if a dimension table is larger than 10 gigabytes, it is a good indicator that you should denormalize the dimension table.

keep normalized if you have frequent UPDATEs and DELETEs: if these operations are frequent, keep the dimension tables normalized and the dimension table smaller than 10 gigabytes. There are two main considerations for maintaining and reporting in normalized form without replicating in denormalized form as you design your schema for BigQuery:

Breaking out Reference Fields in Normalized Form

Whether you’re creating a normalized layer that precedes the denormalized presentation layer or you’re using the normalized data layer directly for presentation, you can enhance performance by breaking out frequently updated fields into their own entity within the normalized schema.

Denormalization for the Presentation Layer:

Based on the best-practice considerations outlined above, we’ll design the presentation layer to be denormalized. Below is a denormalized BigQuery schema corresponding to the normalized schema above.

Product:

Product

Product price

Product discount

Product category

Product category discount

Customer and Contract:

Customer

Customer Reward Point Log

Membership type

Service Request

Service Request

Agent

Service Request Feedback

Sales and Inventory

Order

Order Feedback (other data source because 150-character limitation for CD)

Order Items

E-commerce (Click Stream) Session Hits Checkout Step

Promotions

Sales Performance (online or offline channel)

Marketing Campaign

Customer Journey (first activity date, last activity date – its own data mart)

Note that within the tables above, the e-commerce data originates from Google Analytics and is stored in nested, repeated form.

Workflow: Customer Sentiment Analysis

One of the types of analysis that our data model will enable is the analysis of customer sentiment based on survey inputs.

The end-to-end workflow mapped below will include the transfer of data to Google Cloud Storage, the population of the normalized and denormalized layers of the data model in BigQuery, and the generation of additional fields through Google’s Natural Language API.

Step 1: Preparing Files

The source data will be either files or external databases. In the case of an external database, you can export tables compressed to tar.gz files in preparation for Google Cloud Storage.

Step 2: Transfering Files to Google Cloud Storage

There are several ways to transfer data to a Google Cloud Storage bucket, including the two options below:

ETL Job using any data integration tool, such as Talend Batch / Shell scripts using gsutil command in Google Cloud SDK

$ gsutil cp gs:// /

Step 3: Organizing Transferred Files

The Cloud Storage data lake for the retail data will consist of two buckets:

Staging: A region standard bucket where we receive the tar.gz files and decompress them

Archive: depending on the expected frequency of disaster recovery or the need to query for new KPIs, you can configure the archive as either nearline or coldline: [region & (nearline or coldline)] Nearline bucket: choose if expected access may be once per month Coldline bucket: choose if expected access may be once per year



Step 4: Reading Data from GCS Bucket

You can set up a Cloud Function to fire based on a Cloud Storage trigger for when a file is uploaded for processing. The Cloud Function itself can then trigger an Airflow workflow in Cloud Composer.

You can configure a Cloud Composer workflow to perform the following steps in parallel, as shown in the diagram below:

copying data from Google Cloud Storage buckets to normalized tables in BigQuery

generate additional fields to capture sentiment analysis

Cloud Composer Workflow

The Cloud Composer workflow includes the following specific steps:

Extracts the input file that triggered the workflow. The Airflow microservice within Cloud Composer will manage the following steps: Decompress the input Execute a Cloud Dataflow job that process the file that contains customer review to be sent to Natural Language API. Send the result to BigQuery as the normalized layer of the data model Process the other files to be sent to BigQuery, also within the normalized layer

The Dataflow job includes the following Python code for Apache Beam:

Create Pipeline:

argv = [ '--project={0}'.format(PROJECT), '--staging_location=gs://{0}/staging/'.format(BUCKET), '--temp_location=gs://{0}/staging/'.format(BUCKET), '--runner=DataflowRunner', '--job_name=dataextraction{0}'.format(YESTERDAY), '--save_main_session', '--requirements_file=/home/airflow/gcs/dags/requirements.txt', '--num_workers=1', '--max_num_workers=1' ] p = beam.Pipeline(argv=argv) (p | 'ReadData' >> beam.io.textio.ReadFromText(src_path) | 'ParseCSV' >> beam.ParDo(Sentiment()) | 'WriteToBigQuery' >> beam.io.WriteToBigQuery ('[PROJECT-ID]:[DATASET-ID].[TARGET-TABLE]', write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND) ) p.run()

Send a request to Natural Language API and retrieve the Sentiment, Score, and Magnitude fields:

client = language.LanguageServiceClient() document = types.Document(content=content, type=enums.Document.Type.PLAIN_TEXT) sentiment = client.analyze_sentiment(document).document_sentiment

Customer Sentiment Analysis: measure customer sentiment based on any text input received from the customer. Source: : measure customer sentiment based on any text input received from the customer. Source: Kaggle Datasets score of the sentiment ranges between -1.0 (negative) and 1.0 (positive) and

magnitute indicates the perceived strength of emotion expressed. Longer text responses may generate greater magnitudes.

Sentiment Score Magnitude Clearly Positive* 0.8 > 3.0 Clearly Negative* -0.6 > 4.0 Neutral 0.1 0.0 Mixed 0.0 4.0

The Score and Magnitude fields are returned by the Natural Language API. Sentiment is a calculated field that will vary by use case.

Further analysis can be done using the sentiment results. Below is a sample Data Studio dashboard that visualizes a BigQuery data source that includes sentiment data generated from the Natural Language API.

Additional Use Cases

The presentation layer of the data model that we built above will facilitate the following additional analysis use cases.

Online Promotion Influence on Offline Sales

Understand the influence of online promotions on offline sales and vice versa

Inputs

Channel (Online/Offline)

Total amount

Total number of transactions

Number of visits

Customer Segmentation

Segment customers based on behavior (and then use the segments for website personalization or targeted marketing).

Input

Customer Number

Annual_Revenue,

timeOnScreen,

UniqueScreenViews,

Number of Visits

Loyalty_Program,

Lifetime_Value,

Age

Target

Customer Number

Cluster ID

Determining Real Campaign Impact

Measuring the real impact of a campaign based on a prediction without the campaign if we are running campaign for a specific segment of Customer.

We select a segment of customers similar to the segment who were exposed to a campaign.

We select them from a period before the campaign period, try to avoid any seasonality period. We will use some features from the data in order to get a similar segment:

Gender

Age (Segmented)

Total Purchase amount (Segmented)

Number of Visits

Predictive Modeling

Taking advantage of other Google Cloud Platform services such as AutoML tables, we can generate additional data for analysis based on the original retail fields included in our model.

Customer Service Performance

Understand top/bottom performing customer service representatives – understand how customer service drives retention revenue:

Input customer representative ID channel (online chat, phone, email, other) cases resolved average time to resolution average time to first response satisfaction score sales/order revenue

Target customer representative ID predicted sales/orders predicted/revenue



Churn Prevention for Marketing

Identify users who are most likely to churn and use to target them with suitable messaging.

Input Transactions Payment data Credit card expiration date Product/website usage Daily/Weekly/Monthly Average Users per account Power Users Login frequency Number of features used Number of high value / sticky features used Sentiment Result Customer Sentiment Analysis Result Number of Transactions Total Revenue

Target Customer ID Prediction date Probability (NUM) Prediction of churn (BOOL)



Customer Lifetime Value Prediction

Determine characteristics of users most likely to generate high lifetime value after initial conversion and target marketing to other users who demonstrate similar characteristics.

Input

Customer Number

Annual_Revenue

timeOnScreen

UniqueScreenViews

Number of Visits

Loyalty_Program

Age

Target

Customer Number

Predicted Lifetime value

Build Your Foundation

Data modeling is foundational work that will encourage a cleaner and more deliberate data capture that will facilitate better analysis. Using the capabilities of BigQuery and other components of the Google Cloud Platform, we’re able to not only build robust physical data architectures; we can enrich our data model with additional fields output from predictive modeling and target the most relevant users with the most relevant messages.

For any questions about data modeling, data architecture, and advanced analysis, please contact us today with no obligation.

Sharing is caring!