Overview

Your organizations marketing department is requesting marketing related data for data analytics to be performed by the growing data science group. You have been chosen to be the project leader for two important projects to support this new initiative.

The data can be harvested from multiple data types, coming from a wide variety of sources. The data is planned to provide a 360-degree view of customer buying habits by ingesting social media data, click-streams, location data, log files, and much more customer marketing related data. The plan is to utilize a new data architectural framework called a “Data Lake”.

The marketing group states that the Data Lake can combine customer data from a current CRM platform with social media analytics, a marketing platform that includes buying history, and incident tickets to empower the business to understand the most profitable customer, the cause of customer churn, and the promotions or rewards that will increase loyalty. Some of this data may already be available from another corporate data warehouse Customer Sales, stored in a Redshift DW configuration on AWS.

Confronted with massive volumes and heterogeneous types of data, you realize that in order to deliver insights in a timely manner, the organization will require a data storage and analytics solution that offers more agility and flexibility than the current data management systems. The architecture team is recommending the Amazon Web Services Data Lake framework. This is a new and increasingly popular way to store and analyze data that addresses many of these challenges. The Data Lake will allow the organization to store all of their data, structured and unstructured, in one, centralized repository. Since data can be stored as-is, there is no need to convert it to a predefined schema and you no longer need to know what questions you want to ask of your data beforehand.

The Data Lake needs to support the following:

Collect and store any type of data, structured and un-structured, at any scale and at low costs by storing it in the AWS S3 repository

Securing and protecting all of data stored in the AWS S3 repository

Allow for searching and finding the relevant data in the AWS S3 repository

Allow the data science group to quickly and easily perform data analysis on datasets with no preset schema

The Data Lake will not replace the current Marketing Data Warehouse stored in the AWS Redshift Database. This Data Warehouse currently stores extracted transactional data from several operational systems (mostly Postgresql databases) and is used for business intelligence decision making. The Data Lake data will be utilized by the data science group for ad hoc analytics with these unstructured datasets, so they can quickly explore and discover new insights without the need to convert them into a well-defined schema. Meanwhile, the plan calls for a server-less environment using Lambda events set up to extract, transform and load some of the Data Lake data into the Marketing Data Warehouse to augment the current transactional conformed data.

In order to accomplish this daunting initiative, technical services has chosen Commandeer software. Commandeer is a desktop based cloud management tool that will allow the projects business/data analysts, developers, dev/ops personnel and data scientists to view and edit data, understand the relationships between and see summaries for the Data Lake components (Dynamo DB, S3, Lambdas, CloudWatch rules and other AWS services in the cloud and the local development environment (LocalStack)). Commandeer has an easy to use, intuitive interface and is an integral part of both projects.