The nature of enterprise companies is that they are reliant on the “big picture” - an overarching understanding of the things happening on the market in general as well as in the context of a particular product.

Data Analytics is the way of visualizing the information with a handy set of tools, which show how things are moving along. As such, it is an indispensable element of the decision-making process.

Understanding the big picture binds every source of information together in one beautiful knot and presents a distinct vision of past, present, and possible future.

In one way or another the big picture affects everything:

day-to-day operations;

long-term planning;

strategic decisions;

Big picture view is especially important when your company’s got more than one product, and overall analytics toolbox is scattered.

One of our clients needed a custom big data analytics system, and that was the task set before the APP Solutions' team of developers and PMs.

ECO: Project Setup

The client had several websites and applications with a similar business purpose. The analytics for each product were separate, so it took considerable time and effort to combine and assess into the plain overarching view.

The dispersion of the analytics caused several issues:

The information about the users was inconsistent throughout the product line;

There was no real understanding of how target audiences of each product overlap.

There was a need for a solution that will gather information from different sources and unify them in one system.

Our Solution - Cross-Platform Data Analytics System

Since there were several distinct sources of information at play, which were all part of one company, it made sense to construct a nexus point where all the information would come together. This kind of system is called cross-platform analytics or embedded analytics.

Overall system requirements were:

It has to be an easily-scalable system

It can handle big data streams

It can produce high-quality data analytics coming from multiple sources.

In this configuration, the proposed system consists of two parts:

Individual product infrastructure - where data is accumulated;

Data Warehouse infrastructure - where information is processed, stored and visualized.

Combined information streams would present the big picture of product performance and the audience overlap.

The Development Process Step by Step

Step 1: Designing the Data Warehouse

Data Warehouse is the centerpiece of the data analytics operation. It is the place where everything comes together and gets presented in an understandable form.

The mains requirements for the warehouse were:

Ability to process a large amount of data in a real-time mode

Ability to present data analytics results in a comprehensive form.

Because of that, we needed to figure out a streamlined dataflow that will operate without much of a fuss.

There are lots of data coming in different types of user-related events:

clicks,

conversions,

refunds

other input information.

In addition to storing information, we needed to tie it with the analytics system, which required synchronization of the system elements (individual products) for ever-relevant analytics.

We decided to go with the Cloud Infrastructure for its resource management tools and autoscaling features. It made the system capable of sustaining a massive workload without skipping a beat.

Step 2: Refining Data Processing Workflow

The accuracy of data and its relevance are critical indicators of the system working correctly. The project needed a fine-tuned system of data processing with an emphasis on providing a broad scope of results in minimal time.

The key criteria were:

User profile with relevant info and updates

Event history with a layout on different products and platforms

The system was thoroughly tested to ensure the accuracy of results and efficiency of the processing.

We used BigQuery’s SQL to give data a proper interface.

Google Data Studio and Tableau are used to visualize data in a convenient form due to its flexibility and accessibility.

Step 3: Fine-Tuning Data Gathering Sequence

Before any analytics could happen - there is data gathering to be done, and it should be handled with care. The thing is - there should be a fine-tuned sequence in the data gathering operation so that everything else could work properly.

To collect data from various products, we have developed a piece of javascript code that gathers data from different sources. It sends data over for processing and subsequent visualization in Google Data Studio and Tableau.

This approach is not resource-demanding and highly efficient for the cause, which makes the solution cost-effective.

The whole operation looks like this:

Client-side Data is gathered by JavaScript tag Another part of the data is submitted by individual products server-to-server The information is sent to the custom analytics server API which publishes it to the events stream Data processing application pulls events from the events stream and performs logical operations on data Data processing app stores resulting data into BigQuery

Step 4: Cross-Platform Customer/User Synchronization

The central purpose of the system was to show an audience overlap between various products.

Our solution was to apply a cross-platform user profiling based on digital footprint. That gives the system a unified view on the customer - synchronized across the entire product line.

The solution includes the following operations:

Identification of the user credentials

Credential matching over profiles on different platforms.

After that - the profiles were then merged into a unified profile that was gathered data across the board

Retrospective analysis - to analyze the user activity on different products, compare profile and merge the data if there are any significant commonalities.

Step 5: Maintaining Scalability

The number one priority of any big data-related operation can scale according to the required workload.

Data processing is a kind of operation that requires significant resources to be appropriately performed. It needs speed (approx 25GB/h) and efficiency to be genuinely useful in serving its cause.

The system requirements included:

Being capable of processing large quantities of data at the required timeframe

Being capable of easily integrating new elements

Being open to a continuous evolution

To provide the best possible environment for scalability - we have used the Google Cloud Platform. Its autoscaling features secure smooth and reliable data processing operations.

To keep data processing workflow uninterrupted no matter the workload, we used Apache Beam.

Tech Stack

Google cloud platform Cloud Pub/Sub Cloud Dataflow Apache Beam Java BigQuery Cloud Storage

Google Data Studio

Tableau

Project Team

Any project would not be complete without the team

Project Manager

Developer

System Architect

DevOps + CloudOps

Conclusion

This project can be considered as a big milestone for our team. Over the years we have worked on different aspects of a big data operation and developed many projects that involved data processing and analytics. However, this project gave a chance to create an entire system from the ground up, integrate it with the existing infrastructure and bring it all to a completely new level.

During the development of this project, we have utilized more streamlined workflows that allowed us to make the complete turnaround much faster. Because of that, we have to manage to deploy an operating prototype of the system ahead of planned date and dedicated more time to its testing and refinement.