“90% of the data ever created was created in the last 2 years" - unknown

Why change?

“The data is getting bigger and we just can’t cope. No matter how many more blades we add to the server the queries still run slowly. We've stripped out all the unrequired fields, only uploaded two years worth of data, optimised the cubes so only the most relevant dimensions are available and finally tuned the queries so they would only pull back the most frequently asked requests. We can barely get this data set running and the management now want it merged with external sales data to measure the impact of their latest campaign. Do they realise how difficult that is? Taking it offline into MS Excel/Access might workaround the problem for now.”

Granted, maybe this is a bit hypothetical (or maybe it isn't). One thing is for certain, the way we used to do data warehousing and business intelligence in the past may not continue to work for us in the future. Disruptive technologies such as Digital, Social and the Internet of Things (IoT) mean that the sheer volume and variety of data is becoming overwhelming.

How do you fish for business insights in an ocean of data? How can you deal with the ever growing volume of internal data being created by your core systems such as SAP? How do you pool this data and fish it with the rest of the ocean to create new business insights that could never be obtained before?

This post will not solve these questions but it may give you a glimmer of hope. It is based on a short feasibility project to understand if it is possible to replace SAP’s business warehouse with a big-data solution, namely Google BigQuery.

Best to get my caveat in upfront. I'm not in any way attempting to marginalise the skills, intelligence and sheer tenacity required to build, tune and run a data warehouse (I have many SAP BW, SAS and Cognos scars to show from my previous endeavours). I'm merely attempting in this post to challenge the status-quo and maybe to do some thinking outside (or even burn) the box.

This pilot would not have been possible without specialist knowledge and expertise from Appsbroker on BigQuery and report writing and TCS for data extraction from SAP. I would highly recommend working with these guys.

Apache Hadoop versus Google BigQuery?

OK now for the very short history of Big Data (apologies for any inaccuracies). Once upon a time two opposing technologies Apache Hadoop and Google BigQuery traced their family tree and found that they were actually related. They both had the same daddy - Google.

Hadoop was originally based on whitepapers that Google released in December 2004, introducing MapReduce and the Google File System and running on commodity hardware. Google had reportedly disclosed a 'diluted' version of their established technology which they had been using for many years.

In 2006, while working at Yahoo, Doug Cutting and Michael Cafarella created Hadoop. The Apache foundation adopted this open-source technology and released it on the world. Hadoop took some time to get traction but has now developed into a really strong product with a massive ecosystem of tools to support it.

Virtually all of the commercial ‘Big Data’ platforms offered by leading vendors (check the list for yourself) are based on open-source Hadoop. These vendors provide the cloud storage, infrastructure, management and processing power required to productise Hadoop.

BigQuery on the other hand may be a more legitimate ‘Big Data’ heir. You don’t have to purchase any additional licences or configure any infrastructure to use it. You simply pay for what you use and Google manage the rest. Google may have also kept their best technology for their own product.

The BigQuery stack

I've seen business intelligence (BI) diagrams with up to 10 layers, which I find a bit daunting and confusing. For the purpose of this post I’ll simplify the stack to 3 layers:

The ETL (Extract, Transform, Load) layer [on-premise] - automatically gets the transactional/master data from the source (SAP ECC6 in our case), cleanses and organises it, then loads it into a database. We needed a BigQuery compatible ETL tool to do this. We chose Talend (open-source) although other tools such as Informatica are also compatible. The database layer [cloud] - holds the data in logical tables so that it can be 'interrogated'. BigQuery provided the muscle for this part. The reporting layer [cloud] - the results of the 'interrogation'. Again a number of BigQuery compatible tools are available. We chose BIME for the pilot. For simple regular reports we were also able to connect Google Sheets as a (zero cost) reporting layer. Many reporting tools (for different purposes/mining/analysis techniques) could be simultaneously connected to the same 'one version of the truth' BigQata database.

5 Challenges to overcome

The ETL process (as expected) was the most challenging aspect of the project. It also consumed the most time (70%) and resource. Included in this challenge was the BigQuery data schema design. Do not underestimate this step. Clever data architecture is required to make it work - all the previous BI skills I commended earlier come into play. To reduce query time and cost, staging tables must be included in the BigQuery design. Separate data tables were created for each year for reporting purposes. BigQuery alloys for table unions, this feature can also be used to add additional data sets, such as 'forecasts'. BigQuery also supports a syntax that allows other datasets (e.g. customer, consumer, social) to be easily joined to existing data without requiring pre-processing or reloading. The SAP Java Connector (JCO) is critical to the ETL process. This took approximately 5 minutes to upload 1 months data. Version 3+ of the connector allows for loading large volumes of data. If you have to use version 2.0 you will need to chunk the data into smaller loads. BigQuery caching both decreases response time and also lowers processing costs. It is very important to ensure that there is a high caching ratio (90%+ works) built into the design. Incremental updates and the ability to restate history (retrospectively re-configure master-data hierarchies). Incremental updates relied on the (indexed) time-stamp field of the date the record was created in SAP. Master data tables were held separately and joined to the transactional data as reports were executed (BigQuery is capable of supporting joins). This allowed for flexible master data changes.

Success and reasons to continue

We successfully created an end-to-end automated reporting solution. SAP-Talend-BigQuery-Bime (& Sheets) . The BIME report results reconciled with the equivalent SAP BW report.

. The BIME report results reconciled with the equivalent SAP BW report. We didn't even test the power of BigQuery as a full years worth of data from SAP barely crossed the BigQuery free allowance of resources. We didnt even touch the sides of what was possible. We have no doubt that we could load all of our historical and ongoing data with no drop in performance.

Reports ran c.20 times faster than the equivalent reports in SAP BW (running on a multi-blade server) at less than 10 seconds each. It was also much faster than this in the BigQuery database layer, the 10 seconds being mainly in the reporting front end (which I'm sure could be further tuned). After the first time the reports were ran on fresh data they were instantaneous (caching kicked in).

BIME could be used to provide Executive reports on any mobile device. Imagine instant access to the pulse of the business from an iPhone.

The cost savings of doing BI in this way is significant when compared to traditional methods. Cached properly the BigQuery costs are a small fraction of the running and infrastructure costs most of us are used to. This also means that a step change in volume and variety of data sources if economically viable.

To replace SAP BW in it's entirety with BigQuery would be a 6 month project, ETL and BigQuery data architecture accounting for the lions share of time.

Conclusion

Our pilot was successful and proves that BigQuery is a viable option for enterprise BI. The glimmer of hope that I mentioned earlier is more than just a glimmer. Imagine joining all those disparate data pools (financial, customer, campaign, social, IoT) into an enterprise data ocean and being to data mine this to provide rapid clear insights that were never possible before. Imagine doing this with a clear cost benefit advantage.



This is the end of this post. It is based completely on my own personal opinions. In previous posts I have covered how to introduce Interventions to kick-start Google Apps Transformation and measure it, Tips for getting off Lotus Notes and how to successfully manage an Enterprise wide change programme using Simon Sinek's methodology. I hope to keep posting further related topics if I continue to get interest I have received so far - Thank you!