NOTE: The opinions expressed in this article are my own and not necessarily those of my employer

In the 2nd quarter of 2017 a new world opened up for me at spilgames

That world was us moving out of our data center hosted HP Vertica database to Google BigQuery, Google's fully managed analytical datawarehouse.

After working with 'traditional' non MPP databases like Oracle, SQL server, and MPP systems like Hadoop Hive and Vertica for 19 years, I personally believe that Google's approach to Cloud datawerhousing is the right way to move forward in this field. I cannot describe what a relief it is for ETL developers / data engineers to work with a real DWAAS (DataWarehouse-As-A-Service).

At the moment of writing (june 2018), -and as far as my research goes-, BigQuery is still the only FULL DWAAS system available, meaning:

There are no additional components needed, nor is there any setup / configuration needed to get started. Just create a google account and you're there! In contrast with other (cloud) analytical DWH providers, the whole notion of cluster size and CPU/RAM configurations does not exists. BigQuery is just there! Some might see this as a disadvantage, I don't. Why bother when BigQuery can dynamically assign up to 2000+ 'virtual nodes' (called slots) to your workload? Ok, it makes the service a black box. But this also puts the responsibility to keep it working in Google's hands, not in yours. I mean: most of us (in the west) are also not collecting our own water from rain / the ground / a river and have machines in our homes to make it clean and drinkable. We open the tap and expect clean, drinkable water to be there... It is someone else's responsibility to do that work for us. And they are also accountable for it! Things like indexes, creating and managing statistics and query hints don't exist in BigQuery. Based on usage patterns BigQuery actually moves data around in their colossus files system / auto optimizes for usage. Some might see this as a disadvantage, I again don't. Think about it: would really rally care about doing all that work just to keep the damn system working and performing? And then doing it over again once your data amounts or usage patterns changed? How many consultancy hours have been lost over the last decades by doing all that work instead of adding business value? Compute and storage are fully separated in BigQuery: The dremel engine ans Colussus storage engine are connected by an extreme fast network interface (Jupiter). Data is always stored multi-tenant. Data is not backed up, but you can 'undelete' data up to 2 days in the past using snapshot decorators. So, no filegroups, disk provisioning, running out of storage. And since June 2018 you can also migrate away from your difficult to maintain HIVE instance by importing ORC and Parquet formatted data... The payment model of BigQuery is pay-for-use, meaning: storing 1 TB of uncompressed data costs 20 USD (and 10 USD when when after 90 days data in a table or partition had no insert/updates/deletes). Querying 1TB of uncompressed data cost 5 USD. TIPS: As the storage mode is columnar, you only pay for columns that you use need / SELECT. So don't do no blind SELECT * on huge datasets! If you want to see what a table contains use the preview option: it incurs no costs...

Points of improvement

No system in perfect of course. Although I see the progress that has been made since that 2nd quarter of 2017, there are some things that need to be improved in the future:

One confusing of BigQuery is the fact it supports 2 SQL dialects: There is the now called 'Legacy SQL' that does NOT follow the ANSII standard, and 'standard SQL' that DOES follow the ANSII standards. Standard SQL is relatively new though. This means that not all the functionality from Legacy SQL has been ported there yet. It is the branch where all the new development is happening though, and thus the advised dialect to use. An example of the only recent maturing of Standard SQL is that in November 2017 the set operators EXCEPT and INTERSECT finally have been added... An example where you have to use legacy SQL is when using the earlier mentioned snapshot decorators

Partitioning is not fully mature yet. Currently ingestion time and time/date column partitioning are supported only. This is especially relevant because it is THE way to control for costs: when you filter on time partitions, only the partitions that are part of the result set are charged. You can use sharding though, a functionality that we use to store game event data based on game_id. You can then use the pseudo column called _TABLE_SUFFIX to filter on individual games / 'push down' to individual shards. This works but is not optimal.

Related to the previous point: the whole pay-for-use model does need time to get used to. Trying to keep all your SQL logic virtualized on top of raw event data therefore is a no go, even if performance allows for it. There is flat-rate pricing, but you need to have 40.000 USD costs before that becomes attractive...

need time to get used to. Trying to keep all your SQL logic virtualized on top of raw event data therefore is a no go, even if performance allows for it. There is flat-rate pricing, but you need to have 40.000 USD costs before that becomes attractive... The only limits we have encountered is that BigQuery's 'virtual nodes' can run out of memory when analytical partitioning queries are highly skewed (looking at their partitioning key). Furthermore, extreme high CPU queries can also fail with a 'resources exceeded' warning. These examples show that it also can be a downside that there is nothing to tune / configure.

While BigQuery is awesome for huge datasets, the biggest disadvantage I encounter is that for small datasets you always have 3 to 4 seconds of initialization time. So currently, for these kind of workloads it definitely loses to other cloud datawarehouses that always have nodes assigned to your instance. The BigQuery team is aware of this weak point though: as far as I can see the team is working in this...

Conclusion

All things considered I am extremely happy with BigQuery as an MPP datawarehouse. It will be though to go back to a traditional system that just have this 'overhead of work' that you as a ETL specialist / DWH engineer don't REALLY want to be bothered with....

Last note: trying BigQuery out is really easy: https://cloud.google.com/bigquery/docs/tutorials

If you have any BigQuery questions: just ask them below! I will do my best to answer them!