Looking at recent announcements from open source powerhouses Cloudera, Netflix, Databricks, and so on — we notice a definitive trend. All of these efforts are trying to make Data Lakes act more like Data Warehouses — without all the limitations and baggage associated with traditional Data Warehouse storage.

What do we mean by all this? The industry is trying to supercharge data lakes with a number of common properties, namely:

1. Managed metadata. Dataset & table definitions, schema definitions, partitions — hopefully, all of it maintained autonomously and magically. New partitions are created dynamically as needed. Data flows into the right partitions automatically. Everything is always consistent, error-proof, and up to date.

2. ACID operations & snapshot isolation. No read or write locks. All table operations commit and fail atomically. No race conditions. No data or metadata inconsistencies, no half-completed loads, and no “file-not-founds”.

3. Time travel. Ability to go back to a previous state of a table. Ability to un-delete a table. Read from a table as it was 5 days ago. Not to be confused with quantum physics.

4. Mutations. DML operations. Ability to insert, update, or delete individual rows or entire tables. Partition-level mutation locks. Consistent and ACID-compliant. Correlated mutations. Especially convenient for GDPR and the like.

5. DDL operations. Convenience of managing your tables in a declarative SQL-like dialect, as opposed to file jui-jitsu, command-line, and API calls.

6. Integrated security. Native IAM capability with flexible access controls and integrated authentication. Always-on encryption. Frequent key encryption key rotation. Options for self-managed keys. Compliance with HIPAA, PCI, GDPR, and the like.

7. Secure in-place data sharing. Data silos are a crime against humanity. Data sharing in-place via aforementioned access controls and integration authentication. Public datasets for public good.

8. Streaming ingest. Ability to DDOS your storage system with millions of rows of data per second — without impacting your query capacity, and without causing performance regression on storage.

9. True separation of compute and storage. No need to hydrate local SSDs to get maximum performance. Hotspot-free reads. No degradation of IO performance with increased concurrency.

10. Interoperability. Open & bottleneck-free interoperability with Hadoop, Spark, pandas, and open source. Run Hadoop and Spark workloads directly on storage, versus a single-threaded bottleneck like JDBC.

11. File-free world. Schema on write. Users see datasets and tables, not files.

This is quite a wish list, one that users would find universally convenient and powerful. Good news is that such a storage system exists today. In fact, it’s been around since 2012. It’s called Google BigQuery!

BigQuery storage is fully managed, seamlessly scalable, self-optimizing, and open to streaming & Hadoop workloads. In fact, among Data Warehouses, properties 8, 9, and 10 are unique to BigQuery!

BigQuery: best of both worlds

Google BigQuery storage gives users much more than that:

In-memory BI acceleration layer. Unique to BigQuery. BI Engine. Clustering. Increased data locality and performance. Automatic & free background coalesce, compaction, and re-clustering. Unique to BigQuery. Don’t spend money (via cluster-time or credits) on vacuuming, rebuilding indexes, or auto re-clustering your data. BigQuery just does it behind the scenes, and for free. Automatic & free batch ingest. Unique to BigQuery. Do not spend money (via cluster-time or credits) on ingest into your Data Warehouse. Don’t let ingest workloads compete with analytics workloads. Automatic & free software, hardware, and security patches. No maintenance windows. No “cluster healing” or failed jobs due to updates. Seamless and transparent maintenance. Storage localization to compute. Unique to BigQuery. Public IaaS and storage do not offer data locality benefits. This is why some Data Warehouses rely on local SSDs for performance. BigQuery is vertically integrated with its storage system, and is able to place bits as close to compute as possible for maximum performance. Performance. This goes without saying. Managed ingest. With BigQuery’s Data Transfer Service, all your Google data (like Doubleclick and Google Analytics) automagically appear in BigQuery. Tail GCS or S3 buckets to make data appear in BigQuery. Streaming ELT. Tight integration with Dataflow, the powerhouse streaming engine, via BQ Storage API for reads and BQ Streaming API for writes. ML in SQL. Declarative Machine Learning in SQL. Automatic parameter hypertuning. Data engineering is done via BigQuery queries. Great introduction to ML for SQL practitioners, and an easy-to-use ML development tool for machine learning experts. Scale-invariant Spreadsheets. Connected Sheets. Monitoring & Auditing. Stackdriver monitoring. Audit Logs. Billing reports.

One thing is clear — users continue to seek more data warehouse-like features in their data lakes. Ultimately, customers want the same thing — a cheap, powerful, seamlessly scalable, and autonomously maintained storage system that takes care of the mundane “tinkering” tasks and lets users focus on their analytics, data processing, and machine learning needs.

If you are in the market for a data lake that does data warehouse things, perhaps you should be thinking of a data warehouse that does data lake things, and if that’s the case, Google BigQuery is the only sensible choice