The following article is an abridged version of our new guide to Data Lakes and Data Lake Platforms – get the full version for free here.

If you’re working with data in any capacity, you should be familiar with Data Lakes. Even if you don’t need one today, the rapid growth of data and demand for increasingly versatile analytic use cases (such as reporting, machine learning, and predictive analytics) could result in your organization outgrowing its data infrastructure much sooner than you currently foresee.

We’ve prepared this quick guide to help get you ready for that day, and keep you from asking embarrassing questions such as “what kind of data lake do you think we should buy?” To understand what’s wrong with that line of reasoning, let’s jump right into:

What is a Data Lake?

An architecture, not a product

The very first thing to understand, and which often confuses people who come from a database background, is that the term “data lake” is most commonly used to describe a certain type of big data architecture, rather than a specific product or component. This confusion is made worse by the fact that certain products are actually called Data Lake, and by the general mess of terminology and vendor noise in the big data space.

However, and as we’ve previously explained in an article on KDnuggets, the term itself is meaningful in describing a specific architectural paradigm: the idea that when working with massive amounts of data, we would opt to store raw data in its original form in one centralized repository (the ‘lake’), and use this repository to later support a broad range of use cases.

Store now, analyze later

The core tenet of the data lake approach is to separate storage from analysis. Data lakes ingest streams of structured, semi-structured and unstructured data sources, and store the data as-is and schema-less.

This is a sharp deviation from traditional analytics, where we would build our database in a way that was best suited to support a particular use case – transactional, reporting, ad-hoc analytics, etc, and then structure the data accordingly.

After all data has been stored successfully, and we’ve decided what we would like to do with it, data can then be outputted to other systems that will make it workable for various consumer applications: Data warehouse, machine learning, BI tools, NoSQL databases and dozens of other platforms that manage, integrate and structure the data for analysis.

Typical Use Cases for Data Lakes

Data lakes are helpful when working with streaming data – event-based data streams generated continuously, such as by IoT devices, clickstream tracking, or product/server logs. Typically these are small records in very large quantities, in semi-structured format (often JSON).

Deployments of data lakes will usually address one or more of the following business use cases:

Business intelligence and analytics – analyzing streams of data to identify high-level trends and granular, record-level insights.

Data science – unstructured data opens creates more possibilities for analysis and exploration, enabling innovative applications of machine learning, advanced statistics and predictive algorithms.

Data serving – data lakes are usually an integral part of high-performance architectures for applications that rely on fresh or real-time data, including recommender systems, predictive decision engines or fraud detection tools.

Advantages of data lakes

In the scenarios we just described, there are numerous advantages to a data lake approach:

Resource optimization : by decoupling (cheap) storage from (expensive) compute resources, data lakes can be tremendously cheaper than databases when working with high scales.

: by decoupling (cheap) storage from (expensive) compute resources, data lakes can be tremendously cheaper than databases when working with high scales. Less ongoing maintenance : the fact that data is ingested without any kind of transformation or structuring means it’s easy to add new sources or modify existing ones without having to build custom pipelines.

: the fact that data is ingested without any kind of transformation or structuring means it’s easy to add new sources or modify existing ones without having to build custom pipelines. Broader range of use cases: data lakes give organizations more flexibility in how they eventually choose to work with the data, and can support a broader range of use cases, since you are not limited by the way you chose to structure your data upon its ingestion.

Data lake challenges

Despite these advantages, it is an unfortunate truth that most big data projects fail (85% according to some estimates). Common obstacles include:

Technical complexity : Not only are most data lake architectures not self-service for business users, they are not self-service even for experienced developers – instead requiring a team of dedicated data engineers to maintain the multitude of building blocks, pipelines and moving parts that compose a data lake architecture. As we’ve previously written, this process kind of sucks.

: Not only are most data lake architectures not self-service for business users, they are not self-service even for experienced developers – instead requiring a team of dedicated data engineers to maintain the multitude of building blocks, pipelines and moving parts that compose a data lake architecture. As we’ve previously written, this process kind of sucks. Slow time-to-value : Data lake projects can drag on for months or even years, creating a resource drain that is increasingly difficult to justify.

: Data lake projects can drag on for months or even years, creating a resource drain that is increasingly difficult to justify. Data swamps: Storing raw data provides a high level of flexibility, but foregoing all data governance and management principles can lead organizations to hoard massive amounts of data that they will likely never use, while making it more difficult to access the data that could actually be useful.

Data lake deployments

While data lakes are often associated with Hadoop, the terms are far from interchangeable. In fact, it would be safe to say that most new data lakes are being built in the Cloud (with Amazon Web Services leading the pack, following by Microsoft Azure and Google Cloud Platform). For these reasons Gartner has recently predicted that Hadoop implementations are approaching obsoletion.

However, our experience indicates that organizations which work with truly massive data (in the tens to hundreds of petabytes) might be deterred by the costs of storing and processing all of that data in the Cloud, and hence we are still seeing large data lake implementations built on HDFS on-premises – so we are not so quick to proclaim the death of Hadoop.

What is a Data Lake Platform?

A Data Lake Platform (DLP) is a software that is meant to address the data lake challenges which we’ve covered in the previous section – complexity, sluggish time to value, and overall messiness. The DLP addresses these challenges by:

Unifying data lake operations and combining building blocks to create simpler data lake architectures – instead of a patchwork of glued-together systems, a DLP offers a single platform that takes care of everything from data management and storage to processing, ETL jobs and outputs.

to create simpler data lake architectures – instead of a patchwork of glued-together systems, a DLP offers a single platform that takes care of everything from data management and storage to processing, ETL jobs and outputs. Enforcing best practices and optimizing data flows , and thus replacing months of manual coding in Apache Spark or Cassandra with automated actions managed through a GUI.

, and thus replacing months of manual coding in Apache Spark or Cassandra with automated actions managed through a GUI. Improved performance and resource utilization throughout storage, processing and serving layers.

Data Lake Platforms enable developers without an extensive big data background to create a complete pipeline from incoming data streams to structured data, that can be queried using SQL or other analytic tools. By doing so, DLPs enable organizations to generate more business value from their data lakes, at a faster pace.

Components of a Data Lake Platform

Unifying data lake operations and combining building blocks to create simpler data lake architectures – instead of a patchwork of glued-together systems, a DLP offers a single platform that takes care of everything from data management and storage to processing, ETL jobs and outputs.

Enforcing dozens of best practices you need to master for a data lake , and thus replacing months of manual coding in Apache Spark or Cassandra with automated actions managed through a GUI.

Improving performance and resource utilization throughout storage, processing and serving layers.

Providing governance and visual data management tools.

Want to learn more?

Get the full guide now to read about:

Data lakes vs databases

Use case and architecture examples

Overview of the Upsolver Data Lake Platform

Download the PDF Now