Diving in the data lake

Pitfalls of data lakes to avoid

Rapid growth of unstructured data is a serious business challenge for organizations. Data repositories, known as data lakes, have a great chance to play an important role in extracting valuable business information from enormous amounts of data.

Storing and processing data on such a scale is a very complex and demanding task. Existing RDBMS-based systems are not able to do this because it would require too much effort on structuring and data validation. Interpretation of unstructured data becomes problematic. Very often data arrive before we can say something about them and we can only decide how to interpret them if we can build context of other ingested data.

TL;DR

You should pay attention to avoid pitfalls in data lakes. These are my hints:

Ingest raw data,

Use metadata and build lineage of your data,

Define and manage an access policy for data and processes,

Optimize for data consumption, not for ingestion and transformation.

What is a data lake?

According to Nick Huedecker at Gartner,

Data lakes are marketed as enterprise-wide data management platforms for analyzing disparate sources of data in its native format. The idea is simple: instead of placing data in a purpose-built data store, you move it into a data lake in its original format. This eliminates the upfront costs of data ingestion, like transformation. Once data is placed into the lake, it’s available for analysis by everyone in the organization.[1]

A lake is a great metaphor explaining big data principles: collecting all possible data, detecting patterns, exceptions and trends with analytics and machine learning. These are the core ideas of data lakes.

This is because one of the basic principles of data science is the more data you get, the better your data model will ultimately be. [2]

This means that it fits nicely into the data lake concept. Exposing stored data makes them possible to model data within the whole context, rather than within a limited one. This reduces the error rate and gives a broad view on data.

Components

Most data lake implementations are probably based on the Hadoop ecosystem (e.g. HDP, CDH), which is a set of tools that makes it easy to use MapReduce or other computation models. I will present and describe individual components. This will allow us to understand their responsibilities and the problems arising therefrom in data lakes.

All data lakes have some distributed file systems. Data should be persisted in raw format because it’s not possible to structure them on ingestion. To achieve this, ingested data should be left in raw form; later they can be structured with transformation processes. As you can see, there is a need for a dedicated layer which allows unstructured data to persist efficiently. In Hadoop, HDFS fulfills this role.

To build ingestion and transformation processes, we need to use some computation system that is fault-tolerant, easily scalable, and efficient at processing large data sets. Nowadays, streaming systems are gaining in popularity. Spark, Storm, Flink… At the beginning of BigData, only MapReduce was available, which was (and still is) used as a bulk-processing framework.

Scalability in a computation system requires resource management. In a data lake, we have huge amounts of data requiring thousands of nodes. Prioritization is achieved by allocating resources and queuing tasks. Some transformations require more resources; some require less. Major tasks get more resources. This resources allocation role in Hadoop is performed by YARN.

Data swamp

In data lakes there are many things going on and it’s not possible to manage them manually. Without constraints and a thoughtful approach to processes, a data lake will become degenerated very quickly. If ingested data do not contain business information, then we can’t find the right context for them. If everyone generates anonymous data without lineage, then we will have tons of useless data. No one will know what is going on. Who is the author of changes? Where did the data come from? Everything starts to look like a data swamp.

To avoid this scenario, pay attention to common mistakes in data lakes. What follows presents the basic problems and some hints.

Transformation on ingestion

If we structure data on ingestion, then we can’t have the full context of how to transform these data. For example, we are reading data from a sensor as a voltage. If we map this to data with a current known context, then after this transformation it will be very hard to revert the transformation. Imagine that the sensor is moved to another location or recalibrated. We do not know when these changes occurred or how the values should be mapped. Even if you think that you have all the required knowledge to structure data, you will come to a moment where there is no way to control everything because there is too much new data.

Constraints on data

Instead of normalization you should add metadata for all entities and processes. Data redundancy and lack of integrity is not a problem in a data lake.

For example, imagine that you have thousands of tables with relations in a data lake.[3] Ingestion in such an environment would be painful as you would have to carefully check where the data match. Structuring during ingestion would probably require defining new tables. If you still want to do it in this way, maybe you should look at data warehouses?

Data without metadata is useless

Getting value out of the data remains the responsibility of the business end user and data scientists, not the developer. This is why we need to persist the context of the data in the form of metadata. Only a business knows which contexts fit to which data. They know the dependencies and they can find patterns in the data. Most current implementations have tools for metadata storage e.g. Atlas.

Security

Such a large scale involves a lot of people whose roles should be well defined. Responsibilities should be allocated reasonably. Every action should leave traces to build audit logs, without which can easily lose control over what is happening. Employees will find workarounds to access a data lake, which creates chaos and puts enterprise information at risk. This is not the only case when data governance plays crucial role. European Union law regulates how data are collected and used. Not taking into account these conditions can lead to considerable problems. We should protect sensitive data such as PAN or PII against unauthorized users. Processes should be run/scheduled by people who know how they work.

Performance

Only consumed data provide value for the company. Even the fastest ingestions and transformations are worth nothing if a business can’t consume valuable data. Pay attention to data in their final form and find out how they can be consumed. As I mentioned in the previous paragraph, data should be well-described with metadata properties for the end user to start consuming the data easily.

Collaboration

Collaboration is a key feature in developing and maintaining a data lake. Every process should be easily identifiable with source code. Data lineage should contain not only the source and destination, but also the process description and version. Ideally we should be able to see structural differences in data after transformation, which allows us to revert changes and reduce cluttering of the environment.

Conclusion

Being aware of problems faced in implementation of a data lake helps to reduce risk for your company. You will accelerate data ingestion and data scientists will understand the content as they will be able to meticulously, thoroughly and swiftly examine the data and extract valuable information. Finally you will deliver high quality data to a wide audience. All this helps you to optimize your organization and reap the benefits.

In the next blog post I will try to describe how a data lake in a Hadoop stack looks today. What are the current problems and alternatives? There will be something about HDP, Kylo, NiFi and pachyderm.io In the last blog post I will test pachyderm.io. We will see if it really can be the future of data lakes.

Appendix

[1] “Gartner Says Beware of the Data Lake Fallacy”, Dec 21, 2016 , http://www.gartner.com/newsroom/id/2809117

[2] “Your ‘Resolution List’ for 2017: 5 Best Practices for Unleashing the Power of Your Data Lakes”, July 28, 2014, https://www.talend.com/blog/2016/12/21/your-resolution-list-for-2017-5-best-practices-for-unleashing-the-power-of-your/

[3] Probably your company also uses sharepoint for versioning