This blog post defines the E of ETL, and describes the role of the Data Lake.

Extraction

An extraction, in data engineering, should be an unaltered snapshot of the state of entities at a given point in time.

Concretely, it often involves calling APIs, scraping websites, getting files from Secure File Transfer Protocol (SFTP) servers, running periodically a query on a copy of an Operational Database and copying files from an S3 account. The result of the extraction is stored in a cheap, scalable, high availability Cloud storage such as S3 to be kept forever — or as long as compliance lets us. The data produced by these extractions is what constitutes the Data Lake.

Data Lake

Different writers, bloggers, and professionals have different definitions of a “Data Lake” and a “Data Warehouse”. Sometimes, their roles are not clearly stated or there is overlap — to the point where it creates confusion and those two words are used interchangeably. The following definition of a Data Lake is simple, it clearly separates it from the Data Warehouse, and it formalizes the useful separation of raw data (part of the Data Lake) from the derived datasets (part of the Data Warehouse / Data Marts).

A Data Lake is a repository that contains all of the unprocessed data produced or gathered by the business. Because no business logic is applied to the data at that point, it remains true, and any analytics (tables, data science models) can be recreated from that source if business requirements were to change. This extraction of data from different sources is necessary, because getting the data from the source is often expensive (API calls, slow SFTP, operational database dumps) and sometimes impossible (APIs evolve, SFTPs get emptied, operational databases mutate records in-place).

Structure

With the democratization of data, and the decentralization of analytics, it is important for a lake to be discoverable, consider the following structure:

s3://myorg-data-lake

├── s3://myorg-data-lake/tweets_mentioning_myorg

└── s3://myorg-data-lake/salesforce_clients

The Data Lake consumers would expect any extract to be a top level prefix in the “myorg-data-lake” S3 bucket, making it easy to browse. However, the data does not need to be collocated in the same bucket, because we can use the Hive metastore to register any of the extracts as tables inside the same schema, providing a central interface to the data lake.

data_lake

├── data_lake.tweets_mentioning_myorg → s3://myorg-twitter-extracts/tweets_mentioning_myorg

└── data_lake.salesforce_clients → s3://myorg-salesforce-extracts/salesforce_clients

Format

Semi-structured and unstructured data is often cited as a characteristic of a Data Lake. However, I believe that conversion to a file format that embeds a schema (such as Parquet, and Avro), during the extraction process, has significant advantages. A schema is a way to define an interface to the dataset, making it easier to use by multiple teams with minimal communication required. It is also a way to perform a lightweight validation that the source system is still producing the expected data. When the schema is no longer valid, the extract breaks, but it reduces the risk of erroneous analytics being produced or cryptic errors deep in the transformation processes. The data extracted might already have some sort of schema (database and API extracts), and storing as JSON / CSV would mean losing some valuable metadata.

In the cases where the source system makes previous data unavailable quickly, extract it once without any conversion, and run conversion on the raw copy. The Hive table can then be pointed at the converted copy. For example, in the case that we files are sourced from an SFTP server and those files disappear after ~1 hour:

SFTP

├── file1.csv

└── file2.csv s3://myorg-data-lake

├── s3://myorg-data-lake/sftp_clients/raw/file1.csv

├── s3://myorg-data-lake/sftp_clients/raw/file2.csv

└── s3://myorg-data-lake/sftp_clients/parquet/…

In this manner, the files are on S3 and the schema can be fixed without fearing the loss of any data.