Data ingestion is a key component of big data architectures. While storing all your data on unstructured object storage such as Amazon S3 might seem simple enough, there are many pitfalls you want to avoid since the way your data is ingested can dramatically impact the performance and usefulness of your data lake.

Having trouble keeping up with data lake best practices? We can automate that for you. Check out our data lake ingestion solution today, or schedule a demo to see Upsolver in action.

In this article we’ll look at 7 best practices for big data ingestion – from strategic principles down to the more tactical (and technical) issues that you should be aware of when building your ingest pipelines.

Why you should think about data ingestion early

While data lakes are typically predicated on the notion of ‘store now, analyze later’, taking this approach too literally will not end well. Dumping all of your data into the lake without having at least a general idea of what you’d like to do with the data later on (and what that will entail) could result in your data lake becoming a ‘data swamp’ – a murky bog of unfamiliar, uncategorized datasets which no one really knows how to make useful.

Data which is not ingested and stored according to best practices will be very difficult to access further down the line, and mistakes made early on could come back to haunt you for months or years because once data is in the lake, it becomes very difficult to ‘reorganize’.

In addition, proper data ingestion should address functional challenges such as:

Optimizing storage for analytic performance, which is often best done upon ingest (data partitioning, columnar file format, etc.)

Ensuring exactly-once processing of streaming event data without data loss or duplication

Visibility into the data you’re bringing in, especially in the case of frequent schema changes

7 Best Practices for Big Data Ingestion

Now that we’ve made the case for why effective ingestion matters, let’s look at the best practices you want to enforce in order to avoid the above pitfalls and build a performant, accessible data lake.

Note that enforcing some of these best practices requires a high level of technical expertise – if you feel your data engineering team could be put to better use, you might want to look into an ETL tool for Amazon S3 that can automatically ingest data to cloud storage and store it according to best practices.

For the purposes of this article we’ll assume you’re building your data lake on Amazon S3, but most of the advice applies to other types of cloud or on-premises object storage including HDFS, Azure Blob, or Google Cloud Storage; they also apply regardless of whichever framework or service you’re using to build your lake – Apache Kafka, Apache Flume, Amazon Kinesis Firehose, etc.

1. Have a plan

This is more of a general guideline but important to keep in mind throughout your big data initiative: blindly dumping your data into S3 is a bad strategy. You want to maintain flexibility and enable experimental use cases, but you also need to be aware of the general gist of what you’re going to be doing with the data, what types of tools you might want to use, and how those tools will want your data to be stored in order to work properly.

For example, if you’re going to be running ad-hoc analytic queries on terabytes of data, you’re likely going to be using Amazon Athena. Optimizing lake storage can dramatically impact the performance and costs of your queries in Athena (check out our webinar on ETL for Amazon Athena to learn more).

2. Create visibility upon ingest

You shouldn’t wait for data to actually be in your lake to know what’s in the data you’re bringing in. You should have visibility into the schema and a general idea of what your data contains even as it is being streamed into the lake, which will remove the need for ‘blind ETLing’ or reliance on partial samples for schema discovery later on.

Upsolver’s schema discovery is built to do exactly this; an alternative would be to write ETL code that pulls a sample of the data from your message broker and infer schema and statistics based on that sample.

3, Lexicographic ordering on S3

You should use a lexicographic date format (yyyy/mm/dd) when storing your data on S3. Since files are listed by S3 in lexicographic order, failing to store them in the correct format will cause problems down the line when retrieving the data.

Note that while Amazon previously recommended to randomizing prefix naming with hashed characters, this is no longer the case according to their most up-to-date documentation.

4. Compressing your data

While storage on S3 is cheap, everything in life is relative, and when you’re processing terabytes or petabytes of data on a daily basis, that’s going to have an impact on your cloud bill – which is why you want to compress your data.

When choosing the compression format, note that very strong compression might actually increase your costs when querying the data since data will need to be uncompressed before querying – which will incur compute costs that are greater than what you’re saving on storage. Using super-strong compression such as BZ2 will make your data lake completely unusable!

Instead, opt for ‘weaker’ compression that reads fast and lowers CPU costs – we like Snappy for this – to reduce your overall cost of ownership.

5. Reduce the number of files

We’ve written previously about dealing with small files on S3 when it comes to query performance – but even before you run your compaction process, you should try to reduce the number of events you’re ingesting in a single write.

Kafka producers might be sending thousands of messages per second – storing each of these as separate files will increase disk reads and degrade performance. At Upsolver we write minutes instead of seconds for this reason.

6. Self-describing File Formats

You want every file you store to contain the metadata needed in order to understand the data contained in that file, as that will make analytic querying far simpler. For optimal performance during query execution you want to use a columnar format such as Parquet or ORC. for the ETL staging layer can use row-based Avro or JSON.

7. Ensuring exactly-once processing

Duplicate events or missed events can significantly hurt the reliability of the data stored in your lake, but exactly-once processing is notoriously difficult to implement since your storage layer is only eventually (not instantly) consistent. Upsolver provides exactly-once processing from Kafka or Kinesis via idempotent operations – if coding your own solution you would need to do the same. This is a complex topic which we will try to cover elsewhere.

Learn more about data lake best practices:

Read our guide to comparing data lake ETL tools

Watch our webinar with AWS and ironSource to learn how ironSource operationalize a petabyte-scale data lake in the cloud

Schedule a friendly chat with our solution architects to learn how to improve your data architecture