More often than not, big data is made up of a lot of small files. Event-based streams from IoT devices, servers or applications will typically arrive in kb-scale JSON files, easily adding up to hundreds of thousands of new files being ingested into your data lake on a daily basis.

Writing small files to an object storage (Amazon S3, Azure Blob, HDFS, etc.) is easy enough; however, trying to query the data in this state using an SQL engine such as Athena or Presto will absolutely kill both your performance and your budget.

We learned our lessons about the impact of compaction, compression, partitioning and proper file formats on query performance the hard way. To avoid repeating our mistakes, keep reading; or use Upsolver to automate all your data lake management and ensure non-stop awesome performance.

The Problem: Small Files = Big Latency

In a data lake approach, you’re ingesting data in its raw form and saving the transformations for later. Since streaming data comes in small files, we will write these small files to S3 rather than attempt to combine them on write.

(The alternative would be to buffer the data in memory or store it as a temporary file on a local disk while files are being compacted, but this adds complexity and can cause data loss. In addition, in cases where there are many devices or servers writing data, this won’t help much.)

Regardless of whether you’re working with Hadoop or Spark, cloud or on-premise, small files are going to kill your performance. Each file comes with its own overhead of milliseconds for opening the file, reading metadata and closing it. In addition, many files mean many non-contiguous disk seeks, which object storage is not optimized for.

Let’s say you’re receiving thousands of new files each hour, which is not uncommon for streaming data, and at the end of a day you would like to query the last 24 hours with SQL. All these milliseconds will easily add up to hours of waiting for results, which isn’t going to be good for anybody.

The Solution: Bigger Files!

One of the principles behind Upsolver is to provide an end-to-end, SaaS Data Lake Platform: it ingests streams and outputs workable data, in a process that’s completely automated and invisible to the end user.

However, very early on, we realized that this meant more than adding an output to Presto / Athena / Hive and calling it a day: our first beta testers were getting terrible performance or Athena would flat-out refuse to run the query, which somewhat diminished our value proposition around “workable data”.

These failed initial tests taught us that feeding small files to SQL engines is a bad idea. So what do you do about it? Well, it’s pretty obvious, really: make bigger files.

Today, after speaking with dozens of clients and writing thousands of lines of code to solve this exact problem, we realize there’s quite a few nuances you need to get right. However, the basic principle is simple: merge your smaller event files into larger archives on a regular basis – AKA compaction.

It gets more complicated when you take things like workload management, event-time and optimal file sizes into account (hence the thousands of lines of codes). Nevertheless, if you want one quick way to improve your query performance today, compaction is definitely the way to go.

Benchmarking the difference

Here’s a really quick benchmark test using Amazon Athena to query 22 million records stored on S3.

Running this query on the uncompacted dataset took 76 seconds.

Here’s the exact same query in Athena, running on a compacted dataset:

This query returned in 10 seconds

This is just a simple example and real-life mileage may vary based on the data and myriad other optimizations you can use to tune your queries; however, we don’t know many data analysts or DBAs who wouldn’t find the prospect of improving query performance by 660% attractive.

Tips for Dealing with Small Files on HDFS / S3

If you’re using Upsolver, compaction is something you don’t need to worry about since it’s handled under the hood. Otherwise, you’ll need to write a script that compacts small files periodically – in which case, you should take care to:

Define your compaction window wisely, depending on how Athena is set up. Compacting too often will be wasteful since files will still be pretty small and any performance improvement will be marginal; compacting too infrequently will result in long processing times and slower queries as the system waits for the compaction jobs to finish.

Delete uncompacted fields to save space and storage costs (we do this every 10 minutes). Needless to say, you should always have a copy of the data in its original state for replay and event sourcing.

Remember to reconfigure your Athena tables partitions once compaction is completed, so that it will read the compacted partition rather than the original files.

Keep your file size as big as possible but still small enough to fit in-memory uncompressed. At Upsolver, we use a 500 MB file size limit to stay within comfortable boundaries.

Want to learn more about reducing costs and improving performance when analyzing streaming data? Check out our previous post on using Apache Kafka with and without a Data Lake; or schedule a live demo to see how the hands-free solution to compacting small files and managing your data lake on S3, GCP or HDFS.