by Mikhail Vorontsov

07 Jan 2015 update: extending LZ4 description (thanks to Mikael Grev for a hint!)

This article will give you an overview of several general compression algorithm implementations performance. As it turned out, some of them could be used even when your CPU requirements are pretty strict.

In this article we will compare:

JDK GZIP – a slow algorithm with a good compression, which could be used for long term data compression. Implemented in JDK java.util.zip.GZIPInputStream / GZIPOutputStream .

. JDK deflate – another algorithm available in JDK (it is used for zip files). Unlike GZIP, you can set compression level for this algorithm, which allows you to trade compression time for the output file size. Available levels are 0 (store, no compression), 1 (fastest compression) to 9 (slowest compression). Implemented as java.util.zip.DeflaterOutputStream / InflaterInputStream .

. Java implementation of LZ4 compression algorithm – this is the fastest algorithm in this article with a compression level a bit worse than the fastest deflate. I advice you to read the wikipedia article about this algorithm to understand its usage. It is distributed under a friendly Apache license 2.0.

Snappy – a popular compressor developed in Google, which aims to be fast and provide relatively good compression. I have tested this implementation. It is also distributed under Apache license 2.0.

Compression test

I had to think a little what file set could be useful for data compression testing and at the same time could be present on most of Java developers machines (I don’t want to ask you to download hundreds of megabytes of files just to run the tests). Finally I realised that most of you have JDK javadoc installed locally. I decided to build a single file out of javadoc directory – concatenate all files. This can be easily done with tar , but not all of us are Linux users, so I have used the following class to generate such file:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 public class InputGenerator { private static final String JAVADOC_PATH = "your_path_to_JDK/docs" ; public static final File FILE_PATH = new File ( "your_output_file_path" ) ; static { try { if ( ! FILE_PATH. exists ( ) ) makeJavadocFile ( ) ; } catch ( IOException e ) { e. printStackTrace ( ) ; } } private static void makeJavadocFile ( ) throws IOException { try ( OutputStream os = new BufferedOutputStream ( new FileOutputStream ( FILE_PATH ) , 65536 ) ) { appendDir ( os, new File ( JAVADOC_PATH ) ) ; } System . out . println ( "Javadoc file created" ) ; } private static void appendDir ( final OutputStream os, final File root ) throws IOException { for ( File f : root. listFiles ( ) ) { if ( f. isDirectory ( ) ) appendDir ( os, f ) ; else Files. copy ( f. toPath ( ) , os ) ; } } } public class InputGenerator { private static final String JAVADOC_PATH = "your_path_to_JDK/docs"; public static final File FILE_PATH = new File( "your_output_file_path" ); static { try { if ( !FILE_PATH.exists() ) makeJavadocFile(); } catch (IOException e) { e.printStackTrace(); } } private static void makeJavadocFile() throws IOException { try( OutputStream os = new BufferedOutputStream( new FileOutputStream( FILE_PATH ), 65536 ) ) { appendDir(os, new File( JAVADOC_PATH )); } System.out.println( "Javadoc file created" ); } private static void appendDir( final OutputStream os, final File root ) throws IOException { for ( File f : root.listFiles() ) { if ( f.isDirectory() ) appendDir( os, f ); else Files.copy(f.toPath(), os); } } }

The total file size on my machine is 354,509,602 bytes (338 Mb).

Testing

Initially I thought about reading the whole file into RAM and compressing it in RAM. It turned out that you may pretty easily run out of heap space on commodity 4G machines with such approach 🙁

Instead I decided to rely on the OS file cache. We will use JMH as a test framework. The file will be loaded in OS cache during warmup phase (we will run compression test twice during warmup). We will compress into ByteArrayOutputStream (I know, it is not the fastest solution, but it is consistent across all tests and it does not need to spend more time writing compressed data to disk), so you still need some RAM to keep the output in memory.

Here is a test base class. All tests differ only in the compressing output stream implementation, so they will create a stream in StreamFactory implementation and reuse the base class test:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 @OutputTimeUnit ( TimeUnit. MILLISECONDS ) @State ( Scope. Thread ) @Fork ( 1 ) @Warmup ( iterations = 2 ) @Measurement ( iterations = 3 ) @BenchmarkMode ( Mode. SingleShotTime ) public class TestParent { protected Path m_inputFile ; @Setup public void setup ( ) { m_inputFile = InputGenerator. FILE_PATH . toPath ( ) ; } interface StreamFactory { public OutputStream getStream ( final OutputStream underlyingStream ) throws IOException ; } public int baseBenchmark ( final StreamFactory factory ) throws IOException { ByteArrayOutputStream bos = new ByteArrayOutputStream ( ( int ) m_inputFile. toFile ( ) . length ( ) ) ; try ( OutputStream os = factory. getStream ( bos ) ) { Files. copy ( m_inputFile, os ) ; } return bos. size ( ) ; } } @OutputTimeUnit(TimeUnit.MILLISECONDS) @State(Scope.Thread) @Fork(1) @Warmup(iterations = 2) @Measurement(iterations = 3) @BenchmarkMode(Mode.SingleShotTime) public class TestParent { protected Path m_inputFile; @Setup public void setup() { m_inputFile = InputGenerator.FILE_PATH.toPath(); } interface StreamFactory { public OutputStream getStream( final OutputStream underlyingStream ) throws IOException; } public int baseBenchmark( final StreamFactory factory ) throws IOException { ByteArrayOutputStream bos = new ByteArrayOutputStream((int) m_inputFile.toFile().length()); try ( OutputStream os = factory.getStream( bos ) ) { Files.copy(m_inputFile, os); } return bos.size(); } }

All tests look similar (you can find them in the source code at the end of this article), but here is an example – JDK deflate test:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 public class JdkDeflateTest extends TestParent { @Param ( { "1" , "2" , "3" , "4" , "5" , "6" , "7" , "8" , "9" } ) public int m_lvl ; @Benchmark public int deflate ( ) throws IOException { return baseBenchmark ( new StreamFactory ( ) { @Override public OutputStream getStream ( OutputStream underlyingStream ) throws IOException { return new DeflaterOutputStream ( underlyingStream, new Deflater ( m_lvl, true ) , 512 ) ; } } ) ; } } public class JdkDeflateTest extends TestParent { @Param({"1", "2", "3", "4", "5", "6", "7", "8", "9"}) public int m_lvl; @Benchmark public int deflate() throws IOException { return baseBenchmark(new StreamFactory() { @Override public OutputStream getStream(OutputStream underlyingStream) throws IOException { return new DeflaterOutputStream( underlyingStream, new Deflater( m_lvl, true ), 512 ); } }); } }

Test results

Output file sizes

First of all let’s see the output file sizes:

Implementation File size (bytes) GZIP 64,214,683 Snappy (normal) 138,250,196 Snappy (framed) 101,470,113 LZ4 (fast 64K) 98,326,531 LZ4 (fast 128K) 94,403,752 LZ4 (fast double 64K) 94,478,009 LZ4 (fast 32M) 89,758,917 LZ4 (fast double 32M) 84,337,838 LZ4 (fast triple 32M) 83,426,446 LZ4 (high) 82,085,338 Deflate (lvl=1) 78,383,316 Deflate (lvl=2) 75,280,213 Deflate (lvl=3) 73,251,533 Deflate (lvl=4) 68,110,895 Deflate (lvl=5) 65,721,750 Deflate (lvl=6) 64,214,665 Deflate (lvl=7) 64,019,601 Deflate (lvl=8) 63,874,787 Deflate (lvl=9) 63,868,222

As you can see, the difference between the smallest and the biggest compressed files is pretty large (from 61 to 131 Mb). There are several LZ4 algorithm options in this table – I will cover it in more details closer to the end of this article. Let’s see how long did it take to compress for each implementation.

Compression time

Implementation Compression time (ms) Snappy.framedOutput 2264.700 Snappy.normalOutput 2201.120 Lz4.testFastNative64K 1075.138 Lz4.testFastNative128K 1068.932 Lz4.testFastNativeDouble64K 1261.138 Lz4.testFastNative32M 1076.141 Lz4.testFastNativeDouble32M 1230.563 Lz4.testFastNativeTriple32M 1433.068 Lz4.testHighNative64K 6812.911 deflate (lvl=1) 4522.644 deflate (lvl=2) 4726.477 deflate (lvl=3) 5081.934 deflate (lvl=4) 6739.450 deflate (lvl=5) 7896.572 deflate (lvl=6) 9783.701 deflate (lvl=7) 10731.761 deflate (lvl=8) 14760.361 deflate (lvl=9) 14878.364 GZIP 10351.887

Let’s merge compression time and file size on one diagram in order to calculate the throughput and make some conclusions.

Throughput and efficiency

Implementation Time (ms) Uncompressed file size (Mb) Throughput (Mb/sec) Compressed file size (Mb) Snappy.normalOutput 2201.12 338 153.5581885586 131.8456611633 Snappy.framedOutput 2264.7 338 149.2471409017 96.7694406509 Lz4.testFastNative64K 1075.138 338 314.3782472576 93.771487236 Lz4.testFastNative128K 1068.932 338 316.2034628957 90.0304336548 Lz4.testFastNativeDouble64K 1261.138 338 268.0119067065 90.1012506485 Lz4.testFastNative32M 1076.141 338 314.0852360425 85.6007738113 Lz4.testFastNativeDouble32M 1230.563 338 274.6710245636 80.4308300018 Lz4.testFastNativeTriple32M 1433.068 338 235.8576145724 79.5616588593 Lz4.testHighNative64K 6812.9 338 49.6117659147 78.2826786041 deflate (lvl=1) 4522.644 338 74.7350443679 74.752155304 deflate (lvl=2) 4726.477 338 71.5120374012 71.7928056717 deflate (lvl=3) 5081.934 338 66.5101120951 69.8581056595 deflate (lvl=4) 6739.45 338 50.1524605124 64.9556112289 deflate (lvl=5) 7896.572 338 42.8033835442 62.6771450043 deflate (lvl=6) 9783.701 338 34.5472536415 61.2398767471 deflate (lvl=7) 10731.761 338 31.4952969974 61.0538492203 deflate (lvl=8) 14760.361 338 22.8991689295 60.9157438278 deflate (lvl=9) 14878.364 338 22.7175514727 60.9094829559 GZIP 10351.887 338 32.651051929 61.2398939133

Many of these implementations are pretty slow: ~23 Mb/sec for high level deflate or even ~33 Mb/sec for GZIP is not something you should be happy with on Xeon E5-2650. At the same time fastest deflate version is running at ~75 Mb/sec, Snappy at ~150 Mb/sec and LZ4 (fast, JNI) at truly surprising ~320 Mb/sec (actually much faster, but this time includes reading file from OS cache).

This diagram clearly shows that 2 implementations are not competitive at the moment: Snappy is slower than LZ4 (fast), but produces the bigger files. LZ4 (high) is in turn slower than deflate levels 1 to 4 and produces a bigger output compared to even deflate level=1.

As a result, I would probably choose between LZ4(fast) JNI implementation and deflate level=1 when I need an “on the fly compression”. You may have to use deflate if your organization does not allow 3rd party libraries. You should consider how much spare CPU cycles you have on your box as well as where the compressed data is being sent. For example, if you are writing compressed data directly to HDD, then performance above ~100 Mb/sec would not help you (provided that your file is large enough) – HDD speed will become the bottleneck. Same output written into a modern SSD – even LZ4 would not be fast enough 🙂 If you compress your data prior to sending it over a Gigabit network, you should probably use LZ4, because 75 Mb/sec of deflate performance is considerably less than 125 Mb/sec of network throughput (yes, I know about packet headers, but the difference will be still considerable).

LZ4 compression algorithm

LZ4 is an algorithm which encodes data in frames. Each frame contains a header and compressed data. The size of the compression buffer (amount of data which will be compressed in one frame) is an LZ4BlockOutputStream constructor argument:

1 public LZ4BlockOutputStream ( OutputStream out, int blockSize, LZ4Compressor compressor ) public LZ4BlockOutputStream(OutputStream out, int blockSize, LZ4Compressor compressor)

Current implementation allows the block size to be between 64 bytes and 32 Mb. Obviously, the bigger frame you will use, the higher will be the compression ratio. You should keep in mind that LZ4BlockOutputStream will allocate identically sized uncompression buffer (this info is stored in the frame header).

As you have seen above, there is very little difference in the time required to compress the data with either 64K or 32M buffer, which means you should try using the bigger buffer in order to obtain some extra compression.

Another interesting LZ4 property (thanks to Mikael Grev for idea) is that it makes sense to use 2 LZ4BlockOutputStream -s in a row, because subsequent blocks may contain similarly encoded data. The performance penalty is pretty unnoticeable, but you can gain extra compression (in case of 32M buffer, the output was 89M for the single pass and 84 for the double pass at a tiny cost of ~200 ms for 89M of data output from the first pass). It does not make much sense to make three or more passes – you will have very little compression improvement.

At the same time, it makes more sense to double the buffer size on the single pass rather than make 2 passes on the smaller buffers (the exception are buffers over 16M, which will allow you to circumvent 32M compression buffer limitation) – as you can see, you get nearly identical file size for double 64K pass and for single 128K pass. Double pass, as you have seen, will always be slower.

See also

Benchmark suite for data compression library on the JVM – a comprehensive test suite for data compressors implemented in Java or accessible via JNI. It tests time and space efficiency using several sets of data to compress. Thanks to Sam Van Oort for a link!

Summary

If you think that data compression is painfully slow, then check LZ4 (fast) implementation, which is able to compress a text file at ~320 Mb/sec – compression at such speed should be not noticeable for most of applications. It makes sense to increase the LZ4 compression buffer size up to its 32M limit if possible (keep in mind that you will need a similarly sized buffer for uncompression). You can also try chaining 2 LZ4BlockOutputStream -s with 32M buffer size to get most out of LZ4.

-s with 32M buffer size to get most out of LZ4. If you are restricted from using 3rd party libraries or want a little bit better compression, check JDK deflate (lvl=1) codec – it was able to compress the same file at ~75 Mb/sec.

Source code

Java compression test project source code

Use standard JMH approach to run this project: