Having a wide variety of compression formats means you can pick between various trade-offs. Unfortunately, often when presented with a data feed those decisions have already been made on your behalf.

Though the popular ZIP archive file format supports multiple lossless data compression schemes, DEFLATE remains almost ubiquitous. DEFLATE was defined by Phil Katz in 1989 and is a derivative of the LZ77 algorithm originally published twelve years prior to that. In addition to ZIP files, DEFLATE has found use in GZIP and PNG files as well as the HTTP protocol.

An open source implementation of DEFLATE was included in zlib, which was first released in 1995. The compression functionality of zlib was produced by Jean-loup Gailly and the decompression components were written by Mark Adler. The zlib library is considered to be one of the most widely distributed pieces of code in the world as it's used by a huge variety of software. Any security vulnerability in the library could have a significant global impact. Thankfully the C code making up the latest version is only 18K lines long and there have been ports to C++ which were a little over 3K lines long.

In this post I'll present my findings for the fastest ZIP file decompression method I've been able to find.

Installing Tooling I'll first install the Debian-patched version of Info-ZIP's unzip 6.0 and zip 3.0 utilities, the Debian-patched version of 7-Zip 9.20.1, CPython 2.7.12 and a host of other utilities on a fresh installation of Ubuntu 16.04.2 LTS. $ sudo apt update $ sudo apt install \ binutils \ build-essential \ git \ p7zip-full \ python-dev \ unzip \ wget \ zip

An Example Dataset I'll produce a ZIP file containing a subset of files that make up the Hadoop 3.0.3 binary distribution provided by the Apache Foundation. $ wget -c -O hadoop.tar.gz \ https://archive.apache.org/dist/hadoop/common/hadoop-3.0.3/hadoop-3.0.3.tar.gz The above archive is 300 MB in size and contains over 20K files. $ tar -ztvf hadoop.tar.gz | grep -cv '/$' 20424 I'll extract all of the contents sans the documentations into a working folder. $ mkdir -p working $ tar xvf hadoop.tar.gz \ --directory = ./working \ --exclude = hadoop-3.0.3/share/doc \ --strip 1 I'm left with 672 decompressed files adding up to 380 MB in size. The largest file is ~82 MB decompressed, four files are larger than 10 MB and then there is a long tail down to files in the tens of bytes. $ find working -type f -print | wc -l 672 I'll use the strongest DEFLATE compression Info-ZIP supports to create a ZIP archive of the working folder. The resulting ZIP file is 267 MB in size. $ zip -r -9 hadoop.zip working/

Benchmarking Inflation In these benchmarks I'll look at single-threaded, sequentially executed workloads. The machine used has an Intel Core i5 4670K clocked at 3.4 GHz, 8 GB of DDR3 RAM, a SanDisk SDSSDHII960G 960 GB SSD drive which is connected via a SATA interface. For the record the Intel chip I'm using does support SSE 4.2, AVX and AVX2. $ lscpu | tail -n1 | sed 's/ /

/g' | grep 'avx\|sse' | sort avx avx2 sse sse2 sse4_1 sse4_2 ssse3 The first tool I'll benchmark is the Debian-patched Info-ZIP DEFLATE decompression implementation. $ vi run_1 for i in { 1 ..25 } ; do unzip -qq -o hadoop.zip -d working/ done $ time bash run_1 The above completed in 54.55 seconds. For the record the above binary doesn't appear to be built with zlib. I examined the source code and it is possible to exclude zlib while retaining all the DEFLATE functionality. $ objdump --dynamic-syms /usr/bin/unzip | grep -c ZLIB 0 The above doesn't show any mention of zlib. I grep'ed the binary looking for the ZLIB_1 string that appears when zlib has been statically linked and nothing turned up. $ strings /usr/bin/unzip | grep -ic ZLIB_ 0 Next I'll try CPython's unzip functionality. $ vi run_2 for i in { 1 ..25 } ; do python -m zipfile -e hadoop.zip working/ done $ time bash run_2 The above completed in 56.84 seconds. Next I'll use Archiver version 3.2.0 which was built by Matt Holt in Golang. Golang is interesting in that it was co-designed by Ken Thompson, the inventor of UNIX and co-inventor of UTF-8, at Google where he's employed as a Distinguished Engineer. Go can take high-level code and compile it into binaries that can, among other things, take advantage of CPU vectorisation. $ wget -c -O arc \ https://github.com/mholt/archiver/releases/download/v3.2.0/arc_linux_amd64 $ chmod +x arc $ vi run_3 for i in { 1 ..25 } ; do ./arc -overwrite unarchive hadoop.zip done $ time bash run_3 The above completed in 100.975 seconds. Next I'll benchmark version 9.20.1 of the Debian-patched 7-Zip utility. $ vi run_4 for i in { 1 ..25 } ; do 7z x -bd -y -oworking/ hadoop.zip 1 >/dev/null done $ time bash run_4 The above completed in 72.2 seconds. Version 9.20.1 of 7-Zip above was released in 2010. I'll fetch a more recent Debian release of 7-Zip. The following will install version 16.02 which was released in mid-2016. $ wget -c http://http.us.debian.org/debian/pool/main/p/p7zip/p7zip_16.02+dfsg-7_amd64.deb $ sudo dpkg -i p7zip_16.02+dfsg-7_amd64.deb I'll then run the 7-Zip benchmark again. $ time bash run_4 The above completed in 58.05 seconds. Next I'll compile zlib "next generation". This zlib fork is also being worked on by Mark Adler and has several performance-related patches produced by firms such as Cloudflare and Intel. The past two years has seen over 70 commits from Sebastian Pop whom has a long CV with roles at firms such as AWS, Samsung R&D, Qualcomm, AMD, IBM as well as the FSF and the LLVM project. The following was run using GCC version 5.4.0. $ git clone https://github.com/zlib-ng/zlib-ng.git ~/zlib-ng $ cd ~/zlib-ng $ mkdir -p out $ CFLAGS="-O4 -m64 -DZLIB_COMPAT -DWITH_GZFILEOP -DHAVE_HIDDEN -DHAVE_BUILTIN_CTZL -DMEDIUM_STRATEGY -DX86_64 -DX86_NOCHECK_SSE2 -DUNALIGNED_OK -DUNROLL_LESS -DX86_CPUID -DX86_SSE2_FILL_WINDOW -DX86_SSE4_2_CRC_HASH -DX86_PCLMULQDQ_CRC -DX86_QUICK_STRATEGY" \ ./configure \ --prefix=./out \ --zlib-compat $ make $ make install Python is dynamically linked to zlib so I can use the above as a drop-in replacement for its unzip functionality. $ objdump --dynamic-syms /usr/bin/python2.7 | grep ZLIB 0000000000000000 DF *UND* 0000000000000000 ZLIB_1.2.0 inflateCopy $ cd ~ $ vi run_5 for i in { 1 ..25 } ; do LD_PRELOAD = /home/mark/zlib-ng/out/lib/libz.so.1.2.11.zlib-ng \ /usr/bin/python2.7 \ -m zipfile \ -e hadoop.zip \ working/ done $ time bash run_5 The above completed in 48.9 seconds. Finally, I'll compile my own version of Info-ZIP's unzip utility and see if I can find any performance improvements over the Debian-patched release. $ git clone https://github.com/LuaDist/unzip.git $ cd unzip $ mkdir -p out/bin $ export CFLAGS="-O4 -m64 -DZLIB_COMPAT -DWITH_GZFILEOP -DHAVE_HIDDEN -DHAVE_BUILTIN_CTZL -DMEDIUM_STRATEGY -DX86_64 -DX86_NOCHECK_SSE2 -DUNALIGNED_OK -DUNROLL_LESS -DX86_CPUID -DX86_SSE2_FILL_WINDOW -DX86_SSE4_2_CRC_HASH -DX86_PCLMULQDQ_CRC -DX86_QUICK_STRATEGY" $ make -f unix/Makefile generic $ make prefix=./out -f unix/Makefile install The unzip binary I've produced is about 90% smaller and wasn't built with bzip2's library. $ ls -alht ~/unzip/out/bin/unzip /usr/bin/unzip -rwxr-xr-x 2 mark mark 175K Oct 27 05 :01 /home/mark/unzip/out/bin/unzip -rwxr-xr-x 2 root root 159K Nov 20 2015 /usr/bin/unzip $ objdump --dynamic-syms ~/unzip/out/bin/unzip | grep -v GLIBC /home/mark/unzip/out/bin/unzip: file format elf64-x86-64 DYNAMIC SYMBOL TABLE: 0000000000000000 w D *UND* 0000000000000000 __gmon_start__ $ objdump --dynamic-syms /usr/bin/unzip | grep -v GLIBC /usr/bin/unzip: file format elf64-x86-64 DYNAMIC SYMBOL TABLE: 0000000000000000 w D *UND* 0000000000000000 _ITM_deregisterTMCloneTable 0000000000000000 DF *UND* 0000000000000000 BZ2_bzlibVersion 0000000000000000 DF *UND* 0000000000000000 BZ2_bzDecompressInit 0000000000000000 DF *UND* 0000000000000000 BZ2_bzDecompressEnd 0000000000000000 w D *UND* 0000000000000000 __gmon_start__ 0000000000000000 DF *UND* 0000000000000000 BZ2_bzDecompress 0000000000000000 w D *UND* 0000000000000000 _Jv_RegisterClasses 0000000000000000 w D *UND* 0000000000000000 _ITM_registerTMCloneTable 0000000000627304 g D .data 0000000000000000 Base _edata 00000000007198b8 g D .bss 0000000000000000 Base _end 0000000000627304 g D .bss 0000000000000000 Base __bss_start 00000000004019c8 g DF .init 0000000000000000 Base _init 000000000041acd4 g DF .fini 0000000000000000 Base _fini $ cd ~ $ vi run_6 for i in { 1 ..25 } ; do /home/mark/unzip/out/bin/unzip -qq -o hadoop.zip -d ~/working/ done $ time bash run_6 The above completed in 52.9 seconds. Below is a summary of the benchmark times. The fastest time is in seconds while all other times are a multiple of how much slower they were. Seconds Method 48.9s CPython 2.7.12 & zlib-ng 1.2.11 1.08x Info-ZIP's unzip 6.0 (self-compiled w/o zlib) 1.12x Info-ZIP's unzip 6.0 (Debian distribution w/o zlib) 1.16x CPython & zlib 1.2.0 1.19x 7-Zip 16.02 1.48x 7-Zip 9.20.1 2.04x Archiver 3.2.0

Vectorisation Advantage? Imagine having a toolbox at one side of a room and a workbench at the other end. Instead of taking tools out of the toolbox one at a time and bringing them over to the workbench you instead brought 4 or 8 tools at a time. That's Vectorisation. Intel have been including op codes in their CPUs for vectorisation since the mid-1990s. In 2011, Intel and AMD began shipping CPUs that supported AVX. AVX2 chips followed in 2012 and AVX-512 chips from Intel followed in 2016. The performance improvements require that vectorised op codes are used in any code executed. The following is Assembly code written by Denis Bakhvalov in a scalar (non-vectorised) fashion. mov edx, dword ptr [r9 + 4*rsi] # loading rhs[i] add edx, 1 # rhs[i] + 1 shr edx # (rhs[i] + 1) >> 1 mov dword ptr [rax + 4*rsi], edx # store result to lhs mov esi, edi add edi, 1 # incrementing i by 1 cmp rcx, rsi ja <next iteration> The following performs the same workload as above but in a vectorised fashion. Note how 256 bits / 8 bytes of data are being handled in each iteration. vmovdqu ymm1, ymmword ptr [r9 + 4*rdi] # loading 256 bits from rhs vpsubd ymm1, ymm1, ymm0 # ymm0 has all bits set, like +1 vpsrld ymm1, ymm1, 1 # vector shift right. vmovdqu ymmword ptr [rax + 4*rdi], ymm1 # storing result add rdi, 8 # incrementing i by 8 cmp rsi, rdi jne <next iteration> I've built the following regular expression pattern to help match any AVX or AVX2 instructions being used. I printed the pattern out across multiple lines for easier desktop skimming. $ vi avx_patterns.tmp vaddpd\|vaddps\|vaddsubpd\|vaddsubps\|vandnpd\|vandnps\|vandpd\| vandps\|vblendpd\|vblendps\|vblendvpd\|vblendvps\|vbroadcastf128\| vbroadcasti128\|vbroadcastsd\|vbroadcastss\|vcmppd\|vcmpps\|vcmpsd\| vcmpss\|vcvtdq2pd\|vcvtdq2ps\|vcvtpd2dq\|vcvtpd2ps\|vcvtps2dq\| vcvtps2pd\|vcvttpd2dq\|vcvttps2dq\|vdivpd\|vdivps\|vdpps\| vextractf128\|vextracti128\|vgatherdpd\|vgatherdps\|vgatherqpd\| vgatherqps\|vhaddpd\|vhaddps\|vhsubpd\|vhsubps\|vinsertf128\| vinserti128\|vlddqu\|vmaskmovpd\|vmaskmovps\|vmaxpd\|vmaxps\|vminpd\| vminps\|vmovapd\|vmovaps\|vmovddup\|vmovddup\|vmovdqa\|vmovdqu\| vmovmskpd\|vmovmskps\|vmovntdq\|vmovntdqa\|vmovntpd\|vmovntps\| vmovshdup\|vmovsldup\|vmovupd\|vmovups\|vmpsadbw\|vmulpd\|vmulps\| vorpd\|vorps\|vpabsb\|vpabsd\|vpabsw\|vpackssdw\|vpacksswb\| vpackusdw\|vpackuswb\|vpaddb\|vpaddd\|vpaddq\|vpaddsb\|vpaddsw\| vpaddusb\|vpaddusw\|vpaddw\|vpalignr\|vpand\|vpandn\|vpavgb\| vpavgw\|vpblendd\|vpblendvb\|vpblendw\|vpbroadcastb\|vpbroadcastd\| vpbroadcastq\|vpbroadcastw\|vpcmpeqb\|vpcmpeqd\|vpcmpeqq\|vpcmpeqw\| vpcmpgtb\|vpcmpgtd\|vpcmpgtq\|vpcmpgtw\|vperm2f128\|vperm2i128\| vpermd\|vpermilpd\|vpermilps\|vpermpd\|vpermps\|vpermq\|vpgatherdd\| vpgatherdq\|vpgatherqd\|vpgatherqq\|vphaddd\|vphaddsw\|vphaddw\| vphsubd\|vphsubsw\|vphsubw\|vpmaddubsw\|vpmaddwd\|vpmaskmovd\| vpmaskmovq\|vpmaxsb\|vpmaxsd\|vpmaxsw\|vpmaxub\|vpmaxud\|vpmaxuw\| vpminsb\|vpminsd\|vpminsw\|vpminub\|vpminud\|vpminuw\|vpmovmskb\| vpmovsxbd\|vpmovsxbq\|vpmovsxbw\|vpmovsxdq\|vpmovsxwd\|vpmovsxwq\| vpmovzxbd\|vpmovzxbq\|vpmovzxbw\|vpmovzxdq\|vpmovzxwd\|vpmovzxwq\| vpmuldq\|vpmulhrsw\|vpmulhuw\|vpmulhw\|vpmulld\|vpmullw\|vpmuludq\| vpor\|vpsadbw\|vpshufb\|vpshufd\|vpshufhw\|vpshuflw\|vpsignb\|vpsignd\| vpsignw\|vpslld\|vpslldq\|vpsllq\|vpsllvd\|vpsllvq\|vpsllw\|vpsrad\| vpsravd\|vpsraw\|vpsrld\|vpsrldq\|vpsrlq\|vpsrlvd\|vpsrlvq\|vpsrlw\| vpsubb\|vpsubd\|vpsubq\|vpsubsb\|vpsubsw\|vpsubusb\|vpsubusw\|vpsubw\| vptest\|vpunpckhbw\|vpunpckhdq\|vpunpckhqdq\|vpunpckhwd\|vpunpcklbw\| vpunpckldq\|vpunpcklqdq\|vpunpcklwd\|vpxor\|vpxor\|vrcpps\|vroundpd\| vroundps\|vrsqrtps\|vshufpd\|vshufps\|vsqrtpd\|vsqrtps\|vsubpd\| vsubps\|vtestpd\|vtestps\|vunpckhpd\|vunpckhps\|vunpcklpd\|vunpcklps\| vxorpd\|vxorps\|vzeroall\|vzeroupper I'll remove the new lines for grep compatibility. $ cat avx_patterns.tmp | tr -d '

' > avx_patterns Only the Golang Archiver utilities appears to be using any of AVX op codes. $ objdump -d ~/arc | grep -cf avx_patterns # 3539 $ objdump -d /usr/bin/unzip | grep -cf avx_patterns # 0 $ objdump -d /usr/lib/p7zip/7z | grep -cf avx_patterns # 0 $ objdump -d /bin/gzip | grep -cf avx_patterns # 0 $ objdump -d /home/mark/unzip/out/bin/unzip | grep -cf avx_patterns # 0 $ objdump -d /usr/bin/python2.7 | grep -cf avx_patterns # 0 $ objdump -d /home/mark/zlib-ng/out/lib/libz.so.1.2.11.zlib-ng \ | grep -cf avx_patterns # 0 I'll then look for signs of SSE 4.2 op codes in use in any of the binaries benchmarked in this post. $ vi sse42_patterns crc32[qwlb]\|pcmpestri\|pcmpestrm\|pcmpistri\|pcmpistrm\|pcmpgtq $ objdump -d ~/arc | grep -cf sse42_patterns # 24 $ objdump -d /usr/bin/unzip | grep -cf sse42_patterns # 0 $ objdump -d /usr/lib/p7zip/7z | grep -cf sse42_patterns # 0 $ objdump -d /bin/gzip | grep -cf sse42_patterns # 0 $ objdump -d /home/mark/unzip/out/bin/unzip | grep -cf sse42_patterns # 0 $ objdump -d /usr/bin/python2.7 | grep -cf sse42_patterns # 0 $ objdump -d /home/mark/zlib-ng/out/lib/libz.so.1.2.11.zlib-ng \ | grep -cf sse42_patterns # 5 Both Archiver and zlib-ng have instances of SSE 4.2 op codes. The above isn't proof that those op codes were executed in the code paths followed during the ZIP extraction process but an absence of them proves there is no support at all for these optimisations. It's also worth noting that CRC-32 checksums, which have a dedicated op code in SSE 4.2, are used with GZIP compression in zlib but Alder-32 is used for DEFLATE compression. The instances of CRC-32-related op codes above prove existence in the binary but can be presumed to go unused in the code path the unzip process took. Alder-32 was generally able to outperform CRC-32 at the expense of accuracy but that changed when Intel released a CRC-32-specific op code in the SSE 4.2 instruction set additions a few years ago. Intel published research two years ago where they worked to improve the performance of an Alder-32 implementation using AVX2. Though not measured in the research the firm believes AVX512 could improve on their findings further. There have been similar efforts on ARM chips in recent years as well. After my initial publication Klaus Post reached out and noted that the DEFLATE bit streams seen in the above ZIP file don't contain checksums. There are CRC-32 checksums contained within the ZIP file structure itself but not within the bit streams of compressed data. This means checksums are unlikely to be a major source of computational overhead. He also noted that the inflating of these streams is done byte by byte with each value dependant on the previous value. This means Intel's vectorisation instructions aren't going to be of great use for this workload. That being said, the Bit Manipulation Instruction Sets 1 and 2, which were first introduced in Intel CPUs using the Haswell microarchitecture in 2014, could help accelerate DEFLATE decompression. I've yet to prove these are present and contributing to zlib-ng's strong performance.