Stateless ZIP library - SLZ

Branch name Last version Released Links Notes Development master 1.2.0 2020-02-16 git / web may be broken 1.2 1.2-stable 1.2.0 2020-02-16 git / web Stable version

Introduction

SLZ is a fast andwhich produces an output that can beor gzip. Itat all, zlib is perfectly fine for this.

How does it work ?

The purpose is to use SLZ in situations where ais needed and zlib's resource usage would be too high while the compression ratio is not critical. The typical use case is inwhich have to compress many streams in parallel with little CPU resources to assign to this task, and without having to thottle the compression ratio due to the memory usage. In such an environment, the server's. In addition its high performance made it fill a gap in network backup applications.

Whileof memory per stream in addition to a few tens of bytes for the stream descriptor itself,a stream descriptor made of. Thus it is particularly suited to environments having to deal with tens to hundreds of thousands of concurrent streams.

The key difference between zlib and SLZ is thatin that it doesn't consider the data previously compressed as part of its dictionary. It doesn't hurt compression performance when it is fed in large enough chunks (at least a few kB at once), but itat a time, or the caller will have to implement theto maximize the compression ratio. For example, an HTTP gateway may defragment all the parts from a chunked-encoded response buffer into a linear buffer that is then fed at once to SLZ. This temporary buffer would then be static and shared between all streams running on the same thread.

A side effect from its stateless nature is that it can emit a zlib-compatible streamby processing the sensitive and attacker-controlled data in distinct batches. Since, there is no risk that attacker-controlled data is used to guess sensitive data using dictionary lookups.

Reference tests

Another induced benefit which was not initially seeked is thatin its fastest mode. This is the result of the simpler matching and encoding mechanisms involved which permit even more optimisations.

Some performance tests were run using version 1.0 on various types of data showing anything betweenon the test laptop depending on the difficulty to find matching strings and on the amount of data emitted as literals (hence not compressed). In the end, comparative tests were performed using somethat are already being used for other libraries. This performance report was produced using theto get something comparable to what is reported by the LZ4 utility here : https://github.com/Cyan4973/lz4 . New tests would be deserved on more recent versions since performance on x86_64 and armv8 was further improved.

Product Input BW Compression ratio Notes lz4-r124 352 MB/s 2.09 lzo-2.09 347 MB/s 2.10 slz-142 266 MB/s 2.12 deflate format slz-142 214 MB/s 2.12 gzip format zlib-1.28 -1 -h 88 MB/s 1.64 huffman only zlib-1.28 -1 -r 81 MB/s 1.65 RLE only zlib-1.28 -1 -F 70 MB/s 2.32 fixed huffman zlib-1.28 -1 70 MB/s 2.74

The tests were run on a single core of a core i5-3320M at 3.3 GHz (turbo mode).SLZ was included in Inikep's lzbench benchmark suite, which compares a wide range of compression tools. The project maintains an updated performance table where other solutions can be compared.

can emit a stream that a regular zlib-compatible browser can decompress (RFC1950/1951/1952 aka zlib, deflate, gzip). Herefor a less compressed payload (29% larger than zlib). Tests run on other file types (including HTML, CSS, JS) show that the ratio of. The outputfor SLZ than zlib.

#Cores Input size Time(s) BW/core(MB/s) Total BW(MB/s) 1 211938580 3.575 59.3 59.3 2 211938580 3.669 57.8 115.5 3 211938580 3.697 57.3 172.0 4 211938580 3.748 56.5 226.2

Increasing compression ratio in HTTP gateways

Since SLZ focuses a lot on performance savings gained by, itare used in parallel (low memory bandwidth) as shown in this tests run on a quad-core ARM Cortex A17 (RK3288) running at 1.8 GHz and the same input data :

Since some productslevel based on the CPU usage or the available memory, on high speed networks, zlib can end upin order to fill an uplink. With SLZ,for compression, and the, meaning that the. The net effect is acompared to zlib at a given resource usage.

HAProxy running on a single core of a core i5-3320M, with 500 concurrent users, the default 16kB buffers, and accessing www.haproxy.org (76 kB) shows the following measures with traffic leaving the machine via a gigabit Ethernet link (hence the 976 Mbps of IP traffic) after being compressed with zlib and SLZ, respectively :

Library BW In BW Out BW Saved Ratio memory VSZ/RSS hits/s zlib 480 Mbps 188 Mbps 292 Mbps 2.55 130M / 101M 744 slz 1700 Mbps 810 Mbps 890 Mbps 2.10 23.7M / 9.7M 2382

At 100% CPU (no CPU usage limit) :

Library BW In BW Out BW Saved Ratio memory VSZ/RSS hits/s zlib 1240 Mbps 976 Mbps 264 Mbps 1.27 130M / 100M 1738 slz 1600 Mbps 976 Mbps 624 Mbps 1.64 23.7M / 9.7M 2210

At 85% CPU (limited) :

No magics inside!

In the CPU-limited test, once CPU usage limit was reached,. This explains how the gigabit link was always saturated, andaccording to the average compression ratio. In all situations,

dictionary lookup : looking up a sequence of bytes in a recent window is not necessarily fast. SLZ uses a hash-based technique similar to LZO or LZ4, focusing on keeping track of only the most recent occurrence of a given sequence. The hash table contains the absolute position in the stream of the last occurrence of the hashed entry. This table is initialized to impossible values prior to compressing so that unused entries are not even evaluated, as it was found to be faster than not initializing them. During tests, it was found that hashing 4 bytes to one value among 4096-65536 resulted in the best compression ratio, and that 8192 entries was the best tradeoff between speed and compression ratio. Smaller hash tables cause more collisions hurting the compression ratio, and larger ones experience a significant slowdown on certain datasets due to the hashtable having more difficulties to fit entirely in the CPU cache. indexing : if sequences of less than 3 bytes are considered for a lookup, since we only keep the last occurrence, there's a high risk of not keeping the most useful sequence which has large chances of matching a similar word. For example in HTML a lot of occurrences of "><", "</", "/>" can be found so the probability that the one from "</a>" is picked instead of the one from "</td>" will make it hard to find long matches. Conversely, sequences larger than 4 bytes will only match once in a while and will prevent shorter series of characters from being matched. Experimentation has shown that hashing 3 bytes only provided the same compression ratio as hashing 4 bytes, with up to 10% faster input processing thanks to the reduced cost of hashing. Attempts to index the repeated data were made, and caused a significant performance drop (about 5%) for less than 1% compression ratio improvement, so they were abandonned. longest match : the longest match measurement is very similar to an strcmp() call except that we count similar characters. In order to avoid this slow operation most of the time, we store in the hash table a copy of the 4 bytes that were hashed. This makes it easy to perform a comparison and to skip those which do not match, then to only start the longest match measurement for parts which already match on the first 4 bytes. With a 13-bits hash on 32-bit character sequence, there's a theorical 1 chance over 524288 that a matching hash implies a matching sequence. But in practice on selected text this chance is significantly higher since not all 32 bits are used, it's more like 24 bits (6 bits per char), so it's one chance over 2048 that an entry will match. That's still 2047/2048 useless calls to the memmatch function that are saved this way. encoding the output : deflate is often criticized for its bit-oriented stream which is expensive to manipulate as it requires bit shifts. Moreover, on top of this the deflate stream supports several encoding modes, the most common one being dynamic huffman trees, which requires some significant maintenance to keep a valid tree up to date and which SLZ doesn't use. Instead, SLZ uses static huffman trees, and when too many binary entries are found to keep this encoding efficient, it switches to the plain literal mode. Proceeding like this comes with a significant benefit : all code sequences, all distances and all lengths are stored as their respective huffman output bits. This avoids an encoding phase, sequences of bits are immediately sent to the destination buffer without any further operation, and remain reasonably cheap compared to the more common usages. checksumming : 3 checksum methods exist depending on the encoding format. RFC1950 (zlib format) uses an Adler-32 checksum which theorically involves modulus (divide) but which can be optimized as it's done with a number very close to a power of two. RFC1951 (raw deflate) doesn't involve any checksum. RFC1952 (gzip) uses a CRC-32 which is somewhat expensive but can be optimized in various ways. Since SLZ was designed around gzip mainly, most of the optimizations were done to optimize CRC-32. Some well-known methods involve pre-computed tables, which can be further optimized for multi-byte processing and single-byte processing as well. It was found that updating the CRC-32 on a per-character basis before looking up the hash came at almost no cost because it would probably compensate for some pipeline stalls when the hashed entry was not in the CPU's L1 cache. But in order to avoid multiplying the functions, the checksum computation was moved out of the function with a very low extra cost (less than 1%) as it can process all the input at once 4 bytes at a time, contributes to preloading the input data into the caches. implementation-specific optimizations : modern CPUs can load unaligned words from memory with minimal to no overhead. On architectures that support this (ix86, x86_64, armv7), this is used to limit the number of operations and memory accesses. Some CPUs have a smaller cache and the direct mapping between distances and huffman sequences can make it thrash a lot. Some experimentations were made using a direct mapping only for shortest distances (the most common ones), but results were not encouraging for now as a cache miss is not completely offset by the amount of extra operations.

There are 6 key points having a large impact on compression speed in any LZ- based compressor :

collisions in the dictionary, caused by suboptimal hash distribution. Here with 8k entries we have what looks like the best tradeoff between speed and compression ratio for most workloads tested, and it can easily be changed using a macro. output encoding : the dynamic huffman trees would definitely shorten the output stream but at an important performance cost. But even with fixed trees it is possible to waste a lot of space. First, input bytes 144 to 255 are encoded on 9 bits per byte. SLZ constantly monitors the output stream to decide when it's preferable to switch to plain literals which cost exactly 52 bits to switch back and forth and will not inflate the output stream any further. Second, referencing a match can be longer than the data it replaces. This is true for short sequences, but it can be true even for a long sequence in the middle of a series of plain literals. By monitoring the output stream and verifying the encoding cost before switching, it is possible to avoid any overhead beyond the base 5 bytes per 32kB plus headers for non-compressible data.

Get it!

git format-patch

Implementations

HAProxy can use SLZ since version 1.6.0 to compress HTTP responses.

External links

These two factors have a significant impact on the compression ratio :SLZ is provided as a library with a few extra tools (eg:, a compressor emitting the various formats). It is distributed under the, meaning that you can do a lot of things with it, such as merge it into, as well as use it in proprietary software.are welcome and should be sent as(see) and will be made under the same license exclusively.