LZ4_8088

Extremely fast LZ4 decompression for the 8088/8086 CPU

LZ4_8088 is a chunk of assembly code that implements incredibly fast LZ4 decompression for 8088 and 8086 CPUs. It is specifically optimized for the 8088 and is intended for use in hobby or retrocomputing projects. Code is provided for most generic x86 assemblers, as well as an example of how to use it in Turbo Pascal.

LZ4 itself is a compresson format that is implemented as an open-source C library, with ports to other platforms. The goal of the library is speed, and it currently holds many top speed rankings in compression benchmarks. One variant of LZ4 called LZ4_HC implements optimal parsing, which creates the very best possible set of literal and match runs for a given input and coding set. This leads to compression ratios that are competitive with PKZIP/zlib, but unlike PKZIP, LZ4 doesn't implement order-0 coding (ie. bit twiddling which 8088/8086 is very bad at) so the end result decompresses at nearly memcpy() speeds.

Here are some statistics comparing LZ4 with PKWare's Data Compression Library (DCL). The DCL was chosen as a comparison because it uses an algorithm very similar to deflate (which was considered state of the art speed-wise for many years on DOS platforms), and also because the DCL could be measured with microsecond accuracy. To eliminate implementation bias, PKWare's retail DCL compiled assembly code was used.

Data Type Filename Original size LZ4_HC compressed size LZ4_HC ratio DCL Implode size DCL Ratio memcpy() time in µs (REP MOVSW) LZ4 Decompression Speed (x slower than memcpy) DCL Decompression speed (x slower than memcpy) Sparse bitmap shuttle.pic 16390 3297 20% 2552 16% 44989 1.91 20.50 RLE bitmap rle.dmp 16384 436 3% 387 2% 44977 0.76 5.79 Small text file text.txt 4988 3037 61% 2323 47% 13716 3.17 61.21 Large text file largetxt.txt 56899 26890 47% 25642 45% 156457 3.17 54.43 Sparse compiled binary robotron.com 40704 21048 52% 18178 45% 112036 2.75 53.51 Dense compiled binary linewars.exe 61744 41500 67% 36641 59% 169924 2.16 70.08 Uncompressable data random.bin 64000 64260 100% 71822 112% 176113 1.39 116.14

Key takeaways from the above table:

For most sources, decompression is never slower than 3.2x memcpy.

For source material that contains long runs of sequences (RLE), decompression is faster than memcpy .

. Compression ratios are competitive; they're mostly within a few percentage points of PKWare ratios.

For full documentation, please see the included LZ4_8088.TXT in the download distribution. It is highly recommended you read the documentation to avoid any pitfalls using the code.

LZ4_8088.ZIP contains the assembler routine, a Turbo Pascal test harness, documentation, compression samples, and compiled binaries for Win32 and DOS 16-bit.

The LZ4 library is provided under the BSD License. However, my code was not derived from any of the original LZ4 code so I can provide it via any license I choose. So, I am providing my code under what I am calling the Demoscene License. The Demoscene License grants you the following rights:

You are free to use this code in any production, commercial or otherwise, without providing remuneration to the author.

If you use this code, you must greet "Trixter/Hornet" if used in a demoscene production, or "Jim Leonard" if used in a normal program. Also, you must send email to trixter@oldskool.org telling him you used the code so he can marvel at your result.

Q: Is this code faster than the LZ4 C source code? For 8088-80286 CPUs, yes. For 386+, no, because the 386 has 32-bit registers, additional segment registers, and additional instructions. Compiling the LZ4 C code with a suitable 32-bit compiler will produce code that definitely outperforms this 8088 assembly code.

Q: How is it possible to exceed the speed of a REP MOVSW? Because 1-byte and 2-byte runs in the compressed data are handled specially with REP STOSW. STOSW can set a fixed value in memory in half the time it takes to move a value from one memory location to another.

Q: Why did you bother doing this? Because I like impressing the teenage version of myself.

Currently this distribution only includes decompression code. It is possible to implement LZ4 "c0" compression on 8088 in as little as 16K memory and with reasonable speed, but I don't plan on developing it.

20130107: Initial release.

20130209: Now includes a size-optimized version of the code as well. The speed routine compiles to 446 bytes including the shift table; the size-optimized code is 30% slower but compiles to 79 bytes. Also, speed-optimized version of the code sped up an additional 1% by contributions from both Peter Ferrie and Terje Mathisen (thanks guys!)

20130210: Forgot an optimization; the size version now optimizes to 78 bytes.

20130213: Fixed a bug that could corrupt output if the matches overlapped.

20190615: Size version optimizes down to 79 bytes, not 78. Also unified the .PAS and .ASM files.

20190617: Added size fixups by Axel Kern. Refactored code into a single .ASM file; the .PAS is the test harness only.

20190630: Fixed a bug in _small; assembles to 78 bytes.

20200314: Added speed improvements by Pavel Zagrebin. These speed optimizations make Peter Ferrie's "reversed match offsets" optimization obsolete, so that version was removed.