Large Text Compression Benchmark

Matt Mahoney

Last update: Sept. 9, 2020. history



This competition ranks lossless data compression programs by the compressed size (including the size of the decompression program) of the first 109 bytes of the XML text dump of the English version of Wikipedia on Mar. 3, 2006. About the test data.

The goal of this benchmark is not to find the best overall compression program, but to encourage research in artificial intelligence and natural language processing (NLP). A fundamental problem in both NLP and text compression is modeling: the ability to distinguish between high probability strings like recognize speech and low probability strings like reckon eyes peach. Rationale.

This is an open benchmark. Anyone may contribute results. Please read the rules first.

Open source compression improvements to this benchmark with certain hardware restrictions may be eligible for the Hutter Prize.

Benchmark Results

Compressors are ranked by the compressed size of enwik9 (109 bytes) plus the size of a zip archive containing the decompresser. Options are selected for maximum compression at the cost of speed and memory. Other data in the table does not affect rankings. This benchmark is for informational purposes only. There is no prize money for a top ranking. Notes about the table:

Program: The version believed to give the best compression. A | denotes a combination of 2 programs.

The version believed to give the best compression. A | denotes a combination of 2 programs. Compression options: selected for what I believe gives the best compression.

selected for what I believe gives the best compression. enwik8: compressed size of first 10 8 bytes of enwik9. This data is used for the Hutter Prize, and is also ranked here but has no effect on this ranking.

compressed size of first 10 bytes of enwik9. This data is used for the Hutter Prize, and is also ranked here but has no effect on this ranking. enwik9: compressed size of first 10 9 bytes of enwiki-20060303-pages-articles.xml.

compressed size of first 10 bytes of enwiki-20060303-pages-articles.xml. decompresser size: size of a zip archive containing the decompression program (source code or executable) and all associated files needed to run it (e.g. dictionaries). A letter following the size has the following meaning: x = executable size. s = source code size (if available and smaller). d = size of a separate decompression program (separate from compression). For self extracting archives (SFX), the size is 0 because the decompresser and compressed data are combined into one file. For testing, if no zip file is supplied I create archives using InfoZIP 2.32 -9. (Prior to July 1, 2008 I used 7zip 4.32 -tzip -mx=9).

size of a zip archive containing the decompression program (source code or executable) and all associated files needed to run it (e.g. dictionaries). A letter following the size has the following meaning: For testing, if no zip file is supplied I create archives using InfoZIP 2.32 -9. (Prior to July 1, 2008 I used 7zip 4.32 -tzip -mx=9). Total size: total size of compressed enwik9 + decompresser size, ranked smallest to largest.

total size of compressed enwik9 + decompresser size, ranked smallest to largest. Comp: compression rate in nanoseconds per byte on the largest file tested (e.g. seconds for enwik9). Speed is approximate and has no effect on ranking. A ~ means "very approximate". Not all tests are done on the same computer. Times reported are the smaller of process time (summed over processors if multi-threaded) or real time as measured with timer). If there is no note then the program was tested on a Compaq Presario 5440, 2.188 GHz, Athlon-64 3500+ in 32 bit Windows XP. An underlined time means that no better compressor is faster.

compression rate in nanoseconds per byte on the largest file tested (e.g. seconds for enwik9). Speed is approximate and has no effect on ranking. A ~ means "very approximate". Not all tests are done on the same computer. Times reported are the smaller of process time (summed over processors if multi-threaded) or real time as measured with timer). If there is no note then the program was tested on a Compaq Presario 5440, 2.188 GHz, Athlon-64 3500+ in 32 bit Windows XP. An underlined time means that no better compressor is faster. Decomp: decompression time as above. If blank, decompression was not tested yet and ranking is pending verification that the output is identical. An underlined time means that no better compressor is faster.

decompression time as above. If blank, decompression was not tested yet and ranking is pending verification that the output is identical. An underlined time means that no better compressor is faster. Mem: approximate memory used for compression in MB. Decompression uses the same or possibly less. There is some ambiguity whether a megabyte means 10 6 bytes or 2 20 bytes. The approximation is course enough that it doesn't matter. I use peak memory as measured with Windows Task Manager during compression (so if you really want to know, 1 MB = 1,024,000 bytes :) Memory does not include swap or temporary files. An underlined value means that no better compressor uses less memory.

approximate memory used for compression in MB. Decompression uses the same or possibly less. There is some ambiguity whether a megabyte means 10 bytes or 2 bytes. The approximation is course enough that it doesn't matter. I use peak memory as measured with Windows Task Manager during compression (so if you really want to know, 1 MB = 1,024,000 bytes :) Memory does not include swap or temporary files. An underlined value means that no better compressor uses less memory. Alg: compression algorithm, referring to the method of parsing the input into symbols (strings, bytes, or bits) and estimating their probabilities (modeling) for choosing code lengths. Symbols may be arithmetic coded (fractional bit length for best compression), Huffman coded (bit aligned for speed), or byte aligned as a preprocessing step. Dict (Dictionary). Symbols are words, coded as 1 or 2 bytes, usually as a preprocessing step. LZ (Lempel Ziv). Symbols are strings. LZ77: repeated strings are coded by offset and length of previous occurrence. LZW (LZ Welch): repeats are coded as indexes into a dynamically built dictionary. ROLZ (Reduced Offset LZ): LZW with multiple small dictionaries selected by context. LZP (LZ predictive): ROLZ with a dictionary size of 1. on (Order-n, e.g. o0, o1, o2...): symbols are bytes, modeled by frequency distribution in context of last n bytes. PPM (Prediction by Partial Match): order-n, modeled in longest context matched, but dropping to lower orders for byte counts of 0. SR (Symbol Ranking): order-n, modeled by time since last seen. BWT (Burrows Wheeler Transform): bytes are sorted by context, then modeled by order-0 SR. ST (Sort Transform): BWT using stable sort with truncated string comparison. DMC (Dynamic Markov Coding): bits modeled by PPM. CM (Context Mixing): bits, modeled by combining predictions of independent models. LSTM (long short term memory): CM using neural network models. Some compressors combine multiple steps such as Dict+PPM or LZP+DMC. I indicate the last stage before coding.

compression algorithm, referring to the method of parsing the input into symbols (strings, bytes, or bits) and estimating their probabilities (modeling) for choosing code lengths. Symbols may be arithmetic coded (fractional bit length for best compression), Huffman coded (bit aligned for speed), or byte aligned as a preprocessing step. Some compressors combine multiple steps such as Dict+PPM or LZP+DMC. I indicate the last stage before coding. Notes: Brief notes. See program descriptions for details. Usually this means the result was reported by somebody else on a different computer.

Fails on enwik9

Compression Compressed size Decompresser Total size Time (ns/byte) Program Options enwik8 enwik9 size (zip) enwik9+prog Comp Decomp Mem Alg Note ------- ------- ---------- ----------- ----------- ----------- ----- ----- --- --- ---- hipp 5819 /o8 20,555,951 (fails) 36,724 x 5570 5670 719 CM ppmz2 23,557,867 (fails) 29,362 s 92210 88070 1497 PPM 26 XMill 0.8 -w -P -9 -m800 26,579,004 (fails) 114,764 xd 616 530 800 PPM lzp3o2 33,041,439 (fails) 23,427 xd 230 270 151 LZP

Programs that properly decompress enwik8 and don't use external dictionaries are still eligible for the Hutter Prize.

Testing not yet completed

Compression Compressed size Decompresser Total size Time (ns/byte) Program Options enwik8 enwik9 size (zip) enwik9+prog Comp Decomp Mem Alg Note ------- ------- ---------- ----------- ----------- ----------- ----- ----- --- --- ---- rdmc 0.06b 33,181,612 1394 1381 DMC 6 ESP v1.92 36,651,292 223 LZ77 16

Pareto frontier: compressed size vs. compression time as of Aug. 18, 2008 from the main table (options for maximum compression).

Pareto frontier: compressed size vs. memory as of Aug. 18, 2008 (options for maximum compression). Notes about compressors I only test the latest supported version of a program. I attempt to find the options that select the best compression, but will not generally do an exhausitve search. If an option advertises maximum compression or memory, I don't try the alternatives. If you know of a better combination, please let me know. I will select the maximum memory setting that does not cause disk thrashing, usually about 1800 MB. If the compressor is not downloadable as a zip file then I will compress the source or executable (whichever archive is smaller) plus any other needed files (dictionaries) into a single zip archive using 7zip 4.32 -tzip -mx=9. If no executable is available I will attempt to compile in C or C++ (MinGW 3.4.2, Borland 5.5 or Digital Mars), Java 1.5.0, MASM, NASM, or gas. 1. Reported by Guillermo Gabrielli, May 16, 2006. Timed on a Celeron D325 2.53Ghz Windows XP SP2 256MB RAM.

2. Decompression size and time for pkzip 2.0.4. kzip only compresses.

3. Reported by Ilia Muraviev (author of PX, TC, pimple), June 10-July 18, 2006. Timed on a P4 3.0 GHz, 1GB RAM, WinXP SP2.

4. enwik9 reported by Johan de Bock, May 19, 2006. Timed on Intel Pentium-4 2.8 GHz 512KB L2-cache, 1024MB DDR-SDRAM.

5. Compressed with paq8h (VC++ compile) and decompressed with paq-8h (Intel compile of same source code). Normally compression and decompression are the same speed.

6. ocamyd 1.65.final and LTCB 1.0 reported by Mauro Vezzosi, May 30-June 20, 2006. Timed on a 1.91 GHz AMD Athlon XP 2600+, 512 MB, WinXP Pro 2002 SP2 using timer 3.01. ocamyd 1.66.final reported Feb. 3, 2007. Times are process times.

7. Under development by Mauro Vezzosi, May 24, 2006.

8. Reported by Denis Kyznetsov (author of qazar), June 2, 2006.

9. Reported by sportman, May 24, 2006. Timed on a Intel Pentium D 830 dual core 3.0GHz, 2 x 512MB DDR2-SDRAM PC4300 533Mhz memory timing 4-4-4-12 (833.000KB free), Windows XP Home SP2. CPU was at 52% so apparently only one of 2 cores was used. Decompression verified on enwik8 only (not timed, about 2.5 hours). WinRK compression options: Model size 800MB, Audio model order: 255, Bit-stream model order: 27, Use text dictionary: Enabled, Fast analyses: Disabled, Fast executable code compression: Disabled

10. Reported by Malcolm Taylor (author of WinRK), May 24, 2006. Timed on an Athlon X2 4400+ with 2GB, running WinXP 64. Decompression not tested. decompresser size is based on SFX stub size reported by Artyom (A.A.Z.), Sept. 2, 2007, although it was not tested this way.

11. Reported by sportman, May 25, 2006. CPU as in note 9.

12. Reported by sportman, May 30, 2006. CPU as in 9 (50% utilized).

13. xwrt 3.2 options are -2 -b255 -m250 -s -f64. ppmonstr J options are -o10 -m1650.

14. Reported by Michael A Maniscalco, June 15, 2006.

15. Reported by Jeremiah Gilbert on the Hutter group, Aug. 18, 2006. Tested under Linux on a dual Xeon 1.6 GHz(lv) (overclocked to 2.13 GHz) with 2 GB memory. Time is user+sys (real=196500 B/ns).

16. Reported by Anthony Williams, Aug. 19-22. 2006. Timed on a 2.53 GHz Pentium 4 with 512 MB under WinXP Home SP2.

17. Tested Aug. 20, 2006 under Ubuntu Linux 2.6.15 on a 2.2 GHz Athlon-64 with 2 GB memory. Time is approximate wall time due to disk thrashing. User+sys time is 153600 ns/byte compress, 148650 decompress.

18. Reported by Dmitry Shkarin (author of durilca4linux), Aug. 22-23, 2006 for durilca4linux_1; and Oct. 16-18, 2006 for durilca4linux_2. 3 GB memory usage is RAM + swap. Tested on AMD Athlon X2 4400+, 2.22 GHz, 2 GB memory under SuSE Linux AMD64 v10.0. durilca4linux_3 reported Feb. 21, 2008 using 4 GB RAM + 1 GB swap. v2 reported Apr. 22, 2008. v3 reported May 22, 2008.

19. enwik8 confirmed by sportman, Sept. 20, 2006. Compression time 61480 ns/byte timed on a 2 x dual core (only one core active) Intel Woodcrest 2GHz with 1333MHz fsb and 4GB 667MHz CL5 memory under SiSoftware Sandra Lite 2007.SP1 (10.105). Drystone ALU 37,014 MIPS, Whetstone iSSE3 25,393 MFLOPS, Integer x8 iSSE4 220,008 it/s, Floating-point x4 iSSE2 119,227 it/s.

20. Reported by Giorgio Tani (author of PeaZip) on Nov. 10, 2006. Tested on a MacBook Pro, Intel T2500 Core Duo CPU (one core used), with 512 MB memory under WinXP SP2. Time is combined compression and decompression.

21. enwik9 -8 reported by sportman, Dec. 12-13, 2006. Hardware as note 19. enwik9 decompression not verified. paq8hp7 -8 enwik8 compression was reported as 16,417,650 (4 bytes longer; the size depends on the length of the input filename, which was enwik8.txt rather than enwik8). I verified enwik8 -7 and -8 decompression.

22. paq8hp8 -8 enwik9 reported by sportman, Jan. 18, 2007. paq8hp10 -8 enwik9 on Apr. 2, 2007. paq8hp11 -8 enwik9 on May 10, 2007. paq8hp12 -8 enwik8/9 on May 20, 2007. Hardware as in note 19. Decompression verified for enwik8 only.

23. 7zip 4.46a options were -m0=PPMd:mem=1630m:o=10 -sfx7xCon.sfx

24. paq8o8-intel (intel compile of paq8o8) -1, paq8o8z-jun7 (DOS port of paq8o8) -1 reported by Rugxulo on Jun 10, 2008. Timed on a AMD64x2 TK-53 Tyler 1.7 GHz laptop with Vista Home Premium SP1.

25. paq8o8z -1 enwik8 (DJGPP compile) reported by Rugxulo on Jun 17, 2008. Tested on a 2.52 Ghz P4 Northwood, no HTT, WinXP Home SP2.

26. Tested on a Gateway M-7301U laptop with 2.0 GHz dual core Pentium T3200 (1MB L2 cache), 3 GB RAM, Vista SP1, 32 bit. Run times are similar to my older computer.

27. enwik9 size reported by Eugene Shelwien, Mar. 5, 2009. enwik8 size and all speeds are tested as in note 26.

28. Reported by Eugene Shelwien on a Q6600, 3.3 GHz, WinXP SP3, ramdrive: bcm 0.06 on Mar. 15, 2009, bcm 0.08 on June 1, 2009.

29. Reported by kaitz (KZ): paq8p3 on Apr. 19, 2009, v2 on Apr. 21, 2009, paq8pxd on Jan. 21, 2012, v2 on Feb. 11, 2012, v3 on Feb. 23, 2012, v4 on Apr. 23, 2012. 2012 tests on a Core2Duo T8300 2.4 GHz, 2 GB.

30. Reported by Sami Runsas (author of bwmonstr), July 14, 2009. Tested on an Athlon XP 2200 (Win32).

31. Reported by Dmitry Shkarin, July 21, 2009, Nov. 12, 2009. Tested on a 3.8 GHz Q9650 with 16 GB memory under Windows XP 64bit Pro SP2. Requires msvcr90.dll.

32. Reported by Mike Russell, Sept. 11, 2009. Tested on an 2.93 GHz Intel Q6800 with 3.5 GB memory.

33. Reported by Con Kolivas (author of lrzip) on Nov. 27, 2009 (lrzip 0.40), Nov. 30, 2009 (lrzip 0.42), Mar. 17, 2012 (lrzip 0.612). Tested on a 3 GHz quad core Q9650, 8 GB, 64 bit debian linux.

34. Reported by sportman, Nov. 29, 2009 (durilca'kingsize), Nov. 30, 2009 (durilca'kingsize4), Apr. 8, 2010 (bsc 1.0.0). Test hardware: 2 x 2.4GHz (overclocked at 2.53 GHz) quad core Xeon Nahalem, 24GB DDR3 1066MHz, 8 x 2TB RAID5, Windows 2008 Server R2 64bit

35. Reported by zody on Dec. 12, 2009. Tested in Windows 7, x64, 3.6 GHz e8200, 4 GB 1066 MHz RAM.

36. Reported by Ilia Muraviev on Dec. 16, 2009. Tested on a 2.40 GHz Core 2 Duo, DDR2-800 4GB RAM, Windows7 x64.

37. Reported by Sami Runsas, Mar. 3, 2010. Tested under Win64 on a Q6600 at 3.0 GHz.

38. Reported by Ilya Grebnov, Apr. 7, 2010. Tested on an Intel Core 2 Duo E8500, 8 GB memory, Windows 7.

39. Reported by Ilya Grebnov, Apr. 8, 2010. Tested on an Intel Core 2 Quad Q9400, 8 GB memory, Windows 7. bsc 2.00 on May 3, 2010. bsc 2.2.0 on June 15, 2010.

40. Reported by Sami Runsas, May 10, 2010. Tested on an overclocked Intel Core i7 860. nanozip 0.08a tested June 6, 2010. nanozip 0.09a on Nov. 5, 2011.

41. lpaq9m reported by Alexander Rhatushnyak on June 9, 2010. Tested on an Intel Core i7 CPU 930 (8 core), 2.8 GHz, 2.99 GB RAM. paq8hp12any tested June 28, 2010.

42. Reported by Michal Hajicek, June 4, 2010 on an AMD Phenom II 965, 64 bit Windows. WinRK, ppmonstr on June 14.

43. Reported by Ilia Muraviev, June 26, 2010. Tested on a Core 2 Quad Q9300, 2.50 GHz, 4 GB DDR2, Windows 7.

44. Timed on a Dell Latitude E6510 laptop Core I7 M620, 2.66 GHz, 4 GB, Windows 7 32-bit.

45. Reported by Richard Geldreich (lzham author) on Aug. 30, 2010. Tested on a 2.6 GHz Core i7 (quad core + HT), 6 GB, Win7 x64.

46. Reported by Stefan Gedo (ST author) on Oct. 14, 2010. Tested on Athlon II X4 635 2.9 GHz, 4 GB memory, Windows 7.

47. Reported by David A. Scott on Dec. 15, 2010. Tested on a I3-370 with 6 GB DDR3 1033 MHz memory.

48. Timed on a Dell Latitude E6510 laptop Core I7 M620, 2.66 GHz, 4 GB, Ubuntu Linux 64-bit.

49. Tested by the author on a Q9450, 3.52 GHz = 440x8, ramdrive.

50. Tested by the author on an Intel Core i7-2600, 3.4 GHz, Kingston 8 GB DDR3, WD VeloicRaptor 10000 RPM 600 GB SATA3, Windows 7 Ultimate SP1.

51. Tested by Bulat Ziganshin on i7-2600, 4.6 GHz with 1600 MHz RAM (8-8-8-21-1T) and NVIDEA GeForce 560Ti at 900/2000 MHz.

52. Tested by Michael Maniscalco on an 8 core Intel Xeon E5620, 2.40 GHz, 12 GB memory running Windows 7 Enterprise SP1, 64 bit.

53. Tested by the author on a Core i7-2600K @ 4.6GHz, 8GB DDR3 @ 1866MHz, 240GB Corsair Force GT SSD.

54. Tested by Piotr Tarsa on a Core 2 Duo E8400, 8 GiB RAM, Ubuntu 11.10 64-bit, OpenJDK 7.

55. Tested by David Catt on a 64 bit Windows 7 laptop, 2.33 GHz, 4 GB, 4 cores.

56. Reported by the author on a Athlon II X4 635 2.9 GHz, 4GB, Windows 8 Enterprise.

57. Reported by the author on a x86_64 Athlon 64 X2 5200+ with 8 GiB of RAM running GNU/Linux 2.6.38.6-libre.

58. Reported by the author on a 4 GHz i7-930 from ramdrive.

59. Reported by the author on a I7-2600, 4.6 GHz, 16 GB RAM, Ubuntu 13.04.

60. Tested by Ilia Muravyov on an Intel Core i7-3770K, 4.8 GHz, 16 GB Corsair Vengeance LP 1800 MHz CL9, Corsair Force GS 240 GB SSD, Windows 7 SP1.

61. Tested by Matt Mahoney on a dual Xeon E-2620, 2.0 GHz, 12+12 hyperthreads, 64 GB RAM (20 GB usable), Fedora Linux.

62. Tested by Valéry Croizier on a 2.5 GHz Core i5-2520M, 4 GB memory, Windows 7 64 bit.

63. Tested by Ilia Muravyov on an Intel i7-3770, 4.7 GHz, Corsair Vengenance LP 1600 MHz CL9 16 GB RAM, Samsung 840 Pro 512 GB SSD, Windows 7 SP1.

64. Tested by Kennon Conrad on a 3.2 GHz AMD A8-5500.

65. Tested by sportman on an Intel Core i7 4960X 3.6GHz OC at 4.5GHz - 6 core (12 threads) 22nm Ivy Bridge-E, Kingston 8 x 4GB (32GB) DDR3 2400MHz 11-14-14 under clocked at 2000MHz 10-11-11. Windows 8.1 Pro 64-bit, SoftPerfect RAM Disk 3.4.5 64-bit.

66. Tested by Byron Knoll on a Intel Core i7-3770, 31.4 GB memory, Linux Mint 14.

67. Tested by Kennon Conrad on a 4.0 GHz i4790K, 16 GB at 1866 MHz, 128 GB SSD Windows 8.1.

68. Tested by Ilia Muraviev on an Intel Core i7-3770K @ 4.8GHz, 8GB 2133 MHz CL11 DDR3, 512GB Samsung 840 Pro SSD, Windows 7 Ultimate SP1. 69. Tested by Nania Francesco Antonio on a Intel Core i7 920 2.67 ghz 6GB ram.

70. Tested by Richard Geldreich on a Core i7 Gulftown 3.3 Ghz, Win64.

71. Tested by Christoph Diegelmann on a Core i7-4770K, 8 GB DDR3, Samsung 840Pro 128 GB, Fedora 21 64 bit, gcc 4.9.2.

72. Tested by Skymmer on a i7-2770K, WinXP x64 SP2.

73. Tested by Andreas M. Nilsson on a 1.7 GHz Intel Core i7, 8 GB 1600 MHz DDR3, Mac OS X 10.10.3 (14D136).

74. Tested by Michael Crogan on a Core i7-3930K, 3.20 GHz, 6+HT, 64 MB, Linux64.

75. Tested by Mauro Vezzosi on a Core i7-4710HQ 2.50-3.50 GHz, 8 GB DDR3, Windows 8.1 64 bit.

76. Tested by Yann Collet on Core i7-3930K, 4.5 GHz, Linux 64, gcc 5.2.0-5.3.1.

77. Tested by Darek on a Core i7 4900 MQ, 2.8 GHz overclocked to 3.7 GHz, 16 GB, Win7Pro 64.

78. Tested by mpais on a Core i7 5820K 4.4 GHz, Windows 10.

79. Tested by Sportman on2 x Intel Xeon E5-2643 v3 6 cores (12 threads) 3.4GHz, 3.7GHz turbo, 20MB L3 cache, 8 x 32GB DDR4 2133MHz CAS 15, SoftPefect RAM Disk 3.4.7, Windows Server 2012 R2 64-bit.

80. Tested by kaitz on an Intel Celeron G1820 DDR3 8GB PC3-12800 (800 MHz).

81. Tested by Darek on Core i7 4900MQ 2.8GHz ovwerclocked to 3.8GHz, 32GB, Win7Pro 64.

82. Tested by Ilia Muraviev on an Intel Core i7-4790K @ 4.6GHz, 32GB @ 1866MHz DDR3 RAM, RAMDisk.

83. Tested by Byron Knoll on an Intel Core i7-7700K, 32 GB DDR4, Ubuntu 16.04-18.04.

84. Tested by Fabrice Bellard on 2 x Xeon E5-2640 v3 @ 2.6 GHz, 196 GB RAM, Linux.

85. Tested by Georgi Marinov on a Windows 10 Laptop: Lenovo Ideapad 310; i5-7200u @2.5GHz; 8GB DDR4 @1066MHz (2133MHz) CL15 CR2T; L2 cache: 2x256KB; L3 cache: 3MB; SSD: Crucial MX500 500GB

86. Tested by Byron Knoll on an Intel Xeon 2.30 GHz, 13 GB, Tesla P100 GPU.

87. Tested by Byron Knoll on an Intel Xeon 2.00 GHz, 13 GB, Tesla V100 GPU.

I have not verified results submitted by others. Timing information, when available, may vary widely depending on the test machine used. About the Compressors The numbers in the headings are the compression ratios on enwik9. .1159 cmix cmix v1 is a free, open source (GPL) file compressor by Byron Knoll, Apr. 16, 2014. It is a context mixing compressor with dictionary preprocessing based on code from paq8hp12any and paq8l but increasing the number of context models and mixer layers. It takes no compression options. cmix v2 was released May 29, 2014. cmix v3 was released June 27, 2014. cmix v4 was released July 22, 2014. It uses 28,976,428 KiB memory (29.7 GB). cmix v5 was released Aug. 13, 2014. The decompressor size is a zip archive containing the source code, makefile, and a dictionary compressed with cmix from 465211 to 90065 bytes. cmix v6 was released Sept. 3, 2014. The decompressor size includes the dictionary compressed with cmix from 465211 to 90207 bytes. cmix v7 was released Feb. 4, 2015. cmix v8 was released Nov. 10, 2015. cmix v9 was released Apr. 8, 2016. cmix v10 was released June 17, 2016. cmix v11 was released July 3, 2016. It incorporates a modification originally developed by Eugene Shelwien in which PPMd is included as a model. cmix v12 was released Nov. 7, 2016. It includes a LSTM model. cmix v13 was released Apr. 24, 2017. cmix v14 was released Nov. 22, 2017. cmix v15 was released May 19, 2018. cmix v16 was released Oct 6, 2018. cmix v17 was released Mar. 24, 2019. cmix v18 was released Aug. 2, 2019. Compression Compressed size Decompresser Total size Time (ns/byte) Program Options enwik8 enwik9 size (zip) enwik9+prog Comp Decomp Mem Notes ------- ------- ---------- ----------- ----------- ----------- ------ ------ ----- ----- cmix v1 16,076,381 128,647,538 279,185 x 128,926,723 181924 179706 20785 66 cmix v2 15,863,623 126,323,656 310,068 x 126,633,724 580083 577626 28152 66 cmix v3 15,809,519 125,971,560 274,992 x 126,246,552 267978 266622 26681 66 cmix v4 15,784,946 125,621,620 278,375 x 125,899,995 284243 282390 28976 66 cmix v5 15,769,367 125,526,628 163,552 s 125,690,180 282056 282647 28865 66 cmix v6 15,738,922 124,172,611 161,908 s 124,334,519 280749 282137 30882 66 cmix v7 15,738,825 124,168,463 166,785 s 124,335,248 280416 280904 30600 66 cmix v8 15,709,216 123,930,173 164,882 s 124,095,055 344244 346641 30311 66 cmix v9 15,627,536 123,874,398 161,911 s 124,036,309 346436 345681 26929 66 cmix v10 15,587,868 123,257,156 164,263 s 123,421,419 355721 355850 29924 66 cmix v11 15,566,358 122,977,954 172,261 s 123,150,215 377529 374440 27745 66 cmix v12 15,440,186 121,718,424 175,953 s 121,894,377 571339 574522 27865 66 cmix v13 15,323,969 120,480,684 177,979 s 120,658,664 617346 615987 27803 66 cmix v14 15,210,458 119,017,492 203,717 s 119,221,209 631838 627802 28287 83 cmix v15 15,111,677 117,959,016 217,830 s 118,176,846 650055 651716 28365 83 cmix v16 14,955,482 116,912,035 226,121 s 117,138,156 613898 658679 27708 83 cmix v17 14,877,373 116,394,271 208,263 s 116,602,534 641189 645651 25258 83 cmix v18 14,838,332 115,714,367 208,961 s 115,923,328 602867 601569 25738 83 .1165 phda9 phda 1.0 (discussion) is the public version of a winning Hutter prize submission dated Dec. 15, 2017 by Alexander Rhatushnyak. There are Windows and Linux executables, no source. The original prize winning version is a 64 bit Linux decompressor (no source) and compressed enwik8 as a RAR archive, awarded Nov. 4, 2017, posted Aug. 12, 2019. Archive plus decompressor size is 15,284,944 bytes. It uses 1 GB memory and a 176 MB scratch file. There is a version that uses only RAM. phda9 1.2 (discussion) was released Mar. 13, 2018. phda9 1.3 was released Apr. 21, 2018. The decompressor size for enwik8 is different (557050 bytes) because the dictionary is loosely compressed in the decompressor instead of in the compressed file. phda9 1.4 was released May 20, 2018. This is mainly a bug fix version. phda9 1.5 was released Aug. 1, 2018. enwik8 uses a separate decompressor with a size of 557415 bytes. phda9 1.6 was released Oct. 20, 2018. enwik8 uses a separate decompressor with a size of 564616 bytes. phda9 1.7 was released Feb. 18, 2019. enwik8 uses a separate decompressor with a size of 565,352 bytes. phda9 1.8 was released July 4, 2019. enwik8 uses a separate decompressor with a size of 558,298 bytes. Compression Compressed size Decompresser Total size Time (ns/byte) Program Options enwik8 enwik9 size (zip) enwik9+prog Comp Decomp Mem Notes ------- ------- ---------- ----------- ----------- ----------- ------ ------ ----- ----- phda9 1.0 15,173,565 118,658,060 41,994 xd 118,700,054 56815 55201 5031 83 phda9 1.2 15,144,786 118,335,817 42,745 xd 118,378,562 60726 61586 4992 83 phda9 1.3 15,069,752 117,617,185 42,108 xd 117,659,293 86557 87375 4996 83 phda9 1.4 15,074,624 117,603,125 42,110 xd 117,645,235 87520 87909 4992 83 phda9 1.5 15,063,267 117,223,130 42,428 xd 117,265,558 85877 86365 4995 83 phda9 1.6 15,040,647 117,039,346 41,911 xd 117,081,257 84713 88401 4996 83 phda9 1.7 15,023,870 116,940,874 43,274 xd 116,984,148 83712 87596 4996 83 phda9 1.8 15,010,414 116,544,849 42,944 xd 116,587,793 86182 86305 6319 83 .1194 nncp nncp is a free, experimental file compressor by Fabrice Bellard, released May 8, 2019. It uses a neural network model with dictionary preprocessing described in the paper Lossless Data Compression with Neural Networks. Compression of enwik9 uses the options:

./preprocess c out.words enwik9 out.pre 16384 512 ./nncp -n_layer 7 -hidden_size 384 -n_embed_out 5 -n_symb 16388 -full_connect 1 -lr 6e-3 c out.pre out.bin Version 2019-11-16 was released Nov. 16, 2019. It was run in 8 threads.

Compression Compressed size Decompresser Total size Time (ns/byte) Program Options enwik8 enwik9 size (zip) enwik9+prog Comp Decomp Mem Alg Notes ------- ------- ---------- ----------- ----------- ----------- ------ ------ ----- --- ----- nncp 2019-05-08 16,791,077 125,623,896 161,133 xd 125,785,029 420168 602409 2040 LSTM 84 nncp 2019-11-16 16,292,774 119,167,224 238,452 xd 119,405,676 826048 1156467 5360 LSTM 84

.1273 tensorflow-compress

tensorflow-compress v1 is a free, open source experimental file compressor by Byron Knoll, July 20, 2020. It uses a LSTM neural network accelerated by a GPU if available. It uses a dictionary and preprocessor from NNCP by default, or from cmix. The test results for v1 use the default settings and were tested by the author on an Intel Xeon 2.30 GHz, 13 GB RAM with a Tesla P100 GPU. It uses 10138 MiB CPU RAM and 15525 MiB GPU RAM. It is run as a Colab notebook.

v2 was released Sept. 7, 2020. It runs on a V100 GPU using 2669 MB CPU RAM and 15621 MB GPU RAM. The decompressor contains a cohab notebook, NNCP preprocessor source code and makefile, and a dictionary created by the NNCP preprocessor.

Program enwik8 enwik9 Prog Total Comp Deco Mem Note --------- ---------- ----------- -------- --------- ---- ---- ---- ---- tensorflow-compress v1 20,119,747 159,716,240 88,870sd 159,805,110 72260 82259 25663 86 tensorflow-compress v2 16,828,585 127,146,379 175,047sd 127,321,426 157196 142820 18290 87

.1277 durilca

durilca and durilca'light 0.5 by Dmitry Shkarin (Apr. 1, 2006) are closed source, experimental command line file compressors based on ppmd/ppmonstr with filters for text, exe, and data with fixed length records (wav, bmp, etc). durilca'light is a faster version with less compression. Unfortunately both crash on enwik9. Decompression is verified on enwik8.

The -m700 option selects 700 MB of memory. (It appears to use substantially more for enwik9 according to Windows task manager). -o12 selects PPM order 12 (optimal for enwik9 -t0). -t0 (default) turns off text modeling, which hurts compression but is necessary to compress enwik9 (although decompression still crashes). -t2(3) turns on text preprocessing (dictionary; thus the increased decompresser size). -t2 also supports 3 additive flags (4, 8, 16) which have no effect on this data, thus -t2(31) or -t2 (default is 31) give the same compression as -t(3).

durilca 0.5(Hutter) was released 1457Z Aug. 16, 2006. It does not use external dictionaries. When run with 1 GB memory (-m700), -o13 is optimal. With 2 GB (-m1650), -o21 is optimal. The unzipped .exe file is 86,016 bytes.

durilca4linux_1 (0825Z Aug 23 2006) is a Linux version of durilca 0.5(Hutter) which successfully compresses enwik9 and decompresses with UnDur (23,375 bytes zipped, 42,065 bytes uncompressed). All versions of durilca require memory specified by -m plus memory to read the input file into memory. In Windows, this exceeds the 2 GB process limit regardless of available RAM and swap. Thus, enwik9 compresses only under Linux with 2 GB real memory and 1 GB additional swap. The -o12 option is optimal for enwik9 (tested under 64 bit SuSE 10.0 by the author), -o24 for enwik8 (verified by me under 64 bit Ubuntu 2.6.15).

durilca4linux_2 (Oct. 16, 2006) is a closed source Linux version specialized for this benchmark. It includes a warning that use on other files may cause data loss. It requires AMD64 Linux and 3 GB of memory (2 GB for enwik8). The decompresser files (EnWiki.dur and UnDur) are contained within a 241,322 byte zip file in the rar distribution. To compress:

./DURILCA d EnWiki.dur ./DURILCA e -m1800 -o10 -t2 enwik9

./UnDur EnWiki.dur ./UnDur enwik9.dur

durilca4linux_3 (dictionary version v1) was released Feb. 21, 2008. Like version 2, it requires extraction of EnWiki.dur before compressing or decompressing, and may not work with files other than enwik8 and enwik9. As tested, requires 64-bit Linux, 4 GB RAM, and 5 GB RAM+swap.

undur3 v2 contains an improved dictionary (version v2), released Apr. 22, 2008, for DURILCA4Linux_3. The compression and decompression programs are the same. The decompression program UnDur (Linux executable) is included. To compress, download durilca4linux_3 and replace the dictionary (EnWiki.dur) with this one. The options are -m3600 (3600 MB memory), -o14 (order 14 PPM), -t2 (text model 2).

undur3 v3, released May 22, 2008, uses an improved dictionary but the same compressor and decompresser as v1 and v2. The dictionary contains 123,995 lowercase words separated by NUL bytes. Of these, 5579 words occur more than once (wasted space?) I tested options -m1500 under Ubuntu Linix with 2 GB memory. At -m1500 top reports 2157 MB virtual memory and 1894 MB real memory. -m1600 caused disk thrashing.

durilca kingsize (July 21, 2009) runs under 64 bit Windows and requires 13 GB memory. It is designed to work only on this benchmark and not in general. The dictionary file EnWiki.fsd must be extracted first from EnWiki.dur before compression or decompression. Requires msvcr90.dll. enwik8 can be compressed with -m1200 (1.2 GB).

durilca4_decoder is a new dictionary for durilca'kingsize (above), Nov. 12, 2009. It is reported as "durilca'kingsize_4" below. Decompression time is reported to be 1411.88 sec with "durilca d" and 1796.98 sec with "UnDur". enwik8 compresses with 1200 MB (-m1200) in 157.38 sec.

Compression Compressed size Decompresser Total size Time (ns/byte) Program Options enwik8 enwik9 size (zip) enwik9+prog Comp Decomp Notes ------- ------- ---------- ----------- ----------- ----------- ----- ----- ----- durilca'light 0.5 -m650 -o12 21,089,993 178,562,475 1,495,422 x 180,057,897 1227 (fails) durilca 0.5 -m700 -o12 -t0 19,227,202 162,117,578 74,292 x 162,191,870 4140 (fails) -m800 -o128 19,321,003 164,298,178 74,292 x 165,372,470 7718 (fails) -m700 -o12 -t2(3) 18,520,589 (fails) 1,507,312 x 3330 3940 durilca 0.5(Hutter) -m700 -o13 -t2 18,128,339 (fails) 77,295 x 5905 -m1650 -o21 -t2 17,958,687 (fails) 77,295 x 6140 6140 durilca4linux_1 -m700 -o13 -t2 18,128,334 23,375 xd 5950 5880 -m1750 -o12 -t2 18,027,888 146,521,559 23,375 xd 146,544,934 5500 7301 18 -m1750 -o24 -t2 17,949,422 23,375 xd 6190 6780 durilca4linux_2 -m1800 -o10 '-t2(11)' 17,002,831 136,536,189 241,322 xd 136,777,511 4249 4827 18 -m1800 -o10 -t2 16,998,300 136,596,818 241,322 xd 136,838,140 4405 4894 18 durilca4linux_3 v1 -m3600 -o14 -t2 16,356,063 129,933,145 345,957 xd 130,279,102 3649 3715 18 -m1200 -o32 -t2 16,348,796 4170 4178 18 durilca4linux_3 v2 -m3600 -o14 -t2 16,323,581 129,670,441 344,525 xd 130,014,966 3628 3639 18 -m1200 -o32 -t2 16,316,255 4148 4157 18 durilca4linux_3 v3 -m3600 -o14 -t2 16,292,414 129,469,384 339,990 xd 129,809,374 3624 3627 18 -m1200 -o32 -t2 16,285,285 4135 4138 18 -m1500 -o6 -t2 16,517,051 133,674,565 3852 -m1500 -o7 -t2 16,418,799 132,239 495 4006 -m1500 -o8 -t2 16,368,632 131,722,213 4149 -m1500 -o9 -t2 16,335,259 131,549,901 339,990 xd 131,889,891 4261 4344 -m1500 -o10 -t2 16,316,775 131,574,739 4405 -m1500 -o11 -t2 16,306,086 131,707,901 4544 -m1500 -o12 -t2 16,299,411 131,807,298 4554 -m1500 -o14 -t2 16,292,414 132,238,662 4763 -m1500 -o16 -t2 16,289,512 132,516,825 4879 -m1500 -o32 -t2 16,285,285 134,238,759 5440 durilca'kingsize -m13000 -o40 -t2 16,258,380 127,695,666 333,790 xd 128,029,456 1413 1805 31 -m22500 -o40 -t2 127,695,666 1806 1814 34 durilca'kingsize_4 -m13000 -o40 -t2 16,209,167 127,377,411 407,477 xd 127,784,888 1398 1797 31 16,209,167 127,377,411 1788 1802 34

.1301 cmve

cmv 00.01.00 is a free, closed source, experimental file compressor for 32 bit Windows by Mauro Vezzosi, Sept. 6, 2015. It uses context mixing. Option "2,3,+" selects max compression (2), max memory (3), and a large set of models (+). A hex bitmap for this argument turns individual models on or off. Note 48 timings are for enwik8 only.

cmv 00.01.01 was released Jan. 10, 2016. It is compatible with 00.01.00 and does not change the compression ratio.

cmve 0.2.0 was released Nov. 28, 2017.

Program Options enwik8 enwik9 zip size Total Comp Deco Cmem Dmem Alg Note -------- ----------- ---------- ----------- --------- ----------- ---- ---- ---- ---- ---- ---- cmv 00.01.00 -m2,3,+ 18,218,283 150,226,739 77,404 x 150,304,143 285750 293090 2817 2817 CM 48,75 150,226,739 77,404 x 150,304,143 216000 2801 CM 75 -m2,3,0x03ededff 18,153,319 720000 ~3900 CM 75 cmv 00.01.01 -m2,3,0x03ed7dfb 18,122,372 149,357,765 77,404 x 149,435,169 426162 394855 3335 3335 CM 75 cmve 0.2.0 -m2,3,0x7fed7dfd 16,424,248 129,876,858 307,787 x 130,184,645 1140801 19963 CM 81

.1323 paq8hp12any

paq8hp12any was developed as a fork of the PAQ series of open source context mixing compressors by Alexander Rhatushnyak. It was forked from the paq8 series developed largely by Matt Mahoney, and uses a dictionary preprocessor (xml-wrt) originally developed by Przemyslaw Skibinski as a separate program and later integrated. All versions are optimized for the Hutter prize. Thus, they are tuned for enwik8. The 12 versions are described below in chronological order. They originally were located here (link broken) and can now be found here (as a zpaq archive) (as of Sept. 16, 2009). All programs are free, GPL open source, command line archivers. Most take a single option controlling memory usage.

Note: these programs are compressed with upack, which compresses better than upx. Some virus detectors give false alarms on all upack-compressed executables. The programs are not infected.

paq8hp1 by Alexander Rhatushnyak, 1945Z Aug. 21, 2006. It is a modification of paq8h using a custom dictionary tuned to enwik8 for the Hutter prize. Because the Hutter prize requires no external dictionaries, the dictionary is spliced into the .exe file during the build process. When run, it creates the dictionary as a temporary file. The program must be run in the current directory (not in your PATH or with an explicit path), or else it can't find this file. The unzipped paq8hp1.exe is 206,764 bytes. Decompression was verified for enwik8 (60730 ns/b for -8, 60660 ns/b for -7). enwik9 is pending.

paq8hp2 (source code) by Alexander Rhatushnyak, 0233Z Aug. 28, 2006 is an improved version of paq8hp1 submitted for the Hutter prize. paq8hp2.exe size is 205,276 bytes. It differs from paq8hp1 mainly in that the 43K word dictionary for 2-3 byte codes is sorted alphabetically. The 80 most frequent words, coded as 1 byte before compression, are grouped by syntactic type (pronoun, preposition, etc).

paq8hp3 (source code) by Alexander Rhatushnyak, released Aug. 29, 2006 is an improved version of paq8hp2 submitted for the Hutter prize on Sept. 3, 2006. The 80 dictionary words coded with 1 byte and 2560 words coded with 2 bytes are organized into semantically related groups or by common suffixes. The 40,960 words with 3 byte codes are sorted from the last character in reverse alphabetical order. paq8hp3.exe is 178,468 bytes unzipped. enwik9 decompression is not yet verified. For enwik8, decompression is verified with time 60300 ns/b compression, 60220 ns/b decompression.

paq8hp4 (source code) by Alexander Rhatushnyak, released and submitted for the Hutter prize on Sept. 10, 2006, is an improved version of paq8hp3. The dictionary is further organized into semantically related groups among 3-byte codes. The unzipped size of paq8hp4.exe is 206,336 bytes.

paq8hp5 (source code) by Alexander Rhatushnyak, released Sept. 20, 2006, is an improved version of paq8hp4, submitted for the Hutter prize on Sept. 25, 2006. The unzipped size of paq8hp5.exe is 174,616 bytes (in spite of a slightly larger dictionary). The dictionary size is optimized for enwik8; a larger dictionary would improve compression of enwik9. Decompression is verified for enwik8 only (-8 at 74640 ns/b). A Linux port of paq8hp5 is by Лъчезар Илиев Георгиев (Luchezar Georgiev), Oct 26, 2006 (mirror).

paq8hp6 (source code) by Alexander Rhatushnyak, released Oct. 29, 2006, is an improved version of paq8hp5. It was submitted as a Hutter prize candidate on Nov. 6, 2006. Unzipped paq8hp6.exe size is 170,400 bytes. The -8 option was not tested on enwik9 due to disk thrashing on my 2 GB PC. Compression was about 25% finished after 9 hours.

paq8hp7a by Alexander Rhatushnyak, Dec. 7, 2006, was intended to supercede paq8hp6 as a Hutter prize entry, then was withdrawn on Dec. 10, 2006 with the release of paq8hp7. Unzipped executable size is 151,664 bytes. -8 for enwik9 (but not enwik8) caused disk thrashing on my computer (2 GB, WinXP).

paq8hp7 (source code) by Alexander Rhatushnyak, Dec. 10, 2006, as a Hutter prize entry. Unzipped paq8hp7.exe size is 152,556 bytes.

paq8hp8 (source code) by Alexander Rasushnyak, Jan. 18, 2007, as a Hutter prize entry (replacing an incorrect version posted 2 days earlier). Unzipped size is 152,692 bytes. The dictionary is identical to paq8hp7.

paq8hp9 (mirror) (source code) by Alexander Rhatushnyak, Feb. 20, 2007, is a Hutter prize entry. Only the -7 option works. The unzipped size of paq8hp9.exe is 112,628 bytes.

paq8hp9any (Feb. 23, 2007) by Alexander Rhatushnyak is a paq8hp9 -7 compatible version with external dictionary where all options work. However the zipped program is larger and -8 was not tested due to disk thrashing, so results are unchanged.

paq8hp10 (Mar. 26, 2007) by Alexander Rhatushnyak was derived from paq8hp9 as a Hutter prize entry. The unzipped size is 103,224 bytes. Only the -7 option works.

paq8hp10any (source code), Mar. 31, 2007, by Alexander Rhatushnyak is archive compatible with paq8hp10 -7 but works with other memory options. When run, paq8hp10.exe and both dictionary files should be in the current directory. This program is not a Hutter prize entry.

paq8hp11 (mirror) by Alexander Rhatushnyak, Apr. 30, 2007, is a Hutter prize entry. paq8hp11.exe is 99,816 bytes. Like paq8hp10, it works only with the -7 option.

To compress: paq8hp11 -7 enwik8.paq8hp11 enwik8 To decompress: paq8hp11 enwik8.paq8hp11

paq8hp11any (source code) by Alexander Rhatushnyak, May 2, 2007, is a paq8hp11 variant that accepts any memory option. It was optimized for speed rather than size. It includes two dictionary files which must be present in the current directory when run, unlike paq8hp11 where the dictionary is self extracted. -8 selects 1850 MB memory. -7 produces the same archive as paq8hp11. Run speeds for -8 enwik8 are 76770+76820 ns/B.

paq8hp12 (mirror) by Alexander Rhatushnyak, May 14, 2007, is a Hutter prize entry. paq8hp12.exe size is 99,696 bytes. It works only with the -7 option like paq8hp11.

paq8hp12any (source code) by Alexander Rhatushnyak, May 20, 2007, is a paq8hp12 variant that accepts any memory option (like paq8hp11any). The -7 option produces an archive identical to that of paq8hp12.

paq8hp12any was updated on Jan. 9, 2009 to fix a compiler issue and add a 64 bit Linux version. Compressed file format was not changed. It was not retested.

Options select memory usage as shown in the table.

Compression Compressed size Decompresser Total size Time (ns/byte) Program Options enwik8 enwik9 size (zip) enwik9+prog Comp Decomp Mem Note ------- ------- ---------- ----------- ----------- ----------- ----- ----- --- ---- paq8hp1 -7 17,566,769 205,783 x 60170 60660 748 -8 17,397,023 142,477,977 205,783 x 142,683,760 63317 1595 paq8hp2 -7 17,390,490 204,557 x 62000 62330 747 -8 17,223,661 141,145,684 204,557 x 141,350,241 65323 1584 paq8hp3 -7 17,241,280 177,477 x 61360 59690 742 -8 17,085,021 139,905,045 177,477 x 140,082,522 63420 1586 paq8hp4 -7 17,039,173 198,525 x ~65000 65110 755 -8 16,889,237 138,188,695 198,525 x 138,387,220 67956 68120 1598 paq8hp5 -7 16,898,402 161,887 x 76300 77710 900 19 -8 16,761,044 137,017,311 161,887 x 137,179,198 ~85153 75162 1787 paq8hp6 -7 16,731,800 138,828,889 166,715 x 138,995,604 74953 73707 941 -8 16,568,451 135,281,289 166,715 x 135,448,004 60865 1807 21 paq8hp7a -7 16,592,672 137,441,743 150,678 x 137,592,421 79795 940 -8 16,431,239 150,678 x 76940 77600 1790 paq8hp7 -7 16,579,500 151,633 x 79620 79660 940 -8 16,417,646 133,835,408 151,633 x 133,987,041 66074 1850 21 paq8hp8 -7 16,528,353 151,711 x 79580 79970 940 -8 16,372,960 133,271,398 151,711 x 133,423,109 64639 1849 22 paq8hp9 -7 16,516,789 136,676,674 111,653 x 136,788,327 84529 85957 940 paq8hp10 -7 16,490,947 102,256 x 86720 88890 940 paq8hp10any -8 16,335,197 132,979,531 333,925 x 133,313,456 55639 1849 22 paq8hp11 -7 16,459,515 98,851 x 129540 128530 947 paq8hp11any -8 16,304,862 132,757,799 327,608 s 133,085,407 57503 1850 22 paq8hp12 -7 16,381,959 98,745 x 130820 131480 936 paq8hp12any -7 16,381,959 330,700 x 78860 76190 941 -8 16,230,028 132,045,026 330,700 x 132,375,726 56993 1850 22 -8 16,230,028 132,045,026 330,700 x 132,375,726 37660 37584 1850 41

paq8hp1 through paq8hp12 can be used as a preprocessor to other compressors by compressing with option -0. In the following tests on ppmonstr, options were tuned for the best possible compression of enwik8 with 2 GB memory (1.65 GB available under WinXP). The xml-wrt 2.0 options are -l0 -w -s -c -b255 -m100 -e2300 (level 0, turn off word containers, turn off space modeling, turn off containers, 255 MB buffer for dictionary, 100 MB buffer, 2300 word dictionary). The xml-wrt 3.0 options are -l0 -b255 -m255 -3 -s -e7000 (-3 = optimize for PPM).

xml-wrt prepends the dictionary to its output. To make the comparison fair, the compressed size of the dictionary must be added. This is done in two ways, first by compressing the preprocessed text and dictionary and adding the compressed sizes, and second by prepending the dictionary to the preprocessed text before compression. The first method compresses about 1-2 KB smaller.

The uncompressed size of each dictionary for paq8hp1 through paq8hp4 is 398,210 bytes. They contain identical words, but in different order. The first two dictionaries are identical. They compress smaller because they are sorted alphabetically. The dictionary for paq8hp5 is 411,681 bytes. It contains all of the words in the first 4 dictionaries plus 1280 new words (44,880 total).

Preprocessor Compressor enwik8 dict total dict+enwik8 ------------ ---------- ---------- ------- ---------- --------- paq8hp1 -0 | ppmonstr J -m1650 -o64 18,322,077 81,190 18,403,267 18,403,991 paq8hp2 -0 | ppmonstr J -m1650 -o64 18,266,424 81,190 18,347,614 18,349,587 paq8hp3 -0 | ppmonstr J -m1650 -o64 18,197,797 107,583 18,305,380 18,306,690 paq8hp4 -0 | ppmonstr J -m1650 -o64 18,170,944 107,590 18,278,534 18,280,098 paq8hp5 -0 | ppmonstr J -m1650 -o64 18,154,921 111,935 18,266,856 18,267,556 xml-wrt 2.0 | ppmonstr J -m1650 -o64 18,625,624 xml-wrt 3.0 | ppmonstr J -m1650 -o64 18,494,374 (none) ppmonstr J -m1650 -o16 19,062,555 ppmonstr J -m1650 -o32 19,084,964 ppmonstr J -m1650 -o64 19,098,634

The transform done by paq8hp1 through paq8hp5 is based on WRT by Przemyslaw Skibinski, which first appeared in PAsQDa and paqar, and later in paq8g and xml-wrt. The steps are as follows:

The input is parsed into seqences of all uppercase letters or all lowercase letters, or one uppercase letter followed by lowercase letters, e.g. "THE", "the", or "The".

All uppercase words are prefixed by a special symbol (0E hex in paq8hp3, paq8hp4, paq8hp5). If a lowercase letter follows with no intervening characters (e.g. "THEre", then a special symbol (0C hex) marks the end. (e.g. 0E "the" 0C "re").

Capitalized words are prefixed with 7F hex (paq8hp3) or 40 hex (paq8hp4, paq8hp5) (e.g. "The" -> 40 "the").

All letters are converted to lower case.

Words are looked up in the dictionary. The first 80 words in the dictionary are coded with 1 byte: 80, 81, ... CF (hex).

The next 2560 words (paq8hp1-4) or 3840 words (paq8hp5) are coded with 2 bytes: D080, D081, ... EFCF (paq8hp1-4), or D080, ... FFCF (paq8hp5).

The last 40960 words are coded with 3 bytes: F0D080, F0D081, ... FFEFCF.

If a word does not match, then the longest matching prefix with length at least 6 is coded and the rest of the word is spelled.

If there is no matching prefix, then the longest matching suffix with length at least 6 is coded after spelling the preceding letters.

If no matching word, prefix, or suffix is found, the word is spelled. Capitalization coding occurs regardless.

Any input bytes with special meaning are escaped by prefixing with 06: 06, 0C, 0E, 40 or 7F, 80-FF.

.1355 emma

emma v0.1.3 is a free, closed source file compressor for 32 bit Windows by mpais, Mar. 8, 2016. It uses context mixing. It has a GUI-only interface to select compression options. For testing, all settings were for maximum compression as follows: Memory usage 512 Mb, maximum order 9, ring buffer 32 Mb, probability refinement level 3, mixing complexity insane, adaptive learning rate on, fast mode on long matches off, ludicrous complexity mode on, match model on, 32 Mb, high complexity; text model on, 128 Mb, high; sparse model on, 16 Mb, high; sparse model on, 16 Mb, high; indirect model on, 16 Mb, high; x86/64 model on, 64 Mb, insane; image models on, 80 Mb, high; audio models on, 32 Mb, high; record model on, 16 Mb, high; distance model on, 8 Mb; JPEG model on, 40 Mb, high; GIF model on, 32 Mb, high; executable code (x86/64) transform on; process conditional jumps on; colorspace (RGB) on; delta coding on; dictionaries: English on, Spanish off, Italian off, French off, Portugese off.

emma v0.1.4 was released Mar. 13, 2016. For testing, the text model was increased to 256 MB. A DMC model (8 MB) was added. The non-text related models were turned off: x86, image, audio, JPEG, GIF. All transforms (x86, RGB, delta) were turned off.

emma 0.1.6 ( discussion) was released Mar. 27, 2016. It was tested by splitting enwik9 into parts using hsplit to move the highly compressible middle part to the end. Then the reordered file was then processed using drt dictionary processing (see lpaq9m) instead of emma's built in dictionary and then compressed with emma with maximum compression and memory options (like below) except that dictionary processing was turned off. The decompressor size includes drt.exe, lpqdict0.dic, hsplit.exe and a BAT file to restore the original order, all compressed with emma, then those files plus emma.exe (without dictionaries) compressed into a zip archive. Specifically, enwik9 was prepared:

fsplit32 enwik9 en1 586000000 fsplit32 en1.1 en2 480000000 fsplit32 en2.1 en3 424000000 copy /b en3.1+en1.2+en3.2+en2.2 enwik9o del en1.1 del en1.2 del en2.1 del en2.2 del en3.1 del en3.2 drt enwik9o enwik9o.drt del enwik9o

drt enwik9o.drt enwik9o d fsplit32 enwik9o en1o 894000000 del enwik9o fsplit32 en1o.1 en2o 838000000 fsplit32 en2o.1 en3o 424000000 copy /b en3o.1+en2o.2+en1o.2+en3o.2 enwik9 del en1o.1 del en1o.2 del en2o.1 del en2o.2 del en3o.1 del en3o.2

hsplit input output N

emma 0.1.12 was released July 10, 2016. There are 32 and 64 bit versions. The 64 bit version can use more memory. Settings were as follows:

x64 x86 Memory 2048 MB 512 MB Max order 10 9 Ring buffer size 128 MB 32 MB Probability refinement level 3 level 3 Mixing complexity insane insane Adaptive learning rate on off Fast mode long matches off off Ludicrous complexity on on Match model 128 MB, high 32 MB, high Text model 1024 MB, high 256 MB, high Sparse model 64 MB, high 16 MB, high Indirect model 64 MB, high 16 MB, high 86/x64 model off off Image models off off Audio models off off Record model 64 MB, high 16 MB, high Distance model 32 MB 8 MB DMC model 32 MB 8 MB JPEG model off off GIF model off off XML model 16 MB 4 MB RAW models off off Transforms exec code off off Colerspace RGB off off Delta coding off off Dictionaries English English

emma 0.1.22 was released Feb. 12, 2017. Settings: all settings = MAX, eceept: image and audio models = off, use fast mode on long matches = off, xml=on, x86model=off, x86 exe code = off, delta coding = off, dictionary = off, ppmd memory = 1024, ppmd order = 14

emma 1.23 was released Aug. 29, 2017. It uses ppmd_mod v3a by Shelwein and is preprocessed with DRT. EMMA 1.23 settings: all settings = MAX, eceept: image and audio models = off, use fast mode on long matches = off, xml=on, x86model=off, x86 exe code = off, delta coding = off, dictionary = off, ppmd memory = 1024, ppmd order = 14

Program enwik8 enwik9 program size total Comp Decomp Mem Alg Note ------- ---------- ---------- ------------ ----------- ----- ------ ---- --- ---- emma 0.1.3 17,971,713 149,864,553 1,844,505 x 151,709,068 110458 113839 1336 CM 77 emma 0.1.4 17,865,328 148,887,824 1,848,033 x 150,735,857 58141 980 CM 78 drt|emma 0.1.16 x64 16,855,079 136,393,547 1,257,839 x 137,651,386 64341 62102 3800 CM 77 emma 0.1.12 x86 17,824,974 148,403,034 1,878,971 x 150,282,005 62639 986 CM 78 emma 0.1.12 x64 17,468,937 142,416,812 2,105,286 x 144,522,098 95997 3688 CM 78 emma 0.1.22 16,679,420 135,169,967 1,302,363 xd 136,472,330 86187 3824 CM 81 drt|emma 1.23 16,523,517 134,164,521 1,358,251 xd 135,522,772 73006 67097 3800 CM 81

.1422 zpaq

A ZPAQ archive is organized into independently compressed blocks. Each block is divided into one or more segments which must be decompressed in sequence. Each segment represents a file or a part of a file. The standard supports both archivers and single file compressors. In the case of a compressor, no filenames are stored in the segment headers, and all the blocks and segments are concatenated to a single output file specified by the user.

ZPAQ uses a streaming format that can be read or written in a single pass. The arithmetic coded data is designed so that the end of a segment can be found by scanning quickly without decoding. There is no central directory information to update when blocks are added, removed, or reordered.

The ZPAQ standard requires that the decompression algorithm be described in the block headers. The header describes a collection of bitwise predictive models based loosely on PAQ components, a program to compute the bytewise contexts for each model, and a second program to perform arbitrary postprocessing on the output data. The two programs are written in an interpreted bytecode language called ZPAQL.

A ZPAQ model specifies a list of 1 to 255 components. Each component outputs a prediction or probability that the next bit will be a 1. Each component may receive as input a computed 32-bit context and the output predictions of earlier components on the list. The last component's prediction is fed to an arithmetic coder to encode or decode the next bit. The components are as follows:

CONST - specifies a fixed, constant prediction.

CM - context model. The context is mapped to a prediction by a table with a user specified size. Each table entry also has a count. The table is updated by adjusting the prediction to reduce the prediction error in proportion to 1/count. The count is incremented up to a user specified limit in the range 4 to 1020.

ICM - indirect context model. The context is mapped to a bit history (an 8 bit state) by a hash table of user specified size. The history is mapped to a prediction by a CM with a fixed, high count limit. The history represents a count of recent 0 and 1 bits and also indicates whether the last bit was a 0 or 1.

MATCH - has an output buffer and pointer table, both of user specified size. The context is mapped to a pointer into the buffer where the same context was last observed. The corresponding bit after the last match is predicted in proportion to the length of the match.

AVG - Two predictions are combined by weighted averaging. The user specifies the weight. Weighted mixing is always in the logistic or "stretched" domain: stretch(p) = log(p/(1-p)).

MIX2 - Two stretched predictions are combined by weighted averaging from a table of weights of a user specified size and selected by a context. After prediction, the selected weight is updated to favor the more accurate input prediction. The user specifies the adaptation rate.

MIX - Like a MIX2 but over a user specified array of earlier predictions and one weight per input per context.

SSE - secondary symbol estimation. A context and a stretched input prediction select an output prediction from two adjacent entries in a 2-D table by interpolation. The table is updated to reduce the prediction error of the nearer of the two entries as with a CM. The user specifies the table size in the context dimension (the probability dimension is fixed at 64), and the initial and maximum counts to determine adaptation rate.

ISSE - indirect SSE. Receives a context and an earlier prediction. The context is mapped to a bit history as with an ICM. The history is mapped to the context of a MIX2 with one prediction from input and the other CONST. It has the effect of adjusting the input prediction based on the bit history of the current context.

There are two ZPAQL virtual machines, one (HCOMP) to compute contexts, and one (PCOMP) to postprocess the decoded data. Each program is called once per decoded byte with that byte as input. A ZPAQL machine has the following state:

An array H of 32 bit unsigned values of user specified size. In HCOMP, the elements at the beginning of the array are each assigned to a component to hold its context.

An array M of 8 bit unsigned values of user specified size.

32 bit registers A, B, C, and D. A is the accumulator, the destination of most arithmetic and logical operations. It also contains the input byte when the program is executed. B and C can point into M. D can point into H.

256 registers, R0 through R255, holding 32 bit values.

A flag register F holding the result of the last comparison (true or false).

A 16 bit program counter.

zpaq 1.03 takes as input a configuration file which describes the arrangement of components, their parameters, and the ZPAQL program HCOMP written one token per byte in a C-like syntax (e.g. "A=B" to assign B to A). PCOMP is not specified because in general the preprocessing step by the compressor is different (and usually more complex) than the postprocessing step. Instead, zpaq 1.03 provides the option of two built-in preprocessors, LZP and E8E9. If selected, the preprocessing is done in C++ by the compressor, and the compressor generates ZPAQL code to perform the inverse transform and insert it into the archive block header. (PCOMP is actually appended to the beginning of the input data and compressed with it. HCOMP is not compressed).

E8E9 is used to improve compression of 32 bit x86 executable files. It replaces the 32 bit relative address after a CALL or JMP (0xE8 or 0xE9) x86 instruction by adding the offset from the beginning of the file. This improves compression because often there are several calls to the same target. PCOMP performs the inverse transform in ZPAQL by subtracting the offset.

LZP encodes long string matches as an escape byte and length byte. The decompresser maintains a rolling context hash which indexes a pointer table (the H array) into the output buffer (the M array) pointing to the previous context match. If an escape is present, then the indicated number of bytes are copied from the previous context match. In zpaq 1.03, the user can specify the sizes of M and H, the hash multiplier (effectively choosing the context length), the value to use as the escape byte (preferably occurring rarely in the input), and minimum match length. Escape bytes in the input are encoded as an escaped 0 length.

zpaq 1.03 is distributed with three configuration files, min.cfg (for speed), mid.cfg (the default), and max.cfg (for good compression). However, the user can also write their own config files.

o0.cfg, o1.cfg, and o2.cfg are order 0, 1, and 2 models with a single CM and direct context lookup with no hashing. o0 is equivalent to fpaq0. In each of the models the asymptotic learning rate was tuned for maximum compression. Other values are given as comments in the sources. The CM uses 2KB, 512KB and 128MB respectively.

min.cfg uses LZP preprocessing with a minimum match length of 3 and an order 4 context hash, followed by compression by single CM with an order 3 context and 512K entries. The LZP has a 1 MB output buffer and 256K index. It uses 4 MB memory.

mid.cfg (the default) does no preprocessing. It has an order 0 ICM, a chain of ISSE with context orders 1 through 5, each taking the previous ISSE as input, a MATCH with an order 7 context, and a final MIX with an order 1 context taking input from all other models. It uses 111 MB memory.

max.cfg does no preprocessing. It has 21 components: an order 0 ICM, a chain of order 1, 2, 3, 4, 5, 7 ISSE, an order 8 MATCH, a wordwise order 0-1 ICM-ISSE chain (for text), sparse order 1 ICM with gaps of 1, 2, and 3, a partially masked order 2 ICM with a gap of 216 for CCITT images (calgary/pic), order 0 and 1 mixers taking a CONST and all previous components as input and averaged together with a context free MIX2, followed by a chain of order 0 and 1 SSE each partially bypassed by a context free and order 0 MIX2, and a final context free MIX of all other components. The two wordwise contexts depend on the current and previous case insensitive sequences of letters in the range a-z. It uses 278 MB memory.

max3.cfg is a variation of max.cfg by Jan Ondrus (Sept. 10, 2009) using 550 MB memory and without a CCITT model.

max4.cfg is a variation of max3.cfg (Sept. 15, 2009) using 1465 MB memory.

drt is the dictionary preprocessor from lpaq9m by Alexander Rasushnyak. The results include the dictionary file lpqdict0.dic compressed from 465,210 to 88,759 bytes in 8 seconds as a separate archive with max4.cfg and decompressed in 7 seconds, and drt.exe with a size of 15,548 bytes (whether uncompressed or as a zip file) with 38 seconds to encode enwik9 and 38 seconds to decode.

max_enwik9.cfg is a variation of max.cfg by Mike Russell, Sept. 11, 2009. It adds 5 more models for higher order contexts using an ISSE chain after the first order 5 mixer.

max_enwik9drt.cfg is a variation of max_enwik9.cfg, Sept. 18, 2009, modified to define word contexts for ASCII range 65-255 instead of A-Z,a-z because DRT encodes words using bytes in the range 128-255. The compressed size of lpqdict0.dic is 86810 bytes, 12+9 sec, compressed separately and added to the compressed sizes.

zpipe 1.00 is a ZPAQ compatible streaming file compressor that compresses or decompresses from standard input to standard output. It takes no options. It compresses equivalently to mid.cfg without storing a filename or comment. The decompresser outputs the contents of archives to a single file by concatenation.

bwt_j2.cfg implements an inverse BWT transform. It was writen by Jan Ondrus, Oct. 6, 2009. The forward transform is implemented by an external preprocessor, bwtpre (included above) by Matt Mahoney, Oct. 6, 2009. bwtpre is based on BBB fast mode compression but does not itself compress. The argument ",18" tells bwt_j2.cfg to use a block size of 210+18-256 bytes. Memory usage is 5x blocksize for both the preprocessor and postprocessor, plus 100 MB for the model. The ability of config files to call external preprocessors was added to zpaq v1.05 on Sept. 28, 2009. The ability to pass arguments was added to zpaq v1.07 on Oct. 2, 2009.

zpaq v1.08 (Oct. 14, 2009) adds the capability to compile ZPAQL configuration files and corresponding archive headers to C++ and link to a copy of itself to speed up compression and decompression. The program first looks for an optimized version of the program, writes and compiles it if needed, then runs it to compress or decompress. Some tests are shown for speed comparison. max.cfg was modified to use less memory. The arguments to min.cfg, mid.cfg, and max.cfg have the effect of improving compression at the cost of doubling memory for each increment.

bwt_slowmode1_1GB_block.cfg implements slow mode BWT transform using 1.25x blocksize memory based on BBB. The inverse transform was re-implemented in ZPAQL by Jan Ondrus, Oct. 15, 2009.

zpaq v1.09 is mainly a Linux port of v1.08 with some cosmetic improvements. Times for obwt_j2.cfg,18 are shown for comparison to v1.07 without optimization. Memory usage is 1838 MB for compression (includes preprocessor) and 1443 MB for decompression.

The c command followed by the name of a configuration file creates a new archive using that file. By default the archive header includes the file name (6 bytes), size (10 bytes), and SHA1 checksum (20 bytes). There are options to omit these and save 36 bytes. The "oc" command in zpaq v1.08 optimizes for speed.

zp 1.00 is a ZPAQ compatible archiver by Matt Mahoney, May 7, 2010. It is designed to have fewer options so it is easier to use. It has 3 compression levels: 1=fast, 2=mid, 3=max. It uses compiled ZPAQL code (like zpaq oc/ox) but without requiring an external C++ compiler to be installed. It automatically detects when an archive is compressed with one of these three models and decompresses with compiled code. Otherwise, it will decompress all other ZPAQ compatible archives with slower, interpreted code. Levels 2 and 3 are the same as zpaq mid.cfg and max.cfg. Only level 1 (fast) was tested because it uses a new model, fast.cfg, an ICM chain of length 2 with order 2 and 4 contexts. It is equivalent to compressing with zpaq ocfast.cfg.

Compression Compressed size Decompresser Total size Time (ns/byte) Program Options enwik8 enwik9 size (zip) enwik9+prog Comp Decomp Mem Alg Note ------- ------- ---------- ----------- ----------- ----------- ----- ----- --- --- ---- zpaq 1.03 co0.cfg 61,217,687 620,040,242 14,317 xd 620,054,559 441 453 0.4 o0 26 co1.cfg 46,083,596 454,040,416 14,317 xd 454,054,733 459 480 0.6 o1 26 co2.cfg 36,694,483 346,551,263 14,317 xd 346,565,580 557 560 134 o2 26 cmin.cfg 33,460,947 294,281,789 14,317 xd 294,296,106 438 513 4 LZP 26 cmid.cfg 20,941,558 180,279,221 14,317 xd 180,293,538 3521 3652 111 CM 26 cmax.cfg 19,412,353 165,191,085 14,317 xd 165,205,402 12211 12204 278 CM 26 cmax3.cfg 19,179,311 161,604,379 14,317 xd 161,618,696 14108 13609 550 CM 26 cmax4.cfg 18,986,507 157,246,349 14,317 xd 157,260,666 14061 13077 1465 CM 26 cmax_enwik9.cfg 18,238,435 149,376,058 14,317 xd 149,390,375 11961 2002 CM 32 drt|zpaq 1.03 cmax4.cfg 18,400,773 149,761,125 29,865 xd 149,790,990 8663 8547 1465 CM 26 cmax_enwik9drt.cfg 18,022,167 146,078,502 29,865 xd 146,108,367 11494 11614 1952 CM 26 zpipe 1.00 20,941,543 180,279,205 13,421 x 180,292,626 3540 3480 111 CM 26 zpaq 1.07 cbwt_j2.cfg,18 20,756,888 174,171,969 13,421 x 174,185,390 5593 4347 1838 BWT 26 zpaq 1.08 ocbwt_slowmodel_1GB_block.cfg 20,756,996 163,565,006 29,153 x 163,594,159 7957 3875 1443 BWT 26 oco0.cfg 61,217,687 335 407 0.4 o0 26 ocmin.cfg 33,460,960 414 383 4 LZP 26 ocmid.cfg 20,941,558 2392 2456 111 CM 26 ocmax.cfg 19,448,650 6569 6641 246 CM 26 ocmax.cfg,3 18,977,961 6667 6640 1861 CM 26 zpaq 1.09 ocbwt_j2.cfg,18 20,756,883 174,171,965 31,744 x 171,203,709 4529 1847 1838 BWT 26 zp 1.00 c1 24,837,469 222,310,430 26,815 s 222,337,245 688 776 37 CM 26 587 688 44

pzpaq 0.01 (a predecessor to zp 1.02) is a free, open source file compressor and archiver by Matt Mahoney, Jan. 21, 2011. It uses a ZPAQ compatible format with speed optimizations for the 3 default compression levels supported by libzpaq, zpaq, and zpipe. It supports parallel compression and decompression by dividing the input into blocks which are compressed or decompressed at the same time in separate threads, writing the result to temporary files, and then comcatenating them when done. For compression with N threads, the input is divided into N blocks of equal size by default, although a different block size can be specified. Larger blocks make compression better but reduce the number of threads that can run at the same time. Using more threads also increases the memory required. pzpaq can also compress or decompress multiple files at once to separate archives or pack them into a solid archive or an archive with the packed files split across blocks within the archive.

The version 0.01 distribution includes a 32 bit Windows executable and source code to compile for Windows or Linux. For Windows, the code must be linked with Pthreads-Win32 and pthreadGC2.dll is required at run time. The program size was calculated from the source code (including libzpaq) required for Linux, which has pthreads installed by default and is not included in the size.

The test results shown below are for 2 machines, a 2.67 GHz Intel Core i7 M620 with 2 cores and 2 hyperthreads per core, running 64 bit Linux (note 48), and a 2.0 GHz Intel T3200 with 2 cores without hyperthreading running 32 bit Windows (note 26). The Linux version was compiled with g++ 4.4.4 -O3 -s -march=native -DNDEBUG. The Windows version used the distributed pzpaq.exe and pthreadGC2.dll. It was compiled with g++ 4.5.0 -O2 -s -march=pentiumpro -fomit-frame-pointer. Times shown are wall (real) times, not process times, in nanoseconds per byte.

We observe the normal 3 way tradeoff between speed, memory, and compression. Compression levels -1, -2, and -3 require 38 MB, 112 MB, and 247 MB per thread respectively. The default is -2. -t selects the number of threads. The default is -t2. -b selects the block size. The default is the input size divided by the number of threads. The -m option limits memory usage in MB by reducing -t. The default is -m500. Selecting larger -m than required has no effect on compression, speed, or actual memory used. -m is only required with -3 -t3 or higher.

C/D time C/D time Lev Thr Block Memory enwik8 Note 48 Note 26 ------------------------- ---------- ----------- ----------- -1 -t2 -b1000000 -m76 28,176,221 471 -1 -t2 -b2500000 -m76 26,915,416 443 -1 -t2 -b5000000 -m76 26,236,689 436 -1 -t2 -b10000000 -m76 25,728,498 429 -1 -t4 -b25000000 -m152 25,253,629 210 220 -1 -t3 -b33333334 -m114 25,144,587 220 240 -1 -t2 -b50000000 -m76 25,009,236 240 290 410 430 -1 -t1 -b100000000 -m38 24,837,482 420 470 750 800 -2 -t2 -b1000000 -m224 24,582,373 1440 -2 -t2 -b2500000 -m224 23,374,191 1396 -2 -t2 -b5000000 -m224 22,644,738 1417 -2 -t2 -b10000000 -m224 22,044,838 1430 -2 -t2 -b25000000 -m224 21,438,679 1382 -2 -t4 -b25000000 -m448 21,438,679 720 730 -2 -t3 -b33333334 -m336 21,303,705 790 820 -2 -t2 -b50000000 -m224 21,138,877 950 980 1300 1310 -2 -t1 -b100000000 -m112 20,941,571 1510 1560 2350 2330 -3 -t2 -b1000000 -m494 23,281,943 4142 -3 -t2 -b2500000 -m494 22,105,128 3896 -3 -t2 -b5000000 -m494 21,371,902 3866 -3 -t2 -b10000000 -m494 20,745,064 3854 -3 -t2 -b25000000 -m494 20,073,978 3816 -3 -t4 -b25000000 -m988 20,073,978 1900 1950 -3 -t3 -b33333334 -m741 19,914,412 2070 2120 -3 -t2 -b50000000 -m494 19,710,450 2180 2250 3670 3990 -3 -t1 -b100000000 -m247 19,448,663 3780 3910 6080 6200 C/D time C/D time Lev Thr Block Memory enwik9 Note 48 Note 26 ------------------------- ----------- ----------- ----------- -1 -t2 -b1000000 -m76 254,931,717 582 -1 -t2 -b10000000 -m76 232,278,737 425 -1 -t2 -b100000000 -m76 224,233,690 392 -1 -t2 -b250000000 -m76 223,043,964 393 -1 -t4 -b250000000 -m152 223,043,964 198 223 -1 -t3 -b333333334 -m114 222,789,971 224 254 -1 -t2 -b500000000 -m76 222,544,698 236 276 408 556 -1 -t1 -b1000000000 -m38 222,310,443 410 470 758 800 -2 -t2 -b1000000 -m224 216,322,292 1377 -2 -t2 -b10000000 -m224 192,436,071 1286 -2 -t2 -b100000000 -m224 182,293,069 1275 -2 -t2 -b250000000 -m224 180,995,559 1278 -2 -t4 -b250000000 -m448 180,995,559 710 742 -2 -t3 -b333333334 -m336 180,716,954 768 811 -2 -t2 -b500000000 -m224 180,516,414 854 881 1275 -2 -t1 -b1000000000 -m112 180,279,234 1487 1532 2231 -3 -t2 -b1000000 -m494 203,976,295 3824 -3 -t2 -b10000000 -m494 180,499,077 3657 -3 -t2 -b100000000 -m494 168,839,648 3611 -3 -t2 -b250000000 -m494 167,036,071 3635 -3 -t4 -b250000000 -m988 167,036,071 1881 1926 -3 -t3 -b333333334 -m741 166,567,322 2025 2158 -3 -t2 -b500000000 -m494 166,324,415 2172 2236 3599 -3 -t1 -b1000000000 -m247 165,887,518 3708 3846 5989

Option -m2 selects the better BWT mode (bwt2), which drops the RLE step and uses an order 0-1 ISSE chain. The order-1 ISSE adjusts the order-0 ICM prediction by mixing it in the logistic domain with a constant, such that the pair of weights is selected by an 8-bit bit history, which is selected by an order 1 context of the BWT output. After coding, the mixing weights are adjusted to reduce the prediction error.

Options -m3 and -m4 select the "mid" and "max" modes, the same as -4 and -5 respectively in pzpaq. The option -bN selects a block size of N*2^20 - 256 bytes. Memory usage per thread for the two BWT modes is 5 times the block size after rounding up to a power of 2. The default is -b32 which uses 160 MB per thread for -m1 and -m2. Memory usage for -m3 and -m4 is not affected by block size. Usage is 111 MB and 246 MB per thread for -m3 and -m4 respectively.

Other changes: there is no longer an option to limit memory. The default number of threads (-t option) is the number of cores. There is no solid mode compression because BWT requires that each block contain only one whole or part of a file. There is a separate decompresser, unzp, which is optimized for fast, mid, max, bwtrle1, and bwt2 modes, and can be configured to optimize for other models by generating, compiling, linking, and running C++ code for an optimized version of itself. Compressed sizes are based on the unzp source code (37,967 bytes).

zpaq 4.00 was released Nov. 13, 2011. It uses libzpaq v4.00, which internally translates ZPAQL into just-in-time (JIT) x86-32 or x86-64, which runs about as fast as the previous version that translated ZPAQL to C++ and compiled it. Unlike the earlier version, it correctly handles all legal ZPAQL, such as jumps into the middle of a 2 byte instruction, such as occurs in max_enwik9.cfg. Like zp 1.02, it uses multi-threading and the same build-in compression levels -m1 through -m4.

Results are shown below for a 4 GB 2.66 GHz Core I7 M620 (note 40), which has 2 cores with 2 hyperthreads each. Run under Ubuntu 64 bit Linux. Compression and decompression times (wall times, ns/byte) are shown for 1 through 4 threads (-t1 through -t4) as the compression method (-m) and block size (-b) are varied. max_enwik9 runs in one thread in a single block.

Compressor Options enwik8 enwik9 -t1 -t2 -t3 -t4 MB/thread ---------- -------- ---------- ----------- --------- --------- --------- --------- ---------- zp 1.02 -m1 -b32 24,091,153 210,224,876 264 313 144 184 131 170 120 165 160 -m1 -b128 22,823,452 197,571,474 264 335 163 208 137 187 136 179 640 -m1 -b256 22,823,452 191,741,553 167 218 1280 -m2 -b32 22,440,353 195,887,789 446 514 259 304 237 274 231 267 160 -m2 -b128 21,246,043 184,023,690 467 543 291 343 250 295 248 294 640 -m2 -b256 21,246,043 178,551,919 304 351 1280 -m3 -b32 21,301,940 185,584,854 1420 1478 805 856 760 790 713 745 111 -m3 -b128 20,941,571 181,908,375 1430 1491 851 897 772 823 723 758 111 -m3 -b1024 20,941,571 180,279,234 1446 1503 111 -m4 -b32 19,912,920 172,989,918 3567 3695 2075 2145 1966 2011 1868 1906 246 -m4 -b128 19,448,663 168,312,889 3578 3706 2156 2234 1984 2043 1875 1925 246 -m4 -b1024 19,448,663 165,887,518 3597 3732 246 Compression Compressed size Decompresser Total size Time (ns/byte) Program Options enwik8 enwik9 size (zip) enwik9+prog Comp Decomp Mem Alg Note ------- ------------ ---------- ----------- ----------- ----------- ----- ----- --- --- ---- zpaq 4.00 -mmax_enwik9 18,238,435 149,376,058 66,958 s 149,440,016 6327 6528 2002 CM 48

zpaq v6.12, Oct. 19, 2012, is a journaling, deduplicating, incremental archiver. These features were added in zpaq v6.00 on Sept. 26, 2012. It implements the level 2 ZPAQ standard introduced with libzpaq v5.00 on Feb. 1, 2012. The level 2 standard allows for uncompressed (but possibly pre/post-processsed) data. The format is described in the ZPAQ specification v2.01.

zpaq v6.12 is designed for large backups. It will compress 100 GB to an external drive in a few hours, then perform daily incremental backups of files whose dates have changed in a few minutes. It recursively traverses directories, storing last-modified dates and attributes of added files.

A journaling archive is append-only. When a journaling archive is updated, it keeps both the old and new versions of each file or directory. The old version can be extracted by specifying a dated version, and any later updates are ignored.

Input is deduplicated before compression by dividing input files into fragments averaging 64 KB on content-dependent boundaries that move when data is inserted or removed. The archive stores fragment SHA-1 hashes and stores any fragment with a matching hash as a pointer to an existing fragment. Any remaining fragments are packed into 16 MB blocks in memory and compressed by multiple threads in parallel to memory buffers before being appended to the archive. After compression is completed, the fragment sizes and hashes are appended, and then a list of index updates in separately compressed blocks. Each update is either a deletion (filename only) or an update (filename, date, attributes, and list of fragment pointers).

An update is performed as a transaction by first appending a temporary header, then the compressed data and index, and then finally going back and updating the header to store the compressed data size so that it can be skipped over when listing the archive contents or preparing a list of files to add or extract. If compression is interrupted or an error occurs, then the temporary header is not updated. If zpaq encounters a temporary header then it assumes that any data following it is corrupted and ignores it during extraction or listing, and overwrites it during the next update.

zpaq also has features to summarize the contents of archives containing millions of files, show update history and version dates, and compare and extract individual files and directories and rename them. Archives can be encrypted.

The deduplication algorithm uses a rolling hash of the input that depends on the last 32 bytes that are not predicted in an order-1 context. Missed predictions (from a 256 byte table) are counted as a heuristic to guess whether a block can be compressed. If not, then it is stored without compression as a speed optimization. There are 4 compression levels (-method 1 through 4). The threshold for compressing a block is 1/16, 1/32, 1/64, and 1/128 of bytes predicted by the order 1 model, respectively. Like earlier versions of zpaq, it also accepts configuration files and external preprocessors. These are always compressed.

The journaling format is not compatible with zpaq versions prior to 6.00. Older versions would decompress a journaling archive to a set of jDC* files that could in theory reconstruct the data. To support older versions, there are three additional modes: streaming, solid, and tiny. In streaming mode, each file is compressed in parallel in a separate block, and large files are split into 16 MB blocks. In solid mode, all files are compressed to a single block in a single thread. Tiny mode is like solid mode except that comments (uncompressed sizes), checksums, and header locator tags (for error recovery) are not stored, saving a few bytes each. None of these modes support journaling, incremental backup, or deduplication, and do not save file attributes or empty directories. An update appends to an archive without checking whether the files have been added before.

There are 4 built in methods. Method 1 is equivalent to "lazy" level 3. It is LZ77 using variable length codes to represent the lengths of literal byte strings or the length and offset of matches to earlier occurrences of the same string in a 16 MB output block. Matches are found by indexing a hash of the next 4 bytes in the input buffer into a table of size 4M which is grouped into 512K buckets of 8 pointers each. The longest match is coded, provided the length is at least 4, or 5 if the offset is greater than 64K and the last output was a literal. Ties are broken by favoring the smaller offset. Bucket elements are selected for replacement using the low 3 bits of the output count.

Literal lengths are coded using "marked binary" Elias gamma codes, where the leading 1 bit of the number is dropped and a 1 bit is inserted in front of the remaining bits and a 0 marks the end. For example, 1100 is coded as 1,1,1,0,1,0,0. Matches are coded as a length and an offset. The length is at least 4. All but the last 2 bits are coded as a marked binary. The number of match bits is given in the first 5 bits of the code. If the code starts with 00, then a literal length and string of literal follow. Otherwise the 5 bits code a number from 0 to 23, and that number of bits, with an implied leading 1 give the offset.

The codes are not compressed further. They are stored in the ZPAQ level 2 format, consisting of a sequence of sub-blocks each preceded by a 4 byte header giving the sub-block size.

Method 2 is also LZ77, but the codes are byte aligned and context modeled rather than coded directly. It also searches 4 order-7 context hashes and 4 order-4 hashes, rather than 8 order-4 hashes like method 1. Method 2 first codes as follows, according to the high 2 bits of the first byte:

00 = literal of length 1..64, followed by uncompressed bytes. 01 = match of length 4..11 and offset 1..2048. 10 = match of length 1..64 and offset of 1..65536. 11 = match of length 1..64 and offset of 1..16777216.

Method 3 uses a Burrows-Wheeler transform (BWT) using libdivsufsort-lite v2.0. This is equivalent to -m2 in older zpaq versions. The input bytes are sorted by their right contexts and compressed using an order 0-1 ICM-ISSE chain. The order 0 ICM (indirect context model) works as in method 2, taking only the previous bits of the current byte (MSB first) as context. The prediction is adjusted by an order-1 indirect secondary symbol estimator (ISSE). An ISSE maps its context (the previous byte and the leading bits of the current byte) to a bit history, and the history selects a pair of mixing weights to compute the weighted average of the constant 1 and the ICM output in the logistic domain, log(p/(1-p)). The output is converted back to linear, and the two weights are updated to reduce the prediction error in favor of the better model. In other words, the output is:

p' := 1/(1 + exp(-w1*1 - w2*log(p/(1-p))))

w1 := w1 + 1 * 0.001 * (bit - p') w2 := w2 + log(p/(1-p)) * 0.001 * (bit - p')

Method 4 is equivalent to mid.cfg or -m3 in older zpaq versions. It directly models the data using an order 0-5 ICM-ISSE chain, an order 7 match model, and an order 1 mixer which produces the bit prediction by mixing the predictions of all other components. The 6 components in the chain each mix the next lower order prediction using a hash of the next higher order context to select a bit history for that context, which selects the mixing weights. A match model has a 16 MB history buffer and a 4M hash table of the previous occurrence of the current context. If a match is found, it predicts the bit that followed the match with probability 1 - 1/(length in bits). The outputs of all 7 models are then mixed as with an ISSE except with a vector of 7 weights selected by an order 1 (16 bit) context, and with a faster weight update rate of about 0.01.

With method 4 you can give an argument like "-method 4 1" to double the memory allocated to the components to improve compression. The same extra memory is needed to decompress. The default is 111 MB per thread. An argument n multiplies memory usage by 2^n. n can be negative.

Methods 1, 2, and 3 only work in journaling and streaming mode, since they have a 16 MB block size limit. Method 4 and configuration files work in all modes.

The following tests are on a 2.0 GHz T3200 with 2 cores. zpaq will automatically detect the number of cores and use the same number of compression or decompression threads, although this can be overridden.

Compression Compressed size Decompresser Total size Time (ns/byte) Program Options enwik8 enwik9 size (zip) enwik9+prog Comp Decomp Mem Alg Note ------- ------------ ---------- ----------- ----------- ----------- ----- ----- --- --- ---- zpaq 6.12 -method 1 37,397,857 328,974,375 104,067 s 329,078,442 93 53 152 LZ77 26 -method 1 -streaming 37,359,931 328,618,875 104,067 s 328,722,942 85 28 151 LZ77 26 -method 2 31,765,035 281,184,939 104,067 s 281,289,006 196 108 153 LZ77 26 -method 2 -streaming 31,730,884 218 126 151 LZ77 26 -method 3 23,341,562 203,365,453 104,067 s 203,469,520 429 369 238 BWT 26 -method 3 -streaming 23,328,888 425 375 238 BWT 26 -method 4 21,768,810 1403 1371 299 CM 26 -method 4 -streaming 21,744,770 1403 1356 299 CM 26 -method 4 -solid 20,941,591 2036 2056 109 CM 26 -method 4 1 -solid 20,740,920 2338 2197 216 CM 26 -method 4 4 -solid 20,581,270 2356 2289 1482 CM 26 -method 4 4 -tiny 20,581,208 173,028,477 104,067 s 173,132,544 2107 2230 1654 CM 26

zpaq v6.19, Jan. 23, 2013, moves the -solid and -tiny modes into a separate program, zpaqd, and eliminates -streaming. It adds 5 more compression levels (0 through 9). -method 5 is max.cfg, a 22 component CM with some of the component sizes reduced to use about 225 MB per thread. -methods 6 through 9 each double the memory size (450 MB to 1.8 GB) and block size (32 MB to 256 MB). All levels except 0 (store uncompressed) have an E8E9 pre/post-processor. -methods 0 through 4 are unchanged.

Compression Compressed size Decompresser Total size Time (ns/byte) Program Options enwik8 enwik9 size (zip) enwik9+prog Comp Decomp Mem Alg Note ------- ------------ ---------- ----------- ----------- ----------- ----- ----- --- --- ---- zpaq 6.19 -method 0 -threads 2 100,050,464 37 42 169 copy 26 -method 1 -threads 2 37,398,697 143 61 225 LZ77 26 -method 2 -threads 2 31,766,023 294 185 225 LZ77 26 -method 3 -threads 2 23,342,327 635 548 322 BWT 26 -method 4 -threads 2 21,770,084 1319 1331 378 CM 26 -method 5 -threads 2 20,491,832 3778 3773 563 CM 26 -method 6 -threads 2 19,901,321 4446 4615 991 CM 26 -method 7 -threads 2 19,497,869 4625 4711 1845 CM 26 -method 8 -threads 1 19,038,853 164,475,887 95,914 s 164,571,801 6153 6296 1911 CM 26 -method 8 -threads 2 19,038,853 3553 3551 3800 CM 48 -method 9 -threads 1 19,004,217 161,001,056 95,914 s 161,096,970 3468 3521 3800 CM 48

zpaq v6.34 has 7 compression methods as follows:

0 = deduplicate only, store uncompressed.

1 = LZ77 with variable length codes in 16 MB blocks (default).

2 = like 1 with longer search for matches and 64 MB blocks.

3 = byte aligned LZ77 with context modeling of literals and parse state.

4 = 3 or BWT, whichever is smaller.

5 = 3, 4, or 8-9 component CM, whichever is smaller.

6 = CM with about 20 components.

Methods 0 and 1 use 16 MB blocks by default. Methods 2..6 use 64 MB blocks. The size can be specified by a second digit N which specifies 2N MB blocks. Thus, the defaults are 04, 14, 26, 36, 46, 56, 66. Larger blocks compress better but require more memory per thread.

Methods 1..6 use heuristics to detect already compressed data and either store it or compress it with a fast method like 1 depending on the degree of compressibility. The heuristic depends on the 256 byte order-1 prediction table that is used to compute the rolling hash used in the fragmentation algorithm. The table is initialized to all zeros at each fragment boundary, and contains the last byte seen in each of 256 possible 1 byte contexts. If the data is random, then at each fragment boundary (average size 64K), the following properties are expected:

The fraction of correct predictions is 1/256.

The number of nonzero entries in the table (if at least 4K) is 1/256.

The frequency distribution, weighting successive occurrences of the same value by 1, 1/2, 1/3... is about 205.

The probability of each value matching any of the previous 4 tables is 1/256.

In addition, the order 1 tables are used to detect text and x86 (.exe) data types. Text is detected if at least 5 letter, digit, period, or comma contexts predict a space, minus any predicted characters in the range 1..8, 11, 12, 14..31, which normally do not appear in text files. If at least 1/4 of the fragments are detected as text, then methods 5 and 6 add extra models for it. x86 is detected if at least 5 contexts predict a 139 (an x86 MOV reg, r/m instruction). If at least 1/8 of the fragments are detected as x86, then a E8E9 pre/post processor is used in methods 1..6.

LZ77 and BWT removed the 16 MB block size limitation of the previous version. Variable length LZ77 adds an extra field of rb = 1..8 bits to represent the low bits of an offset up to 32 bits, where rb increases by 1 for each doubling of the block size over 16 MB. 2rb - 1 is added to the offset, so that it requires a rb..rb+23 bit code.

Byte aligned LZ77 removed the limitation by eliminating the short code (3 bit length and 11 bit offset) and adding a code with 4 offset bytes. Lengths range from m..m+63 where m is the mininum match length, normally 8 when used with an order-1 context model.

BWT removes the block size limitation by removing the IBWT optimization of packing pointers and the byte pointed to into a single 32 bit linked list element when the block size is over 16 MB. No changes were required for higher compression levels.

zpaq versions since v6.22 support custom context models through the command line. When compressing enwik8 and enwik9 the following models are automatically generated:

Option Equivalent ------ ---------- -m 0 -m x4,0 -m 1 -m x4,1,4,0,3,24,16,18 -m 18 -m x8,1,4,0,3,27,16,18 -m 2 -m x6,1,4,8,4,26,16,18 -m 28 -m x8,1,4,8,4,27,16,18 -m 3 -m x6,2,8,0,4,26,16,24c0,0,511 -m 38 -m x8,2,8,0,4,26,16,24c0,0,511 -m 4 -m x6,3ci1 -m 48 -m x8,3ci1 -m 5 -m x6,0ci1,1,1,1,2awm -m 58 -m x8,0ci1,1,1,1,2awm -m 6 -m x6,0w2c0,1010,255i1c256ci1,1,1,1,1,1,2ac0,2,0,255i1c0,3,0,0,255i1c0,4,0,0,0,255i1mm16ts19t0 -m 68 -m x8,0w2c0,1010,255i1c256ci1,1,1,1,1,1,2ac0,2,0,255i1c0,3,0,0,255i1c0,4,0,0,0,255i1mm16ts19t0

The meaning is as follows.

x (experimental) rather than a digit selects a specific method which is the same for every block. It can also be s to add in streaming mode with each file in a separate block and large files split into blocks with no deduplication.

The first digit N1 after x selects a maximum block size of 2N1+20 - 4096 bytes. This is selected by the second digit of the method, if present, or else it defaults to 6 for methods 2..6 or 4 otherwise.

The second digit N2 selects the pre/post processing step. 0 means none. 1 means LZ77 with variable length codes. 2 means LZ77 with byte aligned codes. 3 means BWT. 4..7 means 0..3 with E8E9 filtering.

N3..N8 apply to the LZ77 modes only. N3 (4 or 8) is the minimum match length. N4 (8 or 0) if not 0 specifies a context order to search first. N5 (3 or 4) says to search 2N5 contexts of each order to look for matches. N6 (24..27) specifies 2N6 elements in the hash table for lookups. Each entry requires 4 bytes of memory. It defaults to the block size up to N1=26, then N1-1. N7 and N8 specify that the minimum match (N3) should be increased by 1 after a literal or match, respectively, when the match offset is greater than 2N7 or 2N8 respectively.

The sequence of strings starting with letters followed by a comma-separated list of numbers specifies various context models used by methods 3 and higher. c0 specifies an ICM (indirect context model: context to bit history to prediction). c1...c256 (used in -m 6) specifies a CM (context to prediction) with an update rate of 1/count and maximum count of N1*4-4, e.g. c256 specifies 1020. The remaining arguments to c default to 0. N2 describes any special contexts. N2 in 1..255 (e.g. c0,2) means offset mod N2. N2 in 1000..1255 means the distance to the last occurrence of N2-1000 (e.g. c0,1010 means how far from the last linefeed). N3 and up specifies byte masks starting with the most recent context byte (e.g. c0,2,0,255 means offset mod 2 combined with the second context byte (sparse model)). A value of 256..511 includes the byte aligned LZ77 parse state if applicable (e.g. c0,0,511 means the order 1 context plus parse state hashed together).

i followed by a list specifies a chain of ISSE components with each context order increasing by the specified amount by hashing it with the previous component, (e.g. ci1,1,1,1,2 specifies an order 0 ICM chained with order 1, 2, 3, 4, 6 ISSE). Each ISSE (indirect secondary symbol estimator) adjusts the prediction of the previous component in the bit history of the current context (hashed together with the previous component's context).

a specifies a match model, which predicts the bit which followed the most recent occurrence of the current (normally high order) context. It can take parameters specifying buffer size, hash table index size and context order.

wN1 specifies a word model, an ICM-ISSE chain of increasing order from 0 to N1-1 in words rather than bytes. A word is defined as a sequence of letters converted to upper case, ignoring all other characters (e.g. w2 specifies an order 0 ICM and order 1 ISSE). It can take additional parameters specifying an alphabet range and a mask to convert case.

m specifies a mixer, which adaptively averages the predictions of all prior components. It can take a parameter (default 8) which is the number of bits of context to select the mixing weights (e.g. m16 is a byte-wise order 1 context). It takes additional parameters specifying update rate.

t is a MIX2 2-input mixer which averages just the last 2 components.

s is a SSE which adjusts the prevous prediction like an ISSE but using a direct context instead of a bit history. It takes parameters specifying the number of context bits (e.g. s19 selects the current and previous bytes and the 3 high bits of the second byte), and additional parameters specifying initial and final update rates.

-m is short for -method. -th 1 (-threads 1) selects 1 thread. The default on the test machine is 4 (2 cores + 2 hyperthreads). It is also used in decompression to reduce memory.

Compression Compressed size Decompresser Total size Time (ns/byte) Program Options enwik8 enwik9 size (zip) enwik9+prog Comp Decomp Mem Alg Note ------- ------------ ---------- ----------- ----------- ----------- ----- ----- --- --- ---- zpaq 6.34 -m 1 36,720,879 322,717,507 38 15 456 LZ77 48 -m 18 -th 1 36,174,283 316,439,766 85 25 1200 LZ77 48 -m 2 32,785,291 287,047,166 76 17 1500 LZ77 48 -m 28 -th 1 32,123,217 279,231,899 159 25 1200 LZ77 48 -m 3 30,759,444 270,317,562 89 56 1500 LZ77 48 -m 38 -th 1 30,216,795 264,333,006 198 106 1200 LZ77 48 -m 4 21,982,505 189,860,169 285 224 1800 BWT 48 -m 48 -th 1 21,293,686 179,016,475 596 512 1400 BWT 48 -m 5 20,742,462 179,365,293 937 658 2100 CM 48 -m 58 -th 1 20,214,879 172,645,399 1931 1430 2400 CM 48 -m 6 19,627,225 168,583,236 2348 2356 3300 CM 48 -m 68 -th 1 18,998,601 160,541,121 118,086 s 160,659,207 4300 4408 3200 CM 48

The following table shows compression with the config file max5.cfg (Oct. 14, 2013). This is the same model as max_enwik9.cfg except that it was modified to take an argument to double memory usage for most of the components for each increment. With argument 0, it is the same as max_enwik9. Compression was with zpaqd 6.33 (June 20, 2013), which is the developement tool that accompanies zpaq and produces streaming mode archives from a config file. Thus, the command "zpaqd c max5 3 archive enwik9" compresses to archive.zpaq with 3 passed to $1 in max5.cfg. This has the effect of using almost 8 times as much memory for both compression and decompression as max_enwik9. The archive was decompressed with both zpaq 6.42 (Sept. 26, 2013) and with tiny_unzpaq (Mar. 21, 2012, public domain) compiled with g++ 4.1.2 -O3 under Linux on the test machine, which has 20 GB of available memory. zpaq 6.42 is an archiver like zpaq 6.33 with a number of added features and bug fixes unrelated to compression. tiny_unzpaq is a stand-alone program that extracts only streaming mode archives and is designed so that the source code is as small as possible. It does not support JIT compilation of the ZPAQL code, or multithreading and has no error checking or help message. It takes an archive as an argument with no options and extracts to the saved names.

max6.cfg (Oct. 15, 2013) modifies max5 by rewriting the word model and adding models that count brackets ("[" minus "]" in range 0..2) and a column model (counts bytes after the last linefeed in range 0..64). It also changes the memory parameter from $1 to $3 so it can be passed to zpaq like "-m s10.0.5fmax6". This means to choose streaming mode (s), a block size of 2^10 MB (10), no preprocessing (0), pass 5 as $3 selecting 14 GB (or 1 selecting 1.4 GB) using max6.cfg. For this test, tiny_unzpaq is used to extract when the decompresser is given as "sd" although either program could be used.

Compression Compressed size Decompresser Total size Time (ns/byte) Program Options enwik8 enwik9 size (zip) enwik9+prog Comp Decomp Mem Alg Note ------- ------------ ---------- ----------- ----------- ----------- ----- ----- --- --- ---- zpaqd 6.33 max5 0 18,238,448 5960 2000 CM 61 max5 1 18,135,013 146,750,019 6309 3400 CM 61 max5 2 18,095,676 144,918,290 6521 6600 CM 61 max5 3 18,084,027 143,757,714 4,760 sd 143,762,474 5894 13173 13100 CM 61 zpaq 6.42 143,757,714 125,670 s 143,883,384 5985 13500 CM 61 zpaq 6.42 -m s10.0.1fmax6 18,167,158 150,622,666 125,670 s 150,748,336 6368 6475 1400 CM 61 -m s10.0.5fmax6 17,855,729 142,252,605 4,760 sd 142,257,365 6699 14739 14000 CM 61

zpaq 6.50, Mar. 21, 2014, uses 5 compression levels instead of 6. LZ77 when used in methods 2 and higher uses a suffix array to find matches. There are also other improvements in sorting files, grouping into blocks, detecting file type, detecting random data, and selecting compression algorithm based on type. Tests below used 4 threads.

Compression Compressed size Decompresser Total size Time (ns/byte) Program Options enwik8 enwik9 size (zip) enwik9+prog Comp Decomp Mem Alg Note ------- ------------ ---------- ----------- ----------- ----------- ----- ----- --- --- ---- zpaq 6.50 -method 1 35,691,734 314,117,968 137,993 s 314,255,964 35 23 512 LZ77 48 -method 2 31,184,422 271,626,606 137,993 s 271,764,602 150 24 1800 LZ77 48 -method 3 21,980,366 189,875,990 137,993 s 190,013,986 222 220 1600 BWT 48 -method 4 20,740,505 179,455,249 137,993 s 179,593,245 665 670 2200 CM 48 -method 5 19,625,015 168,590,741 137,993 s 168,728,730 2410 2419 3400 CM 48

.1440 drt|lpaq9m

lpaq versions 1 through 8 may be downloaded here. lpaq9* can be downloaded here or as a zpaq archive. The decompr8 series of Hutter prize entries (decompresser and enwik8 archive) are also listed here because they followed a period of development of the lpaq series.

Note: some of these programs are compressed with upack, which compresses better than upx. Some virus detectors give false alarms on all upack-compressed executables. The programs are not infected.

lpaq1 is a free, open source (GPL) file compressor by Matt Mahoney, July 24, 2007. It uses context mixing. It is a "lite" version of paq8l, about 35 times faster at the cost of about 10% in compression. The "9" option selects maximum memory. The options range from 0 (6 MB) to 9 (1.5 GB). Memory usage is 3 + 3*2N MB, N = 0..9.

The compressor mixes 7 contexts: orders 1, 2, 3, 4, 6, a unigram word context (consecutive letters, case insensitive), and a matched bit context. The contexts (except the matched bit) are mapped to nonstationary bit histories using nibble-aligned hash tables, then mapped to bit prediction probabilities using stationary adaptive tables with bit counts to control adaptation rate. The matched bit context maps the predicted bit (based on a context match), match length and order-1 context (or order 0 if no match) to a bit prediction. The probabilities are combined in the logistic domain (log(p/(1-p)) using a single layer neural network selected by a small context (3 high bits of last byte + context order), then passed through 2 SSE stages (orders 0 and 1) and arithmetic coded. Except for one model for ASCII text, there are no specialized models for binary data, .exe, .bmp, .jpeg, etc.

lpaq2 by Alexander Rhatushnyak, Sept. 20, 2007, contains some speed optimizations.

lprepaq 1.2 by Christian Schnaader, Sept. 29, 2007, is lpaq1 combined with precomp as a preprocessor. precomp compresses JPEG files and also expands data segments compressed with zlib, often making them more compressible. This preprocessing has no effect on text files.

lpaq3 and elpaq3 by Alexander Rhatushnyak, Sept. 29, 2007, has two versions with the same source code. When compiled with -DWIKI, the result is elpaq3 which is tuned for large text files. The normal compile produces lpaq3.

lpaq3a by Alexander Rhatushnyak, Sept. 30, 2007, improves compression on some files over lpaq3 (but not enwik8/9). The archive also contains lpaq3e.ex