Xz format inadequate for long-term archiving

Abstract

One of the challenges of digital preservation is the evaluation of data formats. It is important to choose well-designed data formats for long-term archiving. This article describes the reasons why the xz compressed data format is inadequate for long-term archiving and inadvisable for data sharing and for free software distribution. The relevant weaknesses and design errors in the xz format are analyzed and, where applicable, compared with the corresponding behavior of the bzip2, gzip and lzip formats. Key findings include: (1) safe interoperability among xz implementations is not guaranteed; (2) xz's extensibility is unreasonable and problematic; (3) xz is vulnerable to unprotected flags and length fields; (4) LZMA2 is unsafe and less efficient than the original LZMA; (5) xz includes useless features that increase the number of false positives for corruption; (6) xz shows inconsistent behavior with respect to trailing data; (7) error detection in xz is several times less accurate than in bzip2, gzip and lzip.

Disclosure statement: The author is also author of the lzip format.

Acknowledgements: The author would like to thank Lasse Collin for his detailed and useful comments that helped improve the first version of this article. The author would also like to thank the fellow GNU developers who reviewed this article before publication and the people whose comments have helped to fill in the gaps.

This article tries to be as objective and complete as possible. Please report any errors or inaccuracies to the author at the lzip mailing list lzip-bug@nongnu.org or in private at antonio@gnu.org.

1 Introduction

There are two ways of constructing a software design: One way is to make it so simple that there are obviously no deficiencies and the other way is to make it so complicated that there are no obvious deficiencies. The first method is far more difficult.

-- C.A.R. Hoare

Perfection is reached, not when there is no longer anything to add, but when there is no longer anything to take away.

-- Antoine de Saint-Exupery

Both the xz compressed data format and its predecessor lzma-alone have serious design flaws. But while lzma-alone is a toy format lacking fundamental features, xz is a complex container format full of contradictions. For example, xz tries to appear as a very safe format by offering overkill check sequences like SHA-256 but, at the same time it fails to protect the length fields needed to decompress the data in the first place. These defects make xz inadequate for long-term archiving and reduce its value as a general-purpose compressed data format.

This article analyzes the xz compressed data format, what is to mean the way bits are arranged in xz compressed files and the consequences of such arrangement. This article is about formats, not programs. In particular this article is not about bugs in any compression tool. The fact that the xz reference tool (xz-utils) has had more bugs than bzip2 and lzip combined is mainly a consequence of the complexity and bad design of the xz format. Also the uninformative error messages provided by the xz tool reflect the extreme difficulty of finding out what failed in case of corruption in a xz file.

This article started with a series of posts to the debian-devel mailing list [Debian], where it became clear that nobody had analyzed xz in any depth before adopting it in the Debian package format. The same unthinking adoption of xz seems to have happened in major free software projects, like GNU Coreutils and Linux. In my opinion, it is a mistake for any widely used project to become an early adopter of a new data format; it may cause a lot of trouble if any serious defect is later discovered in the format.

One thing that developers usually fail to consider is that the format used for distribution of source code should be chosen carefully because source tarballs are meant to be archived forever as a way to compile and run a given version of some software in the future, for example to decode old data.

Compression may be good for long-term archiving. For compressible data, multiple compressed copies may provide redundancy in a more useful form and may have a better chance of surviving intact than one uncompressed copy using the same amount of storage space. This is specially true if the format provides recovery capabilities like those of lziprecover, which is able to find and combine the good parts of several damaged copies.

2 The reasons why the xz format is inadequate for long-term archiving

2.1 Xz is a container format

On Unix-like systems, where a tool is supposed to do one thing and do it well, compressed file formats are usually formed by the compressed data, preceded by a header containing the parameters needed for decompression, and followed by a trailer containing integrity information. Bzip2, gzip and lzip formats are designed this way, minimizing both overhead and false positives.

On the contrary, xz is a container format which currently contains another container format (LZMA2), which in turn contains a mix of LZMA data and uncompressed data. In spite of implementing just one compression algorithm, xz already manages 3 levels of headers, which increases its fragility. The xz format is not even fully documented. Section 3.2 of the xz format specification states that "the format of the filter-specific encoded data is out of scope of this document". The details about the [LZMA2 format] described in the Wikipedia were deduced from the source code of the xz-embedded decompressor included in the Linux kernel.

The xz format has more overhead than bzip2, gzip or lzip, most of it either not properly designed (e.g., unprotected headers) or plain useless (padding). In fact, a xz stream can contain such a large amount of overhead that the format designers deemed necessary to compress the overhead using unsafe methods.

There is no reason to use a container format for a general-purpose compressor. The right way of implementing a new compression algorithm is to provide a version number in the header, and the right way of implementing binary filters is to write a preprocessor that applies the filter to the data before feeding them to the compressor. (See for example mince).

2.2 Xz is fragmented by design

Xz was designed as a fragmented format. Xz implementations may choose what subset of the format they support. In particular, integrity checking in xz offers multiple choices of check types, all of them optional except CRC32 which is recommended. (See [Xz format], section 2.1.1.2 'Stream Flags'. See also [RFC 2119] for the definitions of 'optional' and 'recommended'). Safe interoperability among xz implementations is not guaranteed. For example the xz-embedded decompressor does not support the optional check types. Other xz implementations may choose to not support integrity checking at all.

The xz reference tool (xz-utils) ignores the recommendation of the xz format specification and uses by default an optional check type (CRC64) in the files it produces. This prevents decompressors that do not support the optional check types from verifying the integrity of the data. Using --check=crc32 when creating the file makes integrity checking work on the xz-embedded decompressor, but as CRC32 is just recommended, it does not guarantee that integrity checking will work on all xz compliant decompressors. Distributing software in xz format can only be guaranteed to be safe if the distributor controls the decompressor run by the user (or can force the use of external means of integrity checking). Error detection in the xz format is broken; depending on how the file was created and on what decompressor is available, the integrity check in xz is sometimes performed and sometimes not. The latter is usually the case for the tarballs released in xz format by GNU and Linux when they are decompressed with the xz-embedded decompressor (see the third xz test in [benchmark]).

Fragmentation (subformat proliferation) hinders interoperability and complicates the management of large archives. The lack of guaranteed integrity checking increases the probability of undetected corruption. Bzip2, gzip and lzip are free from these defects; any decompressor can decompress and verify the integrity of any file in the corresponding format.

2.3 Xz is unreasonably extensible

The design of the xz format is based on two false ideas; that better compression algorithms can be mass-produced like cars in a factory, and that it is practical to embed all these algorithms in one format. Note, for example, that some xz implementations already do not even fully support integrity checking.

Xz has room for 2^63 filters, which can then be combined to make an even larger number of algorithms. Xz reserves less than 0.8% of filter IDs for custom filters, but even this small range provides about 8 million custom filter IDs for each human inhabitant on earth. There is not the slightest justification for such egregious level of extensibility. Every useless choice allowed by a format takes space and makes corruption both more probable and more difficult to recover from.

The basic ideas of compression algorithms were discovered early in the history of computer science. LZMA is based on ideas discovered in the 1970s. Don't expect an algorithm much better than LZMA to appear anytime soon, much less several of them in a row.

In 2008 one of the designers of xz (Lasse Collin) warned me that lzip would become stuck with LZMA while others moved to LZMA2, LZMA3, LZMH, and other algorithms. Now xz-utils is usually unable to match the compression ratio of lzip because LZMA2 has more overhead than LZMA and, as expected, no new algorithms have been added to xz-utils.

2.4 Xz's extensibility is poorly designed

The xz format lacks a version number field. The only reliable way of knowing if a given version of a xz decompressor can decompress a given file is by trial and error. The 'file' utility does not provide any help (note that no version information is shown for xz):

$ file COPYING.* COPYING.lz: lzip compressed data, version: 1 COPYING.xz: XZ compressed data

Xz-utils can report the minimum version of xz-utils required to decompress a given file, but it must decode each block header in the file to find it out, and only can report older versions of xz-utils by using its knowledge of the features implemented in each past version of itself. If a newer version of xz-utils is required, the file says nothing about which one it may be. The report from xz-utils is also useless to know what version of other decompressors (for example 7-zip) could decompress the file. Note that the version of xz-utils reported may be unable to decompress the file if it was built without support for some feature present in the file.

The extensibility of bzip2 and lzip is better. Both formats provide a version field. Therefore it is trivial for them to seamlessly and reliably incorporate a new compression algorithm because any tool knows the format version of any file. Newer tools know what algorithm to use and old tools can report to the user the format version required to decompress a newer file, from which it is easy to find out the version of a tool able to decompress the file. If an algorithm much better than LZMA is found, a version 2 lzip format (perfectly fit to the new algorithm) can be designed, along with a version 2 lzip tool able to decompress the old and new formats transparently. Bzip2 is already a "version 2" format. The reason why bzip2 does not decompress bzip files is that the original bzip format was abandoned because of problems with software patents.

The extensibility of gzip is obsolete mainly because of the 32-bit uncompressed size (ISIZE) field.

2.5 Xz fails to protect the length of variable size fields

According to [Koopman] (p. 50), one of the "Seven Deadly Sins" (i.e., bad ideas) of CRC and checksum use is failing to protect a message length field. This causes vulnerabilities due to framing errors. Note that the effects of a framing error in a data stream are more serious than what Figure 1 suggests. Not only data at a random position are interpreted as the CRC. Whatever data that follow the bogus CRC will be interpreted as the beginning of the following field, preventing the successful decoding of any remaining data in the stream.



Figure 1. Corruption of message length field. Source: [Koopman], p. 30.

Except the 'Backward Size' field in the stream footer, none of the many length fields in the xz format is protected by a check sequence of any kind. Not even a parity bit. All of them suffer from the framing vulnerability illustrated in the picture above. In particular every LZMA2 header contains one 16-bit unprotected length field. Some length fields in the xz format are of variable size themselves, adding a new failure mode to xz not found in the other three formats; double framing error.

Bzip2 is affected by this defect to a lesser extent; it contains two unprotected length fields in each block header. Gzip may be considered free from this defect because its only top-level unprotected length field (XLEN) can be validated using the LEN fields in the extra subfields. Lzip is free from this defect.

Optional fields are just as unsafe as unprotected length fields if the flag that indicates the presence of the optional field is itself unprotected. The result is the same; framing errors. Again, except the 'Stream Flags' field, none of those flags in the xz format is protected by a check sequence. In particular the critically important 'Block Flags' field in block headers and bit 6 in the control byte of the numerous LZMA2 headers are not protected.

Bzip2 contains 16 unprotected flags for optional huffman bitmaps in each block header. Gzip just contains one byte with four unprotected flags for optional fields in its header. Lzip is free from optional fields.

2.6 Xz uses variable-length integers unsafely

Xz stores many (potentially large) numbers using a variable-length representation terminated by a byte with the most significant bit (msb) cleared. In case of corruption, not only the value of the field may become incorrect, the size of the field may also change, causing a framing error in the following fields. Xz uses such variable-length integers to store the size of other fields. In case of corruption in the size field, both the position and the size of the target field may become incorrect, causing a double framing error. See for example [Xz format], section 3.1.5 'Size of Properties' in 'List of Filter Flags'. Bzip2, gzip and lzip store all fields representing numbers in a safe fixed-length representation.

Xz features a monolithic index that is specially vulnerable to cascading framing errors. Some design errors of the xz index are:

The number of records is coded as an unprotected variable-length integer vulnerable to double framing error. The size of the index is not stored anywhere. It must be calculated by decoding the whole index and can't be verified. ('Backward Size' stores the size of the index rounded up to the next multiple of four bytes, not the real size). When reading from unseekable sources, it delays the verification of the block sizes until the end of the stream and requires a potentially huge amount of RAM (up to 16 GiB), unless such verification is made by hashing, in which case it can't be known what blocks failed the test. The safe and efficient way is to verify the sizes of each block as soon as it is processed, as gzip and lzip do. The list of records is made of variable-length integers concatenated together. Regarding corruption it acts as one potentially very long unprotected variable-length integer. Just one bit flip in the msb of any byte causes the remaining records to be read incorrectly. It also causes the size of the index to be calculated incorrectly, losing the position of the CRC32 and the stream footer. Each record stores the size (not the position) of the corresponding block, but xz's block headers do not provide an identification string that could validate the block size. Therefore, just one bit flip in any 'Unpadded Size' field causes the positions of the remaining blocks to be calculated incorrectly. By contrast, lzip provides a distributed index where each member size is validated by the presence of the ID string in the corresponding member header. Neither the bzip2 format nor the gzip format do provide an index.

2.7 LZMA2 is unsafe and less efficient than the original LZMA

The xz-utils manual says that LZMA2 is an updated version of LZMA to fix some practical issues of LZMA. This wording suggests that LZMA2 is some sort of improved LZMA algorithm. (After all, the 'A' in LZMA stands for 'algorithm'). But LZMA2 is a container format that divides LZMA data into chunks in an unsafe way.

The [LZMA2 format] contains an unrestricted mix of LZMA packets and uncompressed data packets. Each packet starts with a header that is not protected by any check sequence in spite of containing the type and size of the following data. Therefore, every bit flip in a LZMA2 header causes either a framing error or a desynchronization of the decoder. In any case it is usually not possible to decode the remaining data in the block or even to know what failed. Compare this with [Deflate] which at least does protect the length field of its non-compressed blocks. (Deflate's compressed blocks do not have a length field).

Note that of the 3 levels of headers in a xz file (stream, block, LZMA2), the most numerous LZMA2 headers are the ones not protected by a check sequence. There is usually one stream header and one block header in a xz file, but there is at least one LZMA2 header for every 64 KiB of LZMA2 data in the file. In extreme cases the LZMA2 headers can make up to a 3% of the size of the file:

-rw-r--r-- 1 14208 Oct 21 17:26 100MBzeros.lz -rw-r--r-- 1 14195 Oct 21 17:26 100MBzeros.lzma -rw-r--r-- 1 14676 Oct 21 17:26 100MBzeros.xz

The files above were produced by lzip (.lz) and xz-utils (.lzma, .xz). The LZMA stream is identical in the .lz and .lzma files above; they just differ in the header and trailer. The .xz file is larger than the other two mainly because of the 50 LZMA2 headers it contains. LZMA2 headers make xz both more fragile and less efficient.

In practice, for compressible data, LZMA2 is just LZMA with 0.015%-3% more overhead. The maximum compression ratio of LZMA is about 7089:1, but LZMA2 is limited to 6875:1 approximately (measured with 1 TB of data). For uncompressible data, xz expands the data less than lzip (0.01% vs 1.4%), but at the cost of making it more vulnerable to undetected corruption because xz stores uncompressible data uncompressed, and corruption in the uncompressed packets of a LZMA2 stream can't be detected by the decoder, leaving the (optional) check sequence as the only way of detecting errors there. (See the xz tests in [benchmark]).

On the other hand, the original LZMA data stream provides embedded error detection. Any distance larger than the dictionary size acts as a forbidden symbol, allowing the decoder to detect the approximate position of errors, and leaving very little work for the check sequence in the detection of errors.

LZMA2 could have been safer and more efficient if only its designers had copied the structure of Deflate; terminate compressed blocks with a marker, and protect the length of uncompressed blocks. This would have reduced the overhead, and therefore the number of false positives, in the files above by a factor of 25. For compressible files, that only need a header and a marker, the improvement is usually of 8 times less overhead per mebibyte of compressed size (about 500 times less overhead for a file of 64 MiB).

Section 5.3.1 of the xz format specification states that LZMA2 "improves support for multithreading", but in practice LZMA2 is not suitable for parallel decompression mainly for two reasons:

Only LZMA2 streams containing dictionary resets can be decompressed in parallel. But the positions of dictionary resets are not stored anywhere nor marked with any recognizable identification string. Therefore the LZMA2 headers must be decoded sequentially to find the dictionary resets (if any exists). All the packets in a LZMA2 stream share a common check sequence. This means that the partial check sequences calculated for each substream must be combined to obtain the check sequence of the whole stream. But one of the currently supported check sequences (SHA256) can't be calculated by parts and must be calculated sequentially for the whole stream.

2.8 The 4 byte alignment is unjustified

Xz is the only format of the four considered here whose parts are (arbitrarily) aligned to a multiple of four bytes. The size of a xz file must also be a multiple of four bytes for no reason. To achieve this, xz includes padding everywhere; after headers, blocks, the index, and the whole stream. The bad news is that if the (useless) padding is altered in any way, "the decoder MUST indicate an error" according to the xz format specification.

Neither gzip nor lzip include any padding. Bzip2 includes a minimal amount of padding (at most 7 bits) at the end of the whole stream, but it ignores any corruption in the padding.

Xz justifies alignment as being perhaps able to increase speed and compression ratio (see [Xz format], section 5.1 'Alignment'), but such increases can't happen because:

The only last filter in xz is LZMA2, whose output does not need any alignment. The output of the non-last filters in the chain is not stored in the file. Therefore it can't be "later compressed with an external compression tool" as stated in the xz format specification.

One additional problem of the xz alignment is that four bytes are not enough; the IA64 filter has an alignment of 16 bytes. Alignment is a property of each filter that can only be managed by the archiver, not a property of the whole compressed stream. Even the xz format specification acknowledges that alignment of input data is the job of the archiver, not of the compressor.

The conclusion is that the 4 byte alignment is a misfeature that wastes space, increases the number of false positives for corruption, and worsens the burst error detection in the stream footer without producing any benefit at all.

2.9 Trailing data

If you want to create a compressed file and then append some data to it, for example a cryptographically secure hash, xz won't allow you to do so. The xz format specification forbids the appending of data to a file, except what it defines as 'stream padding'. In addition to telling you what you can't do with your files, defining stream padding makes xz show inconsistent behavior with respect to trailing data. Xz accepts the addition of any multiple of 4 null bytes to a file. But if the number of null bytes appended is not a multiple of 4, or if any of the bytes is non-null, then the decoder must indicate an error.

A format that reports as corrupt the only surviving copy of an important file just because cp had a glitch and appended some garbage at the end of the file is not well suited for long-term archiving. The worst thing is that the xz format specification does not offer any compliant way of ignoring such trailing data. Once a xz file gets any trailing data appended, it must be manually removed to make the file compliant again.

In a vain attempt to avoid such inconsistent behavior, xz-utils provides the option '--single-stream', which is just plain wrong for multi-stream files because it makes the decompressor ignore everything beyond the first stream, discarding any remaining valid streams and silently truncating the decompressed data:

cat file1.xz file2.xz file3.sig > file.xz xz -d file.xz # indicates an error xz -d --single-stream file.xz # causes silent data loss xz -kd --single-stream file.xz # causes silent truncation

The '--single-stream' option violates the xz format specification which requires the decoder to indicate an error if the stream padding does not meet its requirements. The xz format should provide a compliant way to ignore any trailing data after the last stream, just like bzip2, gzip and lzip do by default.

2.10 Xz's error detection has low accuracy

"There can be safety tradeoffs with the addition of an error-detection scheme. As with almost all fault tolerance mechanisms, there is a tradeoff between availability and integrity. That is, techniques that increase integrity tend to reduce availability and vice versa. Employing error detection by adding a check sequence to a dataword increases integrity, but decreases availability. The decrease in availability happens through false-positive detections. These failures preclude the use of some data that otherwise would not have been rejected had it not been for the addition of error-detection coding". ([Koopman], p. 33).

But the tradeoff between availability and integrity is different for data transmission than for data archiving. When transmitting data, usually the most important consideration is to avoid undetected errors (false negatives for corruption), because a retransmission can be requested if an error is detected. Archiving, on the other hand, usually implies that if a file is reported as corrupt, "retransmission" is not possible. Obtaining another copy of the file may be difficult or impossible. Therefore accuracy (freedom from mistakes) in the detection of errors becomes the most important consideration.

Two error models have been used to measure the accuracy in the detection of errors. The first model consists of one or more random bit flips affecting just one byte in the compressed file. The second model consists of zeroed 512-byte blocks aligned to a 512-byte boundary, simulating a whole sector I/O error. Just one zeroed block per trial. The first model is considered the most important because bit flips happen even in the most expensive hardware [MSL].

Verification of data integrity in compressed files is different from other cases (like Ethernet packets) because the data that can become corrupted are the compressed data, but the data that are verified (the dataword) are the decompressed data. Decompression can cause error multiplication; even a single-bit error in the compressed data may produce any random number of errors in the decompressed data, or even modify the size of the decompressed data.

Because of the error multiplication caused by decompression, the error model seen by the check sequence is one of unconstrained random data corruption. (Remember that the check sequence verifies the integrity of the decompressed data). This means that the choice of error-detection code (CRC or hash) is largely irrelevant, and that the probability of an error being undetected by the check sequence (Pudc) is 1 / (2^n) for a check sequence of n bits. (See [Koopman], p. 5). Note that if some errors do not produce error multiplication, a CRC is then preferable to a hash of the same size because of the burst error detection capabilities of the CRC.

Decompression algorithms are usually able to detect some errors in the compressed data (for example a backreference to a point before the beginning of the data). Therefore, the total probability of an undetected error (Pud = false negative) is the product of the probability of the error being undetected by the decoder (Pudd) and the probability of the error being undetected by the check sequence (Pudc): Pud = Pudd * Pudc.

It is also possible that a small error in the compressed data does not alter at all the decompressed data. Therefore, for maximum availability, only the decompressed data should be tested for errors. Testing the compressed data beyond what is needed to perform the decompression increases the number of false positives much more than it can reduce the number of undetected errors.

Of course, error multiplication was not applied in the analysis of fields that are not compressed, for example 'Block Header'. Burst error detection was also considered for the 'Stream Flags' and 'Stream Footer' fields.

Trial decompressions were performed using the 'unzcrash' tool included in the lziprecover package.

The following sections describe the places in the xz format where error detection suffers from low accuracy and explain the cause of the inaccuracy in each case.

2.10.1 The 'Stream Flags' field

A well-known property of CRCs is their ability to detect burst errors up to the size of the CRC itself. Using a CRC larger than the dataword is an error because a CRC just as large as the dataword equally detects all errors while it produces a lower number of false positives.

In spite of the mathematical property described above, the 16-bit 'Stream Flags' field in the xz stream header is protected by a CRC32 twice as large as the field itself, providing an unreliable error detection where 2 of every 3 reported errors is a false positive. The inaccuracy reaches 67%. CRC16 is a better choice from any point of view. It can still detect all errors in 'Stream Flags', but produces half the false positives as CRC32.

Note that a copy of the 'Stream Flags', also protected by a CRC32, is stored in the stream footer. With such amount of redundancy xz should be able to repair a fully corrupted 'Stream Flags'. Instead of this the format specifies that if one of the copies, or one of the CRCs, or the backward size in the stream footer gets any damage, the decoder must indicate an error. The result is that getting a false positive for corruption related to the 'Stream Flags' is 7 times more probable than getting real corruption in the 'Stream Flags' themselves.

2.10.2 The 'Stream Footer' field

The 'Stream Footer' field contains the rounded up size of the index field and a copy of the 'Stream Flags' field from the stream header, both protected by a CRC32. The inaccuracy of the error detection for this field reaches a 40%; 2 of every 5 reported errors is a false positive.

The CRC32 in 'Stream Footer' provides a reduced burst error detection because it is stored at front instead of back of codeword. (See [Koopman], p. C-20). Testing has found several undetected burst errors of 31 bits in this field, while a CRC32 correctly placed would have detected all burst errors up to 32 bits. The reason adduced by the xz format specification for this misplacement is to keep the four-byte fields aligned to a multiple of four bytes, but the 4 byte alignment is unjustified.

2.10.3 The 'Block Header' field

The 'Block Header' is of variable size. Therefore the inaccuracy of the error detection varies between 0.4% and 58%, being usually of a 58% (7 of every 12 reported errors are false positives). As shown in the graph below, CRC16 would have been a more accurate choice for any size of 'Block Header'. But inaccuracy is a minor problem compared with the lack of protection of the 'Block Header Size' and 'Block Flags' fields.



Figure 2. Inaccuracy of block header CRC for all possible header sizes.

2.10.4 The 'Block Check' field

Xz supports several types of check sequences (CS) for the decompressed data; none, CRC32, CRC64 and SHA-256. Each check sequence provides better accuracy than the next larger one up to a certain compressed size. For the single-byte error model, the inaccuracy for each compressed size and CS size is calculated by the following formula (all sizes in bytes):

Inaccuracy = ( compressed_size * Pudc + CS_size ) / ( compressed_size + CS_size )

Applying the formula above it results that CRC32 provides more accurate error detection than CRC64 up to a compressed size of about 16 GiB, and more accurate than SHA-256 up to 112 GiB. It should be noted that SHA-256 provides worse accuracy than CRC64 for all possible block sizes.



Figure 3. Inaccuracy of block check up to 1 GB of compressed size.

For the zeroed-block error model, the inaccuracy curves are similar to the ones in figure 3, except that they have discontinuities because a false positive can be produced only if the last block is suitably aligned.

The results above assume that the decoder does not detect any errors, but testing shows that, on large enough files, the Pudd of a pure LZMA decoder like the one in lzip is of about 2.52e-7 for the single-byte error model. More precisely, 277.24 million trial decompressions on files ranging from 1 kB to 217 MB of compressed size resulted in 70 errors undetected by the decoder (all of them detected by the CRC). This additional detection capability reduces the Pud by the same factor. (In fact the reduction of Pud is larger because 9 of the 70 errors didn't cause error multiplication; they produced just one wrong byte in the decompressed data, which is guaranteed to be detected by the CRC). The estimated Pud for lzip, based on these data, is of about 2.52e-7 * 2.33e-10 = 5.88e-17.

For the zeroed-block error model, the additional detection capability of a pure LZMA decoder is probably much larger. A LZMA stream is a check sequence in itself, and large errors seem less probable to escape detection than small ones. In fact, the lzip decoder detected the error in all the 2 million trial decompressions run with a zeroed-block. The xz decoder can't achieve such performance because LZMA2 includes uncompressed packets, where the decoder can't detect any errors.

There is a good reason why bzip2, gzip, lzip and most other compressed formats use a 32-bit check sequence; it provides for an optimal detection of errors. Larger check sequences may (or may not) reduce the number of false negatives at the cost of always increasing the number of false positives. But significantly reducing the number of false negatives may be impossible if the number of false negatives is already insignificant, as is the case in bzip2, gzip and lzip files. On the other hand, the number of false positives increases linearly with the size of the check sequence. CRC64 doubles the number of false positives of CRC32, and SHA-256 produces 8 times more false positives than CRC32, decreasing the accuracy of the error detection instead of increasing it.

Increasing the probability of a false positive for corruption in the long-term storage of valuable data is a bad idea. This is why the lzip format, designed for long-term archiving, provides 3 factor integrity checking and the decompressor reports mismatches in each factor separately. This way if just one byte in one factor fails but the other two factors match the data, it probably means that the data are intact and the corruption just affects the mismatching check sequence. GNU gzip also reports mismatches in its 2 factors separately, but does not report the exact values, making it more difficult to tell real corruption from a false positive. Bzip2 reports separately its 2 levels of CRCs, allowing the detection of some false positives.

Being able to produce files without a check sequence for the decompressed data may help xz to rank higher on decompression benchmarks, but is a very bad idea for long-term archiving. The whole idea of supporting several check types is wrong. It fragments the format and introduces a point of failure in the xz stream; if the corruption affects the stream flags, xz won't be able to verify the integrity of the data because the type and size of the check sequence are lost.

2.11 Xz's error detection is misguided

Xz tries to detect errors in parts of the compressed file that do not affect decompression (for example in padding bytes), obviating the fact that nobody is interested in the integrity of the compressed file; it is the integrity of the decompressed data what matters. Note that the xz format specification sets more strict requirements for the integrity of the padding than for the integrity of the payload. The specification does not guarantee that the integrity of the decompressed data will be verified, but it mandates that the decompression must be aborted as soon as a damaged padding byte is found. (See sections 2.2, 3.1.6, 3.3 and 4.4 of [Xz format]). Xz goes so far as to "protect" padding bytes with a CRC32. This behavior of xz just causes unnecessary data loss.

Checking the integrity of the decompressed data is important because it not only guards against corruption in the compressed file, but also against memory errors, undetected bugs in the decompressor, etc.

The only reason to be concerned about the integrity of the compressed file itself is to be sure that it has not been modified or replaced with other file. But no amount of strictness in the decompressor can guarantee that a file has not been modified or replaced. Some other means must be used for this purpose, for example an external cryptographically secure hash of the file.

2.12 Xz does not provide any data recovery means

File corruption is an unlikely event. Being unable to restore a file because the backup copy is also damaged is even less likely. But unlikely events happen constantly to somebody somewhere. This is why tools like ddrescue, bzip2recover and lziprecover exist in the first place. Lziprecover defines itself as "a last line of defense for the case where the backups are also damaged".

The safer a format is, the easier it is to develop a capable recovery tool for it. Neither xz nor gzip do provide any recovery tool. Bzip2 provides bzip2recover, which can help to manually assemble a correct file from the undamaged blocks of two or more copies. Lzip provides lziprecover, which can produce a correct file by merging the good parts of two or more damaged copies and can additionally repair slightly damaged files without the need of a backup copy.

3 Then, why some free software projects use xz?

Most free software projects using xz (for example to compress their source tarballs), probably chose it because LZMA compresses better than gzip and bzip2, and because the developers of those projects didn't know about the defects of xz, or about lzip, when they adopted xz. Evaluating formats is difficult and most free software projects are not concerned about long-term archiving, or even about format quality. Therefore they tend to use the most advertised formats. Both lzma-alone and xz have gained some popularity in spite of their defects mainly because they are associated to popular projects like GNU Coreutils or the 7-zip archiver. (As far as I know, the main cause of the popularity of xz among GNU/Linux distributions was the very early adoption of lzma-alone and xz by GNU Coreutils when xz was still in alpha status).

This of course is sad because we software developers are among the few people who are able to understand the strengths and weaknesses of formats. We have a moral duty to choose wisely the formats we use because everybody else will blindly use whatever formats we choose.

4 Conclusions

There are several reasons why the xz compressed data format should not be used for long-term archiving, specially of valuable data. To begin with, xz is a complex container format that is not even fully documented. Using a complex format for long-term archiving would be a bad idea even if the format were well-designed, which xz is not. In general, the more complex the format, the less probable that it can be decoded in the future by a digital archaeologist. For long-term archiving, simple is robust.

Xz is fragmented by design. Xz implementations may choose what subset of the format they support. They may even choose to not support integrity checking at all. Safe interoperability among xz implementations is not guaranteed, which makes the use of xz inadvisable not only for long-term archiving, but also for data sharing and for free software distribution. Xz is also unreasonably extensible; it has room for trillions of compression algorithms, but currently only supports one, LZMA2, which in spite of its name is not an improved version of LZMA, but an unsafe container for LZMA data. Such egregious level of extensibility makes corruption both more probable and more difficult to recover from. Additionally, the xz format lacks a version number field, which makes xz's extensibility problematic.

Xz fails to protect critical fields like length fields and flags signalling the presence of optional fields. Xz uses variable-length integers unsafely, specially when they are used to store the size of other fields or when they are concatenated together. These defects make xz fragile, meaning that most of the times when it reports a false positive, the decoder state is so mangled that it is unable to recover the decompressed data.

Error detection in the xz format is less accurate than in bzip2, gzip and lzip formats mainly because of false positives, and specially if an overkill check sequence like SHA-256 is used in xz. Another cause of false positives is that xz tries to detect errors in parts of the compressed file that do not affect decompression, like the padding added to keep the useless 4 byte alignment. In total xz reports several times more false positives than bzip2, gzip or lzip, and every false positive may result in unnecessary loss of data.

All these defects and design errors reduce the value of xz as a general-purpose format because anybody wanting to archive a file already compressed in xz format will have to either leave it as-is and face a larger risk of losing the data, or waste time recompressing the data into a format more suitable for long-term archiving.

The weird combination of unprotected critical fields, overkill check sequences, and padding bytes "protected" by a CRC32 can only be explained by the inexperience of the designers of xz. It is said that given enough eyeballs, all bugs are shallow. But the adoption of xz by several GNU/Linux distributions shows that if those eyeballs lack the required experience, it may take too long for them to find the bugs. It would be an improvement for data safety if compressed data formats intended for broad use were designed by experts and peer reviewed before publication. This would help to avoid design errors like those of xz, which are very difficult to fix once a format is in use.

5 References

6 Glossary

^: Exponentiation symbol.

Accuracy: Freedom from mistakes. Is defined as 1 - inaccuracy.

Bit flip: Error that inverts a single bit value.

Burst Error: A set of bit errors contained within a span shorter than the length of the codeword. The burst length is the distance between the first and last bit errors. Zero or more of the intervening bits may be erroneous.

Check sequence: An error-detecting code value stored with associated data. A dataword plus a check sequence makes up a codeword.

Codeword: A dataword combined with a check sequence.

Cyclic Redundancy Check (CRC): A type of check sequence.

Dataword: The data being protected by a check sequence. A dataword can be any number of bits and is independent of the machine word size (e.g., a dataword could be 64 Megabytes).

False negative: Undetected error producing incorrect decompressed data.

False positive: Inability or refusal to decompress the data in a damaged compressed file even if the compressed data proper remain decompressible. (i.e., either the compressed data have not suffered any damage or the damage does not hinder decompression).

Fragility: Inability to recover the decompressed data (for example decompressing to standard output) in case of a false positive.

Inaccuracy: Ratio of error detection mistakes. Is defined as ( false_negatives + false_positives ) / total_cases.

Overhead: Everything in the file except the compressed data proper. (Headers, check sequences, padding, etc).

Overkill: An unnecessary excess of whatever is needed to achieve a goal.

Pud: Probability of an undetected error (false negative). Pud is equal to Pudd * Pudc.

Pudc: Probability of an error undetected by the check sequence.

Pudd: Probability of an error undetected by the decoder.

Unsafe: 1 Likely to cause severe data loss even in case of the smallest corruption (a single bit flip). 2 Likely to produce false negatives.

Copyright © 2016-2019 Antonio Diaz Diaz.

You are free to copy and distribute this article without limitation, but you are not allowed to modify it.

First published: 2016-06-11

Updated: 2019-06-13