Incremental Journaling Backup Utility and Archiver

zpaq is a free and open source incremental, journaling command-line archiver for Windows, Linux and Mac OS/X. Incremental means that when you back up your hard drive, for example:

zpaq add e:\backup.zpaq c:\*

zpaq extract e:\backup.zpaq c:\Users\Bob -to tmp -until 2013-10-30

zpaq is faster and compresses better than most other popular archivers and backup programs, especially for realistic backups that have a lot of duplicate files and a lot of already compressed files.

Archive size vs. time to compress and extract 10 GB (79,431 files) to an external USB hard drive at default and maximum settings on a Dell Latitude E6510 laptop (Core i7 M620, 2+2 hyperthreads, 2.66 GHz, 4 GB, Ubuntu Linux, Wine 1.6). Data from 10 GB Benchmark (system 4).

Feature comparison

zpaq pcompress exdupe freearc obnam rar 7zip zip Windows W W W W W W Linux L L L L L L L L Update U U U U U U U Incremental I I I I I Rollback R R Dedupe D D D D Encryption E E E E E E GUI G G G G Free F F F F F F F Open source O O O O O O O Specification S S

Download

zpaq.exe for Windows.

The latest version is zpaq v7.15, released Aug. 17, 2016. The download contain source code (zpaq.cpp, libzpaq.cpp, libzpaq.h), Windows executables (32 or 64 bit, XP or later), documentation (zpaq.pod), and a Makefile for compiling in Linux, BSD, or Mac OS/X. You may need unzip.exe to unzip from the Windows command line.



zpaq man page (HTML, latest version).

The ZPAQ archive format is described by a specification and reference decoder. A test case exercising all of the specification features should decompress to the Calgary corpus. The compression algorithm is described here.

The source code includes the libzpaq API providing compression and decompression services for applications in C++. Developers may be interested in the zpaqd development tool and sample configuration files found on the utilities page.

zpaq is written by Matt Mahoney and released to the public domain. It includes code from libdivsufsort 2.0 (C) Yuta Mori, 2003-2008, MIT license, public domain code for AES from libtomcrypt by Tom St Denis and public domain code for salsa20 by D. J. Bernstein.

Features

A zpaq archive can contain at most 4 billion files and at most 250 terabytes of data after deduplication and before compression.

zpaq is for user-level backups. Do not use it to back up the operating system or any software that requires a password to install. zpaq saves regular files and directories, last-modified dates (to the nearest second), and (optionally) Windows attributes or Linux permissions. It does not follow or save symbolic links or junctions. It unknowingly follows hard links. It does not save owner or group IDs, ACLs, extended attributes, the registry, or special file types like devices, sockets, or named pipes.

Open standard specification

Backward and forward compatibility

All versions of zpaq can read archives produced by older versions back to version 1.00 (March 2009). To some extent, older versions can read archives produced by newer versions (forward compatibility) provided they don't use any unsupported features. These are as follows:

v1.00 (Mar. 2009). Level 1 format. Streaming archives with at least one context model. Does not support deduplication or rollback.

v5.00 (Aug. 2012). Level 2 format. Adds support for compression with pre/post processing with no context modeling (e.g. uncompressed or LZ77).

v6.00 (Sept. 2012). Journaling format (dedupe and rollback).

v6.44 (Jan. 2014). Encrypted archives.

v6.47 (Jan. 2014). Multi-part archives. Older versions can read them if concatenated.

Many intermediate versions include compression improvements. This does not break forward compatibility because the decompression code is stored in the archive. The code is written in a sandboxed, virtual machine language called ZPAQL. On x86-32 and x86-64 processors, the ZPAQL code is translated to machine code and executed, so it is as fast as compression algorithms written in compiled languages like C or C++. On other hardware, the ZPAQL code is interpreted, which takes about twice as long.

For example, the following will create a streaming archive using BWT compression that can be extracted by all versions back to v1.00, even though most of these versions could not compress using BWT.

zpaq add archive.zpaq files -method s4.3ci1

Rollback

An archive is updated only by appending changes to it. You can roll back the archive to an earlier state by using the -until option to specify the date and time or version number where to stop reading.

When updating, -until will truncate the archive at that point before appending. So if you backed up some files you didn't mean to, then you can truncate the last update and repeat:

zpaq add backup c:\ -not c:\tmp -until -1

Transacted updates

Updates are committed by first appending a temporary header and then updating it when all of the compressed data and index changes are appended. If you interrupt zpaq (by typing Ctrl-C), then the partially appended data will be ignored and overwritten on the next update.

Deduplication

When adding files, zpaq uses a rolling hash function to split files into fragments with an average size of 64 KB along content-dependent boundaries. Then it computes the SHA-1 hash of the fragment and compares it with saved hashes from the current and previous versions. If it finds a match then the fragment is not stored.

Deduplication requires 1 MB of memory per GB of deduplicated but uncompressed archive data to update, and 0.5 MB per GB to list or extract.

Incremental update and restore

Files are added only if the date has changed since the last update. You can use the -force option to override, but in this case the file will be deduplicated and not saved unless the contents have really changed. This is slower than comparing dates but faster than compressing it again.

Extraction will not clobber existing files unless you give the -force option to allow overwrite. In this case, the file to be overwritten is compared with the stored hashes and not decompressed unless the size or contents is different.

Remote archive support

zpaq updates an archive by appending changes to it. To support remote backups without having to move huge files, zpaq can put the appended changes into a separate, numbered file that you would copy or move to remote storage. You can concatenate the parts to form a complete archive, or simply read them all at once by specifying a pattern in the archive name like "part???.zpaq". zpaq will then search for part001.zpaq, part002.zpaq, etc. and regard the concatenated sequence as a single archive.

To make incremental backups with a local copy:

zpaq add "arc???" files (copy arc001.zpaq) zpaq add "arc???" files (copy arc002.zpaq) zpaq list "arc???" (show contents) zpaq extract "arc???" (restore)

zpaq add "arc???" files -index arc000.zpaq (move arc001.zpaq) zpaq add "arc???" files -index arc000.zpaq (move arc002.zpaq) zpaq list arc000 (show arc???.zpaq contents)

Encryption

Archives can be encrypted using AES-256 in CTR mode. A password must be given every time an encrypted archive is used. Keys are strengthened with Scrypt(N=16384, r=8, p=1) (requiring 208M operations and 16 MB memory) to slow down brute force search for weak keys. Encrypted archives are prefixed with a 32 byte random salt, which also provides an 8 byte IV for the first half of the 16 byte AES counter. If a remote archive has a local index, then both are encrypted with the same key but different salts to generate independent keystreams. Encryption provides privacy but not authentication against tampering.

All of the encryption code (AES, Scrypt, SHA-1, SHA-256) is public domain and tested against published test vectors. The AES code is derived from libtomcrypt 1.17.

Multithreaded compression

zpaq has 5 compression levels. The default, -method 1, is the fastest. It is best for backups where you compress often and extract rarely. -method 2 compresses slower but decompresses as fast as -method 1. It is best for distributing files where you compress once and extract often. Methods 3, 4, and 5 are slower with better compression.

Fragments not removed by deduplication are packed into blocks for compression. Files are sorted by filename extension and then by decreasing size in order to group similar files together. The block size is 16 MB for method 1 and 64 MB for higher methods. You can change the block size to trade compression for memory usage.

Blocks are compressed or decompressed in parallel in separate threads. zpaq automatically detects the number of processor cores and uses all of them in the 64 bit version or at most 2 in the 32 bit version (which is limited to 2 GB memory). You can use the -threads option to change the number of threads. Resident memory per thread required to compress or decompress is approximately as follows. Virtual memory usage may be higher.

Method Compress Decompress Algorithm ------ -------- ---------- --------- 1 128 MB 32 MB LZ77 2 450 MB 128 MB LZ77 3 450 MB 400 MB LZ77+CM or BWT 4 550 MB 550 MB LZ77+CM, BWT or CM 5 850 MB 850 MB CM

Method 1 uses LZ77, compressing by replacing duplicate strings with pointers to previous occurrences. Method 2 is the same but spends more time looking for better matches (using a suffix array instead of a hash table). Method 3 uses either BWT (context sorting) or LZ77 for long matches and an order 1 context model and arithmetic coding for literals depending on the file type. Method 4 either uses LZ77, BWT or a high order context model. Method 5 uses a complex, high order context mixing model with over 20 bit prediction components.

All methods except 5 test whether the data appears to be compressible or already compressed (random). Uncompressible data is simply stored.

An E8E9 filter is applied if x86 data (normally found in .exe and .dll files) is detected. The filter replaces x86 CALL and JMP relative addresses with absolute addresses to make the data more compressible.

Data analysis

zpaq has list options to make it easier to examine the contents of archives containing millions of files. For example, the following compares external dir1 to internal dir2 and lists only differences. Files are compared quickly by size and last modified date, or thoroughly by reading the file, computing its SHA-1 hashes and comparing with the hashes stored in the archive.

zpaq list backup dir1 -to dir2 -not = (compare dates) zpaq list backup dir1 -to dir2 -not = -force (compare contents)

-only *.exe List only files ending with .exe -not *.exe Don't list files matching a pattern. -summary 20 List the 20 largest files and identify duplicates. -all Show all file versions. -until 20 List contents as of the 20'th update

Error detection and recovery

zpaq archives are designed to minimize data loss if damaged. An archive is divided into blocks that can be decompressed independently. Each block begins with a 13 byte tag that can be found by scanning if the previous block is damaged. Each block ends with the SHA-1 hash of the uncompressed data, which is verified to detect errors. Blocks with hash mismatches or other errors are ignored with a warning without killing zpaq.

Each update contains 4 types of blocks.

C - Update header: date, size of compressed data.

D - Compressed data fragments, list of fragment sizes.

H - List of fragment hashes and sizes, one per D block.

I - Index updates: list of files updated or deleted. Each update includes the date, attributes, and list of fragments.

C blocks are used to skip over D blocks to read the index quickly. They are not needed to extract. If a D or H block is lost then so are any files that point to it. If an I block is lost, then so are any files in it. I blocks are small (16 KB) to minimize damage.

When extracting files, the D block is decompressed up to the last used fragment and those fragments are hashed and compared to the stored hashes in the H block.

The zpaq -test -all extract option will decompress internally and verify all of the fragment hashes without writing the output files.

Public Domain API

The source download includes libzpaq, a public domain application programming interface (API) in C++ that provides streaming compression and decompression services to and from files, strings, or arrays using built-in and custom compression algorithms. To use the code, you include libzpaq.h in your program and link to libzpaq.cpp . The API documentation is in libzpaq.h . The precise semantics is described in the ZPAQ specification.

In the simplest case, the application provides an error handling function and derived implementations of two abstract classes, Reader and Writer, specifying the input and output byte streams. For example, to compress from stdin to stdout (assuming binary I/O as in Linux):

#include "libzpaq.h" #include <stdio.h> #include <stdlib.h> void libzpaq::error(const char* msg) { // print message and exit fprintf(stderr, "Oops: %s

", msg); exit(1); } class In: public libzpaq::Reader { public: int get() {return getchar();} // returns byte 0..255 or -1 at EOF } in; class Out: public libzpaq::Writer { public: void put(int c) {putchar(c);} // writes 1 byte 0..255 } out; int main() { libzpaq::compress(&in, &out, "1"); // -method 1 }

libzpaq::decompress(&in, &out);

There are also functions for reading and writing block and segment headers and for passing specialized methods or ZPAQL code to the compressor, as documented in libzpaq.h. The ZPAQ utilities page contains sample compression algorithms written in ZPAQL and a tool zpaqd for running, testing, and debugging ZPAQL.

Contact

zpaq was written by Matt Mahoney, mattmahoneyfl (at) gmail (dot) com