This deduplication thing was built into Restic right from the beginning, because at first I thought “Oh, I’d like to have a backup program”, and then I started thinking about it and discussing with my colleagues what should a backup program do, and one of the things that you have in a backup program - you have duplicate data. Either you have the same file at different times - sometimes files haven’t been modified, so you have exactly the same data and that’s really easy to handle, but sometimes you have virtual machine images of like 100 gigabytes and you just have changed one or two bytes within the whole image, and it would be really a shame to store this 100 gigabytes twice, because most of the data is exactly the same.

So I started looking into algorithms that try to detect changes or similarities in data, and one of the ideas that have been built into this is this really old tool called rsync . There’s a PhD thesis by somebody called Andrew, I think… I’ve forgotten the name.

rsync does really interesting things. For example, when you try to copy a file to a remote server, then on the local site they have some process that opens the files, starts reading and sending to the remote site. If this is cancelled for some reason - for example, you cancel the program locally or your internet connection breaks down… Afterwards you restart this process and rsync will detect that there is some part of the file already on the remote side, it will open it and it will find where it left off on the previous run.

This is the easy part, but what happens when the file on the remote side was modified and you would like to make it pristine again and copy the original file over to the remote side again? You can just delete the remote file and start transferring again, but that’s not very efficient. rsync cuts the file into different blobs and detects which blobs need to be transferred. For example, if you just changed one byte because of a hardware error on the HDD or something like that, then you just need to detect which of the blobs changed. For example, the first blob, the first 1,000 bytes or so - rsync will detect that and will transfer this small amount of data and reconstruct the file on the remote side.

The algorithm that it uses is called a rolling hash sum. It starts by reading the file and taking all the subsets of 32 bytes from the file, and for each of these 32 bytes it computes a hash. When this hash has some property, for example the lower bits are set to zero, then it says “Oh, I’ve found a block boundary!”, and afterwards is uses a real hash function, a cryptography hash function - I think it uses MD5, or something like that - to detect if the content of the blob has been changed.

So you could have also kept the file into 1 kilobyte pieces, but then the problem is you cannot detect when one byte has been inserted at the beginning, because all your blob boundaries are wrong. And when you have these dynamically-sized blobs, then you can detect “Oh, there has been a byte inserted; the first blob is different, but all the other blob markers at the end of the file are completely the same again.” This is really efficient to dynamically slice the file into blobs, and this is what Restic does.

The problem with the algorithm that `rsync’ does is that it is targeted at really small blobs, for example 100 bytes or 5,000 bytes, and in a backup program we don’t deal so much with inter-file duplication, but with intra-file duplication. We have some files that are not exactly the same, but mostly the same. So it’s a good idea to have larger blobs, because when you have a snapshot that I’m doing of my directory now, and two days from now most of the files will be exactly the same and some will be modified, but most of the data will probably be exactly the same, so it makes sense to reduce the number of blobs that you need to handle and have larger blobs.