Synchronizing UNIX files

Using cp, tar, and rsync

File synchronization is the process of adding, changing, or deleting a file in one location, and having the same file added to, changed, or deleted at another location. This article covers three utilties, cp, tar, and rsync, that can aid with synchronization of UNIX files. While cp and tar commands have limited synchronization abilities, rsync provides you with the full range of options; however, all three have their place.

Bare copy with the cp command

Although the cp command is not a true synchronizing command, it is probably the easiest method of copying files from one location to another. For single file copies, cp is obviously very efficient: $ cp source destination .

To copy an entire directory structure you can use the -r option to recursively copy an entire directory structure from one place to another: $ cp -r source destination . Performing this type of copy only copies the files and directories recursively. The permissions, ownership, and other meta-information about the file is not copied over to the destination. You can use the -p option to preserve the ownership, permissions, and times for each file and directory you copy: $ cp -pr source destination .

Using cp is the easiest and most recognized method of copying files, but cp can be inefficient and, without the use of a remote filesystem solution like NFS, it is not possible to copy the directories to a remote system.

Using tar

The tar utility, short for tape archive, was originally created as an efficient method for turning a directory structure (including the files, and file metadata) into a binary stream that could be written to a tape during the course of a backup.

Typically you use tar to create a .tar file that contains the directories that you want: $ tar cf mydir.tar ./mydir . The c option tells tar to create a new archive, and the f option to use the next argument as the name of the archive file to be created (mydir.tar). Any remaining arguments are used as the files or directories to be included in the archive. The tar command automatically traverses a directory structure, so if you specify a directory as one of the files to be included, tar will include the directory, and all of the other files and directories that that directory contains.

One of the important aspects of tar to be aware of is that the pathname that you supply is treated as absolute. That is, if you specify the full directory location to tar, for example the /etc directory: $ tar cf etc.tar /etc . Then, by default, tar will extract the files to their absolute location. For example, if the same archive was extracted: $ tar xf etc.tar . Then the files and directory structure would be recreated within the /etc directory. This could be destructive (in that you may be overwriting files in /etc that you wanted to keep). There are two ways round this. The first is to use GNU tar, which supports the --strip-path option to remove elements from the path that is extracted.

A simple alternative is to change to the parent directory and then use a relative path (see Listing 1).

Listing 1. Changing the parent directory and using a relative path

$ cd / $ tar cf etc.tar ./etc

When the archive file is extracted, the files will be recreated in their relative location again. You can use this trick to help with synchronizing directories. Because tar creates a byte stream of the directory structure, you can combine tar with pipes to copy files from one location to another: $ tar cf - ./etc |( cd /backup; tar xf - ) . The "-" in each case specifies that tar should use the standard output (when writing) or standard input (when reading). The parentheses effectively execute the statements in a subshell. In looking at the previous code before the pipe, a byte stream of the files is created on the standard output. After the pipe, change to a different directory and then extract the byte stream again from the standard input.

To ensure that the ownership and permissions of a file are retained, you can use the p option to preserve the metadata for each file and directory: $ tar cfp - ./etc |( cd /backup; tar xfp - ) .

Once you have this basic structure in place, you can perform more complex operations. For example, you can copy only the files that have changed since a specific time: $ tar cf - --newer 20090101 ./etc |(cd /backup; tar xf - ) . This code creates a copy of the files that have changed since 1st January 2009.

Files can also be synchronized to a remote host by combining the operation with rsh or ssh: $ tar cfp - ./etc |(ssh user@host -- tar xfp -) . Using ssh and tar in this way is a good way of creating a backup of your local machine on a remote host. But there are more efficient methods of synchronizing the information.

Intelligent synchronization with rsync

The main problem with the previous alternatives for synchronizing files is that they copy every single file (and the associated directory structure). While this is fine if you are creating a new copy of the information, if you are synchronizing the information between one directory and another, then this is inefficient.

Consider that you could have a directory with 10,000 files using 100GB. If you changed one file of about 10MB in size using cp or tar, you would have to recopy the whole 100GB of files all over again. In a backup situation, copying that quantity of information is excessive. You want the backup to complete as quickly and effectively as possible. Obviously, if you know which file has changed you could only copy that, but you won't always know this information.

Even using the --newer option with tar is limited because you must know exactly when the last modification was performed. The rsync tool addresses this issue by comparing the directory structure and the individual files and determining where the differences between the source and destination directories are located. Once it is clear which files and directories have changed, it then copies only those items to the destination. Even further, rsync will use a similar algorithm on individual files and only copy the portions of the file that have changed.

In the simplest form, you can use rsync to synchronize from one directory to a new directory like this: $ rsync -r a b . This will create a new directory, b, containing a copy of the directory structure in directory a. The -r option tells rsync to recurse into the directories and copy the entire structure. However, if the destination directory already exists, then a new directory, a, within the destination directory, b, will be created containing a copy of the files. This can have some unfortunate side effects. For example, if you are copying multiple directories to a backup directory, Listing 2 will do what you want.

Listing 2. Copying multiple directories to a backup directory

$ mkdir backup $ rsync dira backup $ rsync dirb backup

Listing 2 creates a directory, backup/dira, containing a copy of the original dira. It also creates a directory, backup/dirb, containing a copy of the original dirb. The following does something different: $ rsync dira backup/dira . The first time you use it, the script will do what you expect. But the second time you use the option, rsync will create the destination directory within the specified destination directory, creating the directory backup/dira/dira. Not only does this not create the structure you want, it also doubles up the contents (one of which will never be synchronized).

There are some additional options you may want to specify when using rsync. The default synchronization does not copy the file metadata, and treats certain special files (like links) as if they were normal files. The main options you want to use are:

--delete -- Delete files that don't exist in the source anymore in the destination directory. The default mode is to simply synchronize file changes and create new files. By default, if a file has been deleted in the source, then it is just ignored. With this option, you create an identical synchronization.

-- Delete files that don't exist in the source anymore in the destination directory. The default mode is to simply synchronize file changes and create new files. By default, if a file has been deleted in the source, then it is just ignored. With this option, you create an identical synchronization. --recursive -- Recursively copy directories and files.

-- Recursively copy directories and files. --times -- Synchronize the modification and creation times for each file and directory.

-- Synchronize the modification and creation times for each file and directory. --owner -- Preserve the file ownership, if possible.

-- Preserve the file ownership, if possible. --group -- Preserve the group ownership, if possible.

-- Preserve the group ownership, if possible. --links -- Copy symbolic links as symbolic links, instead of copying the file data and interpreting the source links.

-- Copy symbolic links as symbolic links, instead of copying the file data and interpreting the source links. --perms -- Preserve the file permissions.

-- Preserve the file permissions. --hard-links -- Preserve hard links (by creating a hard link on the destination) instead of copying the file content.

Some of these options are only valid and achievable if the two systems have identical configurations. For example, the file ownership and group ownership settings can only be preserved if the source and destination machines use the same IDs for the same user.

In addition to the local copies, rsync also performs remote copies using ssh. To use, you need to specify the username and remote host before the source or destination directory. For example, to synchronize a directory to a remote system as user, do the following: $ rsync --recursive dira user@remote:/backup/dirb . If you have not set up password-less ssh connections, then you will be prompted for the remote password. If you have, then this can be an effective way of performing an unattended overnight backup.

The same user/password combination can also be used for the source, allowing you to copy from a remote source to a local directory: $ rsync --recursive user@remote:dira dirb . When copying to a remote system over the Internet, the --compress option will also compress the information before it is transferred over the network, which is much more efficient than a raw byte copy. Of course, when copying to a remote system, you probably don't want to copy the bare files if they contain sensitive information. For this, you need to use encryption.

Encrypting files for synchronization

One of the more common reasons for using a file synchronization solution is to create an exact backup of the files so that you can copy or recreate elements of the directory structure in the event of a problem.

The rsync tool is ideally suited for this, since it can efficiently copy only the files that have changed between two directories. More usefully, because rsync can synchronize to a remote system, you can use it to create an automatic and remote backup without having to separately copy the files to the remote system.

One limitation of this process is that the copy that you create will not be encrypted. If you are copying the files to a remote system, it can be a remote system on a computer that may be accessible to others (for example on a hosting service) and you want to be sure that the files, even if they were reached, could not be read.

Using rsync only, it is not possible to encrypt the files. Nor is it possible to take advantage of the algorithm used by rsync to encrypt only the files that have changed since the last synchronization.

However, by wrapping the execution of rsync into a script, it is possible to take advantage of the output from rsync to create a secondary copy of the files that are then also encrypted.

The basis of the script is to create two copies of the original directory structure. The first copy is the reference copy, and contains an exact duplication of the directory structure. This is needed so that when the directories are synchronized again, the source and destination files can be compared and the list of differences determined as usual. Using the --itemize-changes option to rsync, rsync will create a reference list of what is happening to each file during synchronization. The output details whether the file has changed (or is new) or whether the file has been deleted. You can see an example of this in Listing 3.

Listing 3. Itemized changes from rsync

.d..t...... t1/a/ *deleting t1/a/3 .d..t...... t1/b/ >f.st...... t1/b/1 >f+++++++++ t1/b/6

The lines starting with .d. indicate a new directory or directory change. The *deleting line indicates that a file has been deleted in the source. The >f lines indicate that there has been a change in the file or a new file has been created (with >f++++++++ ).

By parsing this output file, you can determine the changes between the source directory and destination reference directory. Once the changes have been determined, an identical encrypted version of the original file can be created in a third directory. The itemized changes are used to only encrypt (or delete) files that have changed since the last synchronization. You cannot use the encrypted version of the directory to perform the synchronization directly, as the encrypted version of the file will always be different to the source.

The full script is shown in Listing 4.

Listing 4. Full script

#!/usr/bin/perl use warnings; use strict; use File::Basename; use File::Path; my $source = shift; my $dest = shift; my $encdest = shift; if (!defined($source) || !defined($dest) || !defined($encdest)) { print "Error: Not enough arguments!

"; print "Usage: $0 source destination encrypteddest

"; exit(1); } print STDERR "Running rsync between $source and $dest ($encdest)

"; system("rsync --delete --recursive --times -og --links --perms " . "--hard-links --itemize-changes $source $dest " . ">/tmp/$$.rsynclog 2>&1"); open(DATA,"/tmp/$$.rsynclog") or die "Couldn't open the rsynclog

"; my @changedfiles; my @delfiles; while(<DATA>) { next if (m/sending incremental file list/); chomp; last if (length($_) == 0); my ($changes,$filename) = split; push @changedfiles,$filename if ($changes =~ m/^>f/); push @delfiles,$filename if ($changes =~ m/^\*del/); } close(DATA); my $counter = 0; foreach my $file (@changedfiles) { if (-f "$dest/$file") { my $sourcename = encode_filename("$dest/$file"); my $destname = encode_filename("$encdest/$file"); my $dirname = dirname("$encdest/$file"); mkpath($dirname); system(sprintf('cat "%s" |openssl enc -des3 ' . '-pass file:/var/lib/passphrase -a >"%s"', $sourcename,$destname)); $counter++; } } my $delcounter = 0; foreach my $file (@delfiles) { unlink("$encdest/$file"); $delcounter++; } print STDERR "Finished (changed: $counter, deleted: $delcounter)

"; unlink("/tmp/$$.rsynclog"); sub encode_filename { my ($filename) = @_; $filename =~ s/ /\\ /g; $filename =~ s/'/\\'/g; $filename =~ s/"/\\"/g; $filename =~ s/\(/\\(/g; $filename =~ s/\)/\\)/g; $filename =~ s/&/\\&/g; $filename =~ s/#/\\#/g; return($filename); }

The script is relatively simple and very straightforward to use. To run, you specify the source directory, the destination directory for the reference files, and the destination directory for the encrypted version of the files: $ rsyncrypt source destination destination.enc .

The first part of the script performs a basic rsync between the source and destination directories to determine the changes (see Listing 5). That operation generates the itemized file (in /tmp) to document the changes.

Listing 5. Performing a basic rsync between the source and destination directories

system("rsync --delete --recursive --times -og --links --perms " . "--hard-links --itemize-changes $source $dest " . ">/tmp/$$.rsynclog 2>&1");

Next, the list of changes is parsed and a list of the files that have been changed and deleted is generated (see Listing 6).

Listing 6. Parsing the list of changes

while(<DATA>) { next if (m/sending incremental file list/); chomp; last if (length($_) == 0); my ($changes,$filename) = split; push @changedfiles,$filename if ($changes =~ m/^>f/); push @delfiles,$filename if ($changes =~ m/^\*del/); }

For each file that has changed, the encrypted version is created by reading the reference version and creating the encrypted version within the encrypted destination directory (see Listing 7).

Listing 7. Creating the encrypted version of each file that has changed

foreach my $file (@changedfiles) { if (-f "$dest/$file") { my $sourcename = encode_filename("$dest/$file"); my $destname = encode_filename("$encdest/$file"); my $dirname = dirname("$encdest/$file"); mkpath($dirname); system(sprintf('cat "%s" |openssl enc -des3 ' . '-pass file:/var/lib/passphrase -a >"%s"', $sourcename,$destname)); $counter++; } }

The filename has to be encoded, because we are using a shell to perform the actual encryption. Some special characters, which would otherwise be interpreted by the shell, need to be escaped.

For the actual encryption, openssl is used with a simple text file (in /var/lib/passphrase) containing the passphrase that will encode the information. You could also create or use a specially generated key to perform the operation, or any other encryption command that you want to use.

Finally, because the source directory may have files that have been deleted from the original, any deleted files are also removed from the encrypted directory contents (see Listing 8).

Listing 8. Removing deleted files from the encrypted directory contents

foreach my $file (@delfiles) { unlink("$encdest/$file"); $delcounter++; }

The script is very effective and the only downside is that it requires two copies of the information (the reference directory and the encrypted version) instead of just one. Also, to simplify the process, the permissions, ownership, and timestamp information is not synchronized to the encrypted version, but this would be comparatively easy to add. But the benefit is that because rsync is used is generate the list of changes the number of files that need to be encrypted is dramatically reduced, and the new encrypted version of the files can be synchronized to your remote host using the same optimized algorithm to only transfer encrypted files that have changed since the last synchronization.

Summary

This article looked at a number of different methods for synchronizing files. The basic cp is not really synchronization, but it is useful for a direct copy. It is comparatively too time-consuming and inefficient for a true sync operation. With tar, you can take advantage of a time reference point to only copy files changed after a specified period, but this too has limitations if the changes are less obvious or not picked up by such a blanket comparison.

The rsync tool is a much better solution for proper synchronization. It performs extensive checks and comparisons that truly compare the source and destination directories and allow for synchronizations efficiently, even over a network or public connection. For security, you can combine this functionality with an encryption phase to ensure that the remote file is unreadable without the correct passphrase or cryptographic key.

Downloadable resources

Related topics