Understanding the Python `zipfile` API

A zip file is a binary file. Contents of a zip file are compressed using an algorithm and paths are preserved. All open-source zip tools do the same thing, understand the binary representation, process it. It is a no-brainer that one should use BytesIO while working with zip files. Python provides a package to work with zip archives called zipfile

The zipfile package has two main classes:

ZipFile : to represent a zip archive in memory. ZipInfo : to represent a member of a zip file.

A ZipFile is an exact representation of a zip archive. It means you can load a .zip file directly into that class object or dump a ZipFile object to a new archive. Every ZipFile has a list of members. Those members are ZipInfo objects.

A ZipInfo object is a path in the zip file. It is the combination of directories plus path. For example, let us say we have a directory called config, and it stores configurations for application, containers, and, some root-level configuration. Assume the content looks like this,

config

├── app

│ └── app-config.json

├── docker

│ └── docker-compose.yaml

└── root-config.json 2 directories, 3 files

If you zip config directory using your favourite zip tool, I pick this python command,

python -m zipfile -c config.zip config

and then try to list the contents of config.zip using Python command,

python -m zipfile -l config.zip

It displays all paths in the zip file.

What are those paths? Each path listed in the output is a ZipInfo object for Python. To prove that, let us write a small script that creates a zip archive in memory with config.zip,

When you run this script, you see the following output:

<ZipInfo filename='config/' filemode='drwxr-xr-x' external_attr=0x10> <ZipInfo filename='config/docker/' filemode='drwxr-xr-x' external_attr=0x10> <ZipInfo filename='config/docker/docker-compose.yaml' compress_type=deflate filemode='-rw-r--r--' file_size=0 compress_size=2> <ZipInfo filename='config/app/' filemode='drwxr-xr-x' external_attr=0x10> <ZipInfo filename='config/app/app-config.json' compress_type=deflate filemode='-rw-r--r--' file_size=0 compress_size=2> <ZipInfo filename='config/root-config.json' compress_type=deflate filemode='-rw-r--r--' file_size=0 compress_size=2> There are 6 ZipInfo objects present in archive

This ZipInfo object is critical for modifying a file/path in the archive. It is a high-level wrapper for a file stream. On a ZipInfo object, one can read or modify data. One can also create new ZipInfo objects and add them to the archive. Let us see all variations where we use simple Python programs to create, update zip archives in the next section. All the examples don’t create zip files on disk but in memory.

Use case #1: Create zip archive with files

We can create a zip file with the given name by opening a new ZipFile object with write mode ‘w’ or exclusive create mode ‘x.’

Next, we can add files/paths to the zip archive. There are two approaches to do that:

v1: Add a file as a file-like object

This approach writes independent files as file-like objects.

v2: Add a file as a ZipInfo object

This approach composes files as objects and gives more flexibility to add meta information on file.

Both versions create config.zip file on disk. While creating a file in the archive, they consider relative paths like this,

docker/docker-compose.yaml

v2 is slightly flexible as it gives freedom to modify ZipInfo object properties at any point in time.

Use case #2: Read a file from zip archive

Another possible use case is to read a file from an existing zip archive. Let us use the config.zip file created from Use case #1.

For Ex: read the content of docker-compose.yaml from the zip and print it.

Use case #3: Update or Insert a file in zip archive

This use case is the most tricky part of the zipping business in Python. On the first look, it might look simple. Let us try attempting a few solutions.

Attempt #1

The apparent thing that comes to one’s mind is to update a specific file in the archive with the latest data, is this,

If we run the preceding script, it replaces the file in archive config.zip, but, as zipfile is opened in write mode ‘w,’ the other files/paths in archive can vanish. You can check it using this command,

python -m zipfile -l config.zip FileName Modified Size

docker/docker-compose.yaml 2020-03-08 20:05:48 27

Woah, root config and app config have vanished from the config.zip. It is a side-effect.

Don’t use ‘w’ mode, when you update/replace a single file in a zip archive, or your data is gone for good.

Attempt #2

Can’t we append a file to the existing zip? Will it magically overwrite the file? Yes, it does. Just replace mode in previous code snippet from ‘w’ to ‘a.’

with ZipFile(‘config.zip’, ‘a’) as zip_archive:

...

And rerun the script on a fresh config.zip(which has a root, docker and, app configs). You see this warning:

/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/zipfile.py:1506: UserWarning: Duplicate name: 'docker/docker-compose.yaml'

return self._open_to_write(zinfo, force_zip64=force_zip64)

It is just a warning. So go ahead, extract the content like this to see what is inside,

python -m zipfile -e config.zip config

and you find there is only one docker-compose.yaml file in docker directory with all other files/paths preserved. Wonderful!

❯ tree config

config

├── app

│ └── app-config.json

├── docker

│ └── docker-compose.yaml

└── root-config.json 2 directories, 3 files

Even though it seems to be an obvious solution, there is a serious bug here. The extracted archive may not have visible duplicate files, but the underlying file pointer might have duplicated information.

Type zipfile list command, to see those hidden duplicates.

❯ python -m zipfile -l config.zip

File Name Modified Size

docker/docker-compose.yaml 1980-01-01 00:00:00 23

app/app-config.json 1980-01-01 00:00:00 21

root-config.json 1980-01-01 00:00:00 22

docker/docker-compose.yaml 2020-03-08 20:34:48 31

So docker/docker-compose.yaml had appeared twice in the ZipInfo list but only once in extraction. For every update, the zip archive size grows and grows in the magnitude of the updated file size. If you ignore the Python warning, at some point the junk in the archive may occupy more space than actual files.

The two attempts until now couldn’t achieve an acceptable solution. Now comes third, which is a clean and elegant way.

Attempt #3

There is a no easy way to update the contents of a zip archive. A clean way is to create a new zip archive in memory and copy old ZipInfo objects from the old archive into the new archive. In case of a path where data should be inserted or replaced, instead of reading from the old archive, create a custom ZipInfo object and add it to the new archive.

The algorithm looks like this:

Here, we are defining a function that takes the path in archive and data to replace. It iterates over the old archive and copies existing stuff into the new archive. When it spots an existing element, it creates a new ZipInfo object and puts that into the new archive. Run the script on a fresh config.zip(created by createzipv1.py), and you see there is no duplication of file objects and docker/docker-compose.yaml is updated as expected.

This solution has a minor drawback of dealing with two streams at a given time, and in the worst case, it can end up consuming double the amount of run-time memory. It won’t happen unless you are talking about Gigabyte sized zip files.

Use the technique of cloning for updating/inserting paths in a zip archive.

Use case #4: Remove an existing file from zip archive

By now, after looking at many use cases, one can guess how to remove a file from the archive. The cleanest way again is to copy contents from the old archive to the new archive and skip the ZipInfo objects that match the given path. Finally, overwrite the old zip file with the new zip file. The algorithm should have only one condition like this,

... for item in old_archive.filelist:

if item.filename != path:

new_archive.writestr(item, old_archive.read(item.filename)) ...

The delete script now has a function that takes only path argument and skips the respective ZipInfo object while copying.

It finishes all possible use cases that pop up while working with zip files in Python.

Note: The in-memory stream objects created(using BytesIO) in the above scripts can also be used with AWS S3 instead of flushing to a disk.

Final words

As we already discussed, one should monitor the size of a zip file and program memory for a copy operation.

A proper implementation uses a combination of techniques instead of a brute-force approach. Forex: When a stream holds a considerable buffer, Python provides a method called shutils.copyfileobj() to copy file-like objects from source to destination in an efficient way. It can chunk the buffer while copying. To solve the memory problem while updating/inserting/deleting paths in a big archive, one can use it for copying objects.

You can find more information here.

I hope you enjoyed this article! You can find all the code samples here.

https://github.com/narenaryan/python-zip-howto

References: