Image Families is a powerful Google Compute Engine concept that can help very much in automating infrastructure. However, as usually, with powerful concept, it doesn’t take much to shoot yourself in a lag if it is used in a wrong way. Here we will see how it should be used and how it should not.

What is an Image Family?

First of all, if you are not familiar with what is Image Families, let me quote the official documentation:

Image families simplify the process of managing images in your project by grouping related images together and making it easy to roll forward and roll back between specific image versions. An image family always points to the latest version of an image that is not deprecated. Most public images are grouped into an image families. For example, the debian-9 image family in the debian-cloud project always points to the most recent Debian 9 image.

We are also heavily utilizing image families in our Deep Learning images. Here is the latest graph:

In other words, image family is a stack of images, and you always know which image is latest. Official Google Cloud Page with Image Management Best Practices describes it really well from the point of the developer that develop/distributes the images in the following picture:

Unfortunately, the same page does not tell much how the families should be consumed on the customer side. Except maybe one sentence that, if been understood in a wrong way, can lead to a catastrophic consequence. Here is the sentence:

Rather than changing automated tools to direct consumers to the latest image by propagating the specific name to other systems, you can simply reference the image family name, which will always return the latest image in the family, for example, my-application .

What are the problem here? Let see a particular example.

Unfortunate Chain Of Events With Images

Let say I have a web server with my java app installed on it. I do have an image with everything pre-installed. I’m using image family my-app for my images and producing regular builds with system level updates. My images have names with build date append like: my-app-20180412 , my-app-20180511 , etc.

Two questions arising here:

should customers of my images be using image family or direct images?

if I will give name of the family to my customers what should I do with older images? Should I just remove them?

For the sake of the example let say that these are my answers:

advice my customers to use image family and not direct images;

I will be removing old images since I do not want to store a lot of binaries;

Now let’s discuss a possible timeline of what we can call an unfortunate events with images:

0:30pm new image my-app-20170411 has been pushed to the my-app image family

1:00pm customer tests new job process by creating new instance and verifying that the job works correctly (with image my-app-20170411)

1:30pm I’m pushing emergency security patch that was released by the OS vendor, new image is my-app-20170411b2

2:00pm customer starts instances and starts more jobs (at this point images will be using my-app-20170411b2)

2:30pm jobs are failing due to updated version of the system library on the image

3:00pm user rolls back update of the job

3:30pm old jobs also failing (due to updated version of the system library on the image)

11:00pm user finally have found that it is nothing to do with the job, problem is in the image and starts investigating possible way to rollback

11:30pm user finally have familiarized himself with image families APIs and ready to query images from the family

11:45pm user reaches horrible realization that there is no way to rollback since older images are removed from the family (and he/she does not have the id of the older image)

Even if the old images are not deleted, but deprecated, user will not have any possibility to get id of the old images. Just go and see yourself:

All images are there, but since they have been deprecated they are not visible to you so you can not get image id and without id you can not roll back!

However if you do know the image id you can use it:

Now, if you thinking that this “unfortunate” scenario is not possible you probably saying thing by the influence of 2(wrong) assumptions:

Assumption one: it is possible to update OS with all the latest security patches without breaking API compatibility.

Assumption two: Latest image in the family is stable

Let’s address the first one. Modern OS with long term support usually are using slightly outdated version of packages. Key goal for them is stability. This basically mean that any security patch on the latest version of the software need to be re-evaluted and back ported on the old version. Time to time it might not be even possible. This is why when you see glibc 2.17 on the OS it might actually have tons of custom source code from the OS Vendor as well as many APIs that are back-ported form the newer versions! In many cases such things are not possible and the only way forward is to upgrade major version. Just a few random example how upgraded packages complicates things:

https://bugs.launchpad.net/ubuntu/+source/gcc-4.8/+bug/1750937 Kernel update on 2/21 breaks Nvidia drivers

https://twitter.com/ThePSF/status/1024095049645273091 NumPy upgrade that actually have caused different “latest” docker containers and (probably) VM images to be non backward compatible.

https://github.com/ContinuumIO/anaconda-issues/issues/6340 latest miniconda breaks gcc

Second assumption probably indicate that you are relatively new to the field. I will point you to the article about Docker Tags (which is analog of image families in Docker world) when one broken deployment created massive scale impact on the consumers: https://renovatebot.com/blog/docker-mutable-tags . No matter how sure you are in your latest build, it WILL happen, you WILL deploy broken build to the latest in one way or another.

The worst part in the scenario described before is that even all images are there and just deprecated the user has no easy way to roll back, since by using image family you are hiding notion of image ids from the user. So like in the example with Docker most likely the user will not even know how to do a roll back.

How We Can Prevent Such Problems?

Okay, so what to do next. How we should use image families, or should we even be using them at all? The short answer is yes, we should be using image families, however…. The ideal workflow that you should advice your users to use is look like the following one:

get the latest image id from the image family; // this should be the one and only place to interact with the image family

test new image

update your prod to the latest image

In none of this cases, except the very first one, you are using the image family. It does not matter what you are doing either you are developing an online course or new twitter killer, you should never use image family directly. First you get the image id, then you deploy it to the prod (after testing).

Here is how you can, for instance, get the latest image of our Deep Learning Image With TensorFlow:

If you want you can do this steps (get latest -> test -> deploy to prod) each time when new image is released. However this is THE only way that allows you to avoid any surprises on production. If new image is broken your users should never even start using it in the production.

But What About Bug Fixes?

Repeat after me: “known bugs with existing workarounds for current image way more valuable comparing to unknown bugs and un-existing workarounds of a new image”. If you think that new image will have only bug-fixes and will maintain backward compatibility all the time, go back to the previous section of the article, looks like you have missed the point.

Wait, But What if There Is a Security Patch?

You would be surprised but in majority of the cases your users might not care. Not because they do not want to be secure but because they have internal instance that does not have any ports exposed to the Internet. For them the most important thing for the image is to work. I have seen people who have not upgraded their server for years. Yes it had have many-many security vulnerability but it was server in the internal network and it was the intention not to touch it as long as it is working.

Also, keep in mind, in many OS the most important OS security patches are actually been installed even if the user have not done anything.

All this does NOT mean that you should not care about having all the patches and updates in place, specially if your customers have public facing services that are running on your images. But once again, image that your user will get from you should go via the same process of testing as any other changes: get new image id => test => deploy. Do not ask your users (if you producing the images) to use image families on the prod directly!

Conclusion

Image family is the powerful concept that allows to simplify the process of getting latest image. This very useful concept, specially if you are building and distributing images to your customers. However as usual with any powerful things it comes with responsibility that you are now aware of.