Applying updates to hosts in a cluster environment can be tricky. For the individual containers, updates are simple, most of the time running apt update && apt upgrade is sufficient. The hypervisors, on the other hand, require some more attention.

Our compute infrastructure is set up as a hyper-converged cluster, which is typically defined as software-defined compute, storage, and networking which we use Proxmox (compute and networking) and Ceph (storage) for. When applying updates to the hosts I need to keep this in mind because it may affect the containers and virtual machines running on them. After a bit of searching around on best practices for running updates on Proxmox, I came across this post from a user on how they typically do it. The process is basically to, on each host, migrate off the CTs and VMs then update the host through APT then migrate back. By the time I was done, I ended up with a slightly different set of steps.

Migrating Guests off of the Host

Starting with Ceres I enabled nofailback on the HA group for the host. This prevents the HA manager from immediately migrating the guest back.

Then I migrated the guests off of the host to Europa… This did not work 100% the first try.

Some containers have a mount point added for production mass (spinning rust) storage which prevents them from migrating.

To overcome this I disabled the mount point by commenting out the line in the LXC config files located in the /etc/pve/lxc directory on their host and powered off the container.

This resolved the issue for two of the containers that failed to migrate. After the fact, I looked into the mount point settings for containers and found the shared option. When enabled Proxmox will assume that all the hosts have the same mount point available making it safe to migrate.

Next, our 3CX container failed to migrate with this error:

The important part of the output is the “can’t unmap…” line. I figured I’d try unmapping the rbd image manually. Shockingly… this also failed with the same rbd: sysfs write failed error. But what if I force it?

No output from the command but also no error output so that’s promising. The true tell will be if the guest will now migrate, which it does!

Updating the Host

With the host no longer responsible for any guests I can safely upgrade the packages.

The host started at PVE manager version 6.0-11 , after the upgrade it is at 6.1-7 .

Among the packages upgraded was Ceph. The change in versions produced some warnings for it. The Ceph service needed to be restarted in order for the new version to be active.

This left the other hosts running an outdated version.

When Ceph is restarted on a host it will briefly drop out of the pool and the Ceph monitor in Proxmox will display some concerning warnings/errors. This promptly goes away once Ceph comes back online.

Once Ceph is happy again, the nofailback option on the HA group can be disabled which will migrate back all of the guests. Then the LXC config for containers with mount points needs to be reverted and the container powered back on. Then repeat the process for each host.

Post-Upgrade Fixes

The whole process went relatively smoothly, I only had to fix two things after the updates. The first was our VPN gateway, which is a pfSense VM, went into an error state after migrating back.

Simply stopping and starting the VM resolved it. The second was the Proxmox warning for a missing subscription. The fix is very quick and same as the last time I did it, just run the following one-liner on each host.

sed -i.bak "s/data.status !== 'Active'/false/g" /usr/share/javascript/proxmox-widget-toolkit/proxmoxlib.js && systemctl restart pveproxy.service

Conclusion

The process for upgrading Proxmox hosts in a cluster is simple. The only challenge is figuring out the correct process for your particular set up. In our case, we need to enable nofailback on the HA group, migrate the guests, run updates, restart Ceph (if it updates), revert the changes to the HA group, and repeat.

Some day I hope to have the process automated using Ansible and the Proxmox API. My main concern will be error management. My experience from doing it this time shows that there will be occasional issues that I need to account for, either from the migrations or changes caused by the upgrades.