Upgrading an OpenStack cloud has become a challenging task, which

requires choosing the right approach, careful planning and precise

execution to minimize the downtime of the cloud environment. Because of

such complexity, cloud operators prefer to skip one or more releases

before doing an upgrade.

In this OpenStack tutorial, we discuss different aspects of OpenStack upgrades, identify the major pitfalls when upgrading OpenStack and provide solutions and best practices to avoid these pitfalls.

Update Vs. Upgrade

First of all, we should define a strict distinction between updating

and upgrading OpenStack. In this OpenStack tutorial, updating

means applying bug fixes and fixes for security vulnerabilities to

the OpenStack components and underlying operating system. Usually, such

fixes are considered to be safe for _in-place _updates, because they do

not introduce a new functionality and thus do not have regressions.

At the same time upgrading means upgrading to a new stable

OpenStack release. An OpenStack cloud consists of a number of

distributed software components that collaborate with each other in

order to deliver the required cloud services. From the first look, such

components, including operating system dependencies, must be upgraded at

the same time, which make the upgrade tasks even more complex. Good news

is that the OpenStack community aims to keep the APIs for the components

compatible, so the old API version usually is kept and supported for

some time. However, the old API can be marked as deprecated and removed

from newer releases.

Planning an OpenStack upgrade

There are several important steps we recommend for planning any

OpenStack upgrade:

Read the OpenStack release notes thoroughly to identify potential

incompatibilities between releases. Choose the proper method for OpenStack upgrade (see below). Prepare a plan to roll back a failed upgrade. Prepare a plan for data backups, at minimum, with backups of

configuration files and databases. Determine the acceptable downtime for the cloud, as defined by the

SLAs for specific services. If any data loss is projected, notify your

users about the service interruption. Test the upgrade method using a test environment similar to the

production one.

Methods for OpenStack upgrade

Parallel cloud: Deploy a separate OpenStack cloud and migrate

all the resources from the old cloud to the upgraded one. This is the

simplest and least intrusive method. Also it has the simplest rollback

procedure. However, it requires extensive hardware resources and leads

to lengthy downtime. Rolling upgrade: These two methods upgrade each component on

each server one by one, finally giving you an upgraded OpenStack cloud:

In-place upgrade: This method requires shutting down each

service for the upgrade, which gives you some downtime, though less than

the parallel cloud method.

Side by side upgrade: Since OpenStack Icehouse the controllers

are decoupled from the compute nodes, so you can upgrade them

independently. With this method, you can deploy an upgraded controller,

transfer all the data from the old controller to the new one and

seamlessly replace the old controller by the new one. The old controller

is left untouched, so a roll back should be simple. In order to achieve

zero downtime you should have more than one controller in HA mode.

Upgrade pitfalls and solutions

Manual upgrades are prone to failure

Upgrades commonly fail when a number of manually repetitive tasks must

be completed. Your cloud consists of many nodes and each node contains a

number of services. The services on each node collaborate with other

services, and due to this complexity manual upgrades are not an option.

Solution: Use automation for the upgrade. There are many

configuration management tools tools you can use such as Ansible, Chef

and Puppet.

Upgrade of the production cloud can fail

By nature the OpenStack cloud contains custom settings and the

standard upgrade procedure usually does not honor the custom settings in

the configuration files. You should assume that upgrade of the cloud

will fail, so you need to verify the upgrade on a test cloud, which

should be similar to the production one. The test cloud can be smaller

than the production one, but it should have the same architecture and

configuration.

It is very important to have proper automation implemented in your

organization. Both deployment (for the old release) and upgrade

procedures should be automated and both should be under configuration

management control. You should be able to track back each custom setting

to the original requirement. Before upgrading the production cloud, the

upgrading procedure and the corresponding automation should be properly

verified with the following standard approach:

Deploy a test cloud using the same automation scripts that you used

to deploy the production cloud. Apply upgrade scripts to the test cloud. If the upgrade failed, make necessary fixes to the upgrade scripts

and repeat the procedure from the step 1. If the upgrade completed successfully, verify the test cloud. If the verification failed, make necessary fixes to the upgrade

scripts and repeat the procedure from Step 1.

You can use OpenStack Rally for

automated cloud verification. Rally verification scenarios may include

the standard ones and custom scenarios, which are specific for the cloud

under test.

The cloud’s performance will degrade

Each OpenStack release introduces new features and brings new bugs, but

more importantly, will require a new hardware configuration. A new

OpenStack release might require additional or faster CPUs, more memory

and disk space. This is true for several OpenStack releases, including

Liberty. Potentially, community efforts to the OpenStack optimization

may lead to decreased requirements, but at the moment you should expect

the performance of your cloud to degrade due to an upgrade.

To pro-actively identify and solve such performance issues you need to

perform benchmarking and profiling for your clouds: the old and the

new one. You should be able to identify any performance degradation and

add additional resources for OpenStack services under high load. You can

use OpenStack Rally for automated cloud benchmarking and profiling.

Unclean shutdown of the services may lead to an inconsistent state

of the cloud

The service should complete all the requests it has received from the

message queue and notify the message queue to stop sending new requests

to the service. You should shut down OpenStack services gracefully and

give them enough time to complete all the active requests and report

their unavailability to the message queue. Shut down one service at a

time, upgrade it, start, then do the same for next one.

Upgrading the services in the wrong order may break the cloud

You can easily break the cloud by upgrading the services in the wrong

order. The following order is the most recommended:

Upgrade OpenStack Identity (Keystone) Upgrade the OpenStack Image service (Glance) Upgrade OpenStack Compute (Nova) Upgrade OpenStack Networking (Neutron) Upgrade OpenStack Block Storage (Cinder) Upgrade the OpenStack dashboard (Horizon) Upgrade the OpenStack Orchestration (Heat)

Upgrade will fail due to old or missing system dependencies

A new OpenStack release introduces new system dependencies and requires

upgraded versions of the existing system dependencies. The upgraded

OpenStack service will fail to start or will terminate with runtime

failure if its some system dependencies are not installed or upgraded.

When upgrading the OpenStack services make sure that all the

dependencies are also upgraded properly. Usually it implies that all

of the OpenStack components are installed from packages (deb or rpm)

with correctly defined and tested dependencies. Even in this case,

depending on the specific configuration, upgrading the packages can

break some services. It is recommended that if the package manager (yum

or apt-get) asks you to update configuration files, reject the changes.

Instead, review, change the configuration files and restart the services

manually.

Database downgrades are not supported

Most of the OpenStack services support database migrations. That means

that each service will try to upgrade its database during startup.

Usually the automated upgrade is well tested for the stable OpenStack

release and can be used safely (it can be disabled in favor of manual

upgrade, if necessary). At the same time, starting from Kilo, database

downgrades are not supported. Thus, the only reliable way for a

database rollback is to restore a database from backup.

Configuration files will not be upgraded automatically

Each OpenStack release introduces changes to the configuration files.

Options can be removed, renamed and moved to other sections. New options

can be added with the default values that can break your cloud. Read the

release notes thoroughly to identify such changes and apply them to your

configuration files. For example:

In Juno, the ‘identity_uri’ option should be used in the

‘[keystone_authtoken]’ section instead of ‘auth_host’, ‘auth_port’, and

‘auth_protocol’ for all of the services.

‘[keystone_authtoken]’ section instead of ‘auth_host’, ‘auth_port’, and ‘auth_protocol’ for all of the services. In Kilo, when using libvirt 1.2.2 live snapshots are disabled by

default. Deployers can set

‘workarounds.disable_libvirt_livesnapshot=True’ in nova.conf to enable

live snapshot support.

default. Deployers can set ‘workarounds.disable_libvirt_livesnapshot=True’ in nova.conf to enable live snapshot support. In Liberty, setting ‘force_config_drive=always’ in nova.conf is

deprecated, use True/False boolean values instead

Upgrade will fail due to new, deprecated or removed API

If you have custom scripts or other software that uses OpenStack API,

then be prepared for a failed upgrade, because a new OpenStack release

introduces a new API version and marks the old version as deprecated.

In the worst case scenario, the API can be removed from the release.

Read the release notes thoroughly to identify such changes and apply

them to your cloud. For example:

In Kilo, the EC2 API support has been deprecated and removed.

In Liberty, the Load Balancer as a Service (LBaaS) V1 API is marked

as deprecated and is planned to be removed in a future release. Going

forward, the LBaaS V2 API should be used.

Upgrade will fail due to deprecated or removed features, plugins and

drivers

If you are using, a vendor specific plugin then be prepared for a failed

upgrade, because in a new OpenStack release such feature or plugin is

deprecated or even removed. Read the release notes thoroughly to

identify such changes and apply them to your cloud. For example:

In Kilo, XML support in Keystone has been removed

In Kilo and Liberty, many monolithic vendor specific plugins have

been removed from Neutron

Upgrade will fail due to architectural changes

In some cases your cloud may depend on a specific architectural feature

of the old OpenStack release, which is changed or deprecated in a new

release. Read the release notes thoroughly to identify such changes and

apply them to your cloud. For example:

Use Python 3 instead of Python 2.6

Use the pymysql database driver instead of Python-MySQL

Use unified ‘openstack’ client instead of ‘keystone’, ‘glance’, etc.

In Liberty, Ceilometer Alarms is deprecated in favour of Aodh

In Kilo and Liberty releases, the Keystone project deprecates

eventlet in favor of a separate web server with WSGI extensions

The future of OpenStack upgrades

To help solve challenges related to upgrading OpenStack, the OpenStack

community has adopted a Big Tent

approach

for new releases. With the Big Tent model, operators will be able to

select the preferred components and their version, and then add or

upgrade modules incrementally with little or no downtime.

This post first appeared on Stratoscale’s blog.

Superuser is always interested in how-tos and other contributions, please get in touch: [email protected]

Cover Photo // CC BY NC