BetaNews As posted on

Anyone who had worked before with cloud platforms (and especially with OpenStack) probably knows the distributed and de-coupled nature of those systems. A decoupled-distributed system works by using micro-services doing specific tasks and exposing each one its own REST’s API’s. Those micro-services usually talk to each other through a lightweight messaging layer normally in the form of a message broker like RabbitMQ, QPID or other similar solutions.

That’s the exact way OpenStack works. Each major OpenStack component (Keystone, Glance, Cinder, Neutron, Nova, etc.) exposes a REST endpoint, and keep proper communication between components (and sub-components) using a message broker layer, normally, RabbitMQ.

That approach, while ensuring that the whole cloud infrastructure can isolate failures to specific components without spreading the problems across all the system, and also allowing cloud-infra operators to scale all services in a horizontal fashion and distribute the load on a smart way, bring us a problem inherited by the way the complete system is distributed: How to properly monitor the services, and specially, how to identify possible show-stoppers (also known as “single point of failures”) that can render the service partially (or completely) offline?.

OpenStack monitoring In the following chapter we will try to pinpoint what are the actual real-world challenges for the specific case of proper, and what possible solutions can be implemented for each of those challenges:

The system is not monolithic. OpenStack "is not only" OpenStack. Do not rely on just monitoring the default metrics. Proper procedures avoid more failures.



1) The system is not monolithic.



How can we know the real impact on our service when a specific component fails?





The distributed/de-coupled nature of OpenStack directly works against any easy way to monitor the state of the service as a whole. You surely would say: “Hey... But being not-monolithic is one of the best advantages of OpenStack”. Surely yes, but in a distributed system where each component is doing a specific task, how do you relate failures in a specific component to the whole service status? Also, due the fact that every OpenStack major component (Nova, Cinder, Neutron, etc.) is also distributed by itself (having multiple sub-components that accomplish specific tasks inside the service), how can we possibly know the impact on the service when a specific piece of software fails?

The first step for overcoming this challenge is to get familiare with your cloud enviroment in a very intimate level: Identify the relations between all major components and what specific function do they accomplish in the cloud.

Also, for each specific major component, identify specific services whose failures can impact your services.

Simply put: Know the relations between all components in the cloud.

Having that on mind, you need to implement ways to “not only” monitor the state (up-andrunning or stopped-and-failed) for specific services, but also establish a way to identify what other services can be affected by the possible failure of a “X” component.

For example: If Keystone dies, nobody will be able to obtain the service catalog or log into any service. That normally does not affect the currently running instances (virtual machines) or any other already-created cloud-services (object storage, block storage, load balancers, etc.) unless services are restarted and keystone is still down. Moreover, if “apache” fails, that can also affect Keystone (that works through apache wsgi) and other similar API services in the OpenStack cloud that also works through apache.

In conclusion, the OpenStack monitoring platform or solution, whatever it is, need to “not only” be capable of monitor the statuses of individual services, but also be capable of correlate between service failures in order to pinpoint the real impact when Murphy attack us, and send the proper alarms or notifications accordingly.





2) OpenStack is “NOT ONLY” OpenStack





Before you go into a Kernel Panic trying to understand why “OpenStack” is “NOT ONLY” OpenStack, remember the first challenge: The OpenStack-based cloud is a distributed and decoupled system. And not only that, but also: OpenStack is really an orchestration solution which creates resources in the operating system and other devices inside or related to the cloud infrastructure. What are those resources? Virtual machines (in xen, kvm or other hypervisor software components), persistent volumes (in nfs storage servers, ceph clusters, san-based LVM volumes or other storage backends), network entities (ports, bridges, networks, routers, load balancers, firewalls, vpn’s, etc., running with specific components like iptables, kernel namespaces, haproxy, openvswitch and many other sub-components), ephemeral disks (qcow2 files residing in an operating system directory), and many other little systems.

Probably you can see it now: The OpenStack monitoring solution need to take into account the underlying components that are also, in some extension, part of the complete cloud infrastructure. While it’s true that those components can be less complex than the OpenStack main modules and are not so prone to fail, when they do, and “they do fail”, the logs inside major OpenStack services can obfuscate the actual failure cause, and show only the consequence in the OpenStack affected service, but not the actual root cause on the device or operating system software which actually failed.

Another example here: A libvirt failure will cause nova-compute to be unable to deploy a virtual instance. Nova-compute “as a service” will be up and running, but the instances would fail (instance state: error) in the deploying stage. What would be the proper monitoring solution in order to detect this? Monitor libvirt (the service state, its metrics and its logs) along nova-compute logs. Relate events between underlying software and major components, and you will easily identify what’s really wrong!



In conclusion, we need to monitor the end-of-the-chain and consider consistency tests along all final services: Monitor storage, monitor networking, monitor hypervision layer, monitor every component individually and be prepared to analyze logs and metrics when those failures do happen. Also and again: Be Able to relate things!



3) Do not rely on monitoring just the default metrics





Many monitoring solutions on the OpenSource world (Cacti, nagios, and Zabbix being good examples) defines a very specific set of metrics that are good to identify possible problems on the operating system. They don’t offer specialized metrics that establish more complex failure situations, or even the service “real” state. That’s where you need to also “think out of the box” and implement specialized metrics and tests that define if your services are OK, degraded, or completely failed.

A distributed system like OpenStack, where every core service exposes a REST API, and also connects to a tcp-based message service, is susceptible to any problems related to networking bottlenecks, and/or connection-pool exhaustion. Also, many of OpenStack related services connects to SQL based databases, which can also exhaust its max-connections pool. That means a proper connection-states monitoring metrics (established, fin-wait, closing, etc.) needs to be implemented in the monitoring solution in order to detect possible connection-related problems that affect the API. Moreover, proper cli-tests (that can be based on scripts using “openstack” command line tool) can be constructed in order to check the endpoint state and measure its response-time. That response time can be converted into a metric that actually shows the real state of our service: Translation: Service “X” endpoint is slow.

We are really lucky here, because each of the aforementioned monitoring solutions (and most real-world solutions on the street, either commercial or OpenSource) can be “extended” with specialized metrics designed by yourself.

Another example here: You can use the command “time OpenStack catalog list” to actually measure the Keystone API response time, and also evaluate if the answer is correct in order to generate an artificial failure state when the answer is not what expected. You can use simple operating system tools like “netstat” or “ss” in order to monitor different connection states in your API endpoints and see possible problems in your service. The same can be done for critical parts in the OpenStack cloud dependencies like the message broker and the database services. Note that a message broker failure will basically kill your OpenStack cloud. Fail to properly monitor that, and expect the eventual consequences!

In conclusion, don’t be lazy and do your homework. Implement service-related metrics and don’t satisfy yourself with just the defaults.

4) Proper procedures avoid more failures.

If you have seen TV programs about airplane-related accidents, you have probably noticed how many of those accidents happen due to human error, and/or by not following proper procedures (or badly designed procedures).

Human factor is in everything. Murphy is there, but it is ultimately up to us to open our door to him. If you fail to write (and test) a scenario response procedure, you will not only fail to correct a problem, but also probably convert a single failure to “many” failures which can cause loss of revenue, and loss of your job.

Any possible incident in your cloud infra, and its related alarms in your monitoring solution, should be very well documented with proper and very clear steps in order to contain the problem, and then solve it. The first step on problem solving is: Proper detection. If your alarms are well conceived in your OpenStack monitoring solution, you will be able to detect the “real” failure cause. After that, you need to apply “contention”. When an incident is “properly” detected, you must apply the right contention steps so the problem is isolated and the failure does not spread or gets worse. When the contention stage is done, you should apply definitive correction steps, either by means of a programmed maintenance operation, or if your service is already compromised, apply an emergency maintenance operation to reestablish your affected services. Here is where ITIL normative and the likes are good to follow.

If you have a smart OpenStack monitoring system (with some degree of artificial intelligence) that can relate events and recommend proper solutions to detected incidents, take into account this: If you feed your system with inaccurate, incomplete, or in general, wrong information, the output will likely be inaccurate, incomplete, or in general, totally wrong. The “smart” system is not the culprit, it’s you!

In conclusion, the systems are not guilty if they wrongly do their job. It’s the “human factor” striking us again and again. Invite Murphy, and he will happily sit at your side every day. Reduce the human-factor impact on your monitoring solution, then you’ll almost forgot about Mr. Murphy.

Learn how Loom Systems can help you with your monitoring efforts





About Loom Systems