The Best Way to Prevent Incidents

Organizations that put time and effort into problem management get a huge return on their investment. Although fixing incidents when they happen is important, it’s much better to stop them happening in the first place; and if you can’t do that, then at least make sure you know what you can do to minimize the impact of future incidents.

ITIL (the world’s leading best practice for IT service management) says that the purpose of problem management is “to reduce the likelihood and impact of incidents by identifying actual and potential causes of incidents, and managing workarounds and known errors.”

What are the phases of problem management?

According to ITIL 4 (the latest release of ITIL, published in February 2019), problem management has three phases

Problem identification – which identifies and logs problems

Problem control – which analyzes problems and develops workarounds

Error control – which monitors and improves workarounds, and resolves problems if this looks cost effective

How do most organizations identify problems?

Most organizations that I’ve worked with use two methods to identify problems

There’s been a major incident, and the organization needs to understand the underlying causes to ensure the same thing doesn’t happen again. The major incident management process focusses on resolving the incident and restoring normal operations, and then problem management kicks in to analyze what happened and what needs to be done next. There’s been lots of similar incidents. Each of them has been investigated and closed, but they may recur and are causing significant cumulative impact on customers, or on the service provider organization. This cluster of similar incidents is usually identified by trend analysis of incident records, or by good service desk staff recognizing that something similar has happened before. Problem management activity is needed to identify the underlying cause of the incidents and decide how to prevent them in future, or at least reduce their impact.

The trouble with these approaches is that identification comes too late. Problem management activity after incidents have happened is important, as it can help to reduce the impact of future incidents. But it’s much better for everyone if the problem can be identified before it causes any incidents instead of after it’s had a significant impact on the organization.

When’s the best time to identify a problem?

Every incident causes a loss of productivity for one or more users, and requires effort from the service provider organization. If you can identify problems before they cause incidents, then you can provide much better service to your users, and you might even reduce your own costs! This is clearly good for everyone, but it requires some planning and effort.

How to identify problems that haven’t yet caused incidents

So, how can you identify problems without waiting for them to cause incidents first? What activities, processes, or practices, can result in problems being logged, analyzed, and resolved before they cause lost productivity and increased costs? Here are some practical steps you can take.

Review vendor websites and announcements

Every organization uses some third-party products as part of their IT solution. This can include:

User devices such as desktop and laptop computers, laptops, and phones

Operating system software, running on user devices and on servers

Applications, running on user devices

Commercial software, running as cloud-based services, or on your local servers

Network infrastructure, such as switches, routers, firewalls etc.

And many more…

All of these products are likely to include defects, and you can often find out about these defects before they have any impact on your users if you take the trouble to monitor announcements that the vendor makes, on their website, or via newsletters or other communications. Depending on your relationship with the vendor you may already speak to an account manager regularly. They’ll often be able to notify you of significant problems.

Every time you learn about a defect in a third-party product you use, this is an opportunity to address the problem before it’s caused an incident in your environment. Things you might do include:

Develop a plan for how you’ll respond when unavoidable incidents occur, so that you can reduce the impact on your users, and on your IT organization

Understand the exact circumstances that could trigger incidents, and modify how you configure or use the product to avoid triggering them

Monitor future announcements to ensure you can apply any patches or other solutions as soon as they become available

In extreme situations you may want to consider replacing the faulty product with one that does not have the defect. Bear in mind that this is only likely to make sense if the issue is severe, is unlikely to be resolved quickly, and when there is a viable alternative product.

Work closely with internal development teams

Many organizations have software development teams that develop and maintain applications they use. You need to ensure that you have a good working relationship between your operations staff and your development staff, so that you learn about issues and errors as they arise, and you can work together to plan how to manage any incidents they may cause. You should also work together to prioritize resolution of any issues and errors, to ensure that the ones with most impact are addressed in a timely manner.

Monitor user communities and social media

If you have a very large number of users, and especially if some or all of the users are outside your own company, then it’s important to monitor user communities and social media to find out about issues the users are seeing that they’ve not logged as incidents. Sometimes you’ll discover that users have developed perfectly good workarounds for themselves, and you can adopt these to help address the underlying problem – with, of course, suitable recognition of the people who contributed to the solution where that’s practical.

You can also join user communities that support third-party products that you use, and this may enable you to identify problems that are affecting other organizations before they become visible in your own environment.

Use third party threat assessment and penetration testing services

These types of service can help you prevent security incidents, by identifying how you might be attacked, and where you might be vulnerable.

Threat assessment services are provided by organizations who monitor a wide variety of organizations looking for what kind of threats exist, and the extent to which they’re being exploited. They can provide you with information that may help you to avoid security incidents by proactively taking defensive action, before your own organization comes under attack. Similarly, penetration testing services may identify a vulnerability in your defences that you can address before any incidents occur.

Conclusion

If you only use problem management to analyze incidents that have already happened, then you’ll always be reacting after your users have suffered. Try thinking about what might happen in the future and you can get ahead of problems, and deliver much higher value to your users and your customers, often with a reduction in your own overall costs.

If you’ve other ideas for how to identify problems before they cause incidents then please share them here – and if I ever update this blog I’ll be happy to include them – with suitable recognition for whoever contributed.

If you’d like to learn more about other aspects of problem management then you can read some of these blogs:

About Stuart Rance Stuart is an ITSM and security consultant, working with clients all round the world. He is one of the authors of ITIL 4, as well as an author of ITIL Practitioner, ITIL Service Transition, and Resilia: Cyber Resilience Best Practice. He is also a trainer, teaching standard and custom courses in ITSM and information security management, and an examiner helping to create ITIL and other exams. Now that his children have all left home, he has plenty of time on his hands for contributing to our blog - lucky us!