How to investigate and fix production issues?

If you see some recurring issue in your production env, what do you do?

Well Surprisingly restart works most of the times but if you are managing 10’s of services with 100’s of machines having the code in multiple data centers even restarting can be a full-time maintenance job. I have seen teams managing microservices has to keep 2–3 software engineers just to manage the service and restarting servers from time to time.

And production issues are by no means rare and maintaining services that have a lot of unknown production issues can be a really — really painful which can lead to all sorts of problems.

People start to search for new jobs and it can create a feedback loop and makes the problem even worst. Developers are distracted constantly to resolve issues Working on the actual new task becomes hard Over time people and the team can become afraid to release new features Managers don’t trust their team members anymore and start to come up with a lot of processes to stop people from releasing things

Okay, So now we have established that it’s a serious problem. what’s the solution?

Basically, I think there are 2 problems with fixing production issues

Prioritization Technical challenges /Debugging

I have seen that sometimes picking up the task to fix the issue can be a bigger problem then actually solving it and I think there are multiple reasons for it.

Production issues come out of nowhere and it screws up with planned tasks so we tend to ignore it. Very hard to predict how much time it would take to fix the issue

3. “It just happened a few times”, We somehow hope that it goes away by itself.

4. Restarting, again and again, seems an easier option to many people.

Solving the above problems can be tricky. You have to convince your manager and team members to stop working on some task and pick up the investigation. But let’s say somehow you got it prioritized then how do you solve it?

Over time I have learned that debugging or any production investigation is like a scientific process of problem-solving. There are a few things you have to keep in mind.

Before making any hypothesis about the issue, try to learn about the system around it. Do a lot of experimentation around it. Plat graphs if required, make notes etc. This is critical because if you make a hypothesis or assumptions about the issue without knowing the system. It’s very hard to come out of these assumptions and see the full picture. Try to reproduce the issue in a minimum setting, this makes it faster to experiment and try out things. Eventually, this will help you validate the fix as well. Validate simple hypothesis first. A lot of time we come up with some complicated hypothesis that is hard to validate also but the problem ends up being pretty simple. A lot of times the issue is a known issue inside some library, search on StackOverflow or post a new question on StackOverflow with all the details. You will be surprised by the insights you can get from just posting the question because this forces you to construct the question in a minimum reproduction scenario. Off-course StackOverflow community is also great so they can help you with the answers. If you still can’t find the root cause give the issue to your coworker for a fresh perspective. Sometimes it’s just very hard to come out of your own assumptions. If you still can’t find the issue then document your main finding, find a workaround (it should be better then restarting every-time issue happens :P) and move on. Learn from your mistakes and be better next time.

I see lots and lots of hours getting wasted in solving issues so if this can help even few people fixing the issues faster then writing this is worth it. If you liked the content please clap(👏) and Let me know what you think?