How We Fixed a Running Python App in Production Using GDB

Don’t try this in your production. Unless you absolutely need to

GNU Debugger, or gdb , is a well-known debugging tool which is commonly used by developers, mostly to debug programs written in low-level programming languages such as C. In some situations, however, it can be used for other purposes, even (with great caution!) to fix problems in apps running in production.

The App

The situation I’m about to describe happened to us at work. At the time I was working in a team which was tasked with supporting a complex legacy monolithic app (written in Python) while incorporating it into a new system which would eventually replace the old monolith.

The architecture of the new system relied heavily on Apache Kafka which we were using as an event bus between all components of the system. We used Kafka as a part of the Confluent platform, which includes a set of services and tools for better use of Kafka. One of the services in the Confluent platform is the Schema Registry: Confluent encourages using Avro, a schema-based binary format, as the primary format for data interchange through Kafka, and the Schema Registry is meant to ensure that message producers won’t do any backwards incompatible changes to the message schema that can break the consumers.

The Problem

The Schema Registry is basically a RESTful HTTP service, so Confluent’s client libraries for the platform communicate with the Schema Registry using its HTTP API. The Python client library, for example, uses the library requests for this purpose.

requests is a great library (just see the list of acknowledgements on the documentation page) that makes working with HTTP in Python very simple. In fact, if you’re developing in Python, it’s very likely that you used it at some point. And if you used it, you probably know that there is one serious issue with the library: by default HTTP requests are executed without any timeouts and can hang indefinitely until the connection is established and the remote server sends the body. This problem has been brought up in requests ’ issue tracker a lot of times over the years and will be addressed in the next major version of requests , which promises “better defaults and required timeouts”, but for now it is the responsibility of every developer to specify a timeout for every request. While this behaviour is fine for “Hello World”-level scripts, it poses a risk for real-world apps because a simple coding error (omitting an optional parameter) can cause the application to hang, eventually making it unresponsive to clients’ requests or interrupting the processing of some data file or data stream.

As you’ve probably guessed by this point, Confluent’s client library contains the same mistake: it is making HTTP requests to the Schema Registry without timeouts.

The Hang

One day our operations team had been performing maintenance on the Kafka cluster. Shortly after the maintenance we received an alert from our monitoring system, basically telling us that the Kafka consumer within the app is not processing any new messages. All the other functionality of the app was intact, so we started investigating what could have gone wrong with the consumer. Thanks to excessive logging we had added to the application, we were able to quickly deduce that the consumer thread was stuck making a HTTP request to the Schema Registry. We didn’t know exactly why this request hung (though it was probably caused by the maintenance), but what we knew was that we needed to make the consumer interrupt the request and continue processing the messages from Kafka.

Thankfully, at that time the new Kafka-based architecture was only rolled out to a very small fraction of users, so we had time to work out a solution with the least interruption of service for all the users.

The Solution

The technically simplest way to interrupt the request was obvious: we needed to stop and restart the consumer. However, there was no way to do this without restarting the whole application, and this would lead to downtime for all the users of the service in the middle of the day, which was a disproportionate measure considering the number of affected users. We had to find another way.

Being unable to terminate the app, we continued thinking how we could terminate only the hanging request. We knew that our consumer would re-process the message and continue running if we managed to interrupt the request, but we didn’t know how to do that.

After some googling we managed to find the magic combination of lsof , netstat and other Linux commands that gave us the file descriptor of the socket underneath the hanging request, and the only remaining challenge was to find a way to close that socket. Unfortunately, we quickly understood that there is no simple way to close a file descriptor of another process, except for attaching to the process using a debugger.

This is where gdb came to play. However, we could not simply start gdb and type commands in interactive mode, as it is typically done when debugging, because it would suspend the event loop in the application’s main thread, virtually preventing the application from serving users. Instead, we had to figure out all the commands in advance and supply them to gdb as command-line arguments.

The full command was like this:

gdb -p <pid> -ex "call close(<fd>)" -ex "set confirm off" -ex "quit"

The command does the following:

launches gdb , telling it to attach to the application process with pid <pid> ;

, telling it to attach to the application process with pid ; executes the system call close( <fd> ) , where <fd> is the file descriptor we want to close;

, where is the file descriptor we want to close; disables gdb ’s internal option confirm , allowing us to quit gdb without a confirmation prompt;

’s internal option , allowing us to quit without a confirmation prompt; and finally quits gdb .

As expected, executing this command caused an exception to be raised in the application within the requests call, forcing it to retry processing the current message and then to continue processing messages from Kafka normally. Also the command finished within a fraction of a second, thus having minimal impact on serving other requests in the event loop.

The Lesson

After fixing the urgent problem with the production app, we re-examined the source code of Confluent’s client library and subclassed it in our application so that all HTTP requests would be performed with a timeout set. We also opened an issue regarding this problem in Confluent’s issue tracker on Github.

Besides this, I also wrote a linter plugin to make sure we will never ever make the same mistake in our own code. And, of course, we started to be more careful with any third-party libraries, especially the ones that communicate with other resources over a network.