Artist: Eric Joyner

How many times have you gone nuts debugging a process which gives you absolutely zero logs but it is a running and active process according to linux and you have no clue what the process is upto? I feel ya. Fret not. This post is about my learnings trying to troubleshoot a similar scenario.

So we had a production case for months together, where the python process was stuck for indefinitely long time (even days) with absolutely zero activity but the process was listed as active and running by linux. A restart would fix the problem (as always) and the job would be live and kicking.

Finally after sometime, I have found the root cause, so I thought I would share it. For the purpose of the blog I’m going to simulate the behaviour of my application in a sample python script.

import urllib2

url = 'https://google.com'

response = urllib2.urlopen(url)

print response.read()

Basically the above code snippet tries to make a GET request to the above url and print it’s response. Please save the above few lines of code in a file and execute it. In this case, the printed output would be the HTML page of Google.

In another terminal, please type the following command

$ nc -l 9091

If you’re not aware of nc command, you can read about it here: https://en.wikipedia.org/wiki/Netcat. But simply put, we are starting a simple web server with nc command.

We will change the python code we wrote earlier to make it problematic. Instead of Google, I’ll try to walk through the scenario with the example web server we started.

import urllib2

url = 'https://localhost:9091'

response = urllib2.urlopen(url)

print response.read()

Now if we start the same python process, it wouldn’t print a thing, but keep on running… If you had started this process from foreground, yes, you can kill it. But imagine this running in your production instance where the job runs in background and you don’t have nothing in logs printed but the process is running according to linux. We will start debugging now. For that we need gdb . Please install it for the linux variant you are using https://wiki.python.org/moin/DebuggingWithGdb.

First find the process id (pid) of our running python process. In my case it was 2167, now let’s attach the gdb to this python process. gdb python 2167 This command will let you connect to the python process. In the gdb console, py-bt . This command should give you the python frame which the process in executing right now (where it is stuck, in our case). If you go to the bottom of the frame, it should give our program’s statement

4. It’s stuck in the line, where it is trying to open a https connection to the given url. Basically, our server doesn’t know how to respond to a HTTPS connection(nc just listens on the port and nothing else) and our awesome python client we wrote, waits for the server to accept the SSL handshake almost indefinitely. If you try to do, py-list in the gdb shell, it would give the exact statement which is trying to execute

Did you see the location?? Still trying to do handshake, Duh!