It was almost 11PM. My distributed web crawler had been running for a few hours when I discovered a very weird thing. One of its log files had stopped being updated for about 2 hours. This was not a good sign, to say the least.

Was the process hung somewhere? Or, had it crashed? I was determined the find the root cause of the problem.

In this post, you'll find out how I was able to debug a Python crash in my Linux environment.

The symptoms

So I found that some log file - generated by one of my crawler's child processes - had stopped being updated.

[acme@supervisor-1 ~]$ cd /home/acme/appdata/logs [acme@supervisor-1 logs]# ls -l total 98444 -rw-r--r-- 1 acme acme 6476631 Sep 23 22:50 static_supervisor.log3374 -rw-r--r-- 1 acme acme 1403836 Sep 23 22:50 static_supervisor.log3388 -rw-r--r-- 1 acme acme 10481644 Sep 23 21:46 static_supervisor.log3388.1 -rw-r--r-- 1 acme acme 1654650 Sep 23 22:50 static_supervisor.log3389 -rw-r--r-- 1 acme acme 10484998 Sep 23 21:34 static_supervisor.log3389.1 -rw-r--r-- 1 acme acme 953646 Sep 23 22:50 static_supervisor.log3390 -rw-r--r-- 1 acme acme 10485618 Sep 23 22:07 static_supervisor.log3390.1 -rw-r--r-- 1 acme acme 1263996 Sep 23 22:50 static_supervisor.log3391 -rw-r--r-- 1 acme acme 10485654 Sep 23 21:51 static_supervisor.log3391.1 -rw-r--r-- 1 acme acme 1269715 Sep 23 22:50 static_supervisor.log3392 -rw-r--r-- 1 acme acme 10485698 Sep 23 21:53 static_supervisor.log3392.1 -rw-r--r-- 1 acme acme 1275344 Sep 23 22:50 static_supervisor.log3393 -rw-r--r-- 1 acme acme 10485631 Sep 23 21:50 static_supervisor.log3393.1 -rw-r--r-- 1 acme acme 8887556 Sep 23 20:47 static_supervisor.log3395 <==

As you can see, the PID was appended at the end of the filenames. And process 3395 had stopped updating its log file at 20:47. So I peeked into it using vim and I scrolled through to the end.

... 2016-09-23 20:47:12,452 - requests.packages.urllib3.connectionpool - INFO - Starting new HTTP connection (1): www.careertechz.com 2016-09-23 20:47:14,528 - requests.packages.urllib3.connectionpool - DEBUG - "GET /robots.txt HTTP/1.1" 404 1245 2016-09-23 20:47:15,027 - requests.packages.urllib3.connectionpool - DEBUG - "GET /robots.txt HTTP/1.1" 404 1245 2016-09-23 20:47:15,127 - crawlers.web.multi_worker_web_crawler - DEBUG - The URL 'http://www.careertechz.com/' will be crawled. 2016-09-23 20:47:15,128 - crawlers.web.standalone_web_crawler - DEBUG - Crawling URL 'http://www.careertechz.com/'. 2016-09-23 20:47:18,585 - requests.packages.urllib3.connectionpool - INFO - Starting new HTTP connection (1): www.careertechz.com 2016-09-23 20:47:22,053 - requests.packages.urllib3.connectionpool - DEBUG - "GET / HTTP/1.1" 200 None

The log file had stopped being updated while performing a HTTP GET request against www.careertechz.com.

Was the process simply hung, or had it crashed? Let's see if the process was still running:

[acme@supervisor-1 ~]# ps aux | grep 3395 acme 14787 0.0 0.0 112644 956 pts/0 R+ 16:22 0:00 grep --color=auto 3395

Indeed, there was no Python process with PID 3395 running on my system! So clearly, it had died. But why, exactly?

Analyzing the global system messages

This is when I thought of looking at the global system message log. This would probably give me a better idea of the problem, especially since my crawler was running as a systemd service. I could thus expect systemd to log relevant information to /var/log/messages.

So I opened it using vim1 and I searched for the PID in question. And this is what I saw.

[acme@supervisor-1 ~]# sudo vim /var/log/messages Sep 23 20:49:10 supervisor-1 kernel: python invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0 Sep 23 20:49:10 supervisor-1 kernel: python cpuset=/ mems_allowed=0 Sep 23 20:49:10 supervisor-1 kernel: CPU: 0 PID: 3374 Comm: python Not tainted 3.10.0-327.18.2.el7.x86_64 #1 ... Sep 23 20:49:10 supervisor-1 kernel: 5762 pages reserved Sep 23 20:49:10 supervisor-1 kernel: [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name Sep 23 20:49:10 supervisor-1 kernel: [ 341] 0 341 10327 262 23 574 0 systemd-journal Sep 23 20:49:10 supervisor-1 kernel: [ 368] 0 368 10782 1 22 132 -1000 systemd-udevd Sep 23 20:49:10 supervisor-1 kernel: [ 387] 0 387 12797 0 25 113 -1000 auditd Sep 23 20:49:10 supervisor-1 kernel: [ 462] 0 462 60591 100 51 168 0 rsyslogd Sep 23 20:49:10 supervisor-1 kernel: [ 467] 81 467 6650 0 19 104 -900 dbus-daemon ... Sep 23 20:49:10 supervisor-1 kernel: [ 3374] 1000 3374 230620 1720 115 5284 0 python Sep 23 20:49:10 supervisor-1 kernel: [ 3388] 1000 3388 197430 5234 146 15178 0 python Sep 23 20:49:10 supervisor-1 kernel: [ 3389] 1000 3389 198011 3913 140 13479 0 python Sep 23 20:49:10 supervisor-1 kernel: [ 3390] 1000 3390 197082 5154 138 12078 0 python Sep 23 20:49:10 supervisor-1 kernel: [ 3391] 1000 3391 197466 5486 139 12723 0 python Sep 23 20:49:10 supervisor-1 kernel: [ 3392] 1000 3392 196202 4952 137 12503 0 python Sep 23 20:49:10 supervisor-1 kernel: [ 3393] 1000 3393 196074 726 146 18891 0 python Sep 23 20:49:10 supervisor-1 kernel: [ 3395] 1000 3395 319923 71803 307 30442 0 python Sep 23 20:49:10 supervisor-1 kernel: [ 3498] 1000 3498 234541 3983 129 7530 0 python Sep 23 20:49:10 supervisor-1 kernel: [ 3501] 89 3501 22811 0 43 260 0 pickup Sep 23 20:49:10 supervisor-1 kernel: Out of memory: Kill process 3395 (python) score 405 or sacrifice child Sep 23 20:49:11 supervisor-1 kernel: Killed process 3395 (python) total-vm:1279692kB, anon-rss:287212kB, file-rss:0kB

So process 3395 had been terminated by oom-killer, as evidenced by lines 3 and 25. oom-killer is a program responsible for automatically killing some processes whenever the system runs out of memory. Process 3395 had apparently been taking too much memory - ~320MB of virtual memory, as seen on line 22 - so it had been killed to free up some memory.

But... why had it used so much memory?

Testing out the last crawled URL

I figured out that maybe the last URL crawled by process 3395 was the culprit, namely http://www.careertechz.com/. So I loaded it in Google Chrome and it confirmed what had been floating in my mind.

The web page was actually a never-ending PHP error page, displaying a super huge stack trace. Thus, it had probably caused my crawler to consume an ever-growing amount of memory. Unless some timeout had occurred, right?

Here's the class that my crawler used to download web pages:

# Copyright 2017 Benoit Bernard All Rights Reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. import requests from core.contracts import contract from core.http.exceptions import HttpError from requests.exceptions import ConnectionError, RequestException class Request: """ Represents an HTTP request. """ def __init__(self, requests=requests): """ Initializes the Request object. """ contract.is_not_none(requests) self._requests = requests def get(self, url, **kwargs): """ Performs a GET request for the specified URL. """ contract.is_not_empty(url) return self._perform_request(self._requests.get, url, **kwargs) def _perform_request(self, call, url, **kwargs): try: response = call(url, **kwargs) except (ConnectionError, RequestException) as e: raise HttpError("Unable to get content at URL '{0}'.".format(url)) from e return response

This class uses the requests library to perform HTTP requests. So just to validate its behavior, I created a small unit test that downloads the problematic URL.

# Copyright 2017 Benoit Bernard All Rights Reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. import unittest from core.http.exceptions import HttpError from core.http.request import Request class RequestTests(unittest.TestCase): _REQUEST = Request() def test_get_with_url_that_points_to_a_very_huge_page(self): """ Validates that downloading an URL that points to a very large web page will eventually be aborted. """ self.assertRaises(HttpError, self._REQUEST.get, "http://www.careertechz.com/", timeout=10)

In the above test, a timeout of 10 seconds is used. I was expecting requests to time out after that, causing an exception of type HttpError to be raised.

But surprisingly, it never timed out when I ran it. And the memory use of the Python process was growing without bounds. Aha! So I had clearly identified the source of the problem!

But now, why wasn't it timing out?

Solution attempt #1: Using timeouts

As discussed in one of my previous posts, HTTP libraries such as requests typically use two forms of timeout:

Connect timeout : maximum number of seconds to wait for the socket to initially connect.

: maximum number of seconds to wait for the socket to initially connect. Read timeout: maximum number of seconds to wait for the socket to receive data. Any interruption longer than that - either at the beginning or during the download per se - leads to a timeout.

But neither form of timeout had occurred; on the contrary, requests was receiving data from the server, nonstop.

Thus, if I wanted to force requests to time out during long downloads, I needed some kind of "global timeout". Unfortunately, I found that requests doesn't currently have that feature. So I needed an alternative.

Solution attempt #2: Using eventlet

After googling around a little bit, I found an interesting discussion about global timeouts.

People first suggested that one could use eventlet, a networking library that allows non-blocking I/O operations in Python. In theory, I could use it to force the HTTP GET request to time out whenever a download is too long.

So I changed my class as follows:

# Copyright 2017 Benoit Bernard All Rights Reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. import eventlet import requests from core.contracts import contract from core.http.exceptions import HttpError from eventlet.timeout import Timeout from requests.exceptions import ConnectionError, RequestException class Request: """ Represents an HTTP request. """ # The 'timeout' parameter. _TIMEOUT_PARAMETER = "timeout" # The default timeout value (30 seconds). _DEFAULT_TIMEOUT_IN_SECONDS = 30 def __init__(self, requests=requests): """ Initializes the Request object. """ contract.is_not_none(requests) self._requests = requests # Handling timeouts is not trivial: # -Requests only handles two types of timeouts: connect and read timeouts. The former is when we initially # connect the socket to a remote address and port. The latter is when we download data with the socket. # This means that an infinitely large document would be downloaded and loaded in memory. This is bad. # -We need to use eventlet, which monkey-patches the required socket calls to support global download # timeouts. eventlet.monkey_patch(socket=True) def get(self, url, **kwargs): """ Performs a GET request for the specified URL. """ contract.is_not_empty(url) return self._perform_request(self._requests.get, url, **kwargs) def _perform_request(self, call, url, **kwargs): try: with eventlet.Timeout(kwargs.get(self._TIMEOUT_PARAMETER, self._DEFAULT_TIMEOUT_IN_SECONDS)): return call(url, **kwargs) except (ConnectionError, RequestException, Timeout) as e: raise HttpError("Unable to get content at URL '{0}'.".format(url)) from e return response

There's no magic in the above code, really:

By calling eventlet.monkey_patch(socket=True), eventlet replaces all socket-related methods of the Python standard library - including those called by requests - with asynchronous-ready ones. By coupling it with eventlet.Timeout() (as seen on line 59), any wrapped, socket-related code would be timeoutable.

This was cool on paper, but not in practice, though. After carefully testing my class, I found that there were some side effects. For example, this would hang the unit tests suite for my crawler, and the exact reasons weren't clear. "What else could it break?", I wondered. So I just dropped eventlet, out of desperation.

Solution attempt #3: Using signals

People also suggested that Python signals could be used. The problem was that I needed the SIGALRM signal, as explained in this discussion. And that signal cannot be used under Windows - my development platform - as explained in the documentation:

On Windows, signal() can only be called with SIGABRT, SIGFPE, SIGILL, SIGINT, SIGSEGV, SIGTERM, or SIGBREAK. A ValueError will be raised in any other case.

So I realized that implementing a global timeout in requests was much harder than I thought.

Solution attempt #4: Setting a download size limit

This is when I had an epiphany - or sort of. Why was I trying to implement a global timeout mechanism, exactly?

Because I wasn't trying to abort long downloads; instead, I was trying to abort large ones. So why not download each web page by chunks, and then re-calculate the total download size after each new chunk?

Then I could simply use some threshold to limit downloads up to a certain size. This was a super promising - and much cleaner - avenue to explore.

Fortunately, I found that:

requests.get() has a stream parameter which allows to download data by chunk. The returned Response object has an iter_content() method that allows to iterate over content and progressively download it. There's a contextlib.closing() method that allows to release the connection back to the connection pool of requests, even when we partially read response data.

So here's the modified class that I came up with.

# Copyright 2017 Benoit Bernard All Rights Reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. import core.http.exceptions import requests from contextlib import closing from core.contracts import contract from requests.exceptions import RequestException from requests.packages.urllib3.exceptions import HTTPError ... class Request: """ Represents an HTTP request. """ # The number of bytes read all at once during a GET request. _GET_CHUNK_SIZE = 2**10 # The maximum length allowed for a response to a GET request. _MAX_RESPONSE_LENGTH = 2**20 def __init__(self, requests=requests): """ Initializes the Request object. """ contract.is_not_none(requests) self._requests = requests def get(self, url, **kwargs): """ Performs a GET request for the specified URL. """ contract.is_not_empty(url) raw_content = "" try: # We absolutely need to download the response by chunks for a GET request, because some web pages are # very large and downloading them would: # -take forever. # -potentially cause the machine to run out of memory. # This is much more reliable than using some kind of global download timeout mechanism. Additionally, this # allows to continue to support connect and read timeouts independently of it. # # We could also have instantiated a single Session object and re-used it throughout the whole life of # the Request object. Indeed, this would have increased the performance of GET requests. But based on my tests, # this also radically increases the memory footprint, and it might even manifest itself as a potential # leak in debuggers. Obviously, this is not good at all. Anyway, internally, requests.get() and requests.head() do # exactly the same thing in that they both create a new Session for each new request. with self._requests.Session() as session: with closing(session.get(url, stream=True, **kwargs)) as response: for chunk in response.iter_content(chunk_size=self._GET_CHUNK_SIZE, decode_unicode=True): # If the returned chunk is of type 'bytes', it means that iter_content() was unable to detect # the encoding of the response (in order to decode it). This can happen in several situations: # -the URL points to a binary file (not to an HTML page). # -the URL points to a page whose encoding simply can't be detected (for some reason). # -the URL points to an empty page. # -the URL points to an error page (e.g. 404, 503, etc.) # We must then decode the chunk ourselves, assuming that the encoding is UTF-8. # It's also important to note that response.encoding may be equal to None. In most cases, this # occurs the URL points to a binary document (e.g. PDF). raw_content += chunk if response.encoding else chunk.decode() if len(raw_content) >= self._MAX_RESPONSE_LENGTH: raise core.http.exceptions.HttpError("The response was too large (greater than {0} bytes).".format(self._MAX_RESPONSE_LENGTH)) response_url = response.url status_code = response.status_code headers = response.headers except (RequestException, HTTPError) as e: # We catch exceptions coming from: # -requests (RequestException) # -urllib3 (HTTPError) raise core.http.exceptions.HttpError("Unable to get content at URL '{0}'.".format(url)) from e except (UnicodeDecodeError, LookupError) as e: # We catch exceptions coming from: # -str.decode(): can raise errors of type UnicodeDecodeError. For example, this may happen if the URL # points to binary document (e.g. PDF). # -requests: can raise errors of type LookupError when it can't get an incremental decoder for # iter_content because the encoding is invalid (e.g. encoding is 'utf-8, text/html'). raise core.http.exceptions.HttpError("Unable to decode content at URL '{0}'.".format(url)) from e return RawResponse(response_url, status_code, headers, raw_content)

Whenever a new HTTP GET request is sent, the above code downloads the response by chunks of 20KB, and it checks whether the total download size has exceeded a predetermined threshold of 1MB. If it does, then it raises an exception of type core.http.exceptions.HttpError.

I tested the code out and guess what? Once cumulative chunks reached a total size of 1MB, the code successfully aborted the download; it raised an exception of type core.http.exceptions.HttpError, as expected.

Awesome! Problem solved!

Final thoughts

In this post, we've seen how to resolve a specific type of Python crash. I determined that one of my crawler's processes was taking way too much memory, thus causing it to be killed.

The diagnosis was rather straightforward thanks to the availability of critical information in logs; I was quite lucky in this sense. However, with other types of crashes, I'd probably have to generate a Linux core dump and analyze it using the gdb debugger. I've never had to do it, so it would certainly make a super interesting blog post.

Now, I'd be curious to know about you! Have you ever had to debug a Python crash? If so, what tools did you use? And were you successful? Please leave a comment in the comment section below!

Update (17/03/2017): this post was featured in Lobsters. It was also featured in the Python Weekly and Pycoders Weekly newsletters. If you get a chance to subscribe to any one of them, you won't be disappointed! Thanks to everyone for your support and your great feedback!