This is offered to the world in the spirit of “If other programmers can learn from my mistakes, maybe my threads won’t have died in vain.” The trouble is, I don’t think anyone will ever find this by Googling. Maybe I am wrong.

Here is a heavily-sanitised snippet of Python code, sitting inside a loop, being run by a pool of threads.

start_time = datetime.now() if start_time >= next_run_due: logging.debug("Starting the performance-critical section") # Ensure that the results we send are based on the freshest data. fetch_data() # Have noticed this runs slow sometimes. Keep an eye on it. fetch_completed_time = datetime.now() # Don't examine now. Wait until outside performance-critical section! process_data() send_results() # Relax. Now outside the performance-critical section. # This performance check has been placed outside the performance-critical section, because logging.critical() may send a (slow) email. if fetch_completed_time - start_time > threshold: logging.critical("Fetch time is running slow again.") # Implicitly sends an email. # If you see this message, check the network and CPU aren't congested. next_run_due = next_due_date()

So this code checks if some processing is due, and if so, fetchs the data, processes it and sends out the results. It tries to make sure that when it sends out results, it is based on the freshest possible data, so the key section does as little extraneous work as possible.

Sometimes, the data-fetch takes longer than expected – historically, that has been because something else on the server is chewing up more resources that it should. I want to know when that happens, so I can go and fix it, so I log a critical message. The critical message log is linked to send an SMTP message with Python’s logging library’s SMTPHandler. I do this check outside the key section, so processing the email doesn’t (further) slow the sending of the results.

Can you see the problem with this code? I couldn’t. (I love asking this question, because people tend to find dozens of other problems I never thought about!)

Sure, there is no throttling on the critical messages, but that is because this doesn’t occur frequently enough to cause email flooding.

Here’s a hint. Multi-threading and locking.

Some code, in some thread, somewhere, called logging.critical(). Logging.critical() tried to send an email. I expected logging.critical() to write the email to a local buffer, and for sendmail (or whatever) to pick it up and deliver it later. I knew it would be slow, but I was expecting sub-second response times. I haven’t worked out why that isn’t always true, but it isn’t always true. It can take many seconds to send this email.

But that’s okay, because there are no logging.critical() calls during any performance-sensitive areas, so it won’t affect the “freshness” or the data which is really the critical performance characteristic.

Can you see the call to logging.debug()? It does not send an email; it writes to a log file. Its execution time is measured in hundreds of microseconds, so it isn’t a concern.

Or so I thought.

It turns out that logging.debug() blocks, waiting for the logging.critical() call to complete. Now the debug message is taking 10,000 times longer than it normally does! It still isn’t inside the critical part, but note that it is inside the part that is timed. So when fetching + logging it takes more time than the threshold, out goes another logging.critical() call, which is another email message, which is another blocked logging system, which holds up the next thread, and so it goes.

Meanwhile, the time taken to do a logging.critical() call is increasing as they all wait in queue to send emails. The first one goes out, as expected, in less than a second. The second one took twice as long. The third, three times as long. It got to the stage that my threads were freezing up for 90 seconds. I said before that there weren’t enough of these calls to cause email flooding. I stand by that – there was a burst of few hundred over a couple hours which should be easily handled by a server’s email system. But that is assuming that all the threads in the system are not waiting for every email to actually reach its destination before they continue.

Short Term Solution

I’ve turned off emailing of critical logs for the time being. I have moved the debug log outside of the timed section, to stop it from propagating the problem, in any case.

Outstanding Questions

Before I can implement a long-term solution, I have to understand some issues.

Is my MTA running too slowly, even though it is on the localhost? Could it be making me wait while it sends the email, rather than buffering and sending it later? Could it be protecting itself from elephant-interferers err.. spammers by deliberately running slowly? Or is it inappropriate for me (and SMTPHandler) to have every thought it should be fast enough?

Does SMTPHandler buffer emails before sending them out in a different thread? Apparently not. Could it be configured or rewritten to do so? Do I really need to write such a beast myself?

Could BufferingSMTPHandler help? It seems to try to minimise emails sent by chunking them and only sending when the buffer is full. Presumably, it still blocks other logging during this occasional email, so it remains a problem.

Could SysLogHandler help by taking it out of the domain of Python? Or given my limited experience with syslog, will I just have a new set of problems?

In the meantime, if you use SMTPHandler, and you get random periods of sustained bad performance, maybe this will help you?