About a month ago, I found a bad-behavior bug in the tokenize standard library module, and with help from Aron Griffis, submitted a patch to fix it. Yesterday was a Python bug day, and Ezio Melotti merged my change, so I have officially contributed to CPython!

The bug in tokenize was an obscure case: if the code ends with a line that starts with non-space, then ends with many spaces, and no newline, then the tokenizer gets into an N² run-time behavior, where N is the number of spaces. The problem is that each space is tokenized as an error token (because it precedes no good token), so N tokens are produced, but each token takes linear time for the regex to see that there’s no good token following it, leading to N² behavior.

I discovered this working on code that grades student submissions at edX. For some reason there was a submission ending with 40,000 spaces and no newline, and it was taking 20 minutes to tokenize!

Simple demonstration:

import tokenize

import time

from cStringIO import StringIO



def time_to_tokenize_trailing ( spaces ):

source = StringIO ( "@" + " " * spaces )

start = time . time ()

list ( tokenize . generate_tokens ( source . readline ))

end = time . time ()

return end - start



for spaces in xrange ( 1000 , 15000 + 1 , 1000 ):

print " %5d : %.2f s" % ( spaces , time_to_tokenize_trailing ( spaces ))



Ouch:

1000: 0.71s

2000: 2.83s

3000: 6.47s

4000: 11.52s

5000: 17.68s

6000: 26.16s

7000: 35.35s

8000: 46.65s

9000: 58.35s

10000: 72.80s

11000: 89.53s

12000: 107.27s

13000: 126.44s

14000: 147.60s

15000: 166.81s



If you are running a server that tokenizes untrusted Python code, you might want to throw an .rstrip() into it to prevent this case...