Created on 2008-03-22 22:25 by pitrou, last changed 2014-06-22 12:04 by BreamoreBoy.

Messages (31)

msg64343 - (view) Author: Antoine Pitrou (pitrou) * Date: 2008-03-22 22:28

This is a preliminary patch to speedup for and while loops (it will also be able to speedup list comps and gen comps, if I get to do it). The patch works by putting the loop condition test at the end of loop, which allows removing one JUMP_ABSOLUTE byte code out of the critical path. For this two new opcodes are introduced: - FOR_ITER2 is the same as FOR_ITER except that it does an /absolute/ jump if the iterator is /not/ exhausted (when other uses of FOR_ITER are replaced with FOR_ITER2, we can of course restore the original naming) - JUMP_ABS_IF_TRUE /pops/ the top of the stack and does an absolute jump if its value is true Some micro-benchmarks: ./python -m timeit "for x in xrange(10000): pass" Before: 1000 loops, best of 3: 782 usec per loop After: 1000 loops, best of 3: 412 usec per loop ./python -m timeit "x=100" "while x: x -= 1" Before: 10000 loops, best of 3: 22.1 usec per loop After: 100000 loops, best of 3: 16.6 usec per loop ./python Tools/pybench/pybench.py -t ForLoops Before: 365ms per round After: 234ms per round Also, pystone gets 5% faster (from 43300 to 45800). Now for the less shiny things: 1. I'm having problems modifying the pure Python compiler module. For some reasons it seems to have stricter requirements than compile.c itself does (!); I get some assertion error in compiler.pyassem.FlowGraph.fixupOrderHonorNext as soon as a loop gets non-trivial. 2. Line numbers probably need to be fixed. The lnotab format may even have to be adapted in order to accomodate non-monotonically increasing line numbers. Is there some interest in this patch? If yes, I'd like to have your input on the two bullet points above :)

msg64359 - (view) Author: Antoine Pitrou (pitrou) * Date: 2008-03-23 14:18

This new patch includes surgery to the compiler package (especially flowgraph trickery) in order to make it work with the new opcodes. I think my changes are sane but since the package seems basically untested, unmaintained and almost unused, there may be some glitches. However, test_compiler passes. (test_dis will need to be updated for the new opcodes, not a big deal)

msg64364 - (view) Author: Antoine Pitrou (pitrou) * Date: 2008-03-23 14:44

Removed latest patch, it was half-baked.

msg64367 - (view) Author: Antoine Pitrou (pitrou) * Date: 2008-03-23 15:51

This new patch should be ok. The block ordering algorithm in compiler.pyassem looks entirely clean now, to the extent that the previous "fixup" hacks have been disabled. Attaching loops3.py.

msg64383 - (view) Author: Antoine Pitrou (pitrou) * Date: 2008-03-23 23:02

loops4.patch adds a mechanism to avoid blocking signal catching in empty loops (such as "for x in it: pass" or "while x: pass"). Much of the speedup is still retained. ./python -m timeit "for x in xrange(10000): pass" Before: 1000 loops, best of 3: 737 usec per loop After: 1000 loops, best of 3: 438 usec per loop ./python -m timeit "x=100" "while x: x -= 1" Before: 10000 loops, best of 3: 21.7 usec per loop After: 100000 loops, best of 3: 16.6 usec per loop ./python Tools/pybench/pybench.py -t ForLoops Before: 364ms per round After: 242ms per round

msg64506 - (view) Author: Antoine Pitrou (pitrou) * Date: 2008-03-25 19:19

By the way, the compiler package fix has been isolated and cleaned up as part of #2472. Maybe it would be nice to start with that one.

msg64537 - (view) Author: Antoine Pitrou (pitrou) * Date: 2008-03-26 02:09

This new patch also updates the code generation for list comprehensions. Here are some micro-benchmarks: ./python -m timeit -s "l = range(100)" "[x for x in l]" Before: 100000 loops, best of 3: 14.9 usec per loop After: 100000 loops, best of 3: 12.5 usec per loop ./python -m timeit -s "l = range(100)" "[x for x in l if x]" Before: 10000 loops, best of 3: 24.1 usec per loop After: 100000 loops, best of 3: 18.8 usec per loop ./python -m timeit -s "l = range(100)" "[x for x in l if not x]" Before: 100000 loops, best of 3: 15.5 usec per loop After: 100000 loops, best of 3: 11.9 usec per loop Please note that this patch is orthogonal with Neal's patch in #2183, so the two combined should be quite interesting for the list comprehensions bytecode.

msg64572 - (view) Author: Antoine Pitrou (pitrou) * Date: 2008-03-26 23:09

This new patch completes the bytecode modifications. For/while loops as well as list comprehensions and generator expressions are a bit faster now. Also, as a side effect I've introduced a speed improvement for "if" statements and expressions... Some micro-benchmarks (completing the ones already given above): ./python Tools/pybench/pybench.py -t IfThenElse Before: 167ms per round After: 136ms per round ./python -m timeit -s "y=range(100)" "sum(x for x in y)" Before: 10000 loops, best of 3: 20.4 usec per loop After: 100000 loops, best of 3: 17.9 usec per loop ./python -m timeit -s "y=range(100)" "sum(x for x in y if x)" Before: 10000 loops, best of 3: 28.5 usec per loop After: 10000 loops, best of 3: 23.3 usec per loop ./python -m timeit -s "y=range(100)" "sum(x for x in y if not x)" Before: 100000 loops, best of 3: 16.4 usec per loop After: 100000 loops, best of 3: 12.1 usec per loop ./python -m timeit -s "x,y,z=1,2,3" "x if y else z" Before: 1000000 loops, best of 3: 0.218 usec per loop After: 10000000 loops, best of 3: 0.159 usec per loop A couple of tests seem to be failing in obscure ways in the test suite, I'll try to examine them. Most of the test suite runs fine though.

msg64573 - (view) Author: Antoine Pitrou (pitrou) * Date: 2008-03-26 23:43

Ok, the fix for the bizarre failures was really simple. Now the only failing tests are in test_trace (because it makes assumptions regarding the bytecode that aren't true anymore, I'll have to adapt the tests).

msg64577 - (view) Author: Neal Norwitz (nnorwitz) * Date: 2008-03-27 05:04

Antoine, I hope to look at this patch eventually. Unfortunately, there are too many build/test problems that need to be resolved before the next release. If you can help out with those, I will be able to review this patch sooner.

msg64599 - (view) Author: Armin Rigo (arigo) * Date: 2008-03-27 16:21

Can you see if this simpler patch also gives speed-ups? (predict_loop.diff)

msg64603 - (view) Author: Antoine Pitrou (pitrou) * Date: 2008-03-27 16:56

Armin, your patch gives a speed-up for "for" loops and comprehensions, although a bit less. Also, it doesn't speed up "while" loops and "if" statements at all. For some reasons it also appears to make pystone a bit slower. Here are some micro-benchmarks: ./python -m timeit "for x in xrange(10000): pass" Before: 1000 loops, best of 3: 758 usec per loop After: 1000 loops, best of 3: 483 usec per loop ./python -m timeit "x=100" "while x: x -= 1" Before: 10000 loops, best of 3: 21.8 usec per loop After: 10000 loops, best of 3: 21.6 usec per loop ./python -m timeit -s "l = range(100)" "[x for x in l]" Before: 100000 loops, best of 3: 14.9 usec per loop After: 100000 loops, best of 3: 13.3 usec per loop ./python -m timeit -s "l = range(100)" "[x for x in l if x]" Before: 10000 loops, best of 3: 23.9 usec per loop After: 10000 loops, best of 3: 22.3 usec per loop ./python -m timeit -s "l = range(100)" "[x for x in l if not x]" Before: 100000 loops, best of 3: 15.8 usec per loop After: 100000 loops, best of 3: 13.9 usec per loop ./python Tools/pybench/pybench.py -t IfThenElse Before: 164ms per round After: 166ms per round

msg66027 - (view) Author: Antoine Pitrou (pitrou) * Date: 2008-04-30 21:46

Finally I had to slightly change the lnotab format to have the right tracing semantics: the change is that line number increments are now signed bytes rather than unsigned. Still, there is a small change in tracing behaviour (see test_trace.py): the for statement is traced twice in the first loop iteration, that's because the iterator object is retrieved (GET_ITER) at the beginning of the loop while the next iterator value is fetched (FOR_ITER) at the end of the loop. I don't believe this is very painful. All in all, the whole test suite now passes fine. The performance numbers are the same as before.

msg68687 - (view) Author: Χρήστος Γεωργίου (Christos Georgiou) (tzot) * Date: 2008-06-24 13:21

A pointer to previous (minor) research: http://groups.google.com/group/comp.lang.python/browse_frm/thread/72505e3cb6d9cb1a/e486759f06ec4ee5 esp. after Terry Reedy's post

msg80591 - (view) Author: Raymond Hettinger (rhettinger) * Date: 2009-01-26 21:09

Reminder, make sure we can still break out of a "while 1: pass".

msg80593 - (view) Author: Antoine Pitrou (pitrou) * Date: 2009-01-26 21:18

> Reminder, make sure we can still break out of a "while 1: pass". Yes, the patch takes care of that.

msg80598 - (view) Author: Antoine Pitrou (pitrou) * Date: 2009-01-26 23:08

The patches don't apply cleanly anymore, I'll regenerate a new one.

msg80995 - (view) Author: Antoine Pitrou (pitrou) * Date: 2009-02-02 20:36

Here is an updated patch against trunk. To be honest I don't think it has a lot of future: - it changes the lnotab format (to allow negative offsets) - it rewrites the buggy block reordering code in the pure Python compiler package, but nobody seems interested in maintaining that package (see #2472). Perhaps we should just declare the experiment failed and switch to something else.

msg81029 - (view) Author: Raymond Hettinger (rhettinger) * Date: 2009-02-03 07:09

I would like to see this go forward. It looks promising.

msg81958 - (view) Author: Collin Winter (collinwinter) * Date: 2009-02-13 18:13

I don't see the changes to the lnotab format being a roadblock; just mention it in NEWS. Likewise, the pure-Python compiler package shouldn't be a high priority; your changes to that package look good enough. I'm seeing encouraging speed-ups out of this (with gcc 4.3.1 x86_64, compiling Python as 64-bit): Django templates (render a 150x150 table 100 times): Min: 0.595 -> 0.589: 0.94% faster Avg: 0.599 -> 0.591: 1.30% faster Spitfire templates (render a 1000x1000 table 100 times): Min: 0.751 -> 0.729: 2.98% faster Avg: 0.753 -> 0.730: 3.09% faster None of the apps I've benchmarked are negatively impacted. I only have two minor comments. Please commit this. Review comments: - The changes to Python/compile.c:compiler_if(), compiler_for(), compiler_while() have some indentation issues (tabs and such). - Functions like def foo(): while True: pass have a useless JUMP_FORWARD 0 instruction, but I don't think it's worth teaching the peepholer to remove them since it doesn't happen in other circumstances (that I can tell).

msg81962 - (view) Author: Antoine Pitrou (pitrou) * Date: 2009-02-13 18:37

Hello Collin, Thanks for taking a look. > I don't see the changes to the lnotab format being a roadblock; just > mention it in NEWS. Likewise, the pure-Python compiler package shouldn't > be a high priority; your changes to that package look good enough. Well, I have good news: the fixes to the pure Python compiler have been accepted and committed by Neil Schemenauer in r69373. > I'm seeing encouraging speed-ups out of this (with gcc 4.3.1 x86_64, > compiling Python as 64-bit): > Django templates (render a 150x150 table 100 times): > Min: 0.595 -> 0.589: 0.94% faster > Avg: 0.599 -> 0.591: 1.30% faster > > Spitfire templates (render a 1000x1000 table 100 times): > Min: 0.751 -> 0.729: 2.98% faster > Avg: 0.753 -> 0.730: 3.09% faster Not a tremendous speedup but not totally insignificant either. (I see you like Spitfire :-)) > None of the apps I've benchmarked are negatively impacted. I only have > two minor comments. Please commit this. Before committing I want to know what to do with the new jump opcodes, with respect to the alternative proposal I've made in #4715. Ideally, I should fold the #4715 patch back into the present patch, since I think the #4715 approach is more thought out.

msg81976 - (view) Author: Collin Winter (collinwinter) * Date: 2009-02-13 23:23

On Fri, Feb 13, 2009 at 10:37 AM, Antoine Pitrou <report@bugs.python.org> wrote: > > Antoine Pitrou <pitrou@free.fr> added the comment: > > Hello Collin, > > Thanks for taking a look. > >> I don't see the changes to the lnotab format being a roadblock; just >> mention it in NEWS. Likewise, the pure-Python compiler package shouldn't >> be a high priority; your changes to that package look good enough. > > Well, I have good news: the fixes to the pure Python compiler have been > accepted and committed by Neil Schemenauer in r69373. Yeah, I saw that. Fantastic. >> I'm seeing encouraging speed-ups out of this (with gcc 4.3.1 x86_64, >> compiling Python as 64-bit): >> Django templates (render a 150x150 table 100 times): >> Min: 0.595 -> 0.589: 0.94% faster >> Avg: 0.599 -> 0.591: 1.30% faster >> >> Spitfire templates (render a 1000x1000 table 100 times): >> Min: 0.751 -> 0.729: 2.98% faster >> Avg: 0.753 -> 0.730: 3.09% faster > > Not a tremendous speedup but not totally insignificant either. > (I see you like Spitfire :-)) Well, Spitfire and Django represent very different ways of implementing a template system, so I like to measure both when doing application benchmarks. Template systems tend to be heavy CPU consumers for webapps, which is why I include them. >> None of the apps I've benchmarked are negatively impacted. I only have >> two minor comments. Please commit this. > > Before committing I want to know what to do with the new jump opcodes, > with respect to the alternative proposal I've made in #4715. > Ideally, I should fold the #4715 patch back into the present patch, > since I think the #4715 approach is more thought out. That sounds good, especially since Jeffrey and I have already reviewed #4715.

msg82556 - (view) Author: Collin Winter (collinwinter) * Date: 2009-02-21 00:30

On Fri, Feb 13, 2009 at 3:23 PM, Collin Winter <collinw@gmail.com> wrote: > On Fri, Feb 13, 2009 at 10:37 AM, Antoine Pitrou <report@bugs.python.org> wrote: >> Before committing I want to know what to do with the new jump opcodes, >> with respect to the alternative proposal I've made in #4715. >> Ideally, I should fold the #4715 patch back into the present patch, >> since I think the #4715 approach is more thought out. > > That sounds good, especially since Jeffrey and I have already reviewed #4715. If you don't have the bandwidth to integrate 4715 into this patch, I can do that, if you want. Collin

msg82557 - (view) Author: Antoine Pitrou (pitrou) * Date: 2009-02-21 01:37

> If you don't have the bandwidth to integrate 4715 into this patch, I > can do that, if you want. Collin, that would be very nice from you. You could also apply Jeffrey's naming suggestions in #4715. Thanks! Antoine.

msg83003 - (view) Author: Jeffrey Yasskin (jyasskin) * Date: 2009-03-02 06:11

I've updated for_iter.patch to the latest trunk, merging in issue 4715. I also changed tracing a bit so that the first line of a loop doesn't get traced twice in the first iteration, and added to test_dis to check that decreasing line numbers work there. Review at http://codereview.appspot.com/20103 if you like. Performance: 32-bit gcc-4.3 Intel Core2: 2to3: 25.09 -> 24.63: 1.87% faster Django: Min: 0.614 -> 0.609: 0.86% faster Avg: 0.616 -> 0.635: 3.09% slower Pickle: (cPickle) Min: 1.647 -> 1.651: 0.21% slower Avg: 1.650 -> 1.656: 0.39% slower PyBench: Min: 5341 -> 5273: 1.29% faster Avg: 5450 -> 5397: 0.98% faster SlowPickle: (pickle) Min: 1.384 -> 1.341: 3.13% faster Avg: 1.385 -> 1.343: 3.08% faster Spitfire: Min: 0.773 -> 0.690: 11.97% faster Avg: 0.776 -> 0.695: 11.62% faster Spitfire re-run: Min: 0.740 -> 0.693: 6.81% faster Avg: 0.744 -> 0.695: 6.93% faster SlowUnpickle: (pickle) Min: 0.645 -> 0.668: 3.37% slower Avg: 0.646 -> 0.670: 3.59% slower SlowUnpickle re-run: Min: 0.645 -> 0.660: 2.31% slower Avg: 0.645 -> 0.661: 2.32% slower Unpickle: (cPickle) Min: 1.015 -> 1.006: 0.89% faster Avg: 1.021 -> 1.009: 1.16% faster 64-bit gcc-4.3 Intel Core2 2to3: 22.31 -> 21.97: 1.57% faster Django: Min: 0.577 -> 0.564: 2.29% faster Avg: 0.579 -> 0.566: 2.20% faster Pickle: Min: 1.162 -> 1.178: 1.35% slower Avg: 1.166 -> 1.183: 1.37% slower PyBench: Min: 4498 -> 4193: 7.27% faster Avg: 4586 -> 4276: 7.25% faster SlowPickle: Min: 1.212 -> 1.133: 6.92% faster Avg: 1.213 -> 1.135: 6.92% faster Spitfire: Min: 0.631 -> 0.617: 2.32% faster Avg: 0.632 -> 0.621: 1.75% faster SlowUnpickle: Min: 0.575 -> 0.551: 4.31% faster Avg: 0.576 -> 0.553: 4.14% faster Unpickle: Min: 0.708 -> 0.722: 1.88% slower Avg: 0.745 -> 0.736: 1.20% faster

msg83012 - (view) Author: Antoine Pitrou (pitrou) * Date: 2009-03-02 11:52

> I've updated for_iter.patch to the latest trunk, merging in issue 4715. > I also changed tracing a bit so that the first line of a loop doesn't > get traced twice in the first iteration, and added to test_dis to check > that decreasing line numbers work there. Thanks a lot! By the way, why do you bench cPickle? Does your test call Python code significantly? Overall, the results look positive although not overwhelming.

msg83022 - (view) Author: Jeffrey Yasskin (jyasskin) * Date: 2009-03-02 16:24

No particular reason for cPickle. It sometimes shows when we've caused problems by increasing the code size, and shows the size of any random effects that the compiler causes by moving code around. Could you double-check the patch to see if I did anything dumb?

msg83088 - (view) Author: Jeffrey Yasskin (jyasskin) * Date: 2009-03-03 19:45

Hold off on reviewing this. There's one bug around the peepholer not turning itself off when line numbers skip by more than 127, and another around the traceback generator still assuming line numbers are unsigned. I'll post another patch when those are fixed and tested. I may not get to it for a day or two, so someone else is welcome to update the patch instead. :)

msg106883 - (view) Author: Dirkjan Ochtman (djc) * Date: 2010-06-02 14:17

Is this still relevant / will it get some love in the future?

msg192575 - (view) Author: Christian Heimes (christian.heimes) * Date: 2013-07-07 16:13

Is this enhancement still relevant?

msg221249 - (view) Author: Mark Lawrence (BreamoreBoy) * Date: 2014-06-22 12:04