Created on 2008-12-26 21:09 by pitrou, last changed 2015-06-02 21:38 by db3l. This issue is now closed.

Messages (117)

msg78306 - (view) Author: Antoine Pitrou (pitrou) * Date: 2008-12-26 21:09

This patch implements what is usually called "threaded code" for the ceval loop on compilers which support it (only gcc). The idea is that there is a separate opcode dispatch epilog at the end of each opcode, which allows the CPU to make much better use of its branch prediction capabilities. The net result is a 15-20% average speedup on pybench and pystone, with higher speedups on very tight loops (see below for the full pybench result chart). The opcode jump table is generated by a separate script which is called as part of the Makefile (just as the asdl building script already is). On compilers other than gcc, performance will of course be unchanged. Test minimum run-time average run-time this other diff this other diff ------------------------------------------------------------------------------- BuiltinFunctionCalls: 100ms 107ms -7.1% 101ms 110ms -8.2% BuiltinMethodLookup: 76ms 106ms -28.1% 78ms 106ms -26.5% CompareFloats: 108ms 141ms -23.2% 108ms 141ms -23.2% CompareFloatsIntegers: 171ms 188ms -9.4% 173ms 204ms -15.3% CompareIntegers: 165ms 213ms -22.6% 168ms 224ms -25.1% CompareInternedStrings: 127ms 169ms -24.6% 127ms 169ms -24.8% CompareLongs: 95ms 124ms -23.1% 95ms 126ms -24.5% CompareStrings: 109ms 136ms -20.2% 111ms 139ms -19.9% ComplexPythonFunctionCalls: 131ms 150ms -12.4% 136ms 151ms -10.2% ConcatStrings: 159ms 171ms -6.9% 160ms 173ms -7.4% CreateInstances: 148ms 157ms -5.6% 150ms 158ms -4.9% CreateNewInstances: 112ms 117ms -4.3% 112ms 118ms -4.6% CreateStringsWithConcat: 144ms 198ms -27.3% 148ms 199ms -25.7% DictCreation: 90ms 104ms -13.3% 90ms 104ms -13.1% DictWithFloatKeys: 117ms 153ms -23.7% 117ms 154ms -24.0% DictWithIntegerKeys: 104ms 153ms -32.3% 104ms 154ms -32.5% DictWithStringKeys: 90ms 140ms -35.7% 90ms 141ms -36.3% ForLoops: 100ms 161ms -38.1% 100ms 161ms -38.1% IfThenElse: 123ms 170ms -28.0% 125ms 171ms -27.1% ListSlicing: 142ms 141ms +0.3% 142ms 142ms +0.2% NestedForLoops: 135ms 190ms -29.0% 135ms 190ms -29.0% NormalClassAttribute: 249ms 281ms -11.5% 249ms 281ms -11.3% NormalInstanceAttribute: 110ms 153ms -28.2% 111ms 154ms -28.1% PythonFunctionCalls: 106ms 130ms -18.7% 108ms 131ms -17.2% PythonMethodCalls: 151ms 169ms -10.1% 152ms 169ms -9.8% Recursion: 183ms 242ms -24.7% 191ms 243ms -21.4% SecondImport: 142ms 138ms +2.7% 144ms 139ms +3.4% SecondPackageImport: 146ms 149ms -2.3% 148ms 150ms -1.5% SecondSubmoduleImport: 201ms 193ms +3.9% 201ms 195ms +3.4% SimpleComplexArithmetic: 90ms 112ms -19.6% 90ms 112ms -19.8% SimpleDictManipulation: 172ms 230ms -25.2% 173ms 231ms -25.0% SimpleFloatArithmetic: 98ms 133ms -26.3% 99ms 137ms -27.9% SimpleIntFloatArithmetic: 134ms 175ms -23.6% 138ms 176ms -21.6% SimpleIntegerArithmetic: 134ms 183ms -26.8% 141ms 183ms -23.1% SimpleListManipulation: 91ms 143ms -36.3% 93ms 143ms -35.1% SimpleLongArithmetic: 88ms 108ms -17.9% 91ms 109ms -16.2% SmallLists: 127ms 162ms -21.6% 129ms 164ms -21.2% SmallTuples: 149ms 177ms -15.6% 151ms 178ms -15.1% SpecialClassAttribute: 423ms 426ms -0.7% 426ms 430ms -0.9% SpecialInstanceAttribute: 110ms 154ms -28.2% 111ms 154ms -28.3% StringMappings: 428ms 443ms -3.4% 432ms 449ms -3.7% StringPredicates: 124ms 161ms -23.1% 125ms 162ms -22.7% StringSlicing: 207ms 223ms -7.1% 208ms 228ms -8.7% TryExcept: 72ms 166ms -56.3% 73ms 166ms -56.2% TryFinally: 93ms 120ms -22.9% 93ms 124ms -25.2% TryRaiseExcept: 52ms 64ms -19.2% 52ms 65ms -19.2% TupleSlicing: 177ms 195ms -9.1% 178ms 198ms -10.2% WithFinally: 147ms 163ms -10.2% 147ms 164ms -10.1% WithRaiseExcept: 156ms 173ms -10.1% 157ms 174ms -9.7% ------------------------------------------------------------------------------- Totals: 6903ms 8356ms -17.4% 6982ms 8443ms -17.3%

msg78307 - (view) Author: Antoine Pitrou (pitrou) * Date: 2008-12-26 21:22

Armin, by reading the pypy-dev mailing-list it looks like you were interested in this.

On 2008-12-26 22:09, Antoine Pitrou wrote: > This patch implements what is usually called "threaded code" for the > ceval loop on compilers which support it (only gcc). The idea is that > there is a separate opcode dispatch epilog at the end of each opcode, > which allows the CPU to make much better use of its branch prediction > capabilities. The net result is a 15-20% average speedup on pybench and > pystone, with higher speedups on very tight loops (see below for the > full pybench result chart). Now I know why you want opcode stats in pybench :-) This looks like a promising approach. Is is possible to backport this change to Python 2.x as well ?

msg78363 - (view) Author: Antoine Pitrou (pitrou) * Date: 2008-12-27 13:27

> This looks like a promising approach. Is is possible to backport > this change to Python 2.x as well ? Certainly.

msg78364 - (view) Author: Antoine Pitrou (pitrou) * Date: 2008-12-27 13:29

This new patch uses a statically initialized jump table rather than specific initialization code.

msg78616 - (view) Author: Christian Heimes (christian.heimes) * Date: 2008-12-31 16:02

I'm having trouble understanding the technique of the jump table. Can you provide some links to papers that explain the threaded code? I'm interested in learning more. How does your implementation compare to the GForth based threaded code speedwise?

msg78620 - (view) Author: Antoine Pitrou (pitrou) * Date: 2008-12-31 16:15

> I'm having trouble understanding the technique of the jump table. Can > you provide some links to papers that explain the threaded code? I'm > interested in learning more. I haven't read any papers. Having a jump table in itself isn't special (the compiler does exactly that when compiling the switch() statement). What's special is that a dedicated indirect jump instruction at the end of each opcode helps the CPU make a separate prediction for which opcode follows the other one, which is not possible with a switch statement where the jump instruction is shared by all opcodes. I believe that's where most of the speedup comes from. If you read the patch it will probably be easy to understand. I had the idea to try this after a thread on pypy-dev, there are more references there: http://codespeak.net/pipermail/pypy-dev/2008q4/004916.html > How does your implementation compare to the GForth based threaded code > speedwise? Don't know! Your experiments are welcome. My patch is far simpler to integrate though (it's small, introduces very few changes and does not break any existing tests).

msg78628 - (view) Author: Christian Heimes (christian.heimes) * Date: 2008-12-31 16:37

> I haven't read any papers. Having a jump table in itself isn't special > (the compiler does exactly that when compiling the switch() statement). > What's special is that a dedicated indirect jump instruction at the end > of each opcode helps the CPU make a separate prediction for which opcode > follows the other one, which is not possible with a switch statement > where the jump instruction is shared by all opcodes. I believe that's > where most of the speedup comes from. > > If you read the patch it will probably be easy to understand. You are right. It's easier to understand after I've learned how the opcode_targets table is working. Previously I didn't know that one can store the address of a label in an array. Before I got it I wondered where the pointers were defined. Is this a special GCC feature? I haven't seen it before. > Don't know! Your experiments are welcome. My patch is far simpler to > integrate though (it's small, introduces very few changes and does not > break any existing tests). Yes, your patch is much smaller, less intrusive and easier to understand with a little background in CS.

msg78629 - (view) Author: Antoine Pitrou (pitrou) * Date: 2008-12-31 16:40

> Is this a special GCC feature? Yes, it is. http://gcc.gnu.org/onlinedocs/gcc/Labels-as-Values.html

msg78651 - (view) Author: Antoine Pitrou (pitrou) * Date: 2008-12-31 20:24

This new patch adds some detailed comments, at Jason Orendorff's request.

msg78653 - (view) Author: Alexandre Vassalotti (alexandre.vassalotti) * Date: 2008-12-31 21:20

You may want to check out issue1408710 in which a similar patch was provided, but failed to deliver the desired results. I didn't get the advertised ~15% speed-up, but only 4% on my Intel Core2 laptop and 8% on my AMD Athlon64 X2 desktop. I attached the benchmark results. The patch looks pretty clean. Here is a few things that caught my attention while reading your patch. First, you should rename opcode_targets.c to opcode_targets.h. This will make it explicit that the file is not compiled, but just included. Also, the macro USE_THREADED_CODE should be renamed to something else; the word "thread" is too tightly associated with multi-threading. Furthermore, threaded code simply refers to code consisting only of function calls. Maybe, USE_COMPUTED_GOTO or USE_DIRECT_DISPATCH would be better. Finally, why do you disable your patch when DYNAMIC_EXECUTION_PROFILE or LLTRACE is enabled? I tested your patch with both enabled and I didn't see any test failures. By the way, SUNCC also supports GCC's syntax for labels as values (http://docs.sun.com/app/docs/doc/819-5265/bjabt?l=en&a=view).

msg78656 - (view) Author: Skip Montanaro (skip.montanaro) * Date: 2008-12-31 22:12

Works pretty well for me on my MacBook Pro, but on my G5 it performed abysmally. In fact, it ran so much worse that I cleaned up my sandbox and did both checks all over again to make sure I didn't mess something up. It looks like my MacBook Pro saw about a 7% improvement while my G5 saw a 14% degradation. Both computers are running Mac OS X 10.5.6 with the latest Xcode - 3.1.2. On both computers gcc -v reports 4.0.1, Apple build 5490. If this is applied to the core I think it will have to select for more than just gcc. It will also have to select based on the instruction set architecture.

msg78657 - (view) Author: Antoine Pitrou (pitrou) * Date: 2008-12-31 22:17

> Works pretty well for me on my MacBook Pro, but on my G5 it performed > abysmally. In fact, it ran so much worse that I cleaned up my sandbox > and did both checks all over again to make sure I didn't mess something > up. It looks like my MacBook Pro saw about a 14% degradation. Both > computers are running Mac OS X 10.5.6 with the latest Xcode - 3.1.2. > On both computers gcc -v reports 4.0.1, Apple build 5490. You're sure you didn't compile in debug mode or something? Just checking.

msg78658 - (view) Author: Skip Montanaro (skip.montanaro) * Date: 2008-12-31 22:24

Antoine> You're sure you didn't compile in debug mode or something? Just Antoine> checking. There was a cut-n-paste error in that one which I noticed right after submitting (man, do I hate the crappy editing capability of <textarea> widgets). I removed it within a minute or two and replaced it with a correct version. Short answer: Intel thumbs up, PowerPC thumbs down. Skip

msg78660 - (view) Author: Antoine Pitrou (pitrou) * Date: 2008-12-31 22:41

Hello, > You may want to check out issue1408710 in which a similar patch was > provided, but failed to deliver the desired results. > > I didn't get the advertised ~15% speed-up, but only 4% on my Intel Core2 > laptop and 8% on my AMD Athlon64 X2 desktop. I attached the benchmark > results. Thanks. The machine I got the 15% speedup on is in 64-bit mode with gcc 4.3.2. If you want to investigate, you can output the assembler code for ceval.c; the command-line should be something like: gcc -pthread -c -fno-strict-aliasing -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I. -IInclude -I./Include -DPy_BUILD_CORE -S -dA Python/ceval.c and then count the number of indirect jump instructions in ceval.c: grep -E "jmp[[:space:]]\*%" ceval.s There should be 85 to 90 of them, roughly. If there are many less, then the compiler has tried to optimize them by "sharing" them. > First, you should rename opcode_targets.c to opcode_targets.h. This will > make it explicit that the file is not compiled, but just included. Ok. > Also, the macro USE_THREADED_CODE should be renamed to something else; > the word "thread" is too tightly associated with multi-threading. > Furthermore, threaded code simply refers to code consisting only of > function calls. Maybe, USE_COMPUTED_GOTO or USE_DIRECT_DISPATCH would be > better. Ok. > Finally, why do you disable your patch when DYNAMIC_EXECUTION_PROFILE or > LLTRACE is enabled? I tested your patch with both enabled and I didn't > see any test failures. Because otherwise the measurements these options are meant to do would be meaningless. > By the way, SUNCC also supports GCC's syntax for labels as values I don't have a Sun machine to test, so I'll leave to someone else to check and enable if they want to.

msg78663 - (view) Author: Antoine Pitrou (pitrou) * Date: 2008-12-31 22:53

Attached new patch for fixes suggested by Alexandre (rename opcode_targets.c to opcode_targets.h, replace USE_THREADED_CODE with USE_COMPUTED_GOTOS).

msg78685 - (view) Author: Paolo 'Blaisorblade' Giarrusso (blaisorblade) Date: 2009-01-01 03:44

Mentioning other versions as well. The patch is so easy that it can be backported to all supported versions, so I'm adding all of them (2.5 is in bugfix-only mode, and as far as I can see this patch cannot be accepted there, sadly).

msg78686 - (view) Author: Skip Montanaro (skip.montanaro) * Date: 2009-01-01 04:10

Paolo> (2.5 is in bugfix-only mode, and as far as I can see this patch Paolo> cannot be accepted there, sadly). You could backport it to 2.4 & 2.5 and just put it up on PyPI...

msg78687 - (view) Author: Paolo 'Blaisorblade' Giarrusso (blaisorblade) Date: 2009-01-01 04:28

Some other comments. The time saving of indirect threading are also associated with the removal of the range check, but better branch prediction is the main advantage. > Also, the macro USE_THREADED_CODE should be renamed to something else; > the word "thread" is too tightly associated with multi-threading. That's true. > Furthermore, threaded code simply refers to code consisting only of > function calls. Maybe, USE_COMPUTED_GOTO or USE_DIRECT_DISPATCH would be > better. I'd prefer USE_DIRECT_DISPATCH (or better, USE_THREADED_DISPATCH) rather than USE_COMPUTED_GOTO, since the latter is just the used language construct. "indirect threading" is the standard name in CS literature to define this technique. "Direct threading" is a variation where in the bytecode array, opcode is replaced by the pointer opcode_handler[opcode], so that dispatch is slightly faster. Most interpreters use indirect threading to save memory, and because it enables to switch the opcode handlers table to activate for instance debugging. The best paper about this is: "The Structure and Performance of Efficient Interpreters, M. Anton Ertl and David Gregg, 2003". The original paper about (direct) threaded code is this: "Threaded Code, James R. Bell, Comm. of ACM, 1973", but I don't feel it relevant at all today. Indirect threading was introduced in "Indirect Threaded Code, Robert B. K. Dewar, Communications of the ACM, 1975" (that's just a bit more relevant, but still).

msg78688 - (view) Author: Paolo 'Blaisorblade' Giarrusso (blaisorblade) Date: 2009-01-01 04:36

> You may want to check out issue1408710 in which a similar patch was > provided, but failed to deliver the desired results. It's not really similar, because you don't duplicate the dispatch code. It took me some time to understand why you didn't change the "goto fast_next_opcode", but that's where you miss the speedup. The only difference with your change is that you save the range check for the switch, so the slowdown probably comes from some minor output change from GCC I guess. Anyway, this suggests that the speedup really comes from better branch prediction and not from saving the range check. The 1st paper I mentioned simply states that saving the range check might make a small differences. The point is that sometimes, when you are going to flush the pipeline, it's like adding a few instructions, even conditional jumps, does not make a difference. I've observed this behaviour quite a few times while building from scratch a small Python interpreter. I guess (but this might be wrong) that's because the execution units were not used at their fullest, and adding conditional jumps doesn't make a differeence because flushing a pipeline once or twice is almost the same (the second flush removes just few instructions). Or something like that, I'm not expert enough of CPU architecture to be sure of such guesses.

msg78689 - (view) Author: Paolo 'Blaisorblade' Giarrusso (blaisorblade) Date: 2009-01-01 05:03

Topics 1) About different speedups on 32bits vs 64 bits 2) About PPC slowdown 3) PyPI ======= About different speedups on 32bits vs 64 bits ======= An interpreter is very register-hungry, so on x86_64 it spends much less time on register spill (i.e. moving locals from/to memory), so instruction dispatch takes a bigger share of execution time. If the rest of the interpreter is too slow, indirect threading gives no advantage. Look at the amount of register variables in PyEval_EvalFrameEx() (btw, do you support any compiler where nowadays 'register' still makes a difference? That's quite weird). Lots of them are simply cached copies of fields of the current frame and of the current function; without copying them to locals, the compiler should assume they could change at any function call. In fact, adding locals this way gave huge speedups on tight loops on the Python interpreter I built with a colleague for our student project, to experiment with speeding up Python. And adding a write to memory in the dispatch code (to f->last_i) gave a 20% slowdown. Since my interpreter uses a copying garbage collector and CPython uses reference counting, which is much slower (if you don't know this, show me a fast JVM with reference counting), I'm even surprised you can get such a big speedup from threading. ======= About PPC slowdown ======= Somebody should try the patch on Pentium4 as well. During our VM course, threading slowed down a toy interpreter with 4 toy opcodes. Our teachers suggested that with such a small interpreter, since threaded code takes more space (in that case, ~64 vs ~100 bytes), this could give problems with code caches, but suggested checking that idea using performance counters. I'm not sure about why. I don't have right now neither a Pentium4 nor a PowerPC available, so I can't check myself. But this is the best way to analyze the performance unexpected behaviour. ======= PyPI ======= Paolo> (2.5 is in bugfix-only mode, and as far as I can see this patch Paolo> cannot be accepted there, sadly). Skip> You could backport it to 2.4 & 2.5 and just put it up on PyPI... I was thinking to a private backport as well. I didn't know about PyPI, it looks like PyPI is more for contributed modules than for this, would that work?

msg78691 - (view) Author: Paolo 'Blaisorblade' Giarrusso (blaisorblade) Date: 2009-01-01 05:15

== On the patch itself == Why don't you use the C preprocessor instead of that Python code? Sample code: #define OPCODE_LIST(DEFINE_OPCODE) \ DEFINE_OPCODE(STOP_CODE, 0) \ DEFINE_OPCODE(POP_TOP, 1) \ DEFINE_OPCODE(ROT_TWO, 2) \ DEFINE_OPCODE(ROT_THREE, 3) \ DEFINE_OPCODE(DUP_TOP, 4) \ DEFINE_OPCODE(ROT_FOUR, 5) \ DEFINE_OPCODE(NOP, 9) \ ... # define DECL_OPCODE(opcode) \ [opcode] = && label_ ## opcode, void *opcodes[] = { OPCODE_LIST(DECL_OPCODE) }; # undef DECL_OPCODE There are also other ways to do it, but using higher-order functions within the preprocessor in this way is something that I learned from the V8 source code. It has the advantage that OPCODE_LIST can be used in a lot of other places (maybe when implementing the 'opcode' module, if it's written in C).

msg78708 - (view) Author: Antoine Pitrou (pitrou) * Date: 2009-01-01 13:06

Hi, > Why don't you use the C preprocessor instead of that Python code? > Sample code: We would have to change opcode.h for this to be truely useful (in order to re-use OPCODE_LIST()). I think that should be the subject of a separate bug entry for code reorganization. Thanks for all the explanation and pointers! About register allocation, I wonder if the "temporary variables" u,v,w could be declared separately in each opcode rather than at the top of the eval function, so that the compiler doesn't try to store their values between two opcodes. As for the "register" declarations, I think they're just remains of the past. Antoine.

msg78711 - (view) Author: Skip Montanaro (skip.montanaro) * Date: 2009-01-01 14:00

Skip> You could backport it to 2.4 & 2.5 and just put it up on PyPI... Paolo> I was thinking to a private backport as well. I didn't know Paolo> about PyPI, it looks like PyPI is more for contributed modules Paolo> than for this, would that work? I don't see why it wouldn't. Skip

msg78722 - (view) Author: Paolo 'Blaisorblade' Giarrusso (blaisorblade) Date: 2009-01-01 19:43

> We would have to change opcode.h for this to be truely useful (in order to re-use OPCODE_LIST()). Yep. > I think that should be the subject of a separate bug entry for code reorganization. Agreed, I'll maybe try to find time for it. > Thanks for all the explanation and pointers! You're welcome, thanks to you for writing the patch! And > About register allocation, I wonder if the "temporary variables" u,v,w could be declared separately in each opcode rather than at the top of the eval function, so that the compiler doesn't try to store their values between two opcodes. I didn't look at the output, but that shouldn't be needed with decent optimizers, since they are local variables, so the compiler has a full picture of their usage (this does not happen with the content of the heap, where the frame object may lie). I think that change would just save some compilation time for dataflow analysis, maybe :-). Or could make clearer which variables is used where, but that is a matter of taste (what's there is fine for me). I just studied liveness analysis in compilers, and it computes whether a variable is live before and after each statement; if the value of a variable is not used in some piece of code until the next write to the variable, it is considered dead in that piece of code, and that variable does not take space; since u, v, w are either unused or are written to before usage in all opcodes, they're dead at the beginning of each opcode, so they're also dead just before dispatch. The only important thing is that the content of the jump table are known to the compiler and that the compiler makes use of that. Simply passing a non-const jump table to some function defined elsewhere (which could in theory modify it) would make the output much worse.

msg78730 - (view) Author: Alexandre Vassalotti (alexandre.vassalotti) * Date: 2009-01-01 21:58

Antoine Pitrou wrote: > [...] count the number of indirect jump instructions in ceval.c: > > grep -E "jmp[[:space:]]\*%" ceval.s > > There should be 85 to 90 of them, roughly. If there are many less, then > the compiler has tried to optimize them by "sharing" them. I get 86 with GCC 4.x and SUNCC. However, with GCC 3.4 I only get a single computed goto. Is there some hidden option to make GCC avoid sharing jumps? > Because otherwise the measurements these options are meant to do would > be meaningless. Ah, I see now. Maybe you should add a quick comment that mentions this. > I don't have a Sun machine to test, so I'll leave to someone else to > check and enable if they want to. I tested it and it worked, no test failures to report. Just change the macro test: #ifdef __GNUC__ && \ ... to #ifdef (__GNUC__ || __SUNPRO_C) && \ ... I attached some additional benchmarks on SunOS. So far, it seems the benefits of the proposed optimization are highly compiler-dependent.

msg78731 - (view) Author: Alexandre Vassalotti (alexandre.vassalotti) * Date: 2009-01-01 22:00

> Attached new patch for fixes suggested by Alexandre (rename > opcode_targets.c to opcode_targets.h, replace USE_THREADED_CODE with > USE_COMPUTED_GOTOS). You forgot to update your script to use the new name.

msg78735 - (view) Author: Antoine Pitrou (pitrou) * Date: 2009-01-01 22:48

> I get 86 with GCC 4.x and SUNCC. However, with GCC 3.4 I only get a > single computed goto. Is there some hidden option to make GCC avoid > sharing jumps? Try -fno-crossjumping. > I tested it and it worked, no test failures to report. Just change the > macro test: > > #ifdef __GNUC__ && \ > ... > > to > > #ifdef (__GNUC__ || __SUNPRO_C) && \ > ... Thanks. > You forgot to update your script to use the new name. Ah, that's rather dumb :)

msg78736 - (view) Author: Antoine Pitrou (pitrou) * Date: 2009-01-01 22:59

I've updated the comments as per Alexandre's request, added support for SUN CC, and fixed the generation script to use the new filename.

msg78748 - (view) Author: Paolo 'Blaisorblade' Giarrusso (blaisorblade) Date: 2009-01-02 02:53

> I attached some additional benchmarks on SunOS. So far, it seems the benefits of the proposed optimization are highly compiler-dependent. Well, it would be more correct to say that as you verified for GCC 3.4, "miscompilation" of the code happens easily. Any literature research shows that threading in a fast interpreter does help. My experience shows two exceptions to this rule: a) bad compiler output b) interpreters which are not efficient enough - when other operations are even slower than instruction dispatch (which is really slow due to costly mispredictions), threading can't help. This is shown by the number of interpreters using threading. Wikipedia has more pointers on this: http://en.wikipedia.org/wiki/Threaded_code Note that what I called "indirect threading" is called there instead "token threading". Another example of the importance of threading is also shown in this article: http://webkit.org/blog/189/announcing-squirrelfish/ Some clues about why Python does not use threading: http://mail.python.org/pipermail/python-list/1999-May/002003.html It is important to note that people in that mail are not aware of why threading gives a speedup. == For SunCC, I can't say anything without looking at: a) the generated code; if jump targets were aligned only for switch but not for computed gotos, for instance, that could maybe explain such a slowdown. Lots of other details might be relevant. b) performance counters results, especially regarding mispredictions of indirect branches.

On 2009-01-01 23:59, Antoine Pitrou wrote: > Antoine Pitrou <pitrou@free.fr> added the comment: > > I've updated the comments as per Alexandre's request, added support for > SUN CC, and fixed the generation script to use the new filename. Since the patch heavily relies on the compiler doing the right thing (which esp. GCC often doesn't, unfortunately), I think that the opcode dispatch code should only be enabled via a configure option. This is safer than enabling the support unconditionally for GCC and the SUN Pro C compiler, since it is rather likely that some GCC versions have bugs which could render Python unusable if compiled with the dispatching support enabled. A configure option also has the additional benefit that you can enable the support for compilers which support the syntax, but are not included in the hard-coded list of compilers included in the patch.

msg78828 - (view) Author: Antoine Pitrou (pitrou) * Date: 2009-01-02 16:10

> This is safer than enabling the support unconditionally for GCC and > the SUN Pro C compiler, since it is rather likely that some GCC versions > have bugs which could render Python unusable if compiled with the > dispatching support enabled. What do you mean, "unusable"? 10% slower? Well, Python 3.x is already unusable (compared to 2.x) by that metric... Until now, only Skip has reported a slowdown on his PPC machine, while x86 systems have all seen a positive (if tiny, sometimes) improvement. I fear that with a configure option, disabled by default, the code will get very poor testing and perhaps get broken in some subtle way without anyone noticing.

msg78831 - (view) Author: Skip Montanaro (skip.montanaro) * Date: 2009-01-02 16:17

Antoine> I fear that with a configure option, disabled by default, the Antoine> code will get very poor testing and perhaps get broken in some Antoine> subtle way without anyone noticing. That can be fixed by enabling that option on the buildbots where it is presumed to help. I see more slowdown on PPC than people are reporting as speedups on Intel. Is there something I can do to help debug the problem? It doesn't appear the Apple version of gcc supports the -fno-crossjumping flag. If I dump the assembler code for ceval.c will that help others debug the problem? Skip

msg78834 - (view) Author: Antoine Pitrou (pitrou) * Date: 2009-01-02 16:28

> If I dump the assembler code for ceval.c will that help others debug > the problem? Well, I'm no PPC expert but it can be useful. Can you dump it with "-S -dA"? (also, can you try the latest patch? I've made some tiny adjustement in the opcode epilogues, I don't think it will make a difference but who knows)

On 2009-01-02 17:10, Antoine Pitrou wrote: > Antoine Pitrou <pitrou@free.fr> added the comment: > >> This is safer than enabling the support unconditionally for GCC and >> the SUN Pro C compiler, since it is rather likely that some GCC versions >> have bugs which could render Python unusable if compiled with the >> dispatching support enabled. > > What do you mean, "unusable"? Well, not working. GCC versions often have optimizer bugs (esp. the 3.x series and early 4.x versions) and I wouldn't bet on them always getting the dispatch optimizations right. Trying to compile Python with an unconditionally enabled dispatch patch on such a system would render Python unusable. > 10% slower? Well, Python 3.x is already > unusable (compared to 2.x) by that metric... Until now, only Skip has > reported a slowdown on his PPC machine, while x86 systems have all seen > a positive (if tiny, sometimes) improvement. > > I fear that with a configure option, disabled by default, the code will > get very poor testing and perhaps get broken in some subtle way without > anyone noticing. Like Skip said: the buildbots could take care of identifying such problems. People using the option would certainly report problems as well and I'm sure that Linux distributions would compile Python with the switch after verifying that their GCC version works correctly.

msg78871 - (view) Author: Skip Montanaro (skip.montanaro) * Date: 2009-01-02 20:26

OK, I think I'm misreading the output of pybench. Let me reset. Ignore anything I've written previously on this topic. Instead, I will just post the output of my pybench comparison runs and let more expert people interpret as appropriate. The first file is the result of the run on PowerPC (Mac G5).

msg78872 - (view) Author: Skip Montanaro (skip.montanaro) * Date: 2009-01-02 20:27

The next is the result of running on my MacBook Pro (Intel Core 2 Duo).

msg78898 - (view) Author: Antoine Pitrou (pitrou) * Date: 2009-01-02 22:57

> OK, I think I'm misreading the output of pybench. Let me reset. Ignore > anything I've written previously on this topic. Instead, I will just > post the output of my pybench comparison runs and let more expert people > interpret as appropriate. The first file is the result of the run on > PowerPC (Mac G5). Ok, so the threaded version is actually faster by 20% on your PPC, and slower by 5% on your Core 2 Duo. Thanks for doing the measurements!

msg78899 - (view) Author: Skip Montanaro (skip.montanaro) * Date: 2009-01-02 23:03

Antoine> Ok, so the threaded version is actually faster by 20% on your Antoine> PPC, and slower by 5% on your Core 2 Duo. Thanks for doing the Antoine> measurements! Confirmed by pystone runs as well. Sorry for the earlier misdirection. Skip

msg78910 - (view) Author: Alexandre Vassalotti (alexandre.vassalotti) * Date: 2009-01-03 00:45

The patch make a huge difference on 64-bit Linux. I get a 20% speed-up and the lowest run time so far. That is quite impressive! At first glance, it seems the extra registers of the x86-64 architecture permit GCC to avoid spilling registers onto the stack (see assembly just below). However, I don't know why the speed up due to the patch is much more significant on x86-64 than on x86. This is the x86 assembly generated by GCC 4.3 (annotated and slightly edited for readability): movl -440(%ebp), %eax # tmp = next_instr movl $145, %esi # opcode = LIST_APPEND movl 8(%ebp), %ecx # f subl -408(%ebp), %eax # tmp -= first_instr movl %eax, 60(%ecx) # f->f_lasti = tmp movl -440(%ebp), %ebx # next_instr movzbl (%ebx), %eax # tmp = *next_instr addl $1, %ebx # next_instr++ movl %ebx, -440(%ebp) # next_instr movl opcode_targets(,%eax,4), %eax # tmp = opcode_targets[tmp] jmp *%eax # goto *tmp And this is the x86-64 assembly generated also by GCC 4.3: movl %r15d, %eax # tmp = next_instr subl 76(%rsp), %eax # tmp -= first_instr movl $145, %ebp # opcode = LIST_APPEND movl %eax, 120(%r14) # f->f_lasti = tmp movzbl (%r15), %eax # tmp = *next_instr addq $1, %r15 # next_instr++ movq opcode_targets(,%rax,8), %rax # tmp = opcode_targets[tmp] jmp *%rax # goto *tmp The above assemblies are equivalent to the following C code: opcode = LIST_APPEND; f->f_lasti = ((int)(next_instr - first_instr)); goto *opcode_targets[*next_instr++]; On the register-starved x86 architecture, the assembly has 4 stack load and 1 store operations. While on the x86-64 architecture, most variables are kept in registers thus it only uses 1 stack store operation. And from what I saw from the assemblies, the extra registers with the traditional switch dispatch aren't much used, especially with the opcode prediction macros which avoid manipulations of f->f_lasti. That said, I am glad to hear the patch makes Python on PowerPC faster, because this supports the hypothesis that extra registers are better used with indirect threading (PowerPC has 32 general-purpose registers).

msg78911 - (view) Author: Alexandre Vassalotti (alexandre.vassalotti) * Date: 2009-01-03 00:48

One more thing, the patch causes the following warnings to be emited by GCC when USE_COMPUTED_GOTOS is undefined. Python/ceval.c: In function ‘PyEval_EvalFrameEx’: Python/ceval.c:2420: warning: label ‘_make_function’ defined but not used Python/ceval.c:2374: warning: label ‘_call_function_var_kw’ defined but not used Python/ceval.c:2280: warning: label ‘_setup_finally’ defined but not used

msg78915 - (view) Author: Skip Montanaro (skip.montanaro) * Date: 2009-01-03 00:58

Alexandre's last comment reminded me I forgot to post the PPC assembler code. Next two files are the output as requested by Antoine.

msg78921 - (view) Author: Christian Heimes (christian.heimes) * Date: 2009-01-03 01:37

> Alexandre Vassalotti <alexandre@peadrop.com> added the comment: > The patch make a huge difference on 64-bit Linux. I get a 20% speed-up > and the lowest run time so far. That is quite impressive! I'm really, REALLY impressed by the speed up. Good work! I'm not an expert in this kind of optimizations. Could we gain more speed by making the dispatcher table more dense? Python has less than 128 opcodes (len(opcode.opmap) == 113) so they can be squeezed in a smaller table. I naively assume a smaller table increases the amount of cache hits. Christian

msg78923 - (view) Author: Paolo 'Blaisorblade' Giarrusso (blaisorblade) Date: 2009-01-03 01:51

About miscompilations: the current patch is a bit weird for GCC, because you keep both the switch and the computed goto. But actually, there is no case in which the switch is needed, and computed goto give less room to GCC's choices. So, can you try dropping the switch altogether, using always computed goto and seeing how does the resulting code get compiled? I see you'll need two labels (before and after argument fetch) per opcode and two dispatch tabels, but that's no big deal (except for code alignment - align just the common branch target). An important warning is that by default, on my system, GCC 4.2 aligns branch targets for switch to a 16-byte boundary (as recommended by the Intel optimization guide), by adding a ".p2align 4,,7" GAS directive, and it does not do that for computed goto. Adding the directive by hand gave a small speedup, 2% I think; I should try -falign-jumps=16 if it's not enabled (some -falign-jumps is enabled by -O2), since that is supposed to give the same result. Please use that yourself as well, and verify it works for labels, even if I fear it doesn't. > However, I don't know why the speed up due to the patch is much more significant on x86-64 than on x86. It's Amdahl's law, even if this is not about parallel code. When the rest is faster (x86_64), the same speedup on dispatch gives a bigger overall speedup. To be absolutely clear: x86_64 has more registers, so the rest of the interpreter is faster than x86, but dispatch still takes the same absolute time, which is 70% on x86_64, but only 50% on x86 (those are realistic figures); if this patch halved dispatch time on both (we're not so lucky), we would save 35% on x86_64 but only 25% on x86. In fact, on inefficient interpreters, indirect threading is useless altogether. So, do those extra register help _so_ much? Yes. In my toy interpreter, computing last_i for each dispatch doesn't give any big slowdown, but storing it in f->last_i gives a ~20% slowdown - I cross-checked multiple times because I was astonished. Conversely, when the program counter had to be stored in memory, I think it was like 2x slower.

msg78925 - (view) Author: Paolo 'Blaisorblade' Giarrusso (blaisorblade) Date: 2009-01-03 02:20

> I'm not an expert in this kind of optimizations. Could we gain more speed by making the dispatcher table more dense? Python has less than 128 opcodes (len(opcode.opmap) == 113) so they can be squeezed in a smaller table. I naively assume a smaller table increases the amount of cache hits. Well, you have no binary compatibility constraint with a new release, so it can be tried and benchmarked, or it can be done anyway! On x86_64 the impact of the jump table is 8 bytes per pointer * 256 pointers = 2KiB, and the L1 data cache of Pentium4 can be 8KiB or 16KiB wide. But I don't expect this to be noticeable in most synthetic microbenchmarks. Matrix multiplication would be the perfect one I guess; the repeated column access would kill the L1 data cache, if the whole matrixes don't fit.

msg78956 - (view) Author: Antoine Pitrou (pitrou) * Date: 2009-01-03 13:22

> I'm not an expert in this kind of optimizations. Could we gain more > speed by making the dispatcher table more dense? Python has less than > 128 opcodes (len(opcode.opmap) == 113) so they can be squeezed in a > smaller table. I naively assume a smaller table increases the amount of > cache hits. I don't think so. The end of the current table, which doesn't correspond to any valid opcodes, will not be cached anyway. The upper limit to be considered is the maximum value for a valid opcode, which is 147. Reducing that to 113 may reduce cache pressure, but only by a tiny bit. Of course, only experimentation could tell us for sure :)

msg79032 - (view) Author: Daniel Diniz (ajaksu2) Date: 2009-01-04 02:00

IIUC, this is what gcc 4.2.4 generates on a Celeron M for the code Alexandre posted: movl -272(%ebp), %eax movl 8(%ebp), %edx subl -228(%ebp), %eax movl %eax, 60(%edx) movl -272(%ebp), %ecx movzbl (%ecx), %eax - addl $1, %ecx movl %ecx, -272(%ebp) movl opcode_targets.9311(,%eax,4), %ecx movl %eax, %ebx - jmp *%ecx And this is what ICC 11.0 generates for (what I think is) the same bit: movl 360(%esp), %ecx movl %esi, %edi subl 304(%esp), %edi movl %edi, 60(%ecx) movzbl (%esi), %ecx - movl opcode_targets.2239.0.34(,%ecx,4), %eax incl %esi - jmp ..B12.137 # Prob 100% # ..B12.137: jmp *%eax Does this mean that icc handles register starvation better here? FWIW, on this CPU, compiling with icc gives a 20% speed boost in pybench regardless of this patch.

msg79033 - (view) Author: Paolo 'Blaisorblade' Giarrusso (blaisorblade) Date: 2009-01-04 02:28

1st note: is that code from the threaded version? Note that you need to modify the source to make it accept also ICC to try that. In case you already did that, I guess the patch is not useful at all with ICC since, as far as I can see, the jump is shared. It is vital to this patch that the jump is not shared, something similar to -fno-crossjumping should be found. 2nd note: the answer to your questions seems yes, ICC has less register spills. Look for instance at: movl -272(%ebp), %ecx movzbl (%ecx), %eax addl $1, %ecx and movzbl (%esi), %ecx incl %esi This represents the increment of the program counter after loading the next opcode. In the code you posted, one can see that the program counter is spilled to memory by GCC, but isn't by ICC. Either the spill is elsewhere, or ICC is better here. And it's widely known that ICC has a much better optimizer in many cases, and I remember that GCC register allocator really needs improvement. Finally, I'm a bit surprised by "addl $1, %ecx", since any peephole optimizer should remove that; I'm not shocked just because I've never seen perfect GCC output.

msg79034 - (view) Author: Paolo 'Blaisorblade' Giarrusso (blaisorblade) Date: 2009-01-04 02:29

Daniel, I forgot to ask for the compilation command line you used, since they make a lot of difference. Can you post them? Thanks

msg79035 - (view) Author: Daniel Diniz (ajaksu2) Date: 2009-01-04 03:07

Paolo 'Blaisorblade' Giarrusso wrote: > > 1st note: is that code from the threaded version? [...] It is vital to > this patch that the jump is not shared, something similar to > -fno-crossjumping should be found. Yes, threaded version by unconditionally defining USE_THREADED_CODE (old patch version :). Ok, I'll try to find a way to get at -fno-crossjumping behavior. BTW, man gcc suggests using -fno-gcse for programs that use computed gotos (couldn't fin [...] > In the code you posted, one can see that the program > counter is spilled to memory by GCC, but isn't by ICC. Either the spill > is elsewhere, or ICC is better here. I can [mail you|attach here] icc's output if you want to check the overall code, it's about 1.9M with the code annotations. > Finally, I'm a bit surprised by "addl $1, %ecx", since any peephole > optimizer should remove that; I'm not shocked just because I've never > seen perfect GCC output. I'm glad to see the same issue in Alexandre's output, not my fault then :D The command line I used (after a clean build with gcc) was: icc -pthread -c -fno-strict-aliasing -DNDEBUG -g -fwrapv -O3 -w -I. -IInclude -I./Include -DPy_BUILD_CORE -S -fcode-asm -fsource-asm -dA Python/ceval.c (same as with gcc, except for warnings and -fcode-asm -fsource-asm).

msg79037 - (view) Author: Alexandre Vassalotti (alexandre.vassalotti) * Date: 2009-01-04 04:51

Paolo wrote: > So, can you try dropping the switch altogether, using always computed > goto and seeing how does the resulting code get compiled? Removing the switch won't be possible unless we change the semantic EXTENDED_ARG. In addition, I doubt the improvement, if any, would worth the increased complexity. > To be absolutely clear: x86_64 has more registers, so the rest of the > interpreter is faster than x86, but dispatch still takes the same > absolute time, which is 70% on x86_64, but only 50% on x86 (those are > realistic figures); I don't understand what you mean by "absolute time" here. Do you actually mean the time spent interpreting bytecodes compared to the time spent in the other parts of Python? If so, your figures are wrong for CPython on x86-64. It is about 50% just like on x86 (when running pybench). With the patch, this drops to 35% on x86-64 and to 45% on x86. > In my toy interpreter, computing last_i for each dispatch doesn't give > any big slowdown, but storing it in f->last_i gives a ~20% slowdown. I patched ceval.c to minimize f->last_i manipulations in the dispatch code. On x86, I got an extra 9% speed up on pybench. However, the patch is a bit clumsy and a few unit tests are failing. I will see if I can improve it and open a new issue if worthwhile.

msg79051 - (view) Author: Skip Montanaro (skip.montanaro) * Date: 2009-01-04 13:14

I'm sure this is the wrong place to bring this up, but I had a thought about simple JIT compilation coupled with the opcode dispatch changes in this issue. Consider this silly function: >>> def f(a, b): ... result = 0 ... while b: ... result += a ... b -= 1 ... return result ... which compiles to 2 0 LOAD_CONST 1 (0) 3 STORE_FAST 2 (result) 3 6 SETUP_LOOP 32 (to 41) >> 9 LOAD_FAST 1 (b) 12 JUMP_IF_FALSE 24 (to 39) 15 POP_TOP 4 16 LOAD_FAST 2 (result) 19 LOAD_FAST 0 (a) 22 INPLACE_ADD 23 STORE_FAST 2 (result) 5 26 LOAD_FAST 1 (b) 29 LOAD_CONST 2 (1) 32 INPLACE_SUBTRACT 33 STORE_FAST 1 (b) 36 JUMP_ABSOLUTE 9 >> 39 POP_TOP 40 POP_BLOCK 6 >> 41 LOAD_FAST 2 (result) 44 RETURN_VALUE What if you built and compiled a "Mini Me" version of PyEval_EvalFrameEx on-the-fly which only contained the prologue and epilog of the real function and a small switch statement which only knew about the the byte-code instructions used by f()? Would the compiler be better able to optimize the code? Would the instructions' placement nearer to each other provide better cache behavior? Would branch prediction by CPU be improved? Another possibility would be to eliminate the for(;;) ... switch altogether and just inline the code for the individual instructions. It would help if the body of each bytecode instruction was implemented as a macro, e.g.: #define _POP_TOP() \ PREDICTED(POP_TOP); \ TARGET(POP_TOP) \ v = POP(); \ Py_DECREF(v); \ FAST_DISPATCH(); The above function could (lots of hand-waving here) be "compiled" to something like PyObject * _MiniMe(PyFrameObject *f, int throwflag) { _PyEVAL_FRAMEEX_PROLOG _LOAD_CONST(1) _STORE_FAST(2) _SETUP_LOOP(_41) _9: _LOAD_FAST(1) _JUMP_IF_FALSE(_39) _POP_TOP() _LOAD_FAST(2) _LOAD_FAST(0) _INPLACE_ADD() _STORE_FAST(2) _26: _LOAD_FAST(1) _LOAD_CONST(2) _INPLACE_SUBTRACT() _STORE_FAST(1) _JUMP_ABSOLUTE(_9) _39: _POP_TOP() _POP_BLOCK() _LOAD_FAST(2) _RETURN_VALUE() _PyEVAL_FRAMEEX_EPILOG } and the resulting binary code saved as an attribute of the code object. Presumably there would be some decision made about whether to compile a function into this form (maybe only after it's been called N times?).

msg79053 - (view) Author: Ralph Corderoy (ralph.corderoy) Date: 2009-01-04 13:21

Regarding compressing the opcode table to make better use of cache; what if the most frequently occurring opcodes where placed together, e.g. the opcodes were ordered by frequency, most frequent first. Just based on a one-off static analysis of a body of code. A level one cache line can be, what, 64 bytes == 16 32-bit pointers.

msg79056 - (view) Author: Paolo 'Blaisorblade' Giarrusso (blaisorblade) Date: 2009-01-04 13:40

@Alexandre: > > So, can you try dropping the switch altogether, using always computed > > goto and seeing how does the resulting code get compiled? > Removing the switch won't be possible unless we change the semantic > EXTENDED_ARG. In addition, I doubt the improvement, if any, would worth > the increased complexity. OK, it's time that I post code to experiment with that - there is no need to break EXTENDED_ARG. And the point is to fight miscompilations. > Do you actually mean the time spent interpreting bytecodes compared to the time spent in the other parts of Python? If so, your figures are wrong for CPython on x86-64. It is about 50% just like on x86 (when running pybench). With the patch, this drops to 35% on x86-64 and to 45% on x86. More or less, I mean that, but I was making an example, and I made up reasonable figures. 70%, or even more, just for _dispatch_ (i.e. just for the mispredicted indirect jump), is valid for real-world Smalltalk interpreters for instance, or for the ones in "The Structure and Performance of Efficient Interpreters". But, when you say "intepreting opcodes", I do not know which part you refer to, if just the computed goto or for the whole code in the interpreter function.

msg79058 - (view) Author: Paolo 'Blaisorblade' Giarrusso (blaisorblade) Date: 2009-01-04 14:38

@Skip: if one decides to generate binary code, there is no need to use switches. Inline threading (also known as "code copying" in some research papers) is what you are probably looking for: http://blog.mozilla.com/dmandelin/2008/08/27/inline-threading-tracemonkey-etc/ For references and background on threading techniques mentioned there, see: http://en.wikipedia.org/wiki/Threaded_code http://www.complang.tuwien.ac.at/forth/threaded-code.html

msg79077 - (view) Author: Alexandre Vassalotti (alexandre.vassalotti) * Date: 2009-01-04 18:52

> Removing the switch won't be possible unless we change the semantic > EXTENDED_ARG. In addition, I doubt the improvement, if any, would worth > the increased complexity. Nevermind what I have said. I managed to remove switch pretty easily by moving opcode fetching in the FAST_DISPATCH macro and abstracting the control flow of the switch. There is no speed difference on pybench on x86; on x86-64, the code is slower due to the opcode fetching change. > I patched ceval.c to minimize f->last_i manipulations in the dispatch > code. On x86, I got an extra 9% speed up on pybench. However, the > patch is a bit clumsy and a few unit tests are failing. I will see > if I can improve it and open a new issue if worthwhile. Nevermind that too. I found out f->last_i can be accessed anytime via frame.getlineno(). So, you cannot really change how f->last_i is used like I did.

msg79080 - (view) Author: Alexandre Vassalotti (alexandre.vassalotti) * Date: 2009-01-04 19:29

> I managed to remove switch pretty easily by moving opcode fetching > in the FAST_DISPATCH macro and abstracting the control flow of the > switch. Here is the diff against threadceval5.patch.

msg79105 - (view) Author: Paolo 'Blaisorblade' Giarrusso (blaisorblade) Date: 2009-01-05 02:39

@alexandre: if you add two labels per opcode and two dispatch tables, one before (like now) and one after the parameter fetch (where we have the 'case'), you can keep the same speed. And under the hood we also had two dispatch tables before, with the switch, so it's not a big deal; finally, the second table is only used in the slow path (i.e. EXTENDED_ARG, or when additional checks are needed). About f->last_i, when I have time I want to try optimizing it. Somewhere you can be sure it's not going to be used. But you have some changes about that in the abstract-switch patch, I think that was not intended?

msg79123 - (view) Author: Antoine Pitrou (pitrou) * Date: 2009-01-05 10:37

Le lundi 05 janvier 2009 à 02:39 +0000, Paolo 'Blaisorblade' Giarrusso a écrit : > About f->last_i, when I have time I want to try optimizing it. Somewhere > you can be sure it's not going to be used. There are lots of places which can call into arbitrary Python code. A few opcodes could be optimized for sure.

msg79264 - (view) Author: Antoine Pitrou (pitrou) * Date: 2009-01-06 14:30

FWIW, I have made a quick attempt at removing the f->f_lasti assignment in the few places where it could be removed, but it didn't make a difference on my machine. The problem being that there are very few places where it is legitimate to remove the assignment (even a call to Py_DECREF can invoke arbitrary external code through destructors, so even a very simple opcode such as POP_TOP needs the f->f_lasti assignment). Another approach would be to create agregate opcodes, e.g. BINARY_ADD_FAST would combine LOAD_FAST and BINARY_ADD. That's orthogonal to the current issue though.

msg79302 - (view) Author: Paolo 'Blaisorblade' Giarrusso (blaisorblade) Date: 2009-01-07 07:35

@pitrou: <ranting mode> Argh, reference counting hinders even that? </ranting mode> I just discovered another problem caused by refcounting. Various techniques allow to create binary code from the interpreter binary, by just pasting together the code for the common interpreters cases and producing calls to the other. But, guess what, on most platforms (except plain x86, but including x86_64 and maybe x86 for the shared library case) this does not work if the copied code includes function calls (on x86_64 that's due to RIP-relative addressing, and on similar issues on other platforms). So, Python could not even benefit from that! That's a real pity... I'll have to try with subroutine threading, to see if that's faster than indirect threading on current platforms or if it is still slower.

msg79321 - (view) Author: Paolo 'Blaisorblade' Giarrusso (blaisorblade) Date: 2009-01-07 11:09

I finally implemented my suggestion for the switch elimination. On top of threadedceval5.patch, apply abstract-switch-reduced.diff and then restore-old-oparg-load.diff to test it. This way, only computed goto's are used. I would like who had miscompilation problems, or didn't get advantage from the patch, to try compiling and benchmarking this version. I've also been able to reenable static prediction (PREDICT_*) on top of computed gotos, and that may help CPU prediction even more (the BTB for the computed goto will be used to predict the 2nd most frequent target); obviously it may instead cause a slowdown, I'd need stats on opcode frequency to try guessing in advance (I'll try gathering them later through DYNAMIC_EXECUTION_PROFILE). Apply reenable-static-prediction.diff on top of the rest to get this. I'll have to finish other stuff before closing everything to run pybench, I can't get stable timings otherwise, so it'll take some time (too busy, sorry). However I ran the check for regressions and they show none. ==== abstract-switch-reduced.diff is the fixed abstract-switch.diff - actually there was just one hunk which changed the handling of f_lasti, and that looked extraneous. See the end of the message. --- a/Python/ceval.c Thu Jan 01 23:54:01 2009 +0100 +++ b/Python/ceval.c Sun Jan 04 14:21:16 2009 -0500 @@ -1063,12 +1072,12 @@ } fast_next_opcode: - f->f_lasti = INSTR_OFFSET(); /* line-by-line tracing support */ if (_Py_TracingPossible && tstate->c_tracefunc != NULL && !tstate->tracing) { + f->f_lasti = INSTR_OFFSET(); /* see maybe_call_line_trace for expository comments */ f->f_stacktop = stack_pointer;

msg79335 - (view) Author: Skip Montanaro (skip.montanaro) * Date: 2009-01-07 15:09

Paolo> Various techniques allow to create binary code from the Paolo> interpreter binary, by just pasting together the code for the Paolo> common interpreters cases and producing calls to the other. But, Paolo> guess what, on most platforms (except plain x86, but including Paolo> x86_64 and maybe x86 for the shared library case) this does not Paolo> work if the copied code includes function calls (on x86_64 that's Paolo> due to RIP-relative addressing, and on similar issues on other Paolo> platforms). I don't understand. I know little or nothing about the details of various instruction set architectures or linkage methods. Can you break it down into a simple example? Skip

msg79365 - (view) Author: Paolo 'Blaisorblade' Giarrusso (blaisorblade) Date: 2009-01-07 19:26

@skip: In simple words, the x86 call: call 0x2000 placed at address 0x1000 becomes: call %rip + 0x1000 RIP holds the instruction pointer, which will be 0x1000 in this case (actually, I'm ignoring the detail that when executing the call, RIP points to the first byte of the next instruction). If I execute the same instruction from a different location (i.e. different RIP), things will break. So, only code for opcodes without real calls, nor access to globals can be copied like this (inlines are OK). With refcounting, not even POP_TOP is safe since it can call destructors. DUP_TOP is still safe, I guess.

msg79505 - (view) Author: Daniel Diniz (ajaksu2) Date: 2009-01-09 21:02

Paolo, Applying your patches makes no difference with gcc 4.2 and gives a barely noticeable (~2%) slowdown with icc. These results are from a Celeron M 410 (Core Solo Yonah-based), so it's a rather old platform to run benchmarks on.

msg79530 - (view) Author: Paolo 'Blaisorblade' Giarrusso (blaisorblade) Date: 2009-01-10 04:52

@ ajaksu2 > Applying your patches makes no difference with gcc 4.2 and gives a > barely noticeable (~2%) slowdown with icc. "Your patches" is something quite unclear :-) Which are the patch sets you are comparing? And on 32 or 64 bits? But does Yonah supports 64bits? IIRC no, but I'm not sure. I would be surprised from slowdowns for restore-old-oparg-load.diff, really surprised. And I would be just surprised by slowdowns on reenable-static-prediction.diff. Also, about ICC output, we still need to ensure that it's properly compiled (see above the instructions for counting "jmp *" or similar). In the measurements above, ICC did miscompile the patch with the switch. By "properly compiled" I mean that separate indirect branches are generated, instead of just one. > These results are from a > Celeron M 410 (Core Solo Yonah-based), so it's a rather old platform to > run benchmarks on. Not really - at the very least we should listen to results on Pentium 4, Core (i.e. Yonah) and Core 2, and I would also add Pentium3/Pentium M to represent the P6 family. Anyway, I have to do my benchmarks on this, I hope this weekend I'll have time.

msg79532 - (view) Author: Paolo 'Blaisorblade' Giarrusso (blaisorblade) Date: 2009-01-10 08:52

The standing question is still: can we get ICC to produce the expected output? It looks like we still didn't manage, and since ICC is the best compiler out there, this matters. Some problems with SunCC, even if it doesn't do jump sharing, it seems that one doesn't get the speedups - I guess that on most platforms we should select the most common alternative for interpreters (i.e. no switch, one jump table, given by threadedceval5.patch + abstract-switch-reduced.diff). On core platforms we can spend time on fine-tuning - and the definition of "core platforms" is given by "do developers want to test for that?". When that's fixed, I think that we just have to choose the simpler form and merge that. @alexandre: [about removing the switch] > There is no speed difference on pybench on x86; on x86-64, the code is slower due to the opcode fetching change. Actually, on my machine it looks like the difference is caused by the different layout caused by switch removal, or something like that, because fixing the opcode fetching doesn't make a difference here (see below). Indeed, I did my benchmarking duties. Results are that abstract-switch-reduced.diff (the one removing the switch) gives a 1-3% slowdown, and that all the others don't make a significant difference. The differences in the assembly output seem to be due to a different code layout for some branches, I didn't have a closer look. However, experimenting with -falign-labels=16 can give a small speedup, I'm trying to improve the results (what I actually want is to align just the opcode handlers, I'll probably do that by hand). reenable-static-prediction can give either a slowdown or a speedup by around 1%, i.e. around the statistical noise. Note that on my machine, I get only a 10% speedup with the base patch, and that is more reasonable here. In the original thread on PyPy-dev, I got a 20% one with the Python interpreter I built for my student project, since that one is faster* (by a 2-3x factor, like PyVM), so the dispatch cost is more significant, and reducing it has a bigger impact. In fact, I couldn't believe that Python got the same speedup. This is a Core 2 Duo T7200 (Merom) in 64bit mode with 4MB of L2 cache, and since it's a laptop I expect it to have slower RAM than a desktop. @alexandre: > The patch make a huge difference on 64-bit Linux. I get a 20% speed-up and the lowest run time so far. That is quite impressive! Which processor is that? @pitrou: > The machine I got the 15% speedup on is in 64-bit mode with gcc 4.3.2. Which is the processor? I guess the bigger speedups should be on Pentium4, since it has the bigger mispredict penalties. ==== *DISCLAIMER: the interpreter of our group (me and Sigurd Meldgaard) is not complete, has some bugs, and the source code has not yet been published, so discussion about why it is faster shall not happen here - I want to avoid any flame. I believe it's not because of skipped runtime checks or such stuff, but because we used garbage collection instead of refcounting, indirect threading and tagged integers, but I don't have time to discuss that yet. The original thread on pypy-dev has some insights if you are interested on this.

msg79536 - (view) Author: Antoine Pitrou (pitrou) * Date: 2009-01-10 09:44

> @pitrou: > > The machine I got the 15% speedup on is in 64-bit mode with gcc > 4.3.2. > > Which is the processor? I guess the bigger speedups should be on > Pentium4, since it has the bigger mispredict penalties. Athlon X2 3600+.

msg79537 - (view) Author: Antoine Pitrou (pitrou) * Date: 2009-01-10 09:54

> It looks like we still didn't manage, and since ICC is the best > compiler out there, this matters. Well, from the perspective of Python, what matters mostly is the commonly used compilers (that is, gcc and MSVC). I doubt many people compile Python with icc, honestly. Same for CPU-specific tuning: I don't think we want to ship Python with compiler flags which depend on the particular CPU being used.

On 2009-01-10 10:55, Antoine Pitrou wrote: > Antoine Pitrou <pitrou@free.fr> added the comment: > >> It looks like we still didn't manage, and since ICC is the best >> compiler out there, this matters. > > Well, from the perspective of Python, what matters mostly is the > commonly used compilers (that is, gcc and MSVC). I doubt many people > compile Python with icc, honestly. Agreed. Our main targets should be GCC for Linux and MSVC for Windows. On other platforms such as Solaris and AIX, the native vendor compilers are commonly used for compiling Python. That said, with a configure option to turn the optimization on and off, there shouldn't be any problem with slowdowns. > Same for CPU-specific tuning: I don't think we want to ship Python with > compiler flags which depend on the particular CPU being used. Certainly not in the binaries we release on python.org. Of course, people are still free to setup OPT to have the compiler generate CPU specific code.

msg79541 - (view) Author: Paolo 'Blaisorblade' Giarrusso (blaisorblade) Date: 2009-01-10 12:08

> Same for CPU-specific tuning: I don't think we want to ship Python with compiler flags which depend on the particular CPU being used. I wasn't suggesting this - but since different CPUs have different optimization rules, something like "oh, 20% performance slowdown on PowerPC" or "on P4" is important to know (and yeah, configure options are a good solution). Which is the barrier for platform-specific tricks, as long as the code is still portable? I'd like to experiment with __builtin_expect and with manual alignment (through 'asm volatile(".p2align 4")' on x86/x86_64 with GAS - PPC might need a different alignment probably). All hidden through macros to make it disappear on unsupported platforms, without any configure option for them (there shouldn't be the need for that). > I doubt many people compile Python with icc, honestly. Yep :-(. <rant>Why don't distributors do it?</rant> (First culprit might be license/compatibility problems I guess, but the speedup would be worth the time to fix the troubles IMHO).

msg79545 - (view) Author: Antoine Pitrou (pitrou) * Date: 2009-01-10 15:03

> (First culprit might > be license/compatibility problems I guess, but the speedup would be > worth the time to fix the troubles IMHO). That would be the obvious reason IMO. And Intel is the only one who can "fix the troubles".

msg79605 - (view) Author: Jeffrey Yasskin (jyasskin) * Date: 2009-01-11 17:37

Here's a port of threadedceval5.patch to trunk. It passes the tests. I haven't benchmarked this exact patch, but on one Intel Core2, a similar patch got an 11%-14% speedup (on 2to3 and pybench). I've also cleaned up Jakob Sievers' vmgen patch (representing forth-style dispatch) a bit so that it passes all the tests, and on the same machine it got a 13%-17% speedup. The vmgen patch is not quite at feature parity (it throws out support for LLTRACE and a couple other #defines), and there are fairly good arguments against committing it to python at all (it requires installing and modifying vmgen to build), but I'll post it after I've ported it to trunk. Re skip and paolo: JITting and machine-specific assembly will probably be important to speeding up Python in the long run, but they'll also take a long while to get right, so we shouldn't let them distract us from committing the dispatch optimization.

msg79627 - (view) Author: Gregory P. Smith (gregory.p.smith) * Date: 2009-01-11 20:46

Benchmarking pitrou_dispatch_2.7.patch applied to trunk r68522 on a 32- bit Efficeon (x86) using gcc 4.2.4-1ubuntu3 yields a 10% pybench speedup.

msg79688 - (view) Author: Jeffrey Yasskin (jyasskin) * Date: 2009-01-12 17:58

Here's the vmgen-based patch for comparison. Again, it passes all the tests, but isn't complete outside of that and (unless consensus develops that a couple percent is worth requiring vmgen) shouldn't distract from reviewing Antoine's patch. I'll look over threadedceval5.patch in detail next.

msg79695 - (view) Author: Paolo 'Blaisorblade' Giarrusso (blaisorblade) Date: 2009-01-12 19:03

A couple percent maybe is not worth vmgen-ing. But even if I'm not a vmgen expert, I read many papers from Ertl about superinstructions and replication, so the expected speedup from vmgen'ing is much bigger. Is there some more advanced feature we are not using and could use? Have the previous static predictions be converted to superinstructions? Have other common sequences be treated like that? Is there an existing discussion on this?

msg79699 - (view) Author: Jeffrey Yasskin (jyasskin) * Date: 2009-01-12 20:19

I've left some line-by-line comments at http://codereview.appspot.com/11905. Sorry if there was already a Rietveld thread; I didn't see one.

msg79720 - (view) Author: Jeffrey Yasskin (jyasskin) * Date: 2009-01-13 05:56

@Paolo: I'm going to be looking into converting more common sequences into superinstructions. We only have LOAD_CONST+XXX so far. The others are difficult because vmgen doesn't provide easy ways to deal with error handling, but Jakob and I have come up with a couple ideas to get around that. Replication would be trickier since we want the bytecode generation to be deterministic, but it's probably doable too. I'll post any significant results I get.

msg79722 - (view) Author: Paolo 'Blaisorblade' Giarrusso (blaisorblade) Date: 2009-01-13 06:45

Ok, then vmgen adds almost just direct threading instead of indirect threading. Since the purpose of superinstructions is to eliminate dispatch overhead, and that's more important when little actual work is done, what about all ones which unconditionally end with FAST_DISPATCH and are common? DUP_TOP, POP_TOP, DUP_TOPX(2,3) and other stack handling staff which can't fail? To have any of them + XXX without error handling problems? Removing handling of DUP_TOPX{4,5} is implied, you shouldn't check functionality of the compiler during interpretation - indeed, even the idea using a parameter for that is a waste. Have DUP_TOPX2 and DUP_TOPX3, like JVM, is just simpler. > Replication would be trickier since we want the bytecode generation to be deterministic, but it's probably doable too. Bytecode conversion during I/O is perfectly fine, to convert from the actual bytecode to one of the chosen replicas. Conversion in a rescan pass can be also OK (less cache friendly though, so if it's easy to avoid, please do).

msg79734 - (view) Author: Antoine Pitrou (pitrou) * Date: 2009-01-13 11:37

As for superinstructions, you can find an example here: #4715.

msg79835 - (view) Author: Paolo 'Blaisorblade' Giarrusso (blaisorblade) Date: 2009-01-14 06:31

#4715 is interesting, but is not really about superinstructions. Superinstructions are not created because they make sense; any common sequence of opcodes can become a superinstruction, just for the point of saving dispatches. And the creation can even be dynamic! However, when I'll have substantial time for coding, I'd like to spend it experimenting with subroutine threading. vmgen author despises it, but nowadays it probably became even faster, as discussed by the article "Context threading: A flexible and efficient dispatch technique for virtual machine interpreters".

msg79980 - (view) Author: Antoine Pitrou (pitrou) * Date: 2009-01-16 21:58

Here is an updated patch with a dedicated configure option (--with-computed-gotos, disabled by default), rather than a compiler detection switch. (sorry, the patch is very long because it seems running autoconf changes a lot of things in the configure script)

msg79984 - (view) Author: Skip Montanaro (skip.montanaro) * Date: 2009-01-16 23:02

Antoine> (sorry, the patch is very long because it seems running Antoine> autoconf changes a lot of things in the configure script) Normal practice is to not include the configure script in such patches and indicate to people that they will need to run autoconf. Skip

msg79985 - (view) Author: Antoine Pitrou (pitrou) * Date: 2009-01-16 23:08

Thanks Skip, it makes sense... so here is a patch without the configure script. (I wonder however if those huge configure changes, when checked into the SVN, could break something silently somewhere)

msg80324 - (view) Author: Stefan Ring (Ringding) Date: 2009-01-21 14:02

Hi, I ported threadedceval6.patch to Python 2.5.4, in case anyone is interested... Note that you need to run autoconf and autoheader.

msg80489 - (view) Author: Jeffrey Yasskin (jyasskin) * Date: 2009-01-25 00:15

In the comment, you might mention both -fno-crossjumping and -fno-gcse. -fno-crossjumping's description looks like it ought to prevent combining computed gotos, but http://gcc.gnu.org/onlinedocs/gcc-4.3.3/gcc/Optimize-Options.html says -fno-gcse actually does it, and in my brief tests, the manual is actually correct (compiling with just -fno-crossjumping combined gotos anyway). Otherwise, threadedceval6.patch looks good to submit to me.

msg80515 - (view) Author: Antoine Pitrou (pitrou) * Date: 2009-01-25 16:42

Committed in py3k in r68924. I won't backport it to trunk myself but it should be easy enough, provided people are interested.

msg80571 - (view) Author: Paolo 'Blaisorblade' Giarrusso (blaisorblade) Date: 2009-01-26 15:28

-fno-gcse is controversial. Even if it might avoid jumps sharing, the impact of that option has to be measured, since common subexpression elimination allows omitting some recalculations, so disabling global CSE might have a negative impact on other code. It would be maybe better to disable GCSE only for the interpreter loop, but that would make some intra-file inlining impossible.

msg80669 - (view) Author: Gregory P. Smith (gregory.p.smith) * Date: 2009-01-27 21:40

I'll take on the two remaining tasks for this: * add configure magic to detect when the compiler supports this so that it can default to --with-computed-gotos on modern systems. * commit the back port to 2.7 trunk.

msg80717 - (view) Author: Antoine Pitrou (pitrou) * Date: 2009-01-28 16:42

For the record, I've compiled py3k on an embarassingly fast Core2-based server (Xeon E5410), and the computed gotos option gives a 16% speedup on pybench and pystone. (with gcc 4.3.2 in 64-bit mode)

msg80828 - (view) Author: Kevin Watters (kevinwatters) Date: 2009-01-30 17:00

Does anyone know the equivalent ICC command line option for GCC's -fno- gcse? (Or if it has one?) I can't find a related option in the docs. It looks like ICC hits the same combining goto problem, as was mentioned: without changing any options, I applied pitrou_dispatch_2.7.patch to release-26maint and pybench reported literally no difference, +0.0%. Even if stock Python is built with MSVC, devs like myself who ship Python would like to see the benefit.

msg80832 - (view) Author: Mark Dickinson (mark.dickinson) * Date: 2009-01-30 19:04

The x86 gentoo buildbot is failing to compile, with error: /Python/makeopcodetargets.py ./Python/opcode_targets.h File "./Python/makeopcodetargets.py", line 28 f.write(",

".join("\t&&%s" % s for s in targets)) ^ SyntaxError: invalid syntax make: *** [Python/opcode_targets.h] Error 1 I suspect that it's because the system Python on this buildbot is Python 2.3, which doesn't understand generator expressions. Are there any objections to me adding a couple of square brackets to this line to turn the argument of join into a list comprehension?

msg80833 - (view) Author: Mark Dickinson (mark.dickinson) * Date: 2009-01-30 19:21

One other thought: it seems that as a result of this change, the py3k build process now depends on having some version of Python already installed; before this, it didn't. Is this true, or am I misinterpreting something? Might it be worth adding the file Python/opcode_targets.h to the distribution to avoid this problem?

msg80834 - (view) Author: Mark Dickinson (mark.dickinson) * Date: 2009-01-30 19:23

Sorry: ignore that last. Python/opcode_targets.h is already part of the distribution. I don't know what I was doing wrong.

msg80838 - (view) Author: Antoine Pitrou (pitrou) * Date: 2009-01-30 21:38

Mark: """Are there any objections to me adding a couple of square brackets to this line to turn the argument of join into a list comprehension?""" No problems for me. You might also add to the top comments of the file that it is 2.3-compatible.

msg80867 - (view) Author: Mark Dickinson (mark.dickinson) * Date: 2009-01-31 12:27

Square brackets added in r69133. The gentoo x86 3.x buildbot seems to be passing the compile stage now. (Though not the test stage, of course: one can't have everything!)

msg80881 - (view) Author: Antoine Pitrou (pitrou) * Date: 2009-01-31 17:17

> Square brackets added in r69133. The gentoo x86 3.x buildbot seems to be > passing the compile stage now. (Though not the test stage, of course: > one can't have everything!) The test failure also happens on trunk, it may be related to the recent tk changes.

msg80882 - (view) Author: Mark Dickinson (mark.dickinson) * Date: 2009-01-31 19:36

> The test failure also happens on trunk, it may be related to the recent > tk changes. Yes; sorry---I didn't mean to suggest that the test failures were in any way related to the opcode dispatch stuff. Apart from the ttk teething difficulties, there's a weird 'Unknown signal 32' error that's been going on on the gentoo buildbot for many months now. But that's a separate issue (issue #4970, to be precise).

msg81120 - (view) Author: Skip Montanaro (skip.montanaro) * Date: 2009-02-04 01:05

This has been checked in, right? Might I suggest that the TARGET and TARGET_WITH_IMPL macros not include the trailing colon? I think that will make it more friendly toward "smart" editors such as Emacs' C mode. I definitely get better indentation with TARGET(NOP): than with TARGET(NOP) S

msg81166 - (view) Author: Gabriel Genellina (ggenellina) Date: 2009-02-04 21:42

> Might I suggest that the TARGET and TARGET_WITH_IMPL macros not > include the trailing colon? Yes, please!

msg81341 - (view) Author: Antoine Pitrou (pitrou) * Date: 2009-02-07 17:43

Skip, removing the colon doesn't work if the macro adds code after the colon :)

msg81361 - (view) Author: Skip Montanaro (skip.montanaro) * Date: 2009-02-08 01:31

Antoine> Skip, removing the colon doesn't work if the macro adds code Antoine> after the colon :) When I looked I thought both TARGET and TARGET_WITH_IMPL ended with a colon, but I see that's not the case. How about removing TARGET_WITH_IMPL and just include the goto explicitly? There are only a few instances of the TARGET_WITH_IMPL used. Skip

msg83959 - (view) Author: Andrew I MacIntyre (aimacintyre) * Date: 2009-03-22 07:37

Out of interest, the attached patch against the py3k branch at r70516 cleans up the threaded code changes a little: - gets rid of TARGET_WITH_IMPL macro; - TARGET(op) is followed by a colon, so that it looks like a label (for editors that make use of that). On my systems (all older AMD with old versions of gcc), this patch has performance slightly better than SVN r70516, and performance is usually very close to the NO_COMPUTED_GOTOS build.

Is a backport to 2.7 still planned?

On 2009-03-31 03:19, A.M. Kuchling wrote: > A.M. Kuchling <lists@amk.ca> added the comment: > > Is a backport to 2.7 still planned? I hope it is.

msg84770 - (view) Author: Antoine Pitrou (pitrou) * Date: 2009-03-31 12:58

Andrew, your patch disables the optimization that HAS_ARG(op) is a constant when op is known by the compiler (that is, inside a "TARGET_##op" label), so I'd rather keep the version which is currently in SVN.

msg85838 - (view) Author: Andrew I MacIntyre (aimacintyre) * Date: 2009-04-10 02:15

Antoine, in my testing the "loss" of the HAS_ARG() optimisation in my patch appears to have negligible cost on i386, but starts to look significant on amd64. On an Intel E8200 cpu running FreeBSD 7.1 amd64, with gcc 7.2.1 and the 3.1a2 sources, the computed goto version is ~8% faster (average time of all rounds) for pybench (with warp factor set to 3 rather than the default 10, to get the round time up over 10s) than without computed gotos. With my patch applied, the computed goto version is ~5.5% faster than without computed gotos by the same measure. On this platform, Pystone rates at ~86k (no computed gotos), ~85k (computed gotos) and ~82k (computed gotos + my patch). For comparison, this machine running Windows XP (32 bit) with the python.org builds rates ~92k pystones for 2.6.1 and ~81k for 3.1a2. Pybench isn't distributed in the MSI installers :-(

msg90688 - (view) Author: Michele Dionisio (mdionisio) Date: 2009-07-18 19:58

I have patch the code of python3.1 to use computed goto tecnique also with Visual Studio. The performance result is not good (I really don't know why). But it is a good work-araound for use the computed goto also on windows. The only diffentes is that the opcode_targets vector is filled at run-time.

msg110842 - (view) Author: Antoine Pitrou (pitrou) * Date: 2010-07-19 23:55

This is too late for 2.x now, closing.

msg244229 - (view) Author: Srinivas Vamsi Parasa (parasa) * Date: 2015-05-28 00:23

Hi All, This is Vamsi from Server Scripting Languages Optimization team at Intel Corporation. Would like to submit a request to enable the computed goto based dispatch in Python 2.x (which happens to be enabled by default in Python 3 given its performance benefits on a wide range of workloads). We talked about this patch with Guido and he encouraged us to submit a request on Python-dev (email conversation with Guido shown at the bottom of this email). Attached is the computed goto patch (along with instructions to run) for Python 2.7.10 (based on the patch submitted by Jeffrey Yasskin at http://bugs.python.org/issue4753). We built and tested this patch for Python 2.7.10 on a Linux machine (Ubuntu 14.04 LTS server, Intel Xeon – Haswell EP CPU with 18 cores, hyper-threading off, turbo off). Below is a summary of the performance we saw on the “grand unified python benchmarks” suite (available at https://hg.python.org/benchmarks/). We made 3 rigorous runs of the following benchmarks. In each rigorous run, a benchmark is run 100 times with and without the computed goto patch. Below we show the average performance boost for the 3 rigorous runs. ----- Instructions to run the computed goto patch 1) Apply the patch and then generate the new configure script (autoconf configure.ac > configure) 2) Enable execute permissions for Python/makeopcodetargets.py ( sudo chmod +x Python/makeopcodetargets.py) 3) To enable computed gotos, do: ./configure --with-computed-gotos 4) Build the new python binary using make && sudo make install ------ Python 2.7.10 (original) vs Computed Goto performance Benchmark Delta (rigorous run #1) % Delta (rigorous run 2) % Delta (rigorous run #3) % Avg. Delta % iterative_count 24.48 24.36 23.78 24.2 unpack_sequence 19.06 18.47 19.48 19.0 slowspitfire 14.36 13.41 16.65 14.8 threaded_count 15.85 13.43 13.93 14.4 pystone 10.68 11.67 11.08 11.1 nbody 10.25 8.93 9.28 9.5 go 7.96 8.76 7.69 8.1 pybench 6.3 6.8 7.2 6.8 spectral_norm 5.49 9.37 4.62 6.5 float 6.09 6.2 6.96 6.4 richards 6.19 6.41 6.42 6.3 slowunpickle 6.37 8.78 3.55 6.2 json_dump_v2 1.96 12.53 3.57 6.0 call_simple 6.37 5.91 3.92 5.4 chaos 4.57 5.34 3.85 4.6 call_method_slots 2.63 3.27 7.71 4.5 telco 5.18 1.83 6.47 4.5 simple_logging 3.48 1.57 7.4 4.2 call_method 2.61 5.4 3.88 4.0 chameleon 2.03 6.26 3.2 3.8 fannkuch 3.89 3.19 4.39 3.8 silent_logging 4.33 3.07 3.39 3.6 slowpickle 5.72 -1.12 6.06 3.6 2to3 2.99 3.6 3.45 3.3 etree_iterparse 3.41 2.51 3 3.0 regex_compile 3.44 2.48 2.84 2.9 mako_v2 2.14 1.29 5.22 2.9 meteor_contest 2.01 2.2 3.88 2.7 django 6.68 -1.23 2.56 2.7 formatted_logging 1.97 5.82 -0.11 2.6 hexiom2 2.83 2.1 2.55 2.5 django_v2 1.93 2.53 2.92 2.5 etree_generate 2.38 2.13 2.51 2.3 mako -0.3 9.66 -3.11 2.1 bzr_startup 0.35 1.97 3 1.8 etree_process 1.84 1.01 1.9 1.6 spambayes 1.76 0.76 0.48 1.0 regex_v8 1.96 -0.66 1.63 1.0 html5lib 0.83 0.72 0.97 0.8 normal_startup 1.41 0.39 0.24 0.7 startup_nosite 1.2 0.41 0.42 0.7 etree_parse 0.24 0.9 0.79 0.6 json_load 1.38 0.56 -0.25 0.6 pidigits 0.45 0.33 0.28 0.4 hg_startup 0.32 2.07 -1.41 0.3 rietveld 0.05 0.91 -0.43 0.2 tornado_http 2.34 -0.92 -1.27 0.1 call_method_unknown 0.72 1.26 -1.85 0.0 raytrace -0.35 -0.75 0.94 -0.1 regex_effbot 1.97 -1.18 -2.57 -0.6 fastunpickle -1.65 0.5 -0.88 -0.7 nqueens -2.24 -1.53 -0.81 -1.5 fastpickle -0.74 1.98 -6.26 -1.7 Thanks, Vamsi ------------------------------------------------------------------------------------------------------------------------------------------------------------ From: gvanrossum@gmail.com [mailto:gvanrossum@gmail.com] On Behalf Of Guido van Rossum Sent: Tuesday, May 19, 2015 1:59 PM To: Cohn, Robert S Cc: R. David Murray (r.david.murray@murrayandwalker.com) Subject: Re: meeting at PyCon Hi Robert and David, I just skimmed that thread. There were a lot of noises about backporting it to 2.7 but the final message on the topic, by Antoine, claimed it was too late for 2.7. However, that was before we had announced the EOL extension of 2.7 till 2020, and perhaps we were also in denial about 3.x uptake vs. 2.x. So I think it's definitively worth bringing this up. I would start with a post on python-dev linking to the source code for your patch, and adding a message to the original tracker issue too (without reopening it though -- just so the people who were on the bug will be pinged about it). Because of backwards compatibility with previous 2.7.x releases, it's very important that the patch not break anything -- in particular this means you can't add opcodes or change their specification. You will also undoubtedly be asked to test this on a variety of platforms 32-bit and 64-bit that people care about. But I'm sure you're expecting all that. :-) You might also check with Benjamin Peterson, who is the 2.7 release manager. I think he just announced 2.7.10, so it's too late for that, but I assume we'll keep doing 2.7.x releases until 2020. Good luck, --Guido PS. I am assuming you are contributing this under a PSF-accepted license, e.g. Apache 2.0, otherwise it's an automatic nogo. On Tue, May 19, 2015 at 9:33 AM, Cohn, Robert S <robert.s.cohn@intel.com> wrote: Hi Guido, When we met for lunch at pycon, I asked if performance related patches would be ok for python 2.x. My understanding is that you thought it was possible if it did not create a maintainability problem. We have an example now, a 2.7 patch for computed goto based on the implementation in python 3 http://bugs.python.org/issue4753 It increases performance by up to 10% across a wide range of workloads. As I mentioned at lunch, we hired David Murray’s company, and he is guiding intel through the development process for cpython. David and I thought it would be good to run this by you before raising the issue on python-dev. Do you have a specific concern about this patch or a more general concern about performance patches to 2.7? Thanks. Robert --------

msg244231 - (view) Author: Robert Collins (rbcollins) * Date: 2015-05-28 00:34

FWIW I'm interested and willing to poke at this if more testers/reviewers are needed.

msg244256 - (view) Author: Ned Deily (ned.deily) * Date: 2015-05-28 03:52

@Vamsi, could you please open a new issue and attach your patch there so it can be properly tracked for 2.7? This issue has been closed for five years and the code has been out in the field for a long time in Python 3. Thanks!

msg244329 - (view) Author: Roundup Robot (python-dev) Date: 2015-05-28 17:45

New changeset 17d3bbde60d2 by Benjamin Peterson in branch '2.7': backport computed gotos (#4753) https://hg.python.org/cpython/rev/17d3bbde60d2

msg244617 - (view) Author: David Bolen (db3l) Date: 2015-06-01 22:02

The 2.7 back-ported version of this patch appears to have broken compilation on the Windows XP buildbot, during the OpenSSL build process, when the newly built Python is used to execute the build_ssl.py script. After this patch, when that stage executes, and prior to any output from the build script, the python_d process goes to 100% CPU and sticks there until the build process times out 1200s later and kills it. I don't think it's really ssl related though, as after doing some debugging the exact same thing happens if I simply run python_d (I never see a prompt - it just sits there burning CPU). So I think build_ssl.py is just the first use of the generated python_d during the build process. I did try attaching to the CPU-stuck version of python_d from VS, and so far from what I can see, the process never gets past the Py_Initialize() call in Py_Main(). It's all over the place in terms of locations if I try interrupting it, but it's always stuck inside that first Py_Initialize call. I'm not sure if it's something environmental on my slave, or a difference with a debug vs. production build (I haven't had a chance to try building a release version yet). -- David

msg244620 - (view) Author: David Bolen (db3l) Date: 2015-06-01 22:26

I ran a few more tests, and the generated executable hangs in both release and debug builds. The closest I can get at the moment is that it's stuck importing errno from the "import sys, errno" line in os.py - at least no matter how long I wait after starting a process before breaking out, output with -v looks like: > python_d -v # installing zipimport hook import zipimport # builtin # installed zipimport hook # D:\cygwin\home\db3l\test\python2.7\lib\site.pyc matches D:\cygwin\home\db3l\test\python2.7\lib\site.py import site # precompiled from D:\cygwin\home\db3l\test\python2.7\lib\site.pyc # D:\cygwin\home\db3l\test\python2.7\lib\os.pyc matches D:\cygwin\home\db3l\test\python2.7\lib\os.py import os # precompiled from D:\cygwin\home\db3l\test\python2.7\lib\os.pyc import errno # builtin Traceback (most recent call last): File "D:\cygwin\home\db3l\test\python2.7\lib\site.py", line 62, in <module> import os File "D:\cygwin\home\db3l\test\python2.7\lib\os.py", line 26, in <module> import sys, errno KeyboardInterrupt # clear __builtin__._ # clear sys.path # clear sys.argv # clear sys.ps1 # clear sys.ps2 # clear sys.exitfunc # clear sys.exc_type # clear sys.exc_value # clear sys.exc_traceback # clear sys.last_type # clear sys.last_value # clear sys.last_traceback # clear sys.path_hooks # clear sys.path_importer_cache # clear sys.meta_path # clear sys.flags # clear sys.float_info # restore sys.stdin # restore sys.stdout # restore sys.stderr # cleanup __main__ # cleanup[1] zipimport # cleanup[1] errno # cleanup[1] signal # cleanup[1] exceptions # cleanup[1] _warnings # cleanup sys # cleanup __builtin__ [8991 refs] # cleanup ints: 6 unfreed ints # cleanup floats I never have a problem interrupting the process, so KeyboardInterrupt is processed normally - it just looks like it gets stuck in an infinite loop during startup. -- David

msg244628 - (view) Author: R. David Murray (r.david.murray) * Date: 2015-06-02 01:11

Please open a new issue with the details about your problem.

msg244693 - (view) Author: David Bolen (db3l) Date: 2015-06-02 21:38