[Python-Dev] Fixing the GIL (with a BFS scheduler)

Hi all, Here is a second (last?) attempt at getting traction on fixing the GIL (is it broken?) with BFS (WTF?). So don't be shy (don't be too rude either) since ignoring counts as down voting. Relevant Python issue: http://bugs.python.org/issue7946 *Bottom line first* I submitted an implementation of BFS ( http://ck.kolivas.org/patches/bfs/sched-BFS.txt) as a patch to the GIL, which to the extent I have tested it, behaves nicely on Windows XP, Windows 7, GNU/Linux with either CFS or O(1) schedulers, 1/2/4 cores, laptop, desktop and VirtualBox VM guest (some data below). The patch is still work in progress and requires work in terms of style, moving code where it belongs, test code, etc... nevertheless, Python core developers recommended I already (re)post to python-dev for discussion. *So is the GIL broken?* There seems to be some disagreement on that question among Python core developers (unless you all agree it is not broken :) ). Some developers maintain the effects described by David Beazley do not affect real world systems. Even I took the role of a devil's advocate in a previous discussion, but in fact I think that Python, being a general purpose language, is similar to the OS in that regard. It is used across many application domains, platforms, and development paradigms, just as OS schedulers are, and therefore accepting thread scheduling with such properties as a fact of life is not a good idea. I was first bitten by the original GIL last year while testing a system, and found David's research while looking for answers, and later had to work around that problem in another system. Here are other real world cases: 1) Zope people hit this back in 2002 and documented the problem with interesting insight: http://www.zope.org/Members/glpb/solaris/multiproc "I have directly observed a 30% penalty under MP constraints when the sys.setcheckinterval value was too low (and there was too much GIL thrashing)." http://www.zope.org/Members/glpb/solaris/report_ps "A machine that's going full-throttle isn't as bad, curiously enough -- because the other CPU's are busy doing real work, the GIL doesn't have as much opportunity to get shuffled between CPUs. On a MP box it's very important to set sys.setcheckinterval() up to a fairly large number, I recommend pystones / 50 or so." 2) Python mailing list - 2005 http://mail.python.org/pipermail/python-list/2005-August/336286.html "The app suffers from serious performance degradation (compared to pure c/C++) and high context switches that I suspect the GIL unlocking may be aggravating ?" 3) Python mailing list - 2008 http://mail.python.org/pipermail/python-list/2008-June/1143217.html "When I start the server, it sometimes eats up 100% of the CPU for a good minute or so... though none of the threads are CPU-intensive" 4) Twisted http://twistedmatrix.com/pipermail/twisted-python/2005-July/011048.html "When I run a CPU intensive method via threads.deferToThread it takes all the CPU away and renders the twisted process unresponsive." Admittedly, it is not easy to dig reports up in Google. Finally, I think David explained the relevance of this problem quite nicely: http://mail.python.org/pipermail/python-dev/2010-March/098416.html *What about the new GIL?* There is no real world experience with the new GIL since it is under development. What we have is David's analysis and a few benchmarks from the bug report. *Evolving the GIL into a scheduler* The problem addressed by the GIL has always been *scheduling* threads to the interpreter, not just controlling access to it. The patches by Antoine and David essentially evolve the GIL into a scheduler, however both cause thread starvation or high rate of context switching in some scenarios (see data below). *BFS* Enter BFS, a new scheduler designed by Con Kolivas, a Linux kernel hacker who is an expert in this field: http://ck.kolivas.org/patches/bfs/sched-BFS.txt "The goal of the Brain Fuck Scheduler, referred to as BFS from here on, is to completely do away with the complex designs of the past for the cpu process scheduler and instead implement one that is very simple in basic design. The main focus of BFS is to achieve excellent desktop interactivity and responsiveness without heuristics and tuning knobs that are difficult to understand, impossible to model and predict the effect of, and when tuned to one workload cause massive detriment to another." I submitted an implementation of BFS (bfs.patch) which on my machines gives comparable performance to gilinter2.patch (Antoine's) and seems to schedule threads more fairly, predictably, and with lower rate of context switching (see data below). There are however, some issues in bfs.patch: 1) It works on top of the OS scheduler, which means (for all GIL patches!): a) It does not control and is not aware of information such as OS thread preemption, CPU core to run on, etc... b) There may be hard to predict interaction between BFS and the underlying OS scheduler, which needs to be tested on each target platform. 2) It works best when TSC (http://en.wikipedia.org/wiki/Time_Stamp_Counter) is available and otherwise falls back to gettimeofday(). I expect the scheduler to misbehave to some degree or affect performance when TSC is not available and either of the following is true: a) if gettimeofday() is very expensive to read (impacts release/acquire overhead). b) if gettimeofday() has very low precision ~10ms. By design of BFS, once CPU load crosses a given threshold (about 8 CPU bound tasks which need the CPU at once), the scheduler falls back to FIFO behavior and latency goes up sharply. I have no data on how bfs.patch behaves on ARM, AMD, old CPU models, OSX, FreeBSD, Solaris, or mobile. The patch may require some tuning to work properly on those systems, so data is welcome (make sure TSC code in Include/cycle.h works on those systems before benching). All that said, to the extent I have tested it, bfs.patch behaves nicely on Windows XP, Windows 7, GNU/Linux with either CFS or O(1) schedulers, 1/2/4 cores, laptop, desktop and VirtualBox VM guest. *Data* Comparison of proposed patches running ccbench on Windows XP: http://bugs.python.org/issue7946#msg104899 Comparison of proposed patches running Florent's write.py test on Ubuntu Karmic: http://bugs.python.org/issue7946#msg105687 Comparison of old GIL, new GIL and BFS running ccbench on Ubuntu Karmic: http://bugs.python.org/issue7946#msg105874 Last comparison includes a run of old GIL with sys.setcheckinterval(2500) as Zope people do. IO latency shoots up to ~1000ms as result. *What can be done with it?* Here are some options: 1) Abandon it - no one is interested, yawn. 2) Take ideas and workarounds from its code and apply to other patches. 3) Include it in the interpreter as an auxiliary (turn on with a runtime switch) scheduler. 4) Adopt it as the Python scheduler. *Opinion?* Your opinion is needed (however, please submit code review comments which are not likely to interest other people, e.g. "why did you use volatile for X?", at the issue page: http://bugs.python.org/issue7946). Thanks, Nir -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.python.org/pipermail/python-dev/attachments/20100516/845a35b2/attachment.html>