Message boards : Number crunching : Threading and processor affinity tool for Linux

Author Message

Download



The latest version is here: https://github.com/hazelybell/scripts/blob/master/primegrid.py



Requirements





Python 3.7



numpy



scipy



util-linux (with the taskset command)





Benchmarking



To run the benchmark, stop primegrid and call:

python3.7 primegrid.py --benchmark-llr 'k*2^n+1'

For example:

python3.7 primegrid.py --benchmark-llr '25*2^3962242+1'



You can get a relevant prime for the project you're interested in from the subproject status page.



This will then run the benchmark with various thread counts, using HT or not, etc. The benchmark will run until it's relatively confident that the best strategy (thread count, HT, affinity) is the best.



Example output:

INFO:__main__:---------- Current results: INFO:__main__:Processors: 88% Threads: 7 Layout: free Tasks/day 48.68±0.60 INFO:__main__:Processors: 50% Threads: 4 Layout: free Tasks/day 50.39±0.48 INFO:__main__:Processors: 100% Threads: 1 Layout: free Tasks/day 51.79±0.48 INFO:__main__:Processors: 100% Threads: 1 Layout: spread Tasks/day 51.98±0.20 INFO:__main__:Processors: 100% Threads: 8 Layout: free Tasks/day 52.24±0.28 INFO:__main__:Processors: 50% Threads: 4 Layout: spread Tasks/day 52.90±0.30 INFO:__main__:Processors: 50% Threads: 2 Layout: free Tasks/day 57.94±0.69 INFO:__main__:Processors: 50% Threads: 2 Layout: spread Tasks/day 58.86±0.16 INFO:__main__:Processors: 100% Threads: 4 Layout: spread Tasks/day 61.50±0.59 INFO:__main__:Processors: 100% Threads: 4 Layout: clump Tasks/day 61.80±0.16 INFO:__main__:Processors: 100% Threads: 4 Layout: free Tasks/day 61.83±0.21 INFO:__main__:Processors: 100% Threads: 2 Layout: spread Tasks/day 62.34±1.40 INFO:__main__:Processors: 100% Threads: 2 Layout: free Tasks/day 64.24±1.65 INFO:__main__:Processors: 100% Threads: 2 Layout: clump Tasks/day 65.65±1.27 INFO:__main__:Processors: 50% Threads: 1 Layout: spread Tasks/day 72.00±3.00 INFO:__main__:Processors: 50% Threads: 1 Layout: free Tasks/day 72.51±2.46 INFO:__main__:----------



This indicates that the benchmark has completed a round, and that using 50% of the processors with tasks that have one thread each and letting Linux manage thread affinity was the fastest. It will continue running until it can be confident in it's choice, but you can stop whenever the error bars (the number after the ±) gets small enough for you, or if you're just sick of waiting.



Affinity



The tool currently supports 3 different affinity types:



'free': Let the OS manage affinity itself

'spread': Spread out the threads of a single task across different cores

'clump': Put the threads of a single task on the same cores



Example: Consider a 4-core CPU with hyperthreading. For a task with 2 threads, 'spread' will put the two threads on two different cores, and 'free' will put the two threads on the same core. For a task with 4 threads, 'spread' will put one thread on each core, while 'clump' will put all the threads on 2 cores.



Managing Affinity



If you decide that you want to run primegrid with an affinity layout other than 'free' you can run the script with '--layout clump' or '--layout spread' as root. The script will watch for BOINC to start primegrid LLR tasks and manage their affinity using 'taskset'.



Advanced Options



'--processors N' Benchmark only using N processors. This includes logical processors. For example on a 4-core processor with hyperthreading, setting '--processors 4' will only benchmark equivalent to setting 50% of CPUs in BOINC. This will reduce the number of different benchmarks run.



'--threads N' Benchmark only using tasks with N threads. This will reduce the number of different benchmarks to run.



'--layout free|spread|clump' Benchmark only tasks using the specified affinity layout. This will reduce the number of benchmarks to run.



Example for a 6-core hyperthreading processor: '--processors 6 --layout free' will only benchmark 1x6, 2x3, 3x2 and 6x1 tasks x threads without worrying about CPU affinity.



'--ci .90' Change the confidence interval used to compute the error bars. This will change the number of times the script re-runs benchmarks before it's "confident" in the results.



'--llr-executable path/to/llr' Specify a specific LLR executable.



Even more options: See '--help' output, but be wary, these can have unfortunate effects on your system.



Known Bugs



Systems with complicated topologies are not modelled by the script. That includes systems with multiple CPU sockets, NUMA, and Ryzen 3000-series CPUs. For thread counts strictly more than 2 with hyperthreading or more than 1 without hyperthreading the CPU affinity may be handled poorly. It is better to let the OS manage CPU affinity in these situations. For example, consider the Ryzen 3600X, a 6-core processor with hyperthreading. On this CPU, cores are organized into groups called CCXs, for which communication inside a single CCX is much faster than communication between cores in different CCXs. Thus plans like 4 threads x 3 tasks (100%) may have very poor performance if 'clump' or 'thread' is chosen. I plan supporting these systems better, eventually, if I can get my hands on one.



Future features





Manage CPU temperature (for systems which thermal throttle :x)



Model NUMA/multisocket/CCX CPU topologies



Manage C-states



Manage GPU driver affinity



Collect power usage and temperature stats





Example Results



PPS-DIV on a i7-7700K @ 4.7Ghz: use 1 thread, 50% CPUs, 'free' affinity

PPS-DIV on a i7-3770K @ 4.3Ghz: use 1 thread, 50% CPUs, 'free' affinity

PPS-DIV on a i7-8750H @ 45W: use 1 thread, 50% CPUs, 'spread' affinity

PPS-DIV on a i7-4700MQ @ 27W: use 1 thread, 50% CPUs, 'spread' affinity

PPS-DIV on a Xeon E3-1225 V2 @ stock: use 1 thread, 100% CPUs, 'spread' affinity

PPS-DIV on a i7-9700K @ 4.8Ghz: use 1 thread, 100% CPUs, 'free' affinity

SoB on a i7-7700K @ 4.7Ghz: use 4 threads, 50% CPUs, 'free' affinity



Conclusions



From experimenting with this script I've come to the following conclusions. Note that these are only for linux. Other operating systems handle threads very differently.



Using 1 thread on half the CPUs if hyperthreading or on all the CPUs if no hyperthreading is generally an okay choice. Even for SoB, using 1 thread isn't much slower than 4 threads.



Choosing whether or not to use all logical processors (threads) on a hyperthreading CPU matters.



Setting thread affinity can improve performance on some systems.



Best thread count changes depending on project. Both k and n can have an effect here and they can have different effects.





Feel free to post your results below!







____________

Crunched with love,

Hazel

Wow! thanks for all your work that went into this Hazel, very cool! Appreciated 🙂

____________

My lucky numbers 10590941048576+1 and 224584605939537911+81292139*23#*n for n=0..26

What version of numpy is required? Ubuntu 18.04 LTS comes with 1:1.13.3-2ubuntu1, and it doesn't seem to have the multiarray numpy extension module. Whatever that is.

Is it giving you an error? It shouldn't require any particular numpy feature... Numpy is just required for scipy.

____________

Crunched with love,

Hazel

I figured it out. python3.7 doesn't work with the 3.6 versions of numpy and scipy for some reason. So I had to make sure the apt versions of those were uninstalled. Then these commands set it up correctly:



sudo apt install python3.7 python3.7 -m pip install pip sudo -H python3.7 -m pip install scipy



(Installing scipy installs numpy too.)



Thanks! Your script looks useful, but I think it will take awhile to learn to use it well.

The script now supports managing the CPU temperature.



Managing CPU Temperature



For some systems which thermal throttle, or are simply too loud, the script can manage the CPU temperature by specifying '--target-temp 95' or some other temperature in °C. The script will then change the maximum allowed CPU frequency until that temperature is met but not exceeded. Do not use this feature on overclocked systems.



Thermal throttling can degrade performance when the CPU runs too fast, overheats, and then runs very slowly until it cools down, then repeats this process over and over again. It is more efficient to run the CPU at a more consistent, intermediate speed.



____________

Crunched with love,

Hazel

I did some benchmarks on my machines for the upcoming 321 challenge, using python3 primegrid.py --benchmark-llr '3*2^15083930+1'



i7-7700K @ 4.7Ghz

Processors: 100% Threads: 1 Layout: spread Tasks/day 7.22±0.06 Processors: 100% Threads: 1 Layout: free Tasks/day 7.23±0.12 Processors: 100% Threads: 2 Layout: clump Tasks/day 7.54±0.16 Processors: 100% Threads: 2 Layout: free Tasks/day 7.66±0.24 Processors: 100% Threads: 2 Layout: spread Tasks/day 7.70±0.07 Processors: 50% Threads: 1 Layout: spread Tasks/day 7.88±0.32 Processors: 50% Threads: 1 Layout: free Tasks/day 7.88±0.24 Processors: 50% Threads: 4 Layout: free Tasks/day 8.32±0.49 Processors: 100% Threads: 4 Layout: clump Tasks/day 8.35±0.29 Processors: 50% Threads: 2 Layout: free Tasks/day 8.40±0.06 Processors: 50% Threads: 2 Layout: spread Tasks/day 8.44±0.22 Processors: 100% Threads: 4 Layout: free Tasks/day 8.46±0.13 Processors: 100% Threads: 4 Layout: spread Tasks/day 8.55±0.07 Processors: 50% Threads: 4 Layout: spread Tasks/day 8.67±0.08 Processors: 88% Threads: 7 Layout: free Tasks/day 9.04±0.04 Processors: 100% Threads: 8 Layout: free Tasks/day 9.13±0.03

Hyper-threading is really doing work on this CPU...



Xeon E3-1225 V2 @ stock

Processors: 100% Threads: 1 Layout: free Tasks/day 2.94±0.03 Processors: 100% Threads: 1 Layout: spread Tasks/day 2.95±0.22 Processors: 75% Threads: 3 Layout: free Tasks/day 2.98±0.12 Processors: 100% Threads: 2 Layout: free Tasks/day 3.18±0.29 Processors: 100% Threads: 2 Layout: spread Tasks/day 3.26±0.06 Processors: 100% Threads: 4 Layout: spread Tasks/day 3.79±0.01 Processors: 100% Threads: 4 Layout: free Tasks/day 3.81±0.01

Basically as expected, multithreading is faster on this 4-core CPU without HT.



i7-3770K @ 4.3Ghz couldn't decide between these three:

Processors: 100% Threads: 8 Layout: free Tasks/day 4.59±0.03 Processors: 88% Threads: 7 Layout: free Tasks/day 4.59±0.04 Processors: 50% Threads: 4 Layout: spread Tasks/day 4.63±0.02

Which is honestly, pretty fascinating. This CPU is right on the line between HT helping and hurting.



i7-8750H @ 45W couldn't decide between these:

Processors: 100% Threads: 6 Layout: free Tasks/day 5.86±0.13 Processors: 92% Threads: 11 Layout: free Tasks/day 5.88±0.08 Processors: 50% Threads: 1 Layout: spread Tasks/day 5.95±0.06 Processors: 50% Threads: 3 Layout: free Tasks/day 5.96±0.11 Processors: 100% Threads: 6 Layout: spread Tasks/day 5.97±0.07 Processors: 50% Threads: 1 Layout: free Tasks/day 5.97±0.03 Processors: 50% Threads: 6 Layout: spread Tasks/day 6.02±0.07 Processors: 50% Threads: 2 Layout: spread Tasks/day 6.04±0.04 Processors: 50% Threads: 3 Layout: spread Tasks/day 6.04±0.07

This is a laptop and it's probably just trading off power budget between the RAM and the cores.







____________

Crunched with love,

Hazel

It's not clear to me, how many tasks you were running?

It's not clear to me, how many tasks you were running?



It depends on the CPU... it's listed in % like how BOINC is configured. That is, i7-7700K is a 4C/8T CPU so "Processors: 50% Threads: 2" means to run 2 tasks because 50% of 8 is 4, and 4/2 is 2.

____________

Crunched with love,

Hazel

It's not clear to me, how many tasks you were running?



It depends on the CPU... it's listed in % like how BOINC is configured. That is, i7-7700K is a 4C/8T CPU so "Processors: 50% Threads: 2" means to run 2 tasks because 50% of 8 is 4, and 4/2 is 2.



For me, it would be so much easier to understand your tests if you explicitly say tasks x threads for each test. Instead of having to guess, and getting it wrong.

Download



The latest version is here: https://github.com/hazelybell/scripts/blob/master/primegrid.py



Requirements





Python 3.7



numpy



scipy



util-linux (with the taskset command)





Benchmarking



To run the benchmark, stop primegrid and call:

python3.7 primegrid.py --benchmark-llr 'k*2^n+1'

For example:

python3.7 primegrid.py --benchmark-llr '25*2^3962242+1'



You can get a relevant prime for the project you're interested in from the subproject status page.



This will then run the benchmark with various thread counts, using HT or not, etc. The benchmark will run until it's relatively confident that the best strategy (thread count, HT, affinity) is the best.



Example output:

INFO:__main__:---------- Current results: INFO:__main__:Processors: 88% Threads: 7 Layout: free Tasks/day 48.68±0.60 INFO:__main__:Processors: 50% Threads: 4 Layout: free Tasks/day 50.39±0.48 INFO:__main__:Processors: 100% Threads: 1 Layout: free Tasks/day 51.79±0.48 INFO:__main__:Processors: 100% Threads: 1 Layout: spread Tasks/day 51.98±0.20 INFO:__main__:Processors: 100% Threads: 8 Layout: free Tasks/day 52.24±0.28 INFO:__main__:Processors: 50% Threads: 4 Layout: spread Tasks/day 52.90±0.30 INFO:__main__:Processors: 50% Threads: 2 Layout: free Tasks/day 57.94±0.69 INFO:__main__:Processors: 50% Threads: 2 Layout: spread Tasks/day 58.86±0.16 INFO:__main__:Processors: 100% Threads: 4 Layout: spread Tasks/day 61.50±0.59 INFO:__main__:Processors: 100% Threads: 4 Layout: clump Tasks/day 61.80±0.16 INFO:__main__:Processors: 100% Threads: 4 Layout: free Tasks/day 61.83±0.21 INFO:__main__:Processors: 100% Threads: 2 Layout: spread Tasks/day 62.34±1.40 INFO:__main__:Processors: 100% Threads: 2 Layout: free Tasks/day 64.24±1.65 INFO:__main__:Processors: 100% Threads: 2 Layout: clump Tasks/day 65.65±1.27 INFO:__main__:Processors: 50% Threads: 1 Layout: spread Tasks/day 72.00±3.00 INFO:__main__:Processors: 50% Threads: 1 Layout: free Tasks/day 72.51±2.46 INFO:__main__:----------



This indicates that the benchmark has completed a round, and that using 50% of the processors with tasks that have one thread each and letting Linux manage thread affinity was the fastest. It will continue running until it can be confident in it's choice, but you can stop whenever the error bars (the number after the ±) gets small enough for you, or if you're just sick of waiting.



Affinity



The tool currently supports 3 different affinity types:



'free': Let the OS manage affinity itself

'spread': Spread out the threads of a single task across different cores

'clump': Put the threads of a single task on the same cores



Example: Consider a 4-core CPU with hyperthreading. For a task with 2 threads, 'spread' will put the two threads on two different cores, and 'free' will put the two threads on the same core. For a task with 4 threads, 'spread' will put one thread on each core, while 'clump' will put all the threads on 2 cores.



Managing Affinity



If you decide that you want to run primegrid with an affinity layout other than 'free' you can run the script with '--layout clump' or '--layout spread' as root. The script will watch for BOINC to start primegrid LLR tasks and manage their affinity using 'taskset'.



Advanced Options



'--processors N' Benchmark only using N processors. This includes logical processors. For example on a 4-core processor with hyperthreading, setting '--processors 4' will only benchmark equivalent to setting 50% of CPUs in BOINC. This will reduce the number of different benchmarks run.



'--threads N' Benchmark only using tasks with N threads. This will reduce the number of different benchmarks to run.



'--layout free|spread|clump' Benchmark only tasks using the specified affinity layout. This will reduce the number of benchmarks to run.



Example for a 6-core hyperthreading processor: '--processors 6 --layout free' will only benchmark 1x6, 2x3, 3x2 and 6x1 tasks x threads without worrying about CPU affinity.



'--ci .90' Change the confidence interval used to compute the error bars. This will change the number of times the script re-runs benchmarks before it's "confident" in the results.



'--llr-executable path/to/llr' Specify a specific LLR executable.



Even more options: See '--help' output, but be wary, these can have unfortunate effects on your system.



Known Bugs



Systems with complicated topologies are not modelled by the script. That includes systems with multiple CPU sockets, NUMA, and Ryzen 3000-series CPUs. For thread counts strictly more than 2 with hyperthreading or more than 1 without hyperthreading the CPU affinity may be handled poorly. It is better to let the OS manage CPU affinity in these situations. For example, consider the Ryzen 3600X, a 6-core processor with hyperthreading. On this CPU, cores are organized into groups called CCXs, for which communication inside a single CCX is much faster than communication between cores in different CCXs. Thus plans like 4 threads x 3 tasks (100%) may have very poor performance if 'clump' or 'thread' is chosen. I plan supporting these systems better, eventually, if I can get my hands on one.



Future features





Manage CPU temperature (for systems which thermal throttle :x)



Model NUMA/multisocket/CCX CPU topologies



Manage C-states



Manage GPU driver affinity



Collect power usage and temperature stats





Example Results



PPS-DIV on a i7-7700K @ 4.7Ghz: use 1 thread, 50% CPUs, 'free' affinity

PPS-DIV on a i7-3770K @ 4.3Ghz: use 1 thread, 50% CPUs, 'free' affinity

PPS-DIV on a i7-8750H @ 45W: use 1 thread, 50% CPUs, 'spread' affinity

PPS-DIV on a i7-4700MQ @ 27W: use 1 thread, 50% CPUs, 'spread' affinity

PPS-DIV on a Xeon E3-1225 V2 @ stock: use 1 thread, 100% CPUs, 'spread' affinity

PPS-DIV on a i7-9700K @ 4.8Ghz: use 1 thread, 100% CPUs, 'free' affinity

SoB on a i7-7700K @ 4.7Ghz: use 4 threads, 50% CPUs, 'free' affinity



Conclusions



From experimenting with this script I've come to the following conclusions. Note that these are only for linux. Other operating systems handle threads very differently.



Using 1 thread on half the CPUs if hyperthreading or on all the CPUs if no hyperthreading is generally an okay choice. Even for SoB, using 1 thread isn't much slower than 4 threads.



Choosing whether or not to use all logical processors (threads) on a hyperthreading CPU matters.



Setting thread affinity can improve performance on some systems.



Best thread count changes depending on project. Both k and n can have an effect here and they can have different effects.





Feel free to post your results below!









Has this been updated to be able to be used on Ryzen 3000 series? I have a 3970 and having issues with affinity.





dupe post