Message boards : Number crunching : LLR 3.8.23 Testing (AVX-512)

Author Message

Hi all,



LLR 3.8.23 has recently been released. This version introduces support for AVX-512. For those new CPUs that support it, this provides both a substantial boost in speed and seems to significantly reduce power usage and heat generation. (Yeah, I know. I don't understand that either.)



LLR builds are available from: http://jpenne.free.fr

LLR wrapper builds are available from: https://github.com/ibethune/llr_wrapper/tree/master/bin



Before we can roll out the new apps to BOINC, we need to do some thorough testing. A testing spreadsheet is available here:



https://docs.google.com/spreadsheets/d/1OEClMY3-QpxBCksayLju-4jg-PHrdPXDTKJZ3PGEqog/edit?usp=sharing



As with previous test efforts there are a range of manual tests, where you need to run the LLR program directly using the command-line (the commands are given at the far right of the document) - both in the single-threaded and multi-threaded (you choose how many threads) configurations. For the manual tests, please post requests to reserve tests, test results (and details of your host OS and CPU / link to BOINC host page). Manual credit will be applied to your account for any tests you complete.



There are also a range of tests to be run under BOINC using app_info.xml. If you need any help setting this up please ask! You can use the new LLR code with the existing LLR wrapper - it has not changed.



Please post all of your results, questions, comments in this thread. I will periodically hide posts and update the google sheet to keep the thread tidy.



When looking for an available test to perform, you MUST check both the spreadsheet for an available box (either yellow or brown) AND also check in this thread to see if anyone has recently reserved that test. It can be up to a day until I update the spreadsheet, and we don't want you wasting you time running a test that someone else is already running.



PLEASE read the instructions at the top of the spreadsheet. Also, the spreadsheet is quite large and you will need to scroll both horizontally and vertically to see the whole thing.



Thanks in advance for all the testing!



EDIT: There's two different manual PRP numbers you can tes, a long number and a short number. The long number is a full sized SOB test, so the short number is more appropriate for slow computers and/or single threading. Run one or the other; do NOT run both!

____________

My lucky number is 75898524288+1

Can't edit Google docs, but will do Windows AVX-512 manual tests, both one thread and MT.







That's intentional. Reserve it here, and I'll update the spreadsheet.





____________

My lucky number is 75898524288+1

And yeah, this machine already finished 5 GCW and 114 CUL tasks on BOINC.

https://www.primegrid.com/results.php?hostid=530837



Great, we can use those instead of the +1 SR5 tests. Thanks!



____________

My lucky number is 75898524288+1

Judging by the lack of volunteers for the BOINC tests, as well as the length of time since we've needed to use app_info, I'm guessing some of you might benefit from a refresher course on how to set up anonymous platform. Do you think that would that be helpful?

____________

My lucky number is 75898524288+1

Judging by the lack of volunteers for the BOINC tests, as well as the length of time since we've needed to use app_info, I'm guessing some of you might benefit from a refresher course on how to set up anonymous platform. Do you think that would that be helpful?



I've never done app_info.xml before.

With a little help I'll happily do the Windows FMA3 tests for both single and multi-thread.

____________

My Primes

Badge Score: 2*1 + 4*2 + 6*6 + 7*7 + 9*2 + 10*1 = 123



Windows AVX-512 Manual MT

Test 12

21181*2^30490964+1 is not prime. RES64: C8B30A86DC718154. OLD64: 5A191F94955483F8 Time : 133744.234 sec .



Honza,



Was the computer simultaneously doing other tests while this was running? That's slower than I would expect. A Kaby Lake (FMA3) can do that test on 4 cores in about 30 hours as opposed to that test, which took 37.



If the CPU was doing other tests (and therefore using memory bandwidth) I'm not concerned. If it wasn't, we may need to look into why that wasn't faster.

____________

My lucky number is 75898524288+1

Yes, tests were running while most of the 2x16 cores were doing other LLR tasks.

Also, as usual with many-cores system, CPUs are clocked only ~2GHz.

But it would do 7 such a tasks at once.



A higher clocked CPU (and RAM) running single LLR MT task could be easily faster.

____________

My stats

Badge score: 1*1 + 5*1 + 8*3 + 9*11 + 10*1 + 11*1 + 12*3 = 186

Judging by the lack of volunteers for the BOINC tests, as well as the length of time since we've needed to use app_info, I'm guessing some of you might benefit from a refresher course on how to set up anonymous platform. Do you think that would that be helpful?



I've never done app_info.xml before.

With a little help I'll happily do the Windows FMA3 tests for both single and multi-thread.



Alright, I'll put you down for those tests, and if nobody comes by and explains how to do it, I'll post an app_info tutorial later today if I get time.



EDIT: start by letting all of your PrimeGrid tasks finish. You'll have to let your queue run dry before turning on app_info. When you turn it on, all of your PrimeGrid tasks currently on that computer will be erased. That's the first step.

____________

My lucky number is 75898524288+1

Keith wrote: wrote: I've never done app_info.xml before.

With a little help I'll happily do the Windows FMA3 tests for both single and multi-thread.

Keith, here's my Windows app_info.xml that I'm using for this testing, at least for single-threaded. It's very minimalist and covers only the five subprojects we're testing here. Anyone criticizing it, especially if they're not doing this testing themselves, will be ignored and have their posts deleted.



You'll also need llr_wrapper_8.00_windows_x86_64.exe, llr.ini.6.07, and cllr64.3.8.23.exe. Right click and use "Save link as" if you have to. On my browser the .xml and .ini file by default want to display rather than save.



Let any PrimeGrid work finish. Stop BOINC using File/Shut down connected client, then File/Exit BOINC. Copy those four files into C:\ProgramData\BOINC\projects\www.primegrid.com.



Change the preferences for the venue your host is in, for example to PPSE. Tell BOINC to only use 25% of the processors (assuming 4 cores with HT off). Restart BOINC and let it grab a PPSE job. Change preferences to SGS, tell BOINC to use 50% of cores, grab an SGS job. Change to the next type of job, set to 75% of cores, etc. Keep doing that for all five apps. You need one +1 SR5 and one -1 SR5. To tell which you have, look in C:\ProgramData\BOINC\slots and in each of the subdirectories you'll find a job running. Look at the llr.in file and you can tell which c value it uses. If the first line is 1000000000:P:1:5:257 then it's SR5 +1 as opposed to 1000000000:M:0:5:258 if it's SR5 -1. The -1 outnumber the +1 by more than 2:1, so you may have to abort some jobs to get a +1.



If you get too many of any kind of job, feel free to abort it. Remember to set No New Tasks before doing so or you'll just get more of them. Feel free to abort SR5 until you have one each of +1 and -1. I aborted more than 30 looking for a +1 where the wingman had already come in. Normal users can't do that as they can't see details of work still in progress, so don't worry about the wingman.



For MT testing, app_config.xml works like it always does. Another way would be to set 25% of cores in BOINC and add the line ThreadsPerTest=4 to the llr.ini.6.07 file. Haven't yet tried it but I think it would work. There may be other ways in the app_info.xml file itself but I'm not trying to be elegant, just want it to work.



When finished with testing, the easiest way to clean everything up is to Detach/Remove PrimeGrid from your BOINC client, then reattach. That removes the www.primegrid.com directory entirely and forces your computer to download everything it needs.

A month ago, I've posted app_info.xml and app_config.xml for all PG's LRR apps, it's in LLR 3.8.23 released.

What Mike warned about letting all of your PrimeGrid tasks finish before starting anonymous platform is valid concern.





____________

My stats

Badge score: 1*1 + 5*1 + 8*3 + 9*11 + 10*1 + 11*1 + 12*3 = 186

** TESTING ANNOUNCEMENT **



I've created a second, alternate, number for testing the PRP test. The original number was full-sized SoB length. It doesn't need to be that long. The second, alternate, PRP number is much shorter, and easier to run on slow computers and/or in single thread mode. Run one or the other; do not run both.





____________

My lucky number is 75898524288+1

hi

H43

./sllr64.3.8.23 -d -oForcePRP=1 -oFermatPRPtest=1 -q"55459*2^1000054+1"

Starting probable prime test of 55459*2^1000054+1

Using all-complex AVX-512 FFT length 80K, Pass1=128, Pass2=640, clm=1, a = 3

Iter: 910343/1000070, ERROR: ROUND OFF (0.4168440416) > 0.4

Continuing from last save file.

Resuming probable prime test of 55459*2^1000054+1 at bit 2 [0.00%]

Using all-complex AVX-512 FFT length 80K, Pass1=128, Pass2=640, clm=1, a = 3

Disregard last error. Result is reproducible and thus not a hardware problem.

For added safety, redoing iteration using a slower, more reliable method.

Continuing from last save file.

Resuming probable prime test of 55459*2^1000054+1 at bit 910343 [91.02%]

Using all-complex AVX-512 FFT length 80K, Pass1=128, Pass2=640, clm=1, a = 3

55459*2^1000054+1 is not prime. RES64: 8A4ACA77151DE2CD. OLD64: 9EE05F653F59A862 Time : 132.848 sec.



That counts as a success. That's normal; there's nothing wrong.

____________

My lucky number is 75898524288+1

Hi,

I can take the remaining manual Linux MT tasks for SSE3. Just so I'm sure, I need to test using the statically linked version of the program, right?

Yes, the statically linked will always work. The dynamically linked one is more unpredictable. The server always uses the statically linked executable.

ANNOUNCEMENT



The 64 bit AVX-512 LLR apps are live. You do not need to use app_info any longer to run the 64 bit apps.



We're still doing more testing on the 32 bit apps.



Even though the 64 bit apps are released, please continue any tests you've reserved.





____________

My lucky number is 75898524288+1

The 32 bit Linux app is also live.



The 32 bit Windows app is waiting for some tests to finish.

____________

My lucky number is 75898524288+1

ANNOUNCEMENT



The 64 bit AVX-512 LLR apps are live. You do not need to use app_info any longer to run the 64 bit apps.







Hi Michael,



Because we are getting close to the next challenge and I don't think I will have time to try it before then, I need to ask a really stupid question: Do I need to do anything to get AVX512 working on my X CPU? I am hoping all I need to do is download the workunits and it will just work?

I am running ESP right now and it is slow on the X CPU compared to the 9900K - approx 2 days (yet to finish) versus 30.5 hours.



Cheers,

Nick

I am running ESP right now and it is slow on the X CPU compared to the 9900K - approx 2 days (yet to finish) versus 30.5 hours.

How many tasks are you running at once?

Are you using multi-threading?

What ram (speed, channels) is installed in each system?

What is the clock while everything is running?



I'll try throwing a single ESP unit on my 7800X system for comparison.



Edit: Estimating 2h15m for a single ESP unit, running 6 threads at 4.0 GHz, quad channel 3000 ram.

Because we are getting close to the next challenge and I don't think I will have time to try it before then, I need to ask a really stupid question: Do I need to do anything to get AVX512 working on my X CPU? I am hoping all I need to do is download the workunits and it will just work?



It's completely automatic.



The next LLR task you get will run AVX-512 if your CPU supports it. Note that tasks downloaded BEFORE the new LLR was installed will NOT be able to use AVX-512. V8.02 tasks do not support AVX-512. 8.03 tasks do.



I am running ESP right now and it is slow on the X CPU compared to the 9900K - approx 2 days (yet to finish) versus 30.5 hours.



All of your ESP tasks are older v8.02 tasks, which your computer downloaded before the new LLR was available.

____________

My lucky number is 75898524288+1

I am running ESP right now and it is slow on the X CPU compared to the 9900K - approx 2 days (yet to finish) versus 30.5 hours.

How many tasks are you running at once?

I am running 14 tasks on the 16 core CPU and 7 tasks on the 8 core CPU. Neither PC is using a GPU. The 8 core PC has a 2080 Ti which seems to need 2 or 3 cores dedicated to the GPU for it to run well.



Are you using multi-threading?

I am using no multi-threading whatsoever on either PC



What ram (speed, channels) is installed in each system?

The 9960X has quad channel 16 Gb total (4 x 4 Gb) 4000 ram - I did have XMP running but it made the system unstable. The 9900K has dual channel 32Gb total 3000 ram and I can't tell you right now if XMP is running.



What is the clock while everything is running?

The clock on the 9960X is 3495 MHz (occasionally 3995 MHz) and the 9900K is very keen to run at 4700 MHz.



I'll try throwing a single ESP unit on my 7800X system for comparison.



Edit: Estimating 2h15m for a single ESP unit, running 6 threads at 4.0 GHz, quad channel 3000 ram.

I am going to need to get multi-threading running aren't I? (Edit: I mean multi-threading for a single task) It seems that there is a considerable throughput improvement, not just individual task times.

Thank you for taking the time to run a task and to reply



All of your ESP tasks are older v8.02 tasks, which your computer downloaded before the new LLR was available.



Awesome - I thought/hoped that was the case. Thank you Michael



Could someone with an AVX-512 capable CPU please try running the first couple of iterations of this test on llr 3.8.21 vs 3.8.23:



C:\ProgramData\BOINC\projects\www.primegrid.com>del llr.ini C:\ProgramData\BOINC\projects\www.primegrid.com>del z0336833 C:\ProgramData\BOINC\projects\www.primegrid.com>cllr64.3.8.23 -d -t4 -q"19249*2^13018586+1" Starting Proth prime test of 19249*2^13018586+1 Using all-complex AVX-512 FFT length 1152K, Pass1=1152, Pass2=1K, clm=2, 4 threads, a = 3 19249*2^13018586+1, bit: 50000 / 13018600 [0.38%]. Time per bit: 2.427 ms. Caught signal. Terminating. C:\ProgramData\BOINC\projects\www.primegrid.com>del llr.ini C:\ProgramData\BOINC\projects\www.primegrid.com>del z0336833 C:\ProgramData\BOINC\projects\www.primegrid.com>cllr64.3.8.21 -d -t4 -q"19249*2^13018586+1" Starting Proth prime test of 19249*2^13018586+1 Using all-complex FMA3 FFT length 1152K, Pass1=384, Pass2=3K, 4 threads, a = 3 19249*2^13018586+1, bit: 50000 / 13018600 [0.38%]. Time per bit: 2.149 ms.



That was on a Xeon Silver 4110.



The AVX-512 transform was significantly slower than the FMA3 transform, at least on this particular test on this particular CPU.



Can anyone reproduce these results? There's no obvious case of PEBKAC here.



____________

My lucky number is 75898524288+1

C:\test>cllr64.3.8.23 -d -t4 -q"19249*2^13018586+1" Starting Proth prime test of 19249*2^13018586+1 Using all-complex AVX-512 FFT length 1152K, Pass1=1152, Pass2=1K, clm=2, 4 threads, a = 3 19249*2^13018586+1, bit: 50000 / 13018600 [0.38%]. Time per bit: 0.739 ms. Caught signal. Terminating. C:\test>del llr.ini C:\test>del z0336833 C:\test>cllr64 -d -t4 -q"19249*2^13018586+1" Starting Proth prime test of 19249*2^13018586+1 Using all-complex FMA3 FFT length 1152K, Pass1=384, Pass2=3K, 4 threads, a = 3 19249*2^13018586+1, bit: 50000 / 13018600 [0.38%]. Time per bit: 1.148 ms. Caught signal. Terminating.



CPU: 7800X stock



Had to download 3.8.21 64-bit console from Jean's site as didn't have a local copy, but I'm seeing a speedup from the AVX-512 version here.





I think I know why the Xeon Silver is not seeing a speed increase. According to following link, it only has one AVX-512 unit per core. Skylake-X and higher series Xeon CPUs have two AVX-512 units.

https://ark.intel.com/content/www/us/en/ark/products/123547/intel-xeon-silver-4110-processor-11m-cache-2-10-ghz.html



To my understanding, a single unit AVX-512 in this kind of application is no better than AVX2/FMA. It might have other advantages not used here. You need the 2nd unit to have a performance increase.



That a reduction in performance was seen might be down to the code being optimised for two units, and not working well when not present.

Not in my case on 2xIntel Xeon Gold 6130.

To add - with 7 threads.

Checked with 14 threads - and there .21 is faster (0.924 vs 1.363 ms).

Playing with affinity had no effect and CPU usage shows the same.





c:\_Honza>cllr64.3.8.21 -d -t7 -q"19249*2^13018586+1"

Starting Proth prime test of 19249*2^13018586+1

Using all-complex FMA3 FFT length 1152K, Pass1=384, Pass2=3K, 7 threads, a = 3

19249*2^13018586+1, bit: 50000 / 13018600 [0.38%]. Time per bit: 1.268 ms.

Caught signal. Terminating.



c:\_Honza>cllr64.3.8.23 -d -t7 -q"19249*2^13018586+1"

Starting Proth prime test of 19249*2^13018586+1

Using all-complex AVX-512 FFT length 1152K, Pass1=1152, Pass2=1K, clm=2, 7 threads, a = 3

19249*2^13018586+1, bit: 50000 / 13018600 [0.38%]. Time per bit: 0.834 ms.

Caught signal. Terminating.

____________

My stats

Badge score: 1*1 + 5*1 + 8*3 + 9*11 + 10*1 + 11*1 + 12*3 = 186

Could someone with an AVX-512 capable CPU please try running the first couple of iterations of this test on llr 3.8.21 vs 3.8.23:







Xeon Platinum 8124M running 9 threads per test



3.8.21

Starting Proth prime test of 19249*2^13018586+1

Using all-complex FMA3 FFT length 1152K, Pass1=384, Pass2=3K, 9 threads, a = 3

^C249*2^13018586+1, bit: 160000 / 13018600 [1.22%]. Time per bit: 0.799 ms.



3.8.23

Starting Proth prime test of 19249*2^13018586+1

Using all-complex AVX-512 FFT length 1152K, Pass1=1152, Pass2=1K, clm=2, 9 threads, a = 3

^C249*2^13018586+1, bit: 170000 / 13018600 [1.30%]. Time per bit: 0.594 ms.



There is another factor to consider. The processors maximum clock speed is reduced when running avx512 work.



For example the Xeon 8124M will only reach 2900Mhz compared to 3300Mhz with AVX2 and 3400Mhz with all other workloads with all cores loaded.



So there are many things at play here, number of avx512 units per processor and the reduced clock speeds.

Hi.

Before we can roll out the new apps to BOINC, we need to do some thorough testing.

Judging by the lack of volunteers for the BOINC tests, [...]

I'm usually only frequenting the forums if I need something or if someone points me to a certain posting or thread. I have just now found this one by chance.

Also, I don't like fiddling with config files too much. Regarding this the Boinc client's GUI is so limited. Could be much nicer, but instead of implementing it we have a Boinc wiki with a lot of complicated stuff. Well, not a PrimeGrid, but a Boinc issue.

So, for both reasons I'm not the ideal guy to do manual testing. ;-)



But, I've told Boinc to get test work from about any project that might want to send some to me. I'm certain other people did this as well. Afair it is possible to send individual test work to individual users or machines.

Can't this feature be used in the future?



Else:

Might it be useful to setup a beta project just like Einstein, Rosetta and Seti did?



Btw.:

Can you deduct the app_config.xml's settings from the work-unit's output?

If so, it should be possible to run test work accordingly.

____________

Greetings, Jens



92914140^65536+1