Message boards : Generalized Fermat Prime Search : Genefer 3.3.0 testing

Author Message

Hi all,



After some delay (sorry!) version 3.3.0 of Genefer is now ready for comprehensive testing, leading up to us deploying the apps on BOINC. Quite a lot has gone in to the new release including the new OCL4/5 transforms as well as some changes to the transform switching and b-limit logic, so the test process will be quite extensive. Binaries are available from SVN:



Windows: https://www.assembla.com/spaces/genefer/subversion/source/HEAD/trunk/bin/windows



Linux: https://www.assembla.com/spaces/genefer/subversion/source/HEAD/trunk/bin/linux



Mac: https://www.assembla.com/spaces/genefer/subversion/source/HEAD/trunk/bin/mac



There is a full list of tests, with some instructions for how and what to run on a google sheet:



https://docs.google.com/spreadsheets/d/1dsyr9e6iSlvVZZa4qQpuJ6QxBXuSrxL1qQIBL4OpI0Q/



There are a large set of manual tests, for n=16,17 and n=22 (which will take a long time, unfortunately!), and also some BOINC tests to run under app_info.xml. Please check carefully as some tests require specifying the transform to use (via the -x option), and some leave genefer free to select the appropriate transform.



Please post into this thread if you want to reserve tests, and if you have results and/or questions. I will try to keep the thread clean by hiding posts when I update the google sheet.



Thanks in advance for the help - the more people contribute, the sooner we can start using the new apps.



Cheers



- Iain





____________

Twitter: IainBethune

Proud member of team "Aggie The Pew". Go Aggie!

3073428256125*2^1290000-1 is Prime!

uh oh.

...

Using OCL4 transform

...

Running on platform 'Apple', device 'ATI Radeon HD Tahiti XT Prototype Compute Engine', version 'OpenCL 1.2 ' and driver '1.2 (Dec 17 2015 21:00:16)'.

...

Error: build program failed.

Error returned by cvms_element_build_from_source

Error: OpenCL error detected: CL_BUILD_PROGRAM_FAILURE.

Errors occurred for all available transform implementations





Known bug. I have ticket open with Apple for over 6 months with no response…



- Iain

____________

Twitter: IainBethune

Proud member of team "Aggie The Pew". Go Aggie!

3073428256125*2^1290000-1 is Prime!

I'll help out if I can :D

First few tests on a titan X, Windows 7





Hi these ones are duplicates of work ich_eben* has already done. I'm trying to keep the spreadsheet up-to-date, but please check back in the thread too.



Useful to see results from different hardware and OS though - thanks!

____________

Twitter: IainBethune

Proud member of team "Aggie The Pew". Go Aggie!

3073428256125*2^1290000-1 is Prime!



Problem renaming genefer.ckpt.new to genefer.ckpt at i=1150976. Continuing processing anyway...





I'm looking into this...

____________

Twitter: IainBethune

Proud member of team "Aggie The Pew". Go Aggie!

3073428256125*2^1290000-1 is Prime!



Problem renaming genefer.ckpt.new to genefer.ckpt at i=1150976. Continuing processing anyway...





I'm looking into this...



Not quite sure what is causing the problem here, I have two suggestions for how to proceed:



1) Andrew, please download a debug binary from http://www.epcc.ed.ac.uk/~ibethune/files/genefer330debug.zip, and run one of the tests where you saw the message about failing to name the checkpoint file. I added some debug statements so every time genefer checkpoints, you should see (if all is well):



Closing genefer.ckpt.new after write: No error Stat genefer.ckpt: No error Rename genefer.ckpt to genefer.ckpt.old: No error Rename genefer.ckpt.new to genefer.ckpt: No error Stat genefer.ckpt: No error Remove genefer.ckpt.old: No error



In your case, I expect to see some message which will give us a clue as to what the problem is.



2) If anyone else can run the same tests as Andrew did (using the stock binary from SVN) and either confirm they see the same error message, or not.



Thanks



- Iain

____________

Twitter: IainBethune

Proud member of team "Aggie The Pew". Go Aggie!

3073428256125*2^1290000-1 is Prime!

I'll help out if I can :D

First few tests on a titan X, Windows 7





Hi these ones are duplicates of work ich_eben* has already done. I'm trying to keep the spreadsheet up-to-date, but please check back in the thread too.



Useful to see results from different hardware and OS though - thanks!

Sorry, the spreadsheet hadn't been updated so I wasn't sure at the time. It didn't take too much time, albeit I shouldn’t get credit for anything I duplicate :)

No ocl3 testing for gfn-22?



Error! Fail to load fglrx kernel module! Maybe you can switch to root user to load kernel module directly

Error! Fail to load fglrx kernel module! Maybe you can switch to root user to load kernel module directly

Error! Fail to load fglrx kernel module! Maybe you can switch to root user to load kernel module directly





Nothing to worry about, this is a warning since you have some ATI drivers on your machine that don't work correctly.





Error: No root privilege. Please check with the system-admin.

Error: No root privilege. Please check with the system-admin.

Error: No root privilege. Please check with the system-admin.





This is not due to genefer, but rather to some OpenCL problem on your system. See http://boinc.berkeley.edu/dev/forum_thread.php?id=10315&postid=62752 for another example





Cannot open 'ocl5'

Fatal error (3). Genefer is terminating.





You made a typo here (maybe missing out the -x, so genefer thinks ocl5 is an input file)???



Also, one other point - for the 'large b' tests (like 1927034^65536+1) you should not specify the transform (-x ) but rather let genefer select for you. I have left those boxes blank for now, so if you can run them again without -x that would be very helpful!



Cheers



- Iain

____________

Twitter: IainBethune

Proud member of team "Aggie The Pew". Go Aggie!

3073428256125*2^1290000-1 is Prime!

This is on windows with the modified file. it loops the error.



> genefer_windows64.exe -q "1927034^65536+1"

genefer 3.3.0-debug (Windows/CPU/64-bit)

Supported transform implementations: fma3 avx-intel sse4 sse2 default x87

Copyright 2001-2016, Yves Gallot

Copyright 2009, Mark Rodenkirch, David Underbakke

Copyright 2010-2012, Shoichiro Yamada, Ken Brazier

Copyright 2011-2014, Michael Goetz, Ronald Schneider

Copyright 2011-2016, Iain Bethune

Genefer is free source code, under the MIT license.



Command line: genefer_windows64.exe -q 1927034^65536+1



Priority change succeeded.

Default transform is past its b limit.



Testing 1927034^65536+1...

Using FMA3 transform

Resuming 1927034^65536+1 from a checkpoint (1339558 iterations left)

Closing genefer.ckpt.new after write: Bad file descriptor remaining)

Stat genefer.ckpt: Bad file descriptor

Rename genefer.ckpt to genefer.ckpt.old: Bad file descriptor

Rename genefer.ckpt.new to genefer.ckpt: Bad file descriptor

Stat genefer.ckpt: Bad file descriptor

Remove genefer.ckpt.old: Bad file descriptor

Estimated time remaining for 1927034^65536+1 is 0:09:38

Closing genefer.ckpt.new after write: Bad file descriptor remaining)

Stat genefer.ckpt: Bad file descriptor

Rename genefer.ckpt to genefer.ckpt.old: Bad file descriptor

Rename genefer.ckpt.new to genefer.ckpt: Bad file descriptor

Stat genefer.ckpt: Bad file descriptor

Remove genefer.ckpt.old: Bad file descriptor



^C caught. Writing checkpoint.

Closing genefer.ckpt.new after write: Bad file descriptor

Stat genefer.ckpt: Bad file descriptor

Rename genefer.ckpt to genefer.ckpt.old: Bad file descriptor

Rename genefer.ckpt.new to genefer.ckpt: Bad file descriptor

Stat genefer.ckpt: Bad file descriptor

Remove genefer.ckpt.old: Bad file descriptor



No ocl3 testing for gfn-22?



We don't expect to actually use Genefer v 3.3.0 on either n=21 or n=22 at a b high enough to require anything other than OCL (or OCL4 on Maxwell GPUs). So we only want to test OCL and OCL4. No need to put anyone through the pain of running such large tests on the slower transforms.



By the time we get to high enough b levels, it's unlikely we'll still be using this version of the apps.



I'm not sure why OCL5 is on there; that might be a mistake. Iain will give you a definitive answer later on (it's the middle of the night in his part of the world.)



On another note -- everyone might want to hold off STARTING any new tests before I get a chance to confer with Iain. I've been testing other parts of the apps and have discovered some problems that will need fixing, so we might need to fix, build, and test everything again.

____________

My lucky number is 75898524288+1

This is on windows with the modified file. it loops the error.



> genefer_windows64.exe -q "1927034^65536+1"

genefer 3.3.0-debug (Windows/CPU/64-bit)



Closing genefer.ckpt.new after write: Bad file descriptor

Stat genefer.ckpt: Bad file descriptor

Rename genefer.ckpt to genefer.ckpt.old: Bad file descriptor

Rename genefer.ckpt.new to genefer.ckpt: Bad file descriptor

Stat genefer.ckpt: Bad file descriptor

Remove genefer.ckpt.old: Bad file descriptor





Thanks, I'm looking into it.



- Iain

____________

Twitter: IainBethune

Proud member of team "Aggie The Pew". Go Aggie!

3073428256125*2^1290000-1 is Prime!



I'm not sure why OCL5 is on there; that might be a mistake. Iain will give you a definitive answer later on (it's the middle of the night in his part of the world.)





OCL4 doesn't work on Mac/ATI, and I believe OCL5 might be faster than OCL4 on some cards (pre-Kepler nvidia cards?).





On another note -- everyone might want to hold off STARTING any new tests before I get a chance to confer with Iain. I've been testing other parts of the apps and have discovered some problems that will need fixing, so we might need to fix, build, and test everything again.



Yes - Please don't start any new tests. Continue to post results for anything that you currently have running. I will update when I have more info re: the problems that Mike found.



- Iain



____________

Twitter: IainBethune

Proud member of team "Aggie The Pew". Go Aggie!

3073428256125*2^1290000-1 is Prime!



Problem renaming genefer.ckpt.new to genefer.ckpt at i=1150976. Continuing processing anyway...



I added some debug statements so every time genefer checkpoints, you should see (if all is well):

Closing genefer.ckpt.new after write: No error Stat genefer.ckpt: No error Rename genefer.ckpt to genefer.ckpt.old: No error Rename genefer.ckpt.new to genefer.ckpt: No error Stat genefer.ckpt: No error Remove genefer.ckpt.old: No error



Iain, successful execution of the function will NOT reset errno to zero/"No error". You must always check exit code of each function first. Otherwise you'll get a mess referencing an error which happened long time ago, as we've already seen few posts below.



I had same rename problem but only once.

Command line: projects/www.primegrid.com/geneferocl4_windows.exe -boinc -q 1471930^262144+1 --device 0 Priority change succeeded. Using OCL4 transform Running on platform 'NVIDIA CUDA', device 'GeForce GTX 750 Ti', version 'OpenCL 1.2 CUDA' and driver '359.06'. Resuming 1471930^262144+1 from a checkpoint (4980735 iterations left) Estimated time remaining for 1471930^262144+1 is 0:47:54 Problem renaming genefer.ckpt to genefer.ckpt.old at i=4849664. Continuing processing anyway... Problem renaming genefer.ckpt.new to genefer.ckpt at i=4849664. Continuing processing anyway... 1471930^262144+1 is complete. (1616875 digits) (err = 0.0000) (time = 0:48:04) 20:22:51 20:22:52 (2820): called boinc_finish

Note that it happened when task was restarted after system crash and checkpoint file may has been corrupted. After these two messages, it checkpointed normally. And finally, the task validated. Full log: http://www.primegrid.com/result.php?resultid=693668117



Roman

Is there any benefit to continuing to run the longer tests, given that they'll need to be re-run? I've got 5 running now which have between 50 and 180 hours left.





I'm not sure why OCL5 is on there; that might be a mistake. Iain will give you a definitive answer later on (it's the middle of the night in his part of the world.)





OCL4 doesn't work on Mac/ATI, and I believe OCL5 might be faster than OCL4 on some cards (pre-Kepler nvidia cards?).





On another note -- everyone might want to hold off STARTING any new tests before I get a chance to confer with Iain. I've been testing other parts of the apps and have discovered some problems that will need fixing, so we might need to fix, build, and test everything again.



Yes - Please don't start any new tests. Continue to post results for anything that you currently have running. I will update when I have more info re: the problems that Mike found.



- Iain







Iain, successful execution of the function will NOT reset errno to zero/"No error". You must always check exit code of each function first. Otherwise you'll get a mess referencing an error which happened long time ago, as we've already seen few posts below.



...

Note that it happened when task was restarted after system crash and checkpoint file may has been corrupted. After these two messages, it checkpointed normally. And finally, the task validated. Full log: http://www.primegrid.com/result.php?resultid=693668117



Roman



Yes, in the normal version of the code, the messages are printed based on the return code, and if subsequent checkpoints are completed OK, everything is fine.



In the debug version, while it's not the cleanest code, at least we learn something useful about the first cause of the failure. And as indicated by the repeated messages, it's not a 'one-off', at least in Andrew's case.



- Iain



____________

Twitter: IainBethune

Proud member of team "Aggie The Pew". Go Aggie!

3073428256125*2^1290000-1 is Prime!

This is on windows with the modified file. it loops the error.



> genefer_windows64.exe -q "1927034^65536+1"

genefer 3.3.0-debug (Windows/CPU/64-bit)



Closing genefer.ckpt.new after write: Bad file descriptor

Stat genefer.ckpt: Bad file descriptor

Rename genefer.ckpt to genefer.ckpt.old: Bad file descriptor

Rename genefer.ckpt.new to genefer.ckpt: Bad file descriptor

Stat genefer.ckpt: Bad file descriptor

Remove genefer.ckpt.old: Bad file descriptor





Thanks, I'm looking into it.



- Iain



Andrew, there is a new debug binary posted at http://www.epcc.ed.ac.uk/~ibethune/files/genefer330debug.zip

It doesn't solve the problem but might help to clarify the error via some better debugging (thanks to stream). Please retry the same test again.



- Iain



____________

Twitter: IainBethune

Proud member of team "Aggie The Pew". Go Aggie!

3073428256125*2^1290000-1 is Prime!

Is there any benefit to continuing to run the longer tests, given that they'll need to be re-run? I've got 5 running now which have between 50 and 180 hours left.







Yes, the changes I'm making are quite localised and won't require re-running everything. When I get new binaries out, you can ctrl-C those tests and restart with the new binary.



- Iain



____________

Twitter: IainBethune

Proud member of team "Aggie The Pew". Go Aggie!

3073428256125*2^1290000-1 is Prime!

How big should the genefer.ckpt file be getting on gfn22?



I haven't noticed anything this big before, but I wasn't necessarily looking, either.



03/03/2016 02:07 PM 40,888,256 genefer.ckpt







How big should the genefer.ckpt file be getting on gfn22?



I haven't noticed anything this big before, but I wasn't necessarily looking, either.



03/03/2016 02:07 PM 40,888,256 genefer.ckpt



That big. :)



The checkpoint file's size is directly proportional to the size of the number being checked.





____________

My lucky number is 75898524288+1

How big should the genefer.ckpt file be getting on gfn22?



I haven't noticed anything this big before, but I wasn't necessarily looking, either.



03/03/2016 02:07 PM 40,888,256 genefer.ckpt



That big. :)



The checkpoint file's size is directly proportional to the size of the number being checked.







Yes, that's about right. Beside some metadata, the checkpoint stores both the GFN being tested and the current state of the Fermat test sequence i.e. it contains all the current state of the calculation.



- Iain

____________

Twitter: IainBethune

Proud member of team "Aggie The Pew". Go Aggie!

3073428256125*2^1290000-1 is Prime!

The windows/amd ocl5 test for 64996^4194304+1 hit maxerr and aborted. This may have been my fault as I was running this at the same time as the same test using plain ocl, each on one of the cores of a 7990, but I neglected to launch each from a separate directory.



I have started each from scratch, each in their own directory so there isn't any file contention.

The windows/amd ocl5 test for 64996^4194304+1 hit maxerr and aborted. This may have been my fault as I was running this at the same time as the same test using plain ocl, each on one of the cores of a 7990, but I neglected to launch each from a separate directory.



I have started each from scratch, each in their own directory so there isn't any file contention.



That is odd, it should be well within the b limit for OCL5… I've started running it on my Mac/AMD, but it will take about a week.



Let me know how the re-run goes.



- Iain

____________

Twitter: IainBethune

Proud member of team "Aggie The Pew". Go Aggie!

3073428256125*2^1290000-1 is Prime!

So far, so good, and its about halfway through the unit on ocl5, which is farther than it was when it errored out. I am still convinced it was my fault, running 2 manual instances in the same directory, and there being contention over the checkpoint file.





However, the run time seems a but long. This machine has completed (and validated) 30 gfn 22 units since brought online december 15. The longest run time for any of those was 59 1/3 hours. The estimated time seems spot on, and the workload on the box is identical to its standard workload, save the 22 units are launched command line rather than by boinc.



...

Estimated time remaining for 64996^4194304+1 is 58:07:27

64996^4194304+1 is composite. (RES=80849c1616f3bdf9) (20186710 digits) (err = 0.0076) (time = 67:14:50) 03:42:14



Not sure what to say here. If the estimate is about as expected, this indicates that the code was running at the expected speed near the start of the calculation. If the final time was longer then either the time reported is wrong (or does it appear to be OK?) or something happened during the run that was slowing the code down either another process was trying to use the GPU, or enough CPU load that it starved the thread feeding the GPU? What timings do you get it you run an OCL benchmark (-b -x ocl) at the moment? If you restart that calculation again, what estimated time does it give?



- Iain

____________

Twitter: IainBethune

Proud member of team "Aggie The Pew". Go Aggie!

3073428256125*2^1290000-1 is Prime!

Benchmark gives



Command line: ..\geneferocl_windows.exe -b -x ocl



Priority change succeeded.

Priority change succeeded.

Generalized Fermat Number Bench

Running benchmarks for transform implementation "OCL"



Running on platform 'AMD Accelerated Parallel Processing', device 'Tahiti', version 'OpenCL 1.2 AMD-APP (1800.11)' and driver '1800.11 (VM)'.



2199064^8192+1 Time: 70.3 us/mul. Err: 0.2031 51956 digits

1798620^16384+1 Time: 66.3 us/mul. Err: 0.1875 102481 digits

1471094^32768+1 Time: 81 us/mul. Err: 0.1875 202102 digits

1203210^65536+1 Time: 89.5 us/mul. Err: 0.1719 398482 digits

984108^131072+1 Time: 114 us/mul. Err: 0.1787 785521 digits

804904^262144+1 Time: 384 us/mul. Err: 0.1719 1548156 digits

658332^524288+1 Time: 416 us/mul. Err: 0.1875 3050541 digits

538452^1048576+1 Time: 828 us/mul. Err: 0.1641 6009544 digits

440400^2097152+1 Time: 1.57 ms/mul. Err: 0.1738 11836006 digits

360204^4194304+1 Time: 3.05 ms/mul. Err: 0.1562 23305854 digits

294612^8388608+1 Time: 6.1 ms/mul. Err: 0.1592 45879398 digits

Genefer Mark = 131.

Priority change succeeded.



What is somewhat interesting is that the initial estimate is what it "should" be, but the immediately reported time remaining is closer to what winds up being the run time.



Starting initialization...

Initialization complete (105.942 seconds).

Estimated time remaining for 64996^4194304+1 is 57:33:33

Testing 64996^4194304+1... 67043328 steps to go (70:08:39 remaining)



This



http://www.primegrid.com/show_host_detail.php?hostid=504933



is the machine in question.



What is somewhat interesting is that the initial estimate is what it "should" be, but the immediately reported time remaining is closer to what winds up being the run time.



Starting initialization...

Initialization complete (105.942 seconds).

Estimated time remaining for 64996^4194304+1 is 57:33:33

Testing 64996^4194304+1... 67043328 steps to go (70:08:39 remaining)





In the next build, I will up the number of iterations before the estimate, which should make it more accurate. Previously boosted the number of iterations for smaller n, with the side effect of reducing it for larger n. Now, there is a minimum (even for large n). I put a beta build up at http://www.epcc.ed.ac.uk/~ibethune/files/geneferocl_macintel.zip - let me know if it helps any?



- Iain



____________

Twitter: IainBethune

Proud member of team "Aggie The Pew". Go Aggie!

3073428256125*2^1290000-1 is Prime!

Well, that certainly looks more accurate



Testing 64996^4194304+1...

Using OCL transform



Running on platform 'Apple', device 'ATI Radeon HD Tahiti XT Prototype Compute Engine', version 'OpenCL 1.2 ' and driver '1.2 (Dec 17 2015 21:00:16)'.



Starting initialization...

Initialization complete (55.661 seconds).

Estimated time remaining for 64996^4194304+1 is 74:41:37

Testing 64996^4194304+1... 67043328 steps to go (74:22:56 remaining)



Any chance you could throw a windows version out there? The system which was producing the slower-than-expected results was a one of those.



The mac producing the above output has historically completed 22 units in no greater than 74.5 hours, so that looks pretty spot on.



Would there be any objection to testing "live" via boinc/app_info on the suspect machine?



Any chance you could throw a windows version out there? The system which was producing the slower-than-expected results was a one of those.





Here you go: http://www.epcc.ed.ac.uk/~ibethune/files/geneferocl_windows.zip





Would there be any objection to testing "live" via boinc/app_info on the suspect machine?



Go for it!

____________

Twitter: IainBethune

Proud member of team "Aggie The Pew". Go Aggie!

3073428256125*2^1290000-1 is Prime!

I've been testing combined apps on GFN-20 for some time, both the old 3.2.10-dev and the new 3.3.0. With 3.2.10-dev, many recent tasks (e.g., 691921583) can still be successfully completed using OCL transform, which is significantly faster than OCL4 on my AMD R9-280X. Genefer 3.3.0, however, just notes that "OCL transform is past its b limit." (e.g., 694560388) and goes ahead using OCL4. So there appears to be some more room for improvement in the b-limit logic.



Any chance you could throw a windows version out there? The system which was producing the slower-than-expected results was a one of those.





Here you go: http://www.epcc.ed.ac.uk/~ibethune/files/geneferocl_windows.zip







So the windows box still exhibits the issue(s), although to a lesser degree. I'm also still not thrilled with the time estimate, so I want to do some "real world" runs to compare performance to historical.



Testing 64996^4194304+1...

Using OCL transform



Running on platform 'AMD Accelerated Parallel Processing', device 'Tahiti', version 'OpenCL 1.2 AMD-APP (1800.11)' and driver '1800.11 (VM)'.



Starting initialization...

Initialization complete (105.864 seconds).

Estimated time remaining for 64996^4194304+1 is 60:26:14

Testing 64996^4194304+1... 66977792 steps to go (65:51:51 remaining)



Any chance you could throw a windows version out there? The system which was producing the slower-than-expected results was a one of those.





Here you go: http://www.epcc.ed.ac.uk/~ibethune/files/geneferocl_windows.zip





Would there be any objection to testing "live" via boinc/app_info on the suspect machine?



Go for it!

This makes no sense.... on my A6-3500's iGPU, I threw in 10.000^8192 at different transforms... OCL3 says it'll take 21s, OCL4 says 18 and OCL 5 says 20, yet actually running the tests ended up with a runtime of 27/26/26.



Let's try a higher n, 16834. Maybe the gap will be more noticeable. Estimates: 1:15 / 0:58 / 1:04. Actual runtimes: 1:26 / 1:10 / 1:14. I guess it's good that the best transform is still obvious, but the difference is quite big, still.





Those 2 were done with only the iGPU running, no CPU tasks were stealing the precious ram bandwidth. So how does it change when it has to compete for resources, say, with some ESP? Keeping in mind that at such low n the app DOES use almost a full CPU core, I ran it alongside 2 ESP tasks. Results were as follows:

-N=8192, OCL3/4/5 estimates 0:24/0:20/0:22, which translates to 29/29/29 (surprise!)

-N=16384, OCL3/4/5 estimates 1:33/1:06/1:12, which translates to 1:42/1:20/1:23



Let's kick the bucket and see what happens with 3 ESPs alongside, so OCL has to fight for both CPU and ram:

-N=8192, OCL3/4/5 estimates 0:30/0:26/0:28, which translates to 0:59/1:01/0:57 (wut!?!?)

-N=16384, OCL3 estimates 1:30. Funningly enough, the very first "testing, (X remaining)" said a 2:01 remaining. Turns out 2:11 was the final runtime, so that was much more accurate. For OCL4, estimate was 1:21, first "testing, (X remaining) was 1:47, final time was 2:09 not accurate at all, though still better than the first impression. But OCL5 is what really baffles me: 1:24 as first estimate, 1:28 as "testing estimate", 2:02 as final time.



Am I the only one that finds interesting the fact that, even though OCL4 is the best transform, OCL5 ends up being much faster when competing for CPU time?

This is on windows with the modified file. it loops the error.



> genefer_windows64.exe -q "1927034^65536+1"

genefer 3.3.0-debug (Windows/CPU/64-bit)



Closing genefer.ckpt.new after write: Bad file descriptor

Stat genefer.ckpt: Bad file descriptor

Rename genefer.ckpt to genefer.ckpt.old: Bad file descriptor

Rename genefer.ckpt.new to genefer.ckpt: Bad file descriptor

Stat genefer.ckpt: Bad file descriptor

Remove genefer.ckpt.old: Bad file descriptor





Thanks, I'm looking into it.



- Iain



Andrew, there is a new debug binary posted at http://www.epcc.ed.ac.uk/~ibethune/files/genefer330debug.zip

It doesn't solve the problem but might help to clarify the error via some better debugging (thanks to stream). Please retry the same test again.



- Iain







I believe it may have been solved



genefer_windows64.exe -q "157476^65536+1"

genefer 3.3.0-debug2 (Windows/CPU/64-bit)

Supported transform implementations: fma3 avx-intel sse4 sse2 default x87

Copyright 2001-2016, Yves Gallot

Copyright 2009, Mark Rodenkirch, David Underbakke

Copyright 2010-2012, Shoichiro Yamada, Ken Brazier

Copyright 2011-2014, Michael Goetz, Ronald Schneider

Copyright 2011-2016, Iain Bethune

Genefer is free source code, under the MIT license.



Command line: genefer_windows64.exe -q 157476^65536+1



Priority change succeeded.



Testing 157476^65536+1...

Using FMA3 transform



The checkpoint doesn't match current test: 157476^65536+1 != 1927034^65536+1. Current test will be restarted

Starting initialization...

Initialization complete (0.040 seconds).

Estimated time remaining for 157476^65536+1 is 0:05:08

157476^65536+1 is composite. (RES=9f64b3f0d545615c) (340605 digits) (err = 0.0032) (time = 0:05:26) 12:20:33

It will take a few hours (days) before it finishes completely, but I started a n22 unit in boinc on my 290x machine. Like the 7990 machine, performance seems to be a bit off.



With 55 min, 25 sec complete, the 290x is 1.184% done, which I believe equates to a 78 hour or so run time. Out of a sample of 11, the longest previous n22 run time on this machine was 71.4 hours.



Once the last (manual) test unit on the 7990 finishes, I'll fire up an app_info test on it as well. I have two boxes with identical hardware, so I can run one with the test ocl and one with production at the same time, with identical workloads.



Interesting that the performance hit doesn't seem to affect a mac.



What do I need to do to log the executable's transform selection whilst running from Boinc?



Closing genefer.ckpt.new after write: Bad file descriptor

Stat genefer.ckpt: Bad file descriptor

Rename genefer.ckpt to genefer.ckpt.old: Bad file descriptor

Rename genefer.ckpt.new to genefer.ckpt: Bad file descriptor

Stat genefer.ckpt: Bad file descriptor

Remove genefer.ckpt.old: Bad file descriptor





I believe it may have been solved







Thanks, we'll keep an eye out for in in future though!

____________

Twitter: IainBethune

Proud member of team "Aggie The Pew". Go Aggie!

3073428256125*2^1290000-1 is Prime!



This makes no sense....



Thanks for investigating... Assuming you did your latest tests using the 3.3.0-debug2 version then the 'Estimated Time Remaining' code is essentially the same as has been in the previous release, so I think these should be as accurate as they have been in the previous standalone apps - can you check if you see the same behaviour with those as with the current app, manually selecting the transform?



- Iain







____________

Twitter: IainBethune

Proud member of team "Aggie The Pew". Go Aggie!

3073428256125*2^1290000-1 is Prime!



What do I need to do to log the executable's transform selection whilst running from Boinc?



If I understand the question, to see what transform was selected, you can just look at stderr.txt in the BOINC dir/slots/*/, or once the task is completed, on the task page on the website.



Cheers



- Iain

____________

Twitter: IainBethune

Proud member of team "Aggie The Pew". Go Aggie!

3073428256125*2^1290000-1 is Prime!

As of this morning, 20.166% done after 14:30 hours, so I should be looking at around 71.9 hours run time, which is pretty close to what I would have expected from this machine.



And from stderr.txt, the initial time estimate



Testing 65946^4194304+1...

Estimated time remaining for 65946^4194304+1 is 71:51:18



For that size task, that is quite accurate.



[edit] The initial startup time before the task actually hit the gpu was nearly 6 minutes, I suspect that will have an impact on any "early" estimates. I actually thought I had something misconfigured until I just let it run. That startup time was 1.5 minutes on a n21 task.[/edit]



[edit2] And by early estimates, I mean manual estimates based on percent completed and time run. [/edit2]



It will take a few hours (days) before it finishes completely, but I started a n22 unit in boinc on my 290x machine. Like the 7990 machine, performance seems to be a bit off.



With 55 min, 25 sec complete, the 290x is 1.184% done, which I believe equates to a 78 hour or so run time. Out of a sample of 11, the longest previous n22 run time on this machine was 71.4 hours.



Once the last (manual) test unit on the 7990 finishes, I'll fire up an app_info test on it as well. I have two boxes with identical hardware, so I can run one with the test ocl and one with production at the same time, with identical workloads.



Interesting that the performance hit doesn't seem to affect a mac.



What do I need to do to log the executable's transform selection whilst running from Boinc?



This makes no sense....



Thanks for investigating... Assuming you did your latest tests using the 3.3.0-debug2 version then the 'Estimated Time Remaining' code is essentially the same as has been in the previous release, so I think these should be as accurate as they have been in the previous standalone apps - can you check if you see the same behaviour with those as with the current app, manually selecting the transform?



- Iain







Yeah, it was done with debug2. Running the regular 1076 build available for download, we get the following results:



Format: OCL3/4/5, estimate -> actual, (1076) VS [debug2]



No CPU running:

-n=13: (21/17/19 -> 27/25/25) VS [21/18/20 -> 27/26/26].

-n=14: (1:15/0:59/1:04 -> 1:26/1:10/1:15) VS [1:15 / 0:58 / 1:04 -> 1:26 / 1:10 / 1:14]

Virtually the same results with both apps, both in speed and estimates, +/-1s margin of error.



With 2 CPU instances. And here, 1076 was figthing with PPS MEGA, while debug2 was fighting for ESP, so results could differ significantly. Let's see if it does:

-n=13: (25/22/21 -> 33/31/27) VS [24/20/22 -> 29/29/29]

-n=14: (1:21/1:10/1:14 -> 1:33/1:19/1:26) VS [1:33/1:06/1:12 VS 1:42/1:20/1:23]

Okay, what's going on... We know that the estimate and runtimes should be consistent from the no CPU test. But here, the numbers are all over the place. Using 1s as margin of error, n=13 around the same estimates, but 1076 is slower. n=14, sometimes we estimate longer, sometimes less, and the same happens to actual compute time.



With 3 CPUs at once. Again, 1076 with MEGA and debug2 with ESP:

-n=13: (28/25/24 -> 48/46/44) VS [30/26/28 -> 59/1:01/57]

-n=14: (1:25/1:13/1:17 -> 1:51/1:42/1:48) VS [1:30/1:21/1:24 -> 2:11/2:09/2:02]

I give up. Seems OCL5 is the better actual transform... until I got to 3 CPU + MEGA, in which OCL4 won (as it should). The only takeaway form this is the good old "you can't see which one is the best unless you try them... but then you already did the calculation anyway, so what's the point of re-doing it to see which one will be the fastest?"

For a phalanx of pending and validated units on windows/amd

n=19

http://www.primegrid.com/results.php?hostid=502810&offset=0&show_names=0&state=0&appid=27

n=18

http://www.primegrid.com/results.php?hostid=502810&offset=0&show_names=0&state=0&appid=26

17low

http://www.primegrid.com/results.php?hostid=502810&offset=0&show_names=0&state=0&appid=24

n=16

http://www.primegrid.com/results.php?hostid=502810&offset=0&show_names=0&state=0&appid=23



No errors, no invalid tasks.



This:

http://www.primegrid.com/result.php?resultid=696057063

is the in progress n=22 unit.



These were running the original test executable as of 2 March, and the update Ian provided as of the early morning (EST) of 8 March, when I switched to the 22 unit, so these are likely all on the original test app.



The only observation is that run times are all over the place, with maxerr exceeded on a number of units. http://www.primegrid.com/result.php?resultid=693523497 is a good example.

This will teach me to type before coffee. Apologies Iain.

For a phalanx of pending and validated units on windows/amd

n=19

http://www.primegrid.com/results.php?hostid=502810&offset=0&show_names=0&state=0&appid=27

n=18

http://www.primegrid.com/results.php?hostid=502810&offset=0&show_names=0&state=0&appid=26

17low

http://www.primegrid.com/results.php?hostid=502810&offset=0&show_names=0&state=0&appid=24

n=16

http://www.primegrid.com/results.php?hostid=502810&offset=0&show_names=0&state=0&appid=23



No errors, no invalid tasks.



This:

http://www.primegrid.com/result.php?resultid=696057063

is the in progress n=22 unit.



These were running the original test executable as of 2 March, and the update Ian provided as of the early morning (EST) of 8 March, when I switched to the 22 unit, so these are likely all on the original test app.



The only observation is that run times are all over the place, with maxerr exceeded on a number of units. http://www.primegrid.com/result.php?resultid=693523497 is a good example.



This is odd. This is using the binary form the 7 March post. Retrying on a 980.



emil:Downloads vzimmerman$ ./geneferocl_macintel -d 1 -q "64996^4194304+1" -x ocl4

geneferocl 3.3.0-1 (Apple-x86/OpenCL/64-bit)

Supported transform implementations: ocl ocl3 ocl4 ocl5

Copyright 2001-2016, Yves Gallot

Copyright 2009, Mark Rodenkirch, David Underbakke

Copyright 2010-2012, Shoichiro Yamada, Ken Brazier

Copyright 2011-2014, Michael Goetz, Ronald Schneider

Copyright 2011-2016, Iain Bethune

Genefer is free source code, under the MIT license.



Command line: ./geneferocl_macintel -d 1 -q 64996^4194304+1 -x ocl4



Priority change succeeded.



Testing 64996^4194304+1...

Using OCL4 transform



Running on platform 'Apple', device 'GeForce GTX 770', version 'OpenCL 1.2 ' and driver '10.9.14 346.03.05f02'.



Starting initialization...

Initialization complete (65.241 seconds).

Estimated time remaining for 64996^4194304+1 is 117:47:21

Testing 64996^4194304+1... 65536000 steps to go (113:20:05 remaining)

maxErr exceeded for 64996^4194304+1, 1.0000 > 0.4500

Errors occurred for all available transform implementations

Boinc tests on the machine which was having the curiously long ocl run times have commenced. Q9550 with 3 sgs units and 2 n22 units running.

http://www.primegrid.com/result.php?resultid=696455063

http://www.primegrid.com/result.php?resultid=696445078



The benchmark phase of the startup really clobbers the cpu, and for a long time, particularly where it is trying to initialize 2 units:



Command line: projects/www.primegrid.com/geneferocl_windows.exe -boinc -q 65990^4194304+1 --device 1



Priority change succeeded.

A benchmark is needed to determine best transform, testing available transform implementations...

Testing OCL transform...



Running on platform 'AMD Accelerated Parallel Processing', device 'Tahiti', version 'OpenCL 1.2 AMD-APP (1800.11)' and driver '1800.11 (VM)'.



Testing OCL3 transform...

Testing OCL4 transform...

Testing OCL5 transform...

Benchmarks completed (675.544 seconds).

Using OCL transform

Starting initialization...

Initialization complete (127.167 seconds).

Testing 65990^4194304+1...

Estimated time remaining for 65990^4194304+1 is 58:17:01

This 770 is suspect. I'm not sure this outcome means anything.



This is odd. This is using the binary form the 7 March post. Retrying on a 980.



emil:Downloads vzimmerman$ ./geneferocl_macintel -d 1 -q "64996^4194304+1" -x ocl4

geneferocl 3.3.0-1 (Apple-x86/OpenCL/64-bit)

Supported transform implementations: ocl ocl3 ocl4 ocl5

Copyright 2001-2016, Yves Gallot

Copyright 2009, Mark Rodenkirch, David Underbakke

Copyright 2010-2012, Shoichiro Yamada, Ken Brazier

Copyright 2011-2014, Michael Goetz, Ronald Schneider

Copyright 2011-2016, Iain Bethune

Genefer is free source code, under the MIT license.



Command line: ./geneferocl_macintel -d 1 -q 64996^4194304+1 -x ocl4



Priority change succeeded.



Testing 64996^4194304+1...

Using OCL4 transform



Running on platform 'Apple', device 'GeForce GTX 770', version 'OpenCL 1.2 ' and driver '10.9.14 346.03.05f02'.



Starting initialization...

Initialization complete (65.241 seconds).

Estimated time remaining for 64996^4194304+1 is 117:47:21

Testing 64996^4194304+1... 65536000 steps to go (113:20:05 remaining)

maxErr exceeded for 64996^4194304+1, 1.0000 > 0.4500

Errors occurred for all available transform implementations

This 770 is suspect. I'm not sure this outcome means anything.





Thanks, I'll ignore the previous result then :)

____________

Twitter: IainBethune

Proud member of team "Aggie The Pew". Go Aggie!

3073428256125*2^1290000-1 is Prime!

Yeah, I beat on it and it produced some strange results. I'm not sure if it has a problem itself, or the total power draw it and the gt640 in the mac pro its running in is causing it to flake out. I don't have the telemetry on a mac that I would have with afterburner or something like that.



I have relegated it to sieving. Genefer units on it wind up drawing 30-40 watts more power.



This 770 is suspect. I'm not sure this outcome means anything.





Thanks, I'll ignore the previous result then :)



Hi all,



Sorry I've been hiding under a rock for the last few days due to some work, home and illness pressures! Normal service to be resumed shortly. Of the couple of issues on my todo list, the benchmark and estimate times are first on the list. Excuse the stream-of-consciousness post here, but I hope it will help to explain the previous results (thanks to Rafael and VZ for the detailed testing).



The digit length of a GFN b^2^n+1 is roughly l = log(b)*2^n .



The component parts of a test:



* The initialisation, which computes the entire GFN as a large integer array, using a long-multiplication which is O(l^2) for small l and roughly O(l^1.585) for large l.



* The main test, which performs a certain number of squarings. Each squaring is O(l*log(l)) and the number of iterations is O(l).



This leads to three observations:



* The initialisation is O(l^2) or less and the calculation time is O(l^2*log(l)), so the initialisation will become less and less significant the larger the number.



* The time per iteration should approximately double when n is increased by 1 since (2^n+1)^2/(2^n)^2 = 2 (ignoring the log term).



* The time for the main test increases by a factor of 4 when n is increased by 1 since each iteration costs twice as much and there are twice as many iterations.



How the number of iterations for a benchmark is determined by:



* n <= 14 - i = 2^n

* n >= 15 - i = 2^(28-n) i.e. 64 iterations for n=22, 128 for n=21 etc...

* So the total amount of time for the benchmark should be roughly constant for large n.

* There is also a maximum time limit for a single benchmark set as 10 minutes, to avoid e.g. benchmarking at large n for x87 taking excessively long



What benchmarks are available and what they measure:



* -b benchmark runs at high priority. Runs a few (24) iterations to prepare CPU/GPU cache. Does not compute the initialisation, and reports the time per iteration. i.e. it measures the maximum speed achievable by a particular transform iteration



* -b3 benchmark runs at high priority. Computes the initialisation and some iterations. Returns the time taken for the iterations (ignoring the intialisation) and projects the time to completion according to: totalNumberOfIterations * measuredTime / benchmarkIterations - i.e. a linear extrapolation of the benchmark



* Transform selection benchmark runs at normal priority (i.e. representative of the runtime environment). Uses the same code as -b3. Really, this should skip the initialisation since it is common to all transforms, and there is no need to run it several times. I will fix this, as it wastes a lot of time.



What timings are printed and what they count:



* "Estimated time remaining for b^N+1 is ...". Printed after same number of iterations as normal benchmark (or at latest 10 minutes after initialisation completed). Based on elapsedTime*iterationsRemaining / iterationsSoFar i.e. linear extrapolation. I discovered a bug here, if the timer limit is hit rather than estimation based on the set number of iterations, but this wouldn't have been the case for any of the large tests you ran...



* "(XXXX remaining)". Printed every time a checkpoint is written. Same method as for the initial estimate.



* Final time display "(time = )". Simply based on wall clock time elapsed during the entire test.



So in summary, fixed a few things, but still no real idea why the initial estimate should be off…



New builds to follow in the next day or so, hopefully!



- Iain

____________

Twitter: IainBethune

Proud member of team "Aggie The Pew". Go Aggie!

3073428256125*2^1290000-1 is Prime!

* The initialisation, which computes the entire GFN as a large integer array, using a long-multiplication which is O(l^2) for small l and roughly O(l^1.585) for large l.

Why not use the same multiplication as the main one (l.log(l)) for the initialization as well?

* The initialisation, which computes the entire GFN as a large integer array, using a long-multiplication which is O(l^2) for small l and roughly O(l^1.585) for large l.

Why not use the same multiplication as the main one (l.log(l)) for the initialization as well?

The initialisation computes a base-2 representation. The transform radix is b.



The initialisation time is negligible with respect to the primality test.



* The initialisation, which computes the entire GFN as a large integer array, using a long-multiplication which is O(l^2) for small l and roughly O(l^1.585) for large l.

Why not use the same multiplication as the main one (l.log(l)) for the initialization as well?

The initialisation computes a base-2 representation. The transform radix is b.



Will the current method be faster than computing b^2^N using main FFT, and convert radix-b balanced representation to base-2?



Side question: Does the init logic take advantage of the fact that b=c.2^s, so we can compute c^2^N and left shift it to get the full number?

Yves Gallot

Volunteer developer

Project scientist

Send message

Joined: 19 Aug 12

Posts: 577

ID: 164101

Credit: 304,715,793

RAC: 0



Joined: 19 Aug 12Posts: 577ID: 164101Credit: 304,715,793RAC: 0 Message 93094 - Posted: 14 Mar 2016 | 11:26:09 UTC - in response to Message 93090.

Last modified: 14 Mar 2016 | 11:26:57 UTC

Will the current method be faster than computing b^2^N using main FFT, and convert radix-b balanced representation to base-2?

I don't think that it can be faster than N squaring.



Side question: Does the init logic take advantage of the fact that b=c.2^s, so we can compute c^2^N and left shift it to get the full number?

Yes, the representation is a "floating integer" m*2^s (m is odd). Just m is computed and store in memory.



The only observation is that run times are all over the place, with maxerr exceeded on a number of units. http://www.primegrid.com/result.php?resultid=693523497 is a good example.



This is a bit odd. On the plus side, the error detection and handling is working nicely, and all your results are valid. However, OCL4 should not be giving errors at all, since the transform is deterministic. Can you run the tests ( -t -x ocl4 and -r -x ocl4) on that machine?



- Iain

____________

Twitter: IainBethune

Proud member of team "Aggie The Pew". Go Aggie!

3073428256125*2^1290000-1 is Prime!

Running these individually resulted in err=0 for everything through n=19.



However, and a big however



The units I ran with valid results, I ran 2 at a time on the GPU, as historically one unit alone wouldn't peg the gpu (290x) for ocl3/ocl4.



If I run 2 instances of -t (in separate directories) I hit maxerr on the third or fourth test unit.



I ran one instance of -t and one instance of -r at the same time, and my video driver crashed.



The card has otherwise been rock solid, so I don't think it is suspect.



I've got a second WR unit finishing up in an hour or so, after that I'll put it on 17mega, 18, and 19, running one unit at a time, to see if I get the same results.



[edit]The same does _not_ occur on the 7990 system I was also using to test. Both test runs coexisted peacefully. Time to go check drivers. Sigh.[/edit]





The only observation is that run times are all over the place, with maxerr exceeded on a number of units. http://www.primegrid.com/result.php?resultid=693523497 is a good example.



This is a bit odd. On the plus side, the error detection and handling is working nicely, and all your results are valid. However, OCL4 should not be giving errors at all, since the transform is deterministic. Can you run the tests ( -t -x ocl4 and -r -x ocl4) on that machine?



- Iain

Tasks starting to come in. Clean runs so far. March 14th and newer are the "single" tasks.



18 here:

http://www.primegrid.com/results.php?hostid=502810&offset=0&show_names=0&state=0&appid=26



19 here:

http://www.primegrid.com/results.php?hostid=502810&offset=0&show_names=0&state=0&appid=27



2nd wr unit:

http://www.primegrid.com/result.php?resultid=696700183 (note the error handling. Fallout 4 contention?)





The only observation is that run times are all over the place, with maxerr exceeded on a number of units. http://www.primegrid.com/result.php?resultid=693523497 is a good example.



This is a bit odd. On the plus side, the error detection and handling is working nicely, and all your results are valid. However, OCL4 should not be giving errors at all, since the transform is deterministic. Can you run the tests ( -t -x ocl4 and -r -x ocl4) on that machine?



- Iain

Thanks for testing. Interesting that running multiple tests per card appears to break things. This would appear to point to a driver issue (esp since it works on your 7990 system).



- Iain



Running these individually resulted in err=0 for everything through n=19.



However, and a big however



The units I ran with valid results, I ran 2 at a time on the GPU, as historically one unit alone wouldn't peg the gpu (290x) for ocl3/ocl4.



If I run 2 instances of -t (in separate directories) I hit maxerr on the third or fourth test unit.



I ran one instance of -t and one instance of -r at the same time, and my video driver crashed.



The card has otherwise been rock solid, so I don't think it is suspect.



I've got a second WR unit finishing up in an hour or so, after that I'll put it on 17mega, 18, and 19, running one unit at a time, to see if I get the same results.



[edit]The same does _not_ occur on the 7990 system I was also using to test. Both test runs coexisted peacefully. Time to go check drivers. Sigh.[/edit]





The only observation is that run times are all over the place, with maxerr exceeded on a number of units. http://www.primegrid.com/result.php?resultid=693523497 is a good example.



This is a bit odd. On the plus side, the error detection and handling is working nicely, and all your results are valid. However, OCL4 should not be giving errors at all, since the transform is deterministic. Can you run the tests ( -t -x ocl4 and -r -x ocl4) on that machine?



- Iain



____________

Twitter: IainBethune

Proud member of team "Aggie The Pew". Go Aggie!

3073428256125*2^1290000-1 is Prime!



2nd wr unit:

http://www.primegrid.com/result.php?resultid=696700183 (note the error handling. Fallout 4 contention?)





Was this running with another task on the same card?

____________

Twitter: IainBethune

Proud member of team "Aggie The Pew". Go Aggie!

3073428256125*2^1290000-1 is Prime!



2nd wr unit:

http://www.primegrid.com/result.php?resultid=696700183 (note the error handling. Fallout 4 contention?)





Was this running with another task on the same card?



No.



I left it running when I fired up fallout, and later civ5.

So far finished 13 18s and 2 19s, all clean, no error fallback. 5 second variance on completion for the 18's, which isn't bad :)



2nd wr unit:

http://www.primegrid.com/result.php?resultid=696700183 (note the error handling. Fallout 4 contention?)





Was this running with another task on the same card?



No.



I left it running when I fired up fallout, and later civ5.



OK, odd that it gets such a huge roundoff error, and that there are later roundoff errors with OCL. I will keep an eye on this and see if it validates eventually...

____________

Twitter: IainBethune

Proud member of team "Aggie The Pew". Go Aggie!

3073428256125*2^1290000-1 is Prime!

New builds (labelled 3.3.0-1) are available:



Windows: https://www.assembla.com/spaces/genefer/subversion/source/HEAD/trunk/bin/windows



Linux: https://www.assembla.com/spaces/genefer/subversion/source/HEAD/trunk/bin/linux



Mac: https://www.assembla.com/spaces/genefer/subversion/source/HEAD/trunk/bin/mac



Please grab the new binaries and continue with the testing matrix:



https://docs.google.com/spreadsheets/d/1dsyr9e6iSlvVZZa4qQpuJ6QxBXuSrxL1qQIBL4OpI0Q/



I'm particularly interested to see:



* Re-runs of the windows CPU tests that showed problems related to checkpointing.



* Tests with AMD/ATI cards running on Linux.



* Tests with Linux and Mac on BOINC using app_info.xml



but all help is appreciated - if you have any questions about how to help, please post here!



- Iain

____________

Twitter: IainBethune

Proud member of team "Aggie The Pew". Go Aggie!

3073428256125*2^1290000-1 is Prime!

How are those linux binaries looking? I have a linux64/amd install on deck and ready to go.

And another hiccup in logic when running files!



Let's say you have 2 candiate files, A and B. And let's say you make a .bat with "genefer A" on one line and "genefer B" on the other, so that once it finishes the candidates in A, it'll immediately start the candidates in B.



Now, everything works good, until you get to B. Once it finishes the candidate list A, the program doesn't delete the genefer file which stores the current line. As a result, if A had 100 candidates, it'll start reading B from the 100th line, instead of the first.



It's also a shame that it can only store 1 file name at a time, so if you have multiple files, it can't track the individual progress of each, meaning you risk skipping lines on either if you mix them up.





Not so much of a problem as it is more of a complain. I wasn't expecting the program to keep track of each individual file progress, but I'd expect that it would at least delete the previous line counter once it's finished, much like it deletes the checkpoint file once it finishes a candidate.



Iget that the current way is better for single files (as if you change the name of the file and it won't lose track of the progress), but I think it would be better if we had a <Filename> <line> format instead of the current one.

This hit me early on when I was trying to test on 2 gpus at the same time.



My quick and dirty workaround was to run from multiple (sub)directories.



And another hiccup in logic when running files!



Let's say you have 2 candiate files, A and B. And let's say you make a .bat with "genefer A" on one line and "genefer B" on the other, so that once it finishes the candidates in A, it'll immediately start the candidates in B.



Now, everything works good, until you get to B. Once it finishes the candidate list A, the program doesn't delete the genefer file which stores the current line. As a result, if A had 100 candidates, it'll start reading B from the 100th line, instead of the first.



It's also a shame that it can only store 1 file name at a time, so if you have multiple files, it can't track the individual progress of each, meaning you risk skipping lines on either if you mix them up.





Not so much of a problem as it is more of a complain. I wasn't expecting the program to keep track of each individual file progress, but I'd expect that it would at least delete the previous line counter once it's finished, much like it deletes the checkpoint file once it finishes a candidate.



Iget that the current way is better for single files (as if you change the name of the file and it won't lose track of the progress), but I think it would be better if we had a <Filename> <line> format instead of the current one.



This hit me early on when I was trying to test on 2 gpus at the same time.



My quick and dirty workaround was to run from multiple (sub)directories.



That's what I did as well.



Iget that the current way is better for single files (as if you change the name of the file and it won't lose track of the progress), but I think it would be better if we had a <Filename> <line> format instead of the current one.



Yes, this is something I will add to my todo list for the a future release. The functionality is not required for BOINC (which is the priority right now), and for those who are testing lists of candidates from files there are easy workarounds. In addition to running in two separate dirs, you can also delete the genefer.dat file when you start testing a new file.



Cheers



- Iain

____________

Twitter: IainBethune

Proud member of team "Aggie The Pew". Go Aggie!

3073428256125*2^1290000-1 is Prime!

How are those linux binaries looking? I have a linux64/amd install on deck and ready to go.



Waiting to hear from Ron about them... but he may be away as I haven't had a reply since I emailed a few days ago.



____________

Twitter: IainBethune

Proud member of team "Aggie The Pew". Go Aggie!

3073428256125*2^1290000-1 is Prime!

Linux binaries now available thanks to Ron: https://www.assembla.com/spaces/genefer/subversion/source/HEAD/trunk/bin/linux



- Iain

____________

Twitter: IainBethune

Proud member of team "Aggie The Pew". Go Aggie!

3073428256125*2^1290000-1 is Prime!

Thanks folks! Does anyone have a genuine 32 bit linux system to test on?



- Iain

____________

Twitter: IainBethune

Proud member of team "Aggie The Pew". Go Aggie!

3073428256125*2^1290000-1 is Prime!



2nd wr unit:

http://www.primegrid.com/result.php?resultid=696700183 (note the error handling. Fallout 4 contention?)





OK, odd that it gets such a huge roundoff error, and that there are later roundoff errors with OCL. I will keep an eye on this and see if it validates eventually...



I think whatever happened to the driver when you were playing Fallout killed this one. The wingman came in (who has a history of good results) and mismatches with your residue. Can you run another one, ideally with a result already returned so it will validate quickly?



- Iain



____________

Twitter: IainBethune

Proud member of team "Aggie The Pew". Go Aggie!

3073428256125*2^1290000-1 is Prime!



2nd wr unit:

http://www.primegrid.com/result.php?resultid=696700183 (note the error handling. Fallout 4 contention?)





OK, odd that it gets such a huge roundoff error, and that there are later roundoff errors with OCL. I will keep an eye on this and see if it validates eventually...



I think whatever happened to the driver when you were playing Fallout killed this one. The wingman came in (who has a history of good results) and mismatches with your residue. Can you run another one, ideally with a result already returned so it will validate quickly?



- Iain





I can, but it'll be sunday. There was another unit that machine already ran which completed cleanly, but is still awaiting validation http://www.primegrid.com/result.php?resultid=696057063



Thanks folks! Does anyone have a genuine 32 bit linux system to test on?



- Iain



If no-one else does, I can reinstall with 32-bit (albeit on 64-bit hardware) and re-run everything.



Thanks folks! Does anyone have a genuine 32 bit linux system to test on?



- Iain



If no-one else does, I can reinstall with 32-bit (albeit on 64-bit hardware) and re-run everything.





We're looking specifically for a 32 bit CPU.



32-bit linux installs on 64 bit hardware are easy to find. :)

____________

My lucky number is 75898524288+1

Thanks folks! Does anyone have a genuine 32 bit linux system to test on?



- Iain



If no-one else does, I can reinstall with 32-bit (albeit on 64-bit hardware) and re-run everything.





We're looking specifically for a 32 bit CPU.



32-bit linux installs on 64 bit hardware are easy to find. :)



I'm guessing 32-bit linux installs on 32-bit hardware with an amd gpu that can run double precision/ocl are not exactly run-of-the mill these days.



Thanks folks! Does anyone have a genuine 32 bit linux system to test on?



- Iain



If no-one else does, I can reinstall with 32-bit (albeit on 64-bit hardware) and re-run everything.





We're looking specifically for a 32 bit CPU.



32-bit linux installs on 64 bit hardware are easy to find. :)



I'm guessing 32-bit linux installs on 32-bit hardware with an amd gpu that can run double precision/ocl are not exactly run-of-the mill these days.





ANY GPU on a 32 bit CPU is REALLY rare. That's not what we're concerned about.



We want to make sure that the program was built with compiler options that permit it to run on a real 32 bit CPU that lacks even the SSE2 instruction set.



____________

My lucky number is 75898524288+1

On this computer the genefer_linux32 version does not run properly.



./genefer_linux32 -b -x default

genefer 3.3.0-1 (Linux/CPU/32-bit)

Supported transform implementations: default x87

Copyright 2001-2016, Yves Gallot

Copyright 2009, Mark Rodenkirch, David Underbakke

Copyright 2010-2012, Shoichiro Yamada, Ken Brazier

Copyright 2011-2014, Michael Goetz, Ronald Schneider

Copyright 2011-2016, Iain Bethune

Genefer is free source code, under the MIT license.



Command line: ./genefer_linux32 -b -x default



Priority change succeeded.

Priority change failed (needs superuser privileges).

Generalized Fermat Number Bench

Running benchmarks for transform implementation "Default"

Illegal instruction





./genefer_linux32 -b -x x87

genefer 3.3.0-1 (Linux/CPU/32-bit)

Supported transform implementations: default x87

Copyright 2001-2016, Yves Gallot

Copyright 2009, Mark Rodenkirch, David Underbakke

Copyright 2010-2012, Shoichiro Yamada, Ken Brazier

Copyright 2011-2014, Michael Goetz, Ronald Schneider

Copyright 2011-2016, Iain Bethune

Genefer is free source code, under the MIT license.



Command line: ./genefer_linux32 -b -x x87



Priority change succeeded.

Priority change failed (needs superuser privileges).

Generalized Fermat Number Bench

Running benchmarks for transform implementation "x87 (80-bit)"

Illegal instruction



We've seen this before on this computer. I have notes from late September of 2015. I don't have notes on the outcome, though.

Yves Gallot

Volunteer developer

Project scientist

Send message

Joined: 19 Aug 12

Posts: 577

ID: 164101

Credit: 304,715,793

RAC: 0



Joined: 19 Aug 12Posts: 577ID: 164101Credit: 304,715,793RAC: 0 Message 93305 - Posted: 19 Mar 2016 | 9:37:55 UTC - in response to Message 93302.

Last modified: 19 Mar 2016 | 10:07:13 UTC

On this computer the genefer_linux32 version does not run properly.

The AMD Athlon XP is a Pentium III (SSE extensions but no SSE2).



Using Intel Software Development Emulator https://software.intel.com/en-us/articles/intel-software-development-emulator:



sde.exe -p4p -- genefer_windows32.exe -b -x default

Running benchmarks for transform implementation "Default"

6008024^256+1 Time: 68.2 us/mul. Err: 0.1562 1736 digits

OK on Pentium4 Prescott



sde.exe -p4 -- genefer_windows32.exe -b -x default

6008024^256+1 Time: TID 0 SDE-ERROR: Executed instruction not valid for specified chip (PENTIUM4): 0x45ee3a: fisttp dword ptr [esp+0x10], st0

NOK, doesn't run on P4 (and then on Pentium III).



The FISTTP instruction (SSE3) is generated by the compiler and we set -march=i686 "Pentium Pro instruction set"...





We want to make sure that the program was built with compiler options that permit it to run on a real 32 bit CPU that lacks even the SSE2 instruction set.





Alas, the only 32-bit cpu I have is a t2400, which has sse2.

On this computer the genefer_linux32 version does not run properly.

...

We've seen this before on this computer. I have notes from late September of 2015. I don't have notes on the outcome, though.



A (slightly) different problem to before, but I have now posted an updated binary to SVN: https://www.assembla.com/spaces/genefer/subversion/source/HEAD/trunk/bin/linux/genefer_linux32?_format=raw



The version number stays at 3.3.0-1 since there are no code changes, just an update to the build environment.



Cheers



- Iain

____________

Twitter: IainBethune

Proud member of team "Aggie The Pew". Go Aggie!

3073428256125*2^1290000-1 is Prime!

Kind of a silly question, but during all this testing I was wondering if any of the tests had been done on an overclocked system just to see if the overclock/heating issue might have magically disappeared on both cpu and gpu platforms.



Cheers

Kind of a silly question, but during all this testing I was wondering if any of the tests had been done on an overclocked system just to see if the overclock/heating issue might have magically disappeared on both cpu and gpu platforms.



Cheers

I don't really think it'll ever be solved. It's just like LLR: faster transforms generate more hear. Up to 1076 (which is what I've tested), the heat and speed are both very high.

Kind of a silly question, but during all this testing I was wondering if any of the tests had been done on an overclocked system just to see if the overclock/heating issue might have magically disappeared on both cpu and gpu platforms.



Cheers



That's not the purpose of the testing. This is what's called "regression testing", checking all existing features to make sure nothing was inadvertently broken.



The "overclocking problem" as you put it isn't actually a problem. The reason anyone can overclock at all is because parts are built to slightly better than advertised specifications to assure that, due to manufacturing variability, even the "bad" parts are good enough to meet advertised specifications. That implies that the "good" parts may be somewhat better.



How good is "better"? It depends, in part, on what you're trying to do with the part. What we do is extremely rough on the electronics, consuming lots of power and generating lots of heat, so it's expected that it would be harder to overclock when running this kind of software. It's not a problem -- it's normal.



The difficulty with overclocking is directly tied to how efficient the programs are. The faster we make the software, the rougher it is on the electronics, and the harder it is to overclock.



Given the implications of the last sentence, I would expect overclocking to be LESS viable with new releases, not more. Optimizing software means you try to make it utilize as much of the available processing power as is possible. The more you do that, the less leeway you have to push the hardware beyond its design limits (i.e., "overclocking").

____________

My lucky number is 75898524288+1

@Michael - thanks man for the great answer.



Cheers

[edit] B68 started [/edit]



For fun, I tried looking at the initial time estimates for the test n=22 unit based on a titan black in double vs. single precision mode. Specifically, I was a curious if ocl4 in SP mode was on par with ocl in DP mode. Alas no, but its not a maxwell, either.



Double precision:

OCL 48:35

OCL3 104:30

OCL4 64:37

OCL5 110:29



Single precision:

OCL 59:45

OCL3 95:17

OCL4 59:56

OCL5 101:50

Although we have not completed every last test, we have reached the point where enough combinations of hardware, software and algorithms have been tested that I have given the go ahead to roll out the 3.3.0-1 binaries to BOINC. Mike will be setting these up sometime in the next few days I think.



Please complete any outstanding tests, and if you spot any problems please let me know!



Thanks for all the testing effort - I will be applying manual credit for the tests soon.



- Iain

____________

Twitter: IainBethune

Proud member of team "Aggie The Pew". Go Aggie!

3073428256125*2^1290000-1 is Prime!

Although we have not completed every last test, we have reached the point where enough combinations of hardware, software and algorithms have been tested that I have given the go ahead to roll out the 3.3.0-1 binaries to BOINC. Mike will be setting these up sometime in the next few days I think.



I expect to install the new 3.3.0-1 app later today on GFN16. GFN16 will soon be hitting the OCL3 limit and must use the new app to continue processing beyond that point. It's therefore the natural choice to go first. (The leading edge is 11,397,334 and OCL3 will stop working at 11,863,283.



If there's no problems with GFN16, I'll install the new apps on all the other GFN ranges afterwards.



Please note that while we are updating both the CPU and OCL apps (for both ATI/AMD and Nvidia GPUs), the CUDA apps are not being updated since we plan on removing the CUDA apps at some point.



Also note that the old OCL2 and OCL3 apps (as well as all the OCL4 and OCL5 test apps that are floating around) will become obsolete when replaced by the new OCL app. If you're running geneferOCL with app_info, once all the new apps are installed you'll probably want to either switch to the new OCL app, or run without app_info, as appropriate.

____________

My lucky number is 75898524288+1

Genefer 3.3.0-1 is now live on GFN-16



Let us know if there's any problems.

____________

My lucky number is 75898524288+1

Genefer 3.3.0-1 is now live on GFN-16



Let us know if there's any problems.

Okay... the app is being whiny about selecting the best transform. Again.



Let me start with a little background. I'm using the Gtx 970; the best transforms are OCL4 (2-prime) > OCL 3 > OCL5 > OCL4 (3-prime). So the app should always try OCL4 2p first, then move to OCL3 and then back to OCL4 3p. OCL5 should never be used, as the both the b limit and the speed of the app are inferior to OCL3.



But that's not what was happening. I did this WU and it chose to use OCL5 instead of OCL3. I decided to download the app and run it manually. So I ran it, and it chose OCL5..... even though the benchmark file clearly said OCL3 was better transform. Which it is, as the run time for OCL3 really is smaller than OCL5 (but I already knew that from previous testing).



And that's not all. Check this other WU. It's a little higher b... but what's that: it's using OCL 3, as it was supposed to! Funingly enough, manually running the app, it choses OCL5 (I tried multiple times), but when Boinc ran it, it choose OCL3!



What's going on here.....



Okay... the app is being whiny about selecting the best transform. Again.



What's going on here.....



Hi Rafael,



A couple of things to mention:



* The combined OCL app will never try both OCL4-2-prime (a.k.a. the fast, low-b version of OCL4) and OCL4-3-prim (a.k.a. the slower, high-b version of OCL4). The selection of which of those algorithms is used is done internally by the transform code, depending on b. In the example you mention, the OCL3, OCL4 (high-b) and OCL5 transforms are all possible.



* For n=16, the respective b limits of the transforms are:



OCL3: bMax=11863283

OCL4: bMax=5720809/400000000 (i.e. genefer uses the high-b, slower version of the transform)

OCL5: bMax=11421893



* Without seeing the 'genefer.dat' file from your slots directory I can't say for sure, but I expect that for such a short test the three transforms all take approximately the same amount of time i.e. depending on exactly what else is running in the environment at the time of the benchmark might make genefer end up choosing one over the other. If you can grab the genefer.dat file that would be useful info - if it really is the case that the benchmarks showed OCL3 faster than OCL5 but genefer chose OCL5 first, that's a bug...



* I note from the two tasks that you showed, the OCL5 one took 02:24, and the OCL3 one took 02:13. I realise that 10% variation seems like quite a lot but for such a short task to get an accurate benchmark of the performance to be sure of choosing the 'best' transform might take longer than the time saved by choosing it.



If you want to be sure that you are getting the transform you want, you can use app_info.xml and specify the '-x ocl4' or whatever, which will also save time as no benchmarks would be run.



- Iain

____________

Twitter: IainBethune

Proud member of team "Aggie The Pew". Go Aggie!

3073428256125*2^1290000-1 is Prime!



Hi Rafael,



A couple of things to mention:



* The combined OCL app will never try both OCL4-2-prime (a.k.a. the fast, low-b version of OCL4) and OCL4-3-prim (a.k.a. the slower, high-b version of OCL4). The selection of which of those algorithms is used is done internally by the transform code, depending on b. In the example you mention, the OCL3, OCL4 (high-b) and OCL5 transforms are all possible.



Yeah, I know. My wording might not have been the best.





* For n=16, the respective b limits of the transforms are:



OCL3: bMax=11863283

OCL4: bMax=5720809/400000000 (i.e. genefer uses the high-b, slower version of the transform)

OCL5: bMax=11421893



* Without seeing the 'genefer.dat' file from your slots directory I can't say for sure, but I expect that for such a short test the three transforms all take approximately the same amount of time i.e. depending on exactly what else is running in the environment at the time of the benchmark might make genefer end up choosing one over the other. If you can grab the genefer.dat file that would be useful info.



(This is the first WU I linked)

Windows/OpenCL/32-bit : 11404838 65536 ocl5 142 ocl3 162 ocl4 208



As you can see, the OCL3 benchmark is clearly higher, so I have no idea as to why OCL 5 is being used instead.



* I note from the two tasks that you showed, the OCL5 one took 02:24, and the OCL3 one took 02:13. I realise that 10% variation seems like quite a lot but for such a short task to get an accurate benchmark of the performance to be sure of choosing the 'best' transform might take longer than the time saved by choosing it.



If you want to be sure that you are getting the transform you want, you can use app_info.xml and specify the '-x ocl4' or whatever, which will also save time as no benchmarks would be run.





With one other task I tried, I got 142 / 145 as the OCL5/3 scores. In that case, I totally agree with you: short times to accurately pinpoint the best transform. And given the fact that we are so close to the OCL5 limit (around 11,4M), you could argue that I'm making a fuss about some nitpick....



But still, 10% free performance does seem like a lot to me. Not so much when the benchmark scores are so similar (142/145), but when they are considerably different (142/162), I can only see this a problem somewhere within the logic.



And I know we're still very far away from this being relevant to other n ranges, but if it is happening with n=16, it might just happen to other ranges. For example, if it was to happen with n=22 due to OCL only being 10% slower than OCL4 on a newer architecture (Polaris or Pascal): with how big those tasks are, 10% is now a few hours long, which is a lot!





It's not like I can do much about it myself, though, in terms of the app. I don't know how to code, the decision to fix it or not is totally outside my hands. If you say it's not worth the effort to fix it (example: Yves decided it was fine to have b=2 broken for OCL4), I'll be cool with that. Let's just keep in mind that there's a minor issue still lingering, even if it doesn't bother us.

For some reason boinc is having issues downloading the new version. It keeps backing off and retrying. The particular system had no problems downloading more ESP sieve tasks, just the new GFN-16 app.



Event log: Internet access ok- project servers may be temporarily down.



EDIT:Checking on one of my wingman shows a lot of errors



http://www.primegrid.com/results.php?hostid=400679&offset=0&show_names=0&state=7&appid=23



and this one



http://www.primegrid.com/results.php?hostid=510579



guess its not ready for prime time yet.



I'm switching over to 17_low for a bit.

Genefer 3.3.0-1 is now live on GFN-16



Let us know if there's any problems.

Live means no app_info needed, right?

____________



Genefer 3.3.0-1 is now live on GFN-16



Let us know if there's any problems.

Live means no app_info needed, right?



Correct.

____________

My lucky number is 75898524288+1

EDIT:Checking on one of my wingman shows a lot of errors



http://www.primegrid.com/results.php?hostid=400679&offset=0&show_names=0&state=7&appid=23

OCL3 transform runs on this computer but AMD OpenCL 1124.2 driver crashes when it tries to compile OCL4 OpenCL code (clBuildProgram).

Or this driver is not able to compile two OpenCL programs.

No error with AMD OpenCL 1912.5 driver...



http://www.primegrid.com/results.php?hostid=510579

Neither Genefer 65536 v3.10 nor Genefer 65536 v3.12 run on this computer.



But Genefer v3.10 message is:

Using OCL3 transform No NVIDIA compatible OpenCL device found. An error (7) occurred. Waiting 10 minutes before attempting to continue from last checkpoint...

and Genefer v3.12

geneferocl 3.3.0-1 (Windows/OpenCL/32-bit) Unhandled Exception Detected... - Unhandled Exception Record - Reason: Access Violation (0xc0000005) at address 0x6eaa2986 read attempt to address 0xC483F897 Engaging BOINC Windows Runtime Debugger...

... error handling should be improved (regression?).



In case anyone is running OCL5 (using app_info and the old v955 OCL4 app or the combined OCL app with the "-x ocl5" switch) on GFN16, it's time to switch over to the new combined stock app. The leading edge of the GFN16 tasks has just surpassed the OCL5 b limit. GFN16 now is in OCL3 territory, but only very briefly. That b limit will be exceeded soon, and GFN16 will only be able to be run by the OCL4 (high) transform or the now-obsolete OCL2 transform.

____________

My lucky number is 75898524288+1

EDIT:Checking on one of my wingman shows a lot of errors



http://www.primegrid.com/results.php?hostid=400679&offset=0&show_names=0&state=7&appid=23

OCL3 transform runs on this computer but AMD OpenCL 1124.2 driver crashes when it tries to compile OCL4 OpenCL code (clBuildProgram).

Or this driver is not able to compile two OpenCL programs.

No error with AMD OpenCL 1912.5 driver...

I tested Genefer 3.3.0-1 on a HD 5000 series (Redwood) and there is no error with driver 1800.11 (Catalyst 15.2).



The hardware is able to run the program but Catalyst 13.x or older drivers may need to be updated.





(This is the first WU I linked)

Windows/OpenCL/32-bit : 11404838 65536 ocl5 142 ocl3 162 ocl4 208



As you can see, the OCL3 benchmark is clearly higher, so I have no idea as to why OCL 5 is being used instead.







With one other task I tried, I got 142 / 145 as the OCL5/3 scores. In that case, I totally agree with you: short times to accurately pinpoint the best transform. And given the fact that we are so close to the OCL5 limit (around 11,4M), you could argue that I'm making a fuss about some nitpick....



But still, 10% free performance does seem like a lot to me. Not so much when the benchmark scores are so similar (142/145), but when they are considerably different (142/162), I can only see this a problem somewhere within the logic.



And I know we're still very far away from this being relevant to other n ranges, but if it is happening with n=16, it might just happen to other ranges. For example, if it was to happen with n=22 due to OCL only being 10% slower than OCL4 on a newer architecture (Polaris or Pascal): with how big those tasks are, 10% is now a few hours long, which is a lot!



It's not like I can do much about it myself, though, in terms of the app. I don't know how to code, the decision to fix it or not is totally outside my hands. If you say it's not worth the effort to fix it (example: Yves decided it was fine to have b=2 broken for OCL4), I'll be cool with that. Let's just keep in mind that there's a minor issue still lingering, even if it doesn't bother us.



You're misunderstanding the data here. These are not benchmark scores, but estimated runtimes i.e. lower is better. OCL5 is predicted to take 142s = 2m22s, OCL3 is 162s = 2m42s. So at least according to the benchmark data, the transform selection logic is working. There is maybe an issue with the accuracy of the benchmarks, and I may be able to improve that in a future release.



I believe that the larger N benchmarks should be more accurate, but I will keep an eye on this for sure.



- Iain







____________

Twitter: IainBethune

Proud member of team "Aggie The Pew". Go Aggie!

3073428256125*2^1290000-1 is Prime!

EDIT:Checking on one of my wingman shows a lot of errors



http://www.primegrid.com/results.php?hostid=400679&offset=0&show_names=0&state=7&appid=23

OCL3 transform runs on this computer but AMD OpenCL 1124.2 driver crashes when it tries to compile OCL4 OpenCL code (clBuildProgram).

Or this driver is not able to compile two OpenCL programs.

No error with AMD OpenCL 1912.5 driver...



http://www.primegrid.com/results.php?hostid=510579

Neither Genefer 65536 v3.10 nor Genefer 65536 v3.12 run on this computer.



But Genefer v3.10 message is:

Using OCL3 transform No NVIDIA compatible OpenCL device found. An error (7) occurred. Waiting 10 minutes before attempting to continue from last checkpoint...

and Genefer v3.12

geneferocl 3.3.0-1 (Windows/OpenCL/32-bit) Unhandled Exception Detected... - Unhandled Exception Record - Reason: Access Violation (0xc0000005) at address 0x6eaa2986 read attempt to address 0xC483F897 Engaging BOINC Windows Runtime Debugger...

... error handling should be improved (regression?).





Thanks, I'll have a look and see where the code might segfault rather than error gracefully so we can catch it using the backoff mechanism.

____________

Twitter: IainBethune

Proud member of team "Aggie The Pew". Go Aggie!

3073428256125*2^1290000-1 is Prime!

For some reason boinc is having issues downloading the new version. It keeps backing off and retrying. The particular system had no problems downloading more ESP sieve tasks, just the new GFN-16 app.



Event log: Internet access ok- project servers may be temporarily down.







If you're talking about the machine with the Radeon HD 6900, it seems to have got the app and returned valid tasks now.





EDIT:Checking on one of my wingman shows a lot of errors



http://www.primegrid.com/results.php?hostid=400679&offset=0&show_names=0&state=7&appid=23



and this one



http://www.primegrid.com/results.php?hostid=510579



guess its not ready for prime time yet.



I'm switching over to 17_low for a bit.



Thanks, I have some ideas what's going on there - it appears to be a driver bug which manifests itself in a less graceful manner in the new app.



- Iain



____________

Twitter: IainBethune

Proud member of team "Aggie The Pew". Go Aggie!

3073428256125*2^1290000-1 is Prime!

You're misunderstanding the data here. These are not benchmark scores, but estimated runtimes i.e. lower is better. OCL5 is predicted to take 142s = 2m22s, OCL3 is 162s = 2m42s. So at least according to the benchmark data, the transform selection logic is working. There is maybe an issue with the accuracy of the benchmarks, and I may be able to improve that in a future release.



I believe that the larger N benchmarks should be more accurate, but I will keep an eye on this for sure.



- Iain







Oh... I suck. I always thought those were the scores.



Well, it makes a lot more sense now. We're back at the "how to accurately determine runtimes" cenario.

Manual credit has now been applied for all of the non-BOINC tests. You should find it under the PSA sub-project.



Thanks again for all the testing!



- Iain

____________

Twitter: IainBethune

Proud member of team "Aggie The Pew". Go Aggie!

3073428256125*2^1290000-1 is Prime!

If you're talking about the machine with the Radeon HD 6900, it seems to have got the app and returned valid tasks now.







I figured out the problem. Avast was stopping the download as suspicious. I turned off Avast for a bit to complete the download. Then once the new app started running, Avast quarantined it thinking it was malicious. I had to set an exception and now everything is smooth sailing.





I dug up the most decrepit GPUs I had laying around and decided to give them a go.



One is a quadro fx 5600, which is CC 1.0, and the other is a quadro fx 1800, CC 1.1.



http://www.primegrid.com/show_host_detail.php?hostid=504933



Neither of them would pull any n=16 to n=18 tasks through boinc. Both will run pps-sieve.



Command line execution works fine, albeit painfully slowly.



Genefer 3.3.0-1 is now live on ALL GFN ranges



Let us know if there's any problems.



If you're using app_info in order to run one of the newer transforms, that should no longer necessary. The stock app should (usually) make the best selection possible.

____________

My lucky number is 75898524288+1

So now that we have a combined app, what are the odds that it will be made available for intel GPUs (without jumping through app_info hoops)?



Sure they are not particularly powerful, but they are probably rather numerous. And for those with macs, they appear to work out-of-the-box with the new combined app.



Obviously, some testing would be in order, particularly on linux boxes, as there are several flavours of ocl that could get installed there. Also, I'm not privy to the effort it would take to make it happen from the back-end. But it seems to me there is a bunch of untapped processing power out there.



If I can crunch out a n=16 unit every half hour on an hd4000, and I have 4 machines with those, that's (almost) 200 more units done per day. And that's just me.





ida:Downloads vzimmerman$ ./geneferocl_macintel -q "157476^65536+1"

geneferocl 3.3.0-1 (Apple-x86/OpenCL/64-bit)

Supported transform implementations: ocl3 ocl4 ocl5

Copyright 2001-2016, Yves Gallot

Copyright 2009, Mark Rodenkirch, David Underbakke

Copyright 2010-2012, Shoichiro Yamada, Ken Brazier

Copyright 2011-2014, Michael Goetz, Ronald Schneider

Copyright 2011-2016, Iain Bethune

Genefer is free source code, under the MIT license.



Command line: ./geneferocl_macintel -q 157476^65536+1



Priority change succeeded.

A benchmark is needed to determine best transform, testing available transform implementations...

Testing OCL3 transform...



Running on platform 'Apple', device 'HD Graphics 4000', version 'OpenCL 1.2 ' and driver '1.2(Dec 17 2015 21:11:48)'.



Testing OCL4 transform...

Testing OCL5 transform...

Benchmarks completed (57.613 seconds).



Testing 157476^65536+1...

Using OCL4 transform

Starting initialization...

Initialization complete (0.827 seconds).

Estimated time remaining for 157476^65536+1 is 0:31:23

So now that we have a combined app, what are the odds that it will be made available for intel GPUs (without jumping through app_info hoops)?



Sure they are not particularly powerful, but they are probably rather numerous. And for those with macs, they appear to work out-of-the-box with the new combined app.



Obviously, some testing would be in order, particularly on linux boxes, as there are several flavours of ocl that could get installed there. Also, I'm not privy to the effort it would take to make it happen from the back-end. But it seems to me there is a bunch of untapped processing power out there.



If I can crunch out a n=16 unit every half hour on an hd4000, and I have 4 machines with those, that's (almost) 200 more units done per day. And that's just me.





ida:Downloads vzimmerman$ ./geneferocl_macintel -q "157476^65536+1"

geneferocl 3.3.0-1 (Apple-x86/OpenCL/64-bit)

Supported transform implementations: ocl3 ocl4 ocl5

Copyright 2001-2016, Yves Gallot

Copyright 2009, Mark Rodenkirch, David Underbakke

Copyright 2010-2012, Shoichiro Yamada, Ken Brazier

Copyright 2011-2014, Michael Goetz, Ronald Schneider

Copyright 2011-2016, Iain Bethune

Genefer is free source code, under the MIT license.



Command line: ./geneferocl_macintel -q 157476^65536+1



Priority change succeeded.

A benchmark is needed to determine best transform, testing available transform implementations...

Testing OCL3 transform...



Running on platform 'Apple', device 'HD Graphics 4000', version 'OpenCL 1

Problem is, RAM bandwidth. Using my 6600k with 3200mhz Dual Channel Dual Rank RAM (so pretty much the best of the best when it comes to RAM bandwidth, as LGA 2011 doesn't have iGPU), I can run up to 2 SGS tasks no problem. But 3 and above, massive slowdowns on both apps. And with pretty much any other LLR task, a single WU is already enough to tank performance.



One has to wonder, what's more important: the ability to crunch for SoB, SR5 or TRP, or crunch for some GFN16? Keep in mind that the Intel HD series all come within AVX / AVX2.0 CPUs; in my opinion, LLR takes priority here.



The only real use to this would be if you intend to use the CPU for something other than PG. Say, WCG: I can run their apps + iGPU GFN just fine, without any loss in performance. But that's a niche situation, and I bet implementing it would cause more good than harm.



Maybe it could be added as a hidden option, only available to people with special titles (say, Project Scientist, Volunteer Tester / Developer, etc.)? Those are more likely to know what they are doing, rather than random people joining in....



Problem is, RAM bandwidth. Using my 6600k with 3200mhz Dual Channel Dual Rank RAM (so pretty much the best of the best when it comes to RAM bandwidth, as LGA 2011 doesn't have iGPU), I can run up to 2 SGS tasks no problem. But 3 and above, massive slowdowns on both apps. And with pretty much any other LLR task, a single WU is already enough to tank performance.



One has to wonder, what's more important: the ability to crunch for SoB, SR5 or TRP, or crunch for some GFN16? Keep in mind that the Intel HD series all come within AVX / AVX2.0 CPUs; in my opinion, LLR takes priority here.



The only real use to this would be if you intend to use the CPU for something other than PG. Say, WCG: I can run their apps + iGPU GFN just fine, without any loss in performance. But that's a niche situation, and I bet implementing it would cause more good than harm.



Maybe it could be added as a hidden option, only available to people with special titles (say, Project Scientist, Volunteer Tester / Developer, etc.)? Those are more likely to know what they are doing, rather than random people joining in....



I haven't done any structured look at the hit existing tasks will take with a task running on the igpu (yet), but I would be curious as to whether it is better in all cases, from a total throughput perspective, to leave the existing llr (or cpu gfn) tasks be and not run a igpu task. If they are always memory bound, I would tend to agree.



I think I may have to give that a try. I have an i3-3217u, an i5-4250u, and a

i7-3615qm that have been running exclusively ESP for the recent past. One per core on the i3 and i5, and 3 units on the i7. I'll kick of a test WR unit on each of them and see what happens to my ESP times.

So now that we have a combined app, what are the odds that it will be made available for intel GPUs (without jumping through app_info hoops)?



I'm going to have to agree with Rafael here. The CPUs are currently faster than the RAM, so turning on more CPU computing power (in the iGPU) that's going to go through the same RAM bottleneck as all the regular CPU tasks isn't going to make more computing happen. Essentially, all you're doing is turning on an asymmetrical analog of HyperThreading, and I would expect the results to be very similar. Overall you won't see more throughput. All of your "gains" in GFN crunching will come at the expense of whatever the CPU is doing (LLR or GFN). And, like HyperThreading, I'd expect the overall throughput to actually go down somewhat because there would be more cache misses.



Since it would seem that this isn't a good strategy, I don't expect we'll put any development effort into making it happen.



EDIT: I'm looking forward to seeing the results of your testing, but PLEASE, PLEASE, PLEASE use identical command line tests and not live workunits. If you use live BOINC tasks, you never will know if your results are real or merely an artifact of the variations in individual tasks. And you should NEVER use live conjecture tasks for benchmarks since they can vary quite significantly.

____________

My lucky number is 75898524288+1



I think I may have to give that a try. I have an i3-3217u, an i5-4250u, and a

i7-3615qm that have been running exclusively ESP for the recent past. One per core on the i3 and i5, and 3 units on the i7. I'll kick of a test WR unit on each of them and see what happens to my ESP times.

Yeah, might be worth looking into other architectures. They might not be as efficient, and thus maybe not require as much bandwidth as the HD 530.



3rd Gen HD graphics is pretty terrible, performance is garbage on almost any task conceivable. In fact, I couldn't even get the OCL 4 app to work on my laptop, it would just open and do absolutely nothing. I wasn't even able to close the damn thing, the CMD window would just get stuck on the screen. At least OCL3 worked.



4th gen might be better.

So now that we have a combined app, what are the odds that it will be made available for intel GPUs (without jumping through app_info hoops)?



I would not be surprised if overall throughput would actually went down when all CPU cores AND iGPU would be used.



Still would like to see some benchmarks for let say 1iGPU task (and) 1 CPU task and Skylake.

Consider office PCs where running all cores is not convenient (slow down the whole system, put too much stress on cooling hence noise).

But single iGPU task might be just fine and even power efficient...



Also, I expect iGPUs to gain speed and power efficiency faster than CPUs (ie. think of regular GPUs vs CPUs).

____________

My stats

Badge score: 1*1 + 5*1 + 8*3 + 9*11 + 10*1 + 11*1 + 12*3 = 186

Well, those 3 are each running a test n=22 unit on the gpu now. Lets see what happens to the llr times.



[edit] All of them chose ocl4. Estimated time for the i7 is 1867 hours. For the i5 1420 hours (that hd5000 apparently makes a difference). For the i3 3051 hours. [/edit]

So now that we have a combined app, what are the odds that it will be made available for intel GPUs (without jumping through app_info hoops)?



I would not be surprised if overall throughput would actually went down when all CPU cores AND iGPU would be used.



Still would like to see some benchmarks for let say 1iGPU task (and) 1 CPU task and Skylake.

Consider office PCs where running all cores is not convenient (slow down the whole system, put too much stress on cooling hence noise).

But single iGPU task might be just fine and even power efficient...



Also, I expect iGPUs to gain speed and power efficiency faster than CPUs (ie. think of regular GPUs vs CPUs).

While I don't have it documented, I tried the HD 530 with 3200mhz (17-18-18-38, I think), Dual Channel, Dual Rank RAM. For the GFN, I think I was doing n=13 as a side project. If you really want, I can re-do and test things more extensively on the weekend, once I get home. But for now:

-1 SGS: fine. No performance loss.

-2 SGS: still good. Again, no performance difference. I think I lost 1s, if anything.

-3 SGS: party pooper. Huge performance loss here. Again, I lack exact numbers, but it went from the realm of less than minute to over minute. Huge hit there.

-I didn't test PPSE or sieve tasks, but even a single PPS Mega WU was already enough to cause mayhem and throw the viability out of the window. So nope, ain't doing that 1+1 tatic.

-On the other hand, I could run 3x WCG apps no problem. Makes sense, these run about 15ºC cooler than the LLR apps, so I'd figure they wouldn't be as intensive. And that's why I'm using my iGPU: I intend to get medals there first, before moving back to PG's LLR.



EDIT: I'm looking forward to seeing the results of your testing, but PLEASE, PLEASE, PLEASE use identical command line tests and not live workunits. If you use live BOINC tasks, you never will know if your results are real or merely an artifact of the variations in individual tasks. And you should NEVER use live conjecture tasks for benchmarks since they can vary quite significantly.



Unless someone has other recommendations, I'll use 3752948*2^3752948-1 for the llr tests and will use 64996^4194304+1 to load the gpu.



I'll kill the gpu processes, and do a control with the llr units in a "standard" workload for those boxes, i.e. one unit per core for the dual cores (they have dual channel ram), and n-1 for the quad core. Once the clean run finishes, I'll re-run with the gpu loaded.



[edit]For a third run, I'll use 157476^65536+1 for the gfn test, since it probably represents a more likely workload for an anemic gpu, as opposed to just trying to peg/crush the gpu, and can get some benchmarks from it.[/edit]

I'm getting this error on two separate Ubuntu 14.04/nVidia PCs when I started them up after the long weekend:



../../projects/www.primegrid.com/primegrid_genefer_3_3_0_3.12_x86_64-pc-linux-gnu__OCLcudaGFN17MEGA: /usr/lib/x86_64-linux-gnu/libstdc++.so.6: version `CXXABI_1.3.8' not found (required by ../../projects/www.primegrid.com/primegrid_genefer_3_3_0_3.12_x86_64-pc-linux-gnu__OCLcudaGFN17MEGA)



I did reset the project/restart etc but the problem persists.



Anyone else seeing this with Ubuntu?

I'm getting this error on two separate Ubuntu 14.04/nVidia PCs when I started them up after the long weekend:



../../projects/www.primegrid.com/primegrid_genefer_3_3_0_3.12_x86_64-pc-linux-gnu__OCLcudaGFN17MEGA: /usr/lib/x86_64-linux-gnu/libstdc++.so.6: version `CXXABI_1.3.8' not found (required by ../../projects/www.primegrid.com/primegrid_genefer_3_3_0_3.12_x86_64-pc-linux-gnu__OCLcudaGFN17MEGA)



I did reset the project/restart etc but the problem persists.



Anyone else seeing this with Ubuntu?



I looked in the database for tasks run with that particular app (Linux, 64-bit, OCL on an Nvidia GPU).



So far you're the only one who has run that app on GFN-17-MEGA, but fortunately the same program is used on all of the GFN ranges. Looking across them all, there's a few computers that are generating lots of errors -- and there's also other computers that are completing the tasks successfully.



Looking at the previous (3.09) version of the software, I see a similar pattern.



Since your Ubuntu system that had been running the 3.09 GFN17MEGA tasks successfully did so about a week ago, is there anything that changed on that system in the last week that you're aware of?



Your's isn't the only computer, however, that seems to have been able to run 3.09 but is having trouble with 3.12. Host # 484282 shows a similar pattern.



____________

My lucky number is 75898524288+1



Since your Ubuntu system that had been running the 3.09 GFN17MEGA tasks successfully did so about a week ago, is there anything that changed on that system in the last week that you're aware of?





Hi Michael,



Both of the machines in question are "headless crunchers" that were turned off over the Easter holiday so I doubt much has changed. The machine running the GFN17-Mega tasks (Host 499554) does do routine automatic updates which is why I also tried a second machine that has OS updates disabled (Host 498992).

The second machine ran into the same problem when I tried a GFN16 task...







Since your Ubuntu system that had been running the 3.09 GFN17MEGA tasks successfully did so about a week ago, is there anything that changed on that system in the last week that you're aware of?





Hi Michael,



Both of the machines in question are "headless crunchers" that were turned off over the Easter holiday so I doubt much has changed. The machine running the GFN17-Mega tasks (Host 499554) does do routine automatic updates which is why I also tried a second machine that has OS updates disabled (Host 498992).

The second machine ran into the same problem when I tried a GFN16 task...







It's the EXACT same software on all the GFN ranges. If it fails to run on one, it will fail on all of them.

____________

My lucky number is 75898524288+1

It's the EXACT same software on all the GFN ranges. If it fails to run on one, it will fail on all of them.



Ah ... right! Understood now! I haven’t been keeping up with the Genefer testing threads and misinterpreted the downloaded application names when I was looking th