Message boards : Number crunching : Major SNAFU in Effect

Author Message

http://www.gpugrid.net/results.php?hostid=490728



Above is my host with erroring WUs

http://www.gpugrid.net/results.php?hostid=490728



Above is my host with erroring WUs





Did someone forget to renew a license?













I'm getting nothing but comp errors on these new tasks also.

Same here, of course. But I haven't seen anyone from the project around here for a while. Is anyone at home?

Same here as well. Error 212 on WU's that were running fine up to 4 -5 hours ago. sounds like a license thing to me as well. Suspended project until the issue is resolved.

Have the same issues on two Linux machines, so not sure if this is a license thing.

For the last 2 years, the License error usually comes after July 1st. 12 month license, I am assuming.



Every task I had in my cache on 4 hosts errored out today. Since I don't run very high resource allotment, some tasks had been running a couple of hours a day with no issues until today. The hosts are processing other projects without any errors during this time. I'd have to guess a license expired today.

Azmodes

Send message

Joined: 7 Jan 17

Posts: 21

Credit: 1,041,977,168

RAC: 4,060,066

Level



Scientific publications

Joined: 7 Jan 17Posts: 21Credit: 1,041,977,168RAC: 4,060,066LevelScientific publications Message 51799 - Posted: 14 May 2019 | 7:56:34 UTC

Last modified: 14 May 2019 | 8:02:17 UTC

Same. I have two Ubuntu machines that throw up nothing but immediate errors now. My two Windows crunchers are fine, though.

The Linux app is broken (most probably its license expired).

All of my Linux hosts run immediately into this error with every single workunit:

<core_client_version>7.9.3</core_client_version> <![CDATA[ <message> process exited with code 212 (0xd4, -44)</message> <stderr_txt> </stderr_txt> ]]>

However my Windows host are crunching happily, so I switched back to Windows on my Linux hosts.



The GPUGrid staff need to act on this without delay.

Same over here:

http://www.gpugrid.net/forum_thread.php?id=4909&nowrap=true#51794



Michael.

____________

President of Rechenkraft.net - Germany's first and largest distributed computing organization.

The Linux ACEMD v9.19 apps were deployed on 13/14 February 2018 - so it possibly looks like a 15 month licence expiry.



The Windows v9.22 apps were deployed on 26 July 2018, so with luck we have until late October for those...



Applications

A temporary fix for Linux users is to set your system date back 1 year.



EDIT: Setting time back 1 year caused certificate errors with other projects. So I have now set time back 1 month. This seems to work better.



This has allowed me to start GPUgrid jobs successfully.



You may need to stop time sync services so the system does not reset the time back to current time.



For systemd based distros (eg...Ubuntu) - sudo datetimectl set-ntp 0 will turn time sync off



EDIT: you will need to reissue this command and reset time after each reboot. If this licensing issue persists, I will post a more permanent time sync fix



This was the temporary fix last year when license issues occurred.

Is project leadership aware of the licensing expiration? Seems like someone should be keeping a tickler file for this so that renewals could happen before WU's start erroring out.

____________



Is project leadership aware of the licensing expiration? Apparently not. That's why this SNAFU.



Seems like someone should be keeping a tickler file for this so that renewals could happen before WU's start erroring out. True.

There wasn't any notification of the pending shutdown of the Quantum Chemistry (CPU) work units either, or when they might be restarted.

I am not sure that there is any project leadership at the moment.

I'm going to just suspend the project on all my hosts. The fact I have to exclude my Turing cards makes it difficult to work with the project anyway.



I'll just check back in occasionally and see if a new Linux app is available with current licensing.

Seems like someone should be keeping a tickler file for this so that renewals could happen before WU's start erroring out.

also in the past, license renewals were not done in time and tasks failed. Too bad, but it really seems that the people at GPUGRID simply forget about these things.



Just in case anyone is still wondering, I've been sent WU 16485663.



Failed three times on Linux v9.19 hosts, now running normally under Windows v9.22



Confirms that it's an application problem, not a data problem.

Got a PM reply from Toni:



Oh gosh, thanks ...

:-)

Got a PM reply from Toni:

Hey Richard, thanks for raising this with admins.

Much appreciated!

So hopefully we will be back up and running shortly :). Thanks for bring it to Toni's attention.

Will someone tell us when the FUBAR has finished???

____________



The problem is still not resolved...



Michael.

____________

President of Rechenkraft.net - Germany's first and largest distributed computing organization.

Got a PM reply from Toni:

Hey Richard, thanks for raising this with admins.

Much appreciated!

What surprises me though is that no one from GPUGRID found out by themselves :-(

I aborted all my gpu wu's to let someone with windows run them. Was hoping the certificate would be renewed by now so I could finish the ones I had time invested that I suspended before they failed. No such luck :-(. Barley enough calander time left to finish them anyway.

Toni responded in this thread: http://www.gpugrid.net/forum_thread.php?id=4925&nowrap=true#51834



We are aware of the problem. We'd like to do a major version upgrade rather than continue fixing the old one. For the time being, I'm deprecating the app for linux so crunching goes on on Windows rather than erroring out.

So it looks like time to find a new project for the majority of my machines. Only have 1 that still runs M$

____________



I came back from the Pent to this. :( Thought my computers borked.

So does anyone want to explain how a BOINC wrapper works? The docs don't really say anything about the mechanics involved.



What pre-requisites are there?



Anyone running a BOINC wrapper on other projects and care to elaborate?

tullio

Send message

Joined: 8 May 18

Posts: 167

Credit: 71,716,208

RAC: 246,500

Level



Scientific publications

Joined: 8 May 18Posts: 167Credit: 71,716,208RAC: 246,500LevelScientific publications Message 51846 - Posted: 17 May 2019 | 2:46:52 UTC

LHC@home uses a boincwrapper. All Windows, MAC OSX and other Linux distros can run their programs written in Scientific Linux. You must have VirtualBox installed.

Tullio

____________



LHC@home uses a boincwrapper. All Windows, MAC OSX and other Linux distros can run their programs written in Scientific Linux. You must have VirtualBox installed.

Tullio



Nope that's even more separation from the client including OS and environment variables like specific libc versions. In the case of LHC they give the choice of VBox or setting up CVFMS and Singularity on your own which is included in vbox.vdi file



https://boinc.berkeley.edu/trac/wiki/WrapperApp



If you DON'T want want to include progress % complete, check pointing, GPU device # within your app then the wrapper can do that.



Don't expect it to be as efficient as there is now another layer between the exe doing the calculations and hardware.

So we can no longer run this BOINC GPU Project under BOINC version 7.9.3 on Ubuntu 18.04.2 LTS [4.15.0-51-generic|libc 2.27 (Ubuntu GLIBC 2.27-3ubuntu1)] Running NVIDIA GeForce GTX 1080 Ti (4095MB) driver: 390.11?

So we can no longer run this BOINC GPU Project under BOINC version 7.9.3 on Ubuntu 18.04.2 LTS [4.15.0-51-generic|libc 2.27 (Ubuntu GLIBC 2.27-3ubuntu1)] Running NVIDIA GeForce GTX 1080 Ti (4095MB) driver: 390.11? Correction:

We can not run this BOINC GPU Project (GPUGrid) on any Linux distro for a who-knows-how-long time period.



I bet it won't be long before we get Linux WUs again. In the mean time there's asteroids, einstein, milkyway & seti to keep one busy.

____________



So does anyone want to explain how a BOINC wrapper works? The docs don't really say anything about the mechanics involved.

From what I understand its a wrapper program they put around their normal (non-BOINC) science app that is used to invoke it. No pre-reqs. No need for vbox. That way the wrapper handles the BOINC interaction and allows the use of non-BOINC app.



See https://boinc.berkeley.edu/trac/wiki/WrapperApp for docs.

____________

BOINC blog

Thanks, I had already read that document and was and still am confused. I gather it is not a VM. So assume you don't need virtualization on the cpu?



Why does BOINC offer versions of BOINC+Virtual Box if this mechanism does not require VBox?



Does VBox do more or less than a wrapper? What are the limitations of a wrapper compared to VBox?



Does the application wrapped in a wrapper have to be native code for the platform? With a VM you could run an app not native to the platform.

Thanks, I had already read that document and was and still am confused. I gather it is not a VM. So assume you don't need virtualization on the cpu?



Why does BOINC offer versions of BOINC+Virtual Box if this mechanism does not require VBox?



Does VBox do more or less than a wrapper? What are the limitations of a wrapper compared to VBox?



Does the application wrapped in a wrapper have to be native code for the platform? With a VM you could run an app not native to the platform.



The wrapper does not need VBox. It's just another interface to perform BOINC related functions while the project's 'math.exe' or w/e is doing the crunching ONLY performs calculations.



VBox can set up the entire OS environment to satisfy all the specifics needed to crunch. If a project needs extra programs that do not typically come with an OS or are normally installed by people then that can be included in the vbox image. Again as LHC as the example, Singularity and CVFMS are included in the image. They can also make 1 vbox image for Windows and Linux Host OSs



Is the BOINC wrapper a memory hog like virtualbox???

____________



I'm trying to to think of projects that use it. Going through project folders it looks like DrugDiscovery CPU Goofy, MindModeling and CAS used it. DHEP, Gerasium, Moo, SRBase, Enigma, YoYo and Yafu are active projects that have a wrapper in the exe name. Some Yoyo ECM tasks can use like 8GB but I think thats the data as its limited to certain types. But nothing like LHC Atlas using 10gb for the other projects. VBox apps are huge because its an entire image.



It seems like most GPUGrid crunching is done in Windows as the stats have only gone down from about 600m to 400m per day.

That still shows the Linux hosts responsible for 1/3 of the total credit. And since the percentage of Linux hosts is 37% compared to 54% for Windows hosts, the Linux hosts are showing a greater percentage of higher production hosts compared to Windows hosts.



It would benefit the project to return the Linux hosts to participation.

It would benefit the project to return the Linux hosts to participation.

Which is why the PM which got Toni's attention had the subject line



Research being delayed - Linux apps broken

:-)

Been a while, and news?

Now the license of the Windows app has expired.

I have the feeling that this project is more important for us than for the GPUGrid team, if there's such an entity at all.

Now the license of the Windows app has expired.

I have the feeling that this project is more important for us than for the GPUGrid team, if there's such an entity at all.



August is the vacation month in Italy. Looking at the "about" I don't see a lot of diversity. Probably took off a week to get their heads out of the quantum clouds and socialize with opposite sex.

August is vacation month in Italy . . .

Most likely most take off the whole month . . . not just a week.

____________



They are in Spain, so I always figured they would head to Majorca. No one ever denied it at any rate.

Same here, of course. But I haven't seen anyone from the project around here for a while. Is anyone at home?

It looks to me like the two main researchers are about to get a flood of workunits that failed due to all of the tasks giving an error. If so, they will have to notify the programmer or programmers, and start an effort to fix the problem. If they're able to read and write in English, they'll then have little worthwhile to do other than tell us what happened, and when they expect a fix.



wolfman1360

Send message

Joined: 19 Feb 17

Posts: 5

Credit: 36,516,841

RAC: 135

Level



Scientific publications

Joined: 19 Feb 17Posts: 5Credit: 36,516,841RAC: 135LevelScientific publications Message 52705 - Posted: 23 Sep 2019 | 20:54:15 UTC - in response to Message 51786.

Am I to assume this has been fixed and I can add my Linux machine here? Or are there no WUs for Linux as of yet?

I know I'm crunching okay under Windows...

Am I to assume this has been fixed and I can add my Linux machine here? It's been fixed, thoguh only the Windows app is released to the production line.

You can add your Linux machine, but it will receive only beta test tasks for a while.



Or are there no WUs for Linux as of yet? The workunits are common, but the new Linux app will be put into the production line only when the new Windows app is working as it should be.



I know I'm crunching okay under Windows... Me too.

Am I to assume this has been fixed and I can add my Linux machine here? It's been fixed, thoguh only the Windows app is released to the production line.

You can add your Linux machine, but it will receive only beta test tasks for a while.



Or are there no WUs for Linux as of yet? The workunits are common, but the new Linux app will be put into the production line only when the new Windows app is working as it should be.



I know I'm crunching okay under Windows... Me too.

I am receiving non-Toni test tasks today for my Linux host. Looks like normal project work.

https://www.gpugrid.net/result.php?resultid=21405079

https://www.gpugrid.net/result.php?resultid=21405557

https://www.gpugrid.net/result.php?resultid=21405187

https://www.gpugrid.net/result.php?resultid=21405090

I am receiving non-Toni test tasks today for my Linux host. Looks like normal project work.

https://www.gpugrid.net/result.php?resultid=21405079

https://www.gpugrid.net/result.php?resultid=21405557

https://www.gpugrid.net/result.php?resultid=21405187

https://www.gpugrid.net/result.php?resultid=21405090

Yes, 'Application version: New version of ACEMD v2.06 (cuda100)' is the new normal.



Being in check-in mode for months has got me so confused. I thought Toni asked not to run acemd3 on Linux as that's not what she needs to test. Or are we now good to go on Linux WUs???

____________



I thought Toni asked not to run acemd3 on Linux as that's not what she needs to test.

Yes, that is what he said. I am just surprised that they send them to Linux machines at all. Can't they block them?



I am receiving non-Toni test tasks today for my Linux host. Looks like normal project work. Yes, 'Application version: New version of ACEMD v2.06 (cuda100)' is the new normal. I received such tasks too. These are from the short queue. (Which is epmty now, though).

I think Toni put some workunits from the short queue to the "New version of ACEMD" queue from time to time to serve as a bit longer test.



I've received only 1 since he's said that. If admins only want Windows hosts to receive the tasks then they could always depreciate the Linux app.