Message boards : News : New multicore app and WUs

Author Message

Toni

Volunteer moderator

Project administrator

Project developer

Project scientist

Send message

Joined: 9 Dec 08

Posts: 958

Credit: 4,353,973

RAC: 0

Level



Scientific publications

Joined: 9 Dec 08Posts: 958Credit: 4,353,973RAC: 0LevelScientific publications Message 48127 - Posted: 10 Nov 2017 | 14:58:07 UTC

Last modified: 10 Nov 2017 | 15:10:10 UTC

Dears,



we would like to test our new CPU multicore application for quantum chemistry tasks ("QC"). Since it’s the first time we have a CPU app out, I’ll test the behavior of GPUGRID with a relatively large batch that you will see soon. Workunits are named "*QC309big*".



Here’s some features of the app, in short (subject to change):



* Platform: Linux only for now, generic x64.

* Threads: as many as Boinc decides. I guess it depends on your machine, your preferences, and other running tasks in ways which are obscure to me…

* Run time: about 1 CPU hour per WU (so, shorter if multithreading)

* Credit: computed with the default algorithm (tasks are short, don’t expect much). Bonus mechanism for fast turnaround is still on.

* Known bugs: restarts and checkpoints. This should be mitigated with the “keep in memory when suspended” option. Sorry about that, it’s outside of our control.

* Network behavior: the first time you get a WU of this kind it downloads a Python interpreter (miniconda) and then some open-source packages, and installs them in the project directory. The installation is reused whenever possible.

* Disk usage: could go around 1 GB, perhaps more when tasks are running. Resetting the project should remove everything.

* Memory usage: should be around 1 GB when running.



Depending on the results of this test, we’ll start thinking about other platforms.



Thanks and nice crunching!



Toni

Sergey Kovalchuk

Send message

Joined: 18 Feb 16

Posts: 5

Credit: 1,094,331

RAC: 2

Level



Scientific publications

Joined: 18 Feb 16Posts: 5Credit: 1,094,331RAC: 2LevelScientific publications Message 48130 - Posted: 10 Nov 2017 | 15:37:26 UTC - in response to Message 48127.

the client does not receive WUs, although there are almost a thousand of them and the client is suitable for the requirements (Linux x64). earlier this host was able to receive test tasks for QC and python



please write the exact requirements (memory, disk, OS) specified when generating tasks

Toni

Volunteer moderator

Project administrator

Project developer

Project scientist

Send message

Joined: 9 Dec 08

Posts: 958

Credit: 4,353,973

RAC: 0

Level



Scientific publications

Joined: 9 Dec 08Posts: 958Credit: 4,353,973RAC: 0LevelScientific publications Message 48131 - Posted: 10 Nov 2017 | 15:41:06 UTC - in response to Message 48130.

Last modified: 10 Nov 2017 | 15:44:30 UTC

Can you check what applications are you accepting in your preferences?



By the way requests are currently as follows:





<rsc_fpops_est>3e12</rsc_fpops_est>

<rsc_fpops_bound>250e15</rsc_fpops_bound>

<rsc_disk_bound>4e9</rsc_disk_bound>

<rsc_memory_bound>1e9</rsc_memory_bound>



Sergey Kovalchuk

Send message

Joined: 18 Feb 16

Posts: 5

Credit: 1,094,331

RAC: 2

Level



Scientific publications

Joined: 18 Feb 16Posts: 5Credit: 1,094,331RAC: 2LevelScientific publications Message 48132 - Posted: 10 Nov 2017 | 16:05:30 UTC - in response to Message 48131.

All apps selected & "accept work from other"





Preferences:

max memory usage when active: 1900.76MB

max memory usage when idle: 1980.80MB

max disk usage: 6.71GB (4,47 free)



Another boinc mystery...

Jobs only seem to go to a subset of eligible machines. If anybody out there has a clue of the reason, I'll be glad to hear.





All error out with this:

Stderr output



<core_client_version>7.6.33</core_client_version>

<![CDATA[

<message>

process exited with code 195 (0xc3, -61)

</message>

<stderr_txt>

12:19:41 (31019): wrapper (7.7.26016): starting

12:19:41 (31019): wrapper (7.7.26016): starting

12:19:41 (31019): wrapper: running ../../projects/www.gpugrid.net/Miniconda3-4.3.30-Linux-x86_64.sh (-b -f -p /var/lib/boinc-client/projects/www.gpugrid.net/miniconda)

Python 3.6.3 :: Anaconda, Inc.

12:19:49 (31019): miniconda-installer exited; CPU time 6.649529

12:19:49 (31019): wrapper: running /var/lib/boinc-client/projects/www.gpugrid.net/miniconda/bin/python (pre_script.py)

12:19:59 (31019): $PROJECT_DIR/miniconda/bin/python exited; CPU time 7.101246

12:19:59 (31019): wrapper: running /var/lib/boinc-client/projects/www.gpugrid.net/miniconda/bin/psi4 (-n 14 -i psi4.in -o psi4.out)

/var/lib/boinc-client/projects/www.gpugrid.net/miniconda/bin/psi4: 3: /var/lib/boinc-client/projects/www.gpugrid.net/miniconda/bin/psi4: readlink: not found

/var/lib/boinc-client/projects/www.gpugrid.net/miniconda/bin/psi4: 9: /var/lib/boinc-client/projects/www.gpugrid.net/miniconda/bin/psi4: /bin/psi4.bin: not found

12:20:00 (31019): $PROJECT_DIR/miniconda/bin/psi4 exited; CPU time 0.001541

12:20:00 (31019): app exit status: 0x7f

12:20:00 (31019): called boinc_finish(195)



</stderr_txt>

]]>

It is this computer:

http://www.gpugrid.net/show_host_detail.php?hostid=420971

All error out after a few seconds on AMD and Intel machines

<core_client_version>7.6.33</core_client_version>

<![CDATA[

<message>

process exited with code 195 (0xc3, -61)

</message>

<stderr_txt>

17:27:46 (14006): wrapper (7.7.26016): starting

17:27:46 (14006): wrapper (7.7.26016): starting

17:27:46 (14006): wrapper: running ../../projects/www.gpugrid.net/Miniconda3-4.3.30-Linux-x86_64.sh (-b -f -p /var/lib/boinc-client/projects/www.gpugrid.net/miniconda)

Python 3.6.3 :: Anaconda, Inc.

17:27:54 (14006): miniconda-installer exited; CPU time 6.648000

17:27:54 (14006): wrapper: running /var/lib/boinc-client/projects/www.gpugrid.net/miniconda/bin/python (pre_script.py)

17:28:05 (14006): $PROJECT_DIR/miniconda/bin/python exited; CPU time 7.584000

17:28:05 (14006): wrapper: running /var/lib/boinc-client/projects/www.gpugrid.net/miniconda/bin/psi4 (-n 15 -i psi4.in -o psi4.out)

/var/lib/boinc-client/projects/www.gpugrid.net/miniconda/bin/psi4: 3: /var/lib/boinc-client/projects/www.gpugrid.net/miniconda/bin/psi4: readlink: not found

/var/lib/boinc-client/projects/www.gpugrid.net/miniconda/bin/psi4: 9: /var/lib/boinc-client/projects/www.gpugrid.net/miniconda/bin/psi4: /bin/psi4.bin: not found

17:28:06 (14006): $PROJECT_DIR/miniconda/bin/psi4 exited; CPU time 0.000000

17:28:06 (14006): app exit status: 0x7f

17:28:06 (14006): called boinc_finish(195)

Hello,



error on my computer: Ubuntu mate 16.04/kernel 4.13.11/Ryzen 5 1400



Stderr output



<core_client_version>7.6.31</core_client_version>

<![CDATA[

<message>

process exited with code 195 (0xc3, -61)

</message>

<stderr_txt>

19:00:23 (31619): wrapper (7.7.26016): starting

19:00:23 (31619): wrapper (7.7.26016): starting

19:00:23 (31619): wrapper: running ../../projects/www.gpugrid.net/Miniconda3-4.3.30-Linux-x86_64.sh (-b -f -p /var/lib/boinc-client/projects/www.gpugrid.net/miniconda)

Python 3.6.3 :: Anaconda, Inc.

19:00:33 (31619): miniconda-installer exited; CPU time 8.382948

19:00:33 (31619): wrapper: running /var/lib/boinc-client/projects/www.gpugrid.net/miniconda/bin/python (pre_script.py)

19:03:37 (31619): $PROJECT_DIR/miniconda/bin/python exited; CPU time 63.497739

19:03:37 (31619): wrapper: running /var/lib/boinc-client/projects/www.gpugrid.net/miniconda/bin/psi4 (-n 7 -i psi4.in -o psi4.out)

/var/lib/boinc-client/projects/www.gpugrid.net/miniconda/bin/psi4: 3: /var/lib/boinc-client/projects/www.gpugrid.net/miniconda/bin/psi4: readlink: not found

/var/lib/boinc-client/projects/www.gpugrid.net/miniconda/bin/psi4: 9: /var/lib/boinc-client/projects/www.gpugrid.net/miniconda/bin/psi4: /bin/psi4.bin: not found

19:03:38 (31619): $PROJECT_DIR/miniconda/bin/psi4 exited; CPU time 0.002335

19:03:38 (31619): app exit status: 0x7f

19:03:38 (31619): called boinc_finish(195)



</stderr_txt>

]]>





Good luck for debug

Dears, all three errors mention a missing "readlink" executable. It is surprising, because it's a fairly basic command, but please check if you can run "readlink" in a terminal. If not installed, should be in the "coreutils" package.

It is installed readlink --version

readlink (GNU coreutils) 8.26

Copyright (C) 2016 Free Software Foundation, Inc.

License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>.



Same here. It is installed readlink version 8.26.

I also have problem with getting new WUs on some of my machines. Looks that ones with Nvidia card get work, and ones without it do not get anything.

____________



Is there a particular reason this is a CPU application and not a GPU one?

Same here. It is installed readlink version 8.26.





Same here.



NNW until there's a fix.

Same here. It is installed readlink version 8.26.





Same here.



NNW until there's a fix.

On Linux CentOS 7.4 is works fine. I suspect that bolinc is not able to find or execute readlink cmd. Please try executing following commands:



which readlink

ls -l `which readlink`

sudo -iu boinc bash -c 'which readlink'

sudo -iu boinc bash -c 'ls -l `which readlink`'

sudo -iu boinc readlink /lib/libz.so.1





On my CentOS they return following results:



# which readlink

/usr/bin/readlink

# ls -l `which readlink`

-rwxr-xr-x. 1 root root 41800 2016-11-05 /usr/bin/readlink

# sudo -iu boinc bash -c 'which readlink'

/bin/readlink

# sudo -iu boinc bash -c 'ls -l `which readlink`'

-rwxr-xr-x. 1 root root 41800 2016-11-05 /bin/readlink

# sudo -iu boinc readlink /lib/libz.so.1

libz.so.1.2.7



____________



# which readlink

/bin/readlink





# ls -l `which readlink`

-rwxr-xr-x 1 root root 43192 Oct 4 20:56 /bin/readlink



The following return nothing

# sudo -iu boinc bash -c 'which readlink'

# sudo -iu boinc bash -c 'ls -l `which readlink`'

# sudo -iu boinc readlink /lib/libz.so.1

Commands do not work for me either.

So, I copied readlink program to usr/bin and now it is working in my ubuntu hosts.

Readlink path usually is /usr/bin but it depend on various packaging and configuration provided by the distro



Don't copy the file from /bin to /usr/bin (or whatever)



just create a symlink. If for same reason readlink will be updated , the file you've copied will not updated



$ sudo ln -sf /bin/readlink /usr/bin/readlink



PS : my readlink path is

$ which readlink

/usr/bin/readlink

Toni

Volunteer moderator

Project administrator

Project developer

Project scientist

Send message

Joined: 9 Dec 08

Posts: 958

Credit: 4,353,973

RAC: 0

Level



Scientific publications

Joined: 9 Dec 08Posts: 958Credit: 4,353,973RAC: 0LevelScientific publications Message 48150 - Posted: 11 Nov 2017 | 10:51:23 UTC - in response to Message 48149.

Last modified: 11 Nov 2017 | 10:55:02 UTC

I'll add /bin to the path in the next app update. That may work, unless there is some weird sandboxing thing going on. You shouldn't need to tweak your system: just let them fail (they should fail fast, so no CPU loss).



Concerning why some hosts are not receiving WUs, it's baffling me. It's not a matter of hosts already having GPUs because my own machine does and it did not get tasks. It may be related to the "reliable hosts" classification.

@Daniel: can you list one of your hosts which gets QC tasks and one which doesn't?



Thanks

@Daniel: can you list one of your hosts which gets QC tasks and one which doesn't?



Thanks

Hosts which get tasks: 449991, 449992, 391907

Hosts which did not get any: 444456, 452231

____________



Many thanks for this: I look forward to the Windows version!



Dears,



we would like to test our new CPU multicore application for quantum chemistry tasks ("QC"). Since it’s the first time we have a CPU app out, I’ll test the behavior of GPUGRID with a relatively large batch that you will see soon. Workunits are named "*QC309big*".



Here’s some features of the app, in short (subject to change):



* Platform: Linux only for now, generic x64.

* Threads: as many as Boinc decides. I guess it depends on your machine, your preferences, and other running tasks in ways which are obscure to me…

* Run time: about 1 CPU hour per WU (so, shorter if multithreading)

* Credit: computed with the default algorithm (tasks are short, don’t expect much). Bonus mechanism for fast turnaround is still on.

* Known bugs: restarts and checkpoints. This should be mitigated with the “keep in memory when suspended” option. Sorry about that, it’s outside of our control.

* Network behavior: the first time you get a WU of this kind it downloads a Python interpreter (miniconda) and then some open-source packages, and installs them in the project directory. The installation is reused whenever possible.

* Disk usage: could go around 1 GB, perhaps more when tasks are running. Resetting the project should remove everything.

* Memory usage: should be around 1 GB when running.



Depending on the results of this test, we’ll start thinking about other platforms.



Thanks and nice crunching!



Toni



____________

John

Conan

Send message

Joined: 25 Mar 09

Posts: 25

Credit: 582,385

RAC: 0

Level



Scientific publications

Joined: 25 Mar 09Posts: 25Credit: 582,385RAC: 0LevelScientific publications Message 48154 - Posted: 11 Nov 2017 | 22:48:37 UTC

Last modified: 11 Nov 2017 | 22:51:20 UTC

Two of my computers have received tasks and processed them with no trouble.

Both run Fedora (16 and 21), host ids are 192138 and 189186.

My 8 core (16 thread) computer (running Fedora 25) has yet to receive a task.



Host 192138 is a 6 core computer and Host 189186 is a four core computer.



The 6 core has shorter Run times per task and more CPU times than the 4 core.



This is as expected due to core count, however the 4 core computer gets higher credit per task than the 6 core, this does not make sense.



6 core getting around 1,500 sec Run time, 8,600 CPU time and about 66 credits.



4 core getting around 3,200 sec Run time, 6,900 CPU time and about 85+ credits.



A bit odd perhaps?



Conan

Toni

Volunteer moderator

Project administrator

Project developer

Project scientist

Send message

Joined: 9 Dec 08

Posts: 958

Credit: 4,353,973

RAC: 0

Level



Scientific publications

Joined: 9 Dec 08Posts: 958Credit: 4,353,973RAC: 0LevelScientific publications Message 48155 - Posted: 11 Nov 2017 | 23:04:48 UTC - in response to Message 48154.

Last modified: 11 Nov 2017 | 23:29:30 UTC

Credit assignment logic has historically been problematic (see here) to the point that I am inclined to think that it has no best solution. For the time being the credit algorithm is the old default one from boinc. I think it relies heavily on the self-computed FLOPS and yes that seems paradoxical.

I haven't been able to successfully process a WU on my computer. I've received many, but they've all resulted in "Computation error".



See screenshot: https://imgur.com/z0vLkoh

I haven't been able to successfully process a WU on my computer. I've received many, but they've all resulted in "Computation error".



See screenshot: https://imgur.com/z0vLkoh



You'll have to try one of the suggestions posted by Daniel or [VENETO] sabayonino above. I'm waiting for more WUs to try myself.

we are not aware of fast and free gpu qm applications. if you know one, let us know.

Toni

Volunteer moderator

Project administrator

Project developer

Project scientist

Send message

Joined: 9 Dec 08

Posts: 958

Credit: 4,353,973

RAC: 0

Level



Scientific publications

Joined: 9 Dec 08Posts: 958Credit: 4,353,973RAC: 0LevelScientific publications Message 48159 - Posted: 12 Nov 2017 | 8:40:07 UTC - in response to Message 48157.

Last modified: 12 Nov 2017 | 9:30:30 UTC

Please do not tweak your system. The current application (QC 3.10) should solve the problem.

we are not aware of fast and free gpu qm applications. if you know one, let us know.



@UF & @UNC developed ANAKIN-ME to create fast, accurate quantum mechanical simulations. See the demo at #SC17 http://nvda.ws/2zyBhKj



https://twitter.com/NVIDIADC



Yes, we have that and it is nice, but limited and not a QM code.

I completed one this morning in Ubuntu.

The new app has 0% failure rate. However, only a handful of hosts are receiving it, for reasons utterly obscure.



This is the only indication i found in the logs:



2017-11-10 20:06:33.9454 [PID=182743] [quota] Overall limits on jobs in progress:

2017-11-10 20:06:33.9454 [PID=182743] [quota] CPU: base 2 scaled 112 njobs 0

2017-11-10 20:06:33.9454 [PID=182743] [quota] GPU: base 2 scaled 0 njobs 0





That "njobs 0" seems to prevent result sending. Any clue hugely appreciated...





The new app has 0% failure rate. However, only a handful of hosts are receiving it, for reasons utterly obscure.



This is the only indication i found in the logs:



2017-11-10 20:06:33.9454 [PID=182743] [quota] Overall limits on jobs in progress:

2017-11-10 20:06:33.9454 [PID=182743] [quota] CPU: base 2 scaled 112 njobs 0

2017-11-10 20:06:33.9454 [PID=182743] [quota] GPU: base 2 scaled 0 njobs 0





That "njobs 0" seems to prevent result sending. Any clue hugely appreciated...

The only reading material I can suggest is http://boinc.berkeley.edu/trac/wiki/ProjectOptions#Joblimits, but I imagine you know that already. Remember to read the following 'Job limits (advanced)' section too.

For those interested in controlling the number of threads used by the multicore app, the following app_config.xml entries seem to work.



<app>

<name>QC</name>

<max_concurrent>1</max_concurrent>

</app>

<app_version>

<app_name>QC</app_name>

<plan_class>mt</plan_class>

<avg_ncpus>9</avg_ncpus>

<cmdline>--nthreads 9</cmdline>

</app_version>

The <avg_ncpus> entry tells BOINC the number of threads to reserve for the app.



The <cmdline> entry tells the app the number of threads available for processing.

Can anybody comment on the suspend/resume behavior under a variety of conditions (ie. with and without "keep in memory")? I expect the calculation to restart from scratch, but not crash.

Like many others I don't get any WUs on my linux machines.

____________



Can anybody comment on the suspend/resume behavior under a variety of conditions (ie. with and without "keep in memory")? I expect the calculation to restart from scratch, but not crash.



When I suspended a task with LAIM on, BOINC manager showed that it was suspended, but the system monitor showed that the task was still busy using all the threads that were allocated to it.



When I suspended a task with LAIM off, BOINC manager showed that the task was suspended and the task disappeared from the system monitor. When the task was resumed, it restarted from 0 and appears to be running normally.

@captainjack - thanks, appreciated.

I just wanted to report back:

My host ID: 420971 gets work and finishes latest version with success!

My host ID: 452211 does not get any work. Message is: There is now work available. This host does not have any GPU and works from an USB stick.



Toni

Volunteer moderator

Project administrator

Project developer

Project scientist

Send message

Joined: 9 Dec 08

Posts: 958

Credit: 4,353,973

RAC: 0

Level



Scientific publications

Joined: 9 Dec 08Posts: 958Credit: 4,353,973RAC: 0LevelScientific publications Message 48177 - Posted: 13 Nov 2017 | 16:15:25 UTC - in response to Message 48176.

Last modified: 13 Nov 2017 | 16:21:37 UTC

Working/not working pairs are useful for debugging indeed (if they have the same preferences, that is). It was suggested that it was the presence of a GPU, but there are GPU-less counter-examples, like this. The scheduler is a software nightmare...



I'll resume tests later this week. In the meantime, there are 1000 more CPU WUs (QC310big).

Today is my lucky day. I just enabled the multicore app, and immediately picked up two of them on my i7-3770 machine running Ubuntu 16.04.3 (Linux 4.10.0.38), and BOINC 7.8.3. They run on 7 cores, with one core reserved for GPU support as set by BOINC preferences, not in the app_config (though I use one for other purposes).



However, suspending them does not shut them down with LAIM enabled, as noted before. I have not tried the non-LAIM case.



If it matters, this machine was attached to GPUGrid earlier, and I had run a few GPU work units on the GTX 980, though I am requesting only the CPU work now. But maybe that has something to do with why I am getting them.



EDIT: Also, I have "Run test applications?" enabled, though I don't know if that is necessary in this case.

Conan

Send message

Joined: 25 Mar 09

Posts: 25

Credit: 582,385

RAC: 0

Level



Scientific publications

Joined: 25 Mar 09Posts: 25Credit: 582,385RAC: 0LevelScientific publications Message 48183 - Posted: 13 Nov 2017 | 22:42:44 UTC

My two computers that are getting or have gotten cpu work, have both been connected before.

The new computer I attached does not get work but says "No work available" even when there is plenty.



Conan

OK, thanks @mmonnin.



I've just

which readlink

followed by

sudo ln -sf /bin/readlink /usr/bin/readlink ,

and am now waiting for some more WUs.

Do not make symlinks. The problem is already solved.

Since it’s the first time we have a CPU app out, I’ll test the behavior of GPUGRID with a relatively large batch that you will see soon.



I just started reading this thread. I thought I would point out that there was a multi-threaded CPU application back in 2014. It just wasn't necessarily for Quantum Chemistry.

____________



Conan

Send message

Joined: 25 Mar 09

Posts: 25

Credit: 582,385

RAC: 0

Level



Scientific publications

Joined: 25 Mar 09Posts: 25Credit: 582,385RAC: 0LevelScientific publications Message 48198 - Posted: 16 Nov 2017 | 7:21:52 UTC - in response to Message 48192.

Since it’s the first time we have a CPU app out, I’ll test the behavior of GPUGRID with a relatively large batch that you will see soon.



I just started reading this thread. I thought I would point out that there was a multi-threaded CPU application back in 2014. It just wasn't necessarily for Quantum Chemistry.



Yes I ran that one on both Windows 32 bit and Linux 64 bit, which is where nearly all my points came from, as I had to stop GPU use a few years ago so I ran the CPU app instead.



Conan

On a 1950x it's reserving all 32 threads but not running them near the maximum.

It seems to be switching which cores are active - my System Monitor CPU usage chart looks like a long line of infinity symbols.



If you divide the CPU time by the runtime, you'll see an average usage of about seventeen cores a second. Everything else is going to waste.



16713948 12878079 453935 23 Nov 2017 | 12:59:03 UTC 23 Nov 2017 | 16:09:15 UTC Completed and validated 680.18 11,586.25

67.70 Quantum Chemistry v3.10 (mt)



16713947 12878078 453935 23 Nov 2017 | 12:59:03 UTC 23 Nov 2017 | 14:12:17 UTC Completed and validated 761.12 12,984.46 267.57 Quantum Chemistry v3.10 (mt)



16713946 12878077 453935 23 Nov 2017 | 12:59:03 UTC 23 Nov 2017 | 15:11:46 UTC Completed and validated 702.76 11,639.75



PS. It's running at top priority over World Community Grid, but they've got similar deadlines. Is this intentional?

getting a ton of quantum chemistry tasks on my aws ec2 p2.xlarge instance.

a47-toni_qc310k-0-1-* are the names of the tasks. Are these the new multicore tasks you talked about? The machine takes a task to 66% in 2 seconds and then sits at that percentage for ~10 minutes.



I think the task stops reporting progress @ 66%? bug? I compiled the boinc client on the ec2 instance, so it could definitely be user error as well.



Same here stuck at 66%. Will go to lunch and see if it finished in the meanwhile.

they finish about 10-15 minutes after they 'hang' on my ec2 instance.

Here as well! Times are in relation with more threads and higher clock frequency on the other computer.

I'm using Ubuntu's bundled system monitor to display CPU usage graphs. That 66% thing is just a bug with the work unit time estimation, but my cores really were gradually rising and falling from 0 to 100%. Like a helix on its side, but with 32 lines.



(It's not thermal throttling.)



IF at all possible, consider limiting each multicore app to four cores - almost every modern CPU's threads can be divided equally by four, so we can ensure the highest throughput as no thread would go to waste.

Toni

Volunteer moderator

Project administrator

Project developer

Project scientist

Send message

Joined: 9 Dec 08

Posts: 958

Credit: 4,353,973

RAC: 0

Level



Scientific publications

Joined: 9 Dec 08Posts: 958Credit: 4,353,973RAC: 0LevelScientific publications Message 48234 - Posted: 23 Nov 2017 | 21:57:07 UTC - in response to Message 48233.

Last modified: 23 Nov 2017 | 22:00:00 UTC

The 66% is due to our using the boinc wrapper for an app which doesn't report its progress. There are three steps in the WU (install, update, compute) and the third is the long one, hence the 2/3.



If I figure out how, I'll try to limit the number of CPUs requested. I think the client has some control over it as well.

Petr Kriz

Send message

Joined: 22 Feb 09

Posts: 3

Credit: 114,900

RAC: 0

Level



Scientific publications

Joined: 22 Feb 09Posts: 3Credit: 114,900RAC: 0LevelScientific publications Message 48235 - Posted: 23 Nov 2017 | 22:46:53 UTC

Just tried to run few tasks and still getting the same error:



<core_client_version>7.6.22</core_client_version>

<![CDATA[

<message>

process exited with code 195 (0xc3, -61)

</message>

<stderr_txt>

23:27:04 (6871): wrapper (7.7.26016): starting

23:27:04 (6871): wrapper (7.7.26016): starting

23:27:04 (6871): wrapper: running ../../projects/www.gpugrid.net/Miniconda3-4.3.30-Linux-x86_64.sh (-b -f -p /var/lib/boinc/projects/www.gpugrid.net/miniconda)

Python 3.6.3 :: Anaconda, Inc.

23:33:01 (6871): task miniconda-installer reached time limit 360

23:33:01 (6871): wrapper: running /var/lib/boinc/projects/www.gpugrid.net/miniconda/bin/python (pre_script.py)

Traceback (most recent call last):

File "pre_script.py", line 1, in <module>

import conda.cli

ModuleNotFoundError: No module named 'conda'

23:33:02 (6871): $PROJECT_DIR/miniconda/bin/python exited; CPU time 0.025285

23:33:02 (6871): app exit status: 0x1

23:33:02 (6871): called boinc_finish(195)



</stderr_txt>

]]>



Any idea, how to solve it?

This one hang for about 6 hours:

http://www.gpugrid.net/result.php?resultid=16717461

Since I had 100% errors (Message 48156 - Posted: 12 Nov 2017 | 2:36:31 UTC) on my first batch of these CPU tasks, I created a symlink as instructed, then deleted the symlink as subsequently instructed, but I have never received a single task since my 12 Nov 2017 post.

OK, we will start production mode next week. Unfortunately we will need more than 50x the current number of CPUs, but it is just the start now, so it is ok.



gdf

On a 1950x it's reserving all 32 threads but not running them near the maximum.

It seems to be switching which cores are active - my System Monitor CPU usage chart looks like a long line of infinity symbols.



If you divide the CPU time by the runtime, you'll see an average usage of about seventeen cores a second. Everything else is going to waste.



16713948 12878079 453935 23 Nov 2017 | 12:59:03 UTC 23 Nov 2017 | 16:09:15 UTC Completed and validated 680.18 11,586.25

67.70 Quantum Chemistry v3.10 (mt)



16713947 12878078 453935 23 Nov 2017 | 12:59:03 UTC 23 Nov 2017 | 14:12:17 UTC Completed and validated 761.12 12,984.46 267.57 Quantum Chemistry v3.10 (mt)



16713946 12878077 453935 23 Nov 2017 | 12:59:03 UTC 23 Nov 2017 | 15:11:46 UTC Completed and validated 702.76 11,639.75



PS. It's running at top priority over World Community Grid, but they've got similar deadlines. Is this intentional?



Pretty typical of multithreaded apps (of any BOINC project) that they do not scale that well past 4-8 cores. I typically use an app_config to 4 cores on mt apps like LHC, Cosmology, yafu, etc.



OK, we will start production mode next week. Unfortunately we will need more than 50x the current number of CPUs, but it is just the start now, so it is ok.



gdf





You will need a windows app for this.







Since I had 100% errors (Message 48156 - Posted: 12 Nov 2017 | 2:36:31 UTC) on my first batch of these CPU tasks, I created a symlink as instructed, then deleted the symlink as subsequently instructed, but I have never received a single task since my 12 Nov 2017 post.





Same here ...

I received some yesterday on a new install of Ubuntu 17.10. No symlink or anything and they completed.

If you need that many CPUs, you will definitely need a windows app

Will the app name stay with "*QC309big*" or will it change for the real stuff? So we might make a app_config file or better still, might you propose an app_config file to limit cpu cores per work-unit to X cores.



@PappaLitto: I quite happy with a Linux only app!



It is time to make some Linux USB sticks (16GB USB3.0, 10 USD): I work with Lubuntu 17.10 on varios computers I do not use, and it works great! Or try BOINCOS v2.0 Beta Release, there all is pre-configured.

Making a windows app will probably need one of the following two solutions. Neither is perfect (by far).



* The "Windows Subsystem for Linux" from Microsoft. It's unfortunately W10 only (as far as I can tell), and probably we'd be the first BOINC project to use it (=headaches).

* A VirtualBox app. Its downsides are known I think.



By the way, question for the gurus: when you run a vbox app, is virtualbox automatically installed on your system?

By the way, question for the gurus: when you run a vbox app, is virtualbox automatically installed on your system?

No. The user has to install it themselves, and usually some VBox extensions are recommended as well.



There are two ways of installing VBox for Windows:



1) Via a combined single-click installer for both VBox and BOINC, available from BOINC. The simplicity is attractive, but there are downsides - there is no control over e.g. installation location, and the version of VBox included is usually several steps behind the current release.



2) Direct from the Oracle VBox site. BOINC will still recognise this - there's no special BOINC code in the combined VBox installer.



Any VBox extensions desired will always have to be downloaded from Oracle. There may be other adjustments required to the host computer, such as enabling virtualisation in the BIOS, which might be unfamiliar to the casual user.

klepel asked:



So we might make a app_config file or better still, might you propose an app_config file to limit cpu cores per work-unit to X cores.



<app_config>

<app>

<name>acemdlong</name>

<max_concurrent>2</max_concurrent>

<gpu_versions>

<gpu_usage>1</gpu_usage>

<cpu_usage>2</cpu_usage>

</gpu_versions>

</app>

<app>

<name>acemdshort</name>

<max_concurrent>2</max_concurrent>

<gpu_versions>

<gpu_usage>1.0</gpu_usage>

<cpu_usage>2</cpu_usage>

</gpu_versions>

</app>

<app>

<name>QC</name>

<max_concurrent>1</max_concurrent>

</app>

<app_version>

<app_name>QC</app_name>

<plan_class>mt</plan_class>

<avg_ncpus>4</avg_ncpus>

<cmdline>--nthreads 4</cmdline>

</app_version>

</app_config>



This will limit the QC (quantum chemistry) app to 4 threads per task and a maximum of 1 task at a time. You can adjust to your preferences.



Hope that helps.

Thanks to both!

I think if the CPU app goes to VBox for Windows/Linux then there will be less user support than just Linux.



More than one current task will need to be allowed for efficient CPU usage.

Even when you get the windows app going, it looks like you're still going short on the number of crunchers by more than half, based on the server status page which shows currently 821 users crunching long units in the last 24 hours, while 34 are crunching quantum chemistry. (821/34 = 24.15)



So, in order to meeting 50x, you will eventually have to create an app for multi CPU-GPU.



Quantum chemistry has a long way to go.



In the mean time, you can't make the windows app too difficult for the crunchers to set up, because most of us are not computer gurus and you will end up with only a few more crunchers.



This is a big undertaking. Good luck guys!!













Even when you get the windows app going, it looks like you're still going short on the number of crunchers by more than half, based on the server status page which shows currently 821 users crunching long units in the last 24 hours, while 34 are crunching quantum chemistry. (821/34 = 24.15)



So, in order to meeting 50x, you will eventually have to create an app for multi CPU-GPU.



I am all in favor of GPU, but as noted on many project forums, it doesn't work for most problems. But don't write off Linux on the CPU yet. It is just in the startup phase. I have even taken my machines off until the production version is released. Once the word gets around (be sure to post a note on the BOINC forum), you will get lots of help. And CPUs are getting more cores all the time.

Even when you get the windows app going, it looks like you're still going short on the number of crunchers by more than half, based on the server status page which shows currently 821 users crunching long units in the last 24 hours, while 34 are crunching quantum chemistry. (821/34 = 24.15)



So, in order to meeting 50x, you will eventually have to create an app for multi CPU-GPU.



Quantum chemistry has a long way to go.



In the mean time, you can't make the windows app too difficult for the crunchers to set up, because most of us are not computer gurus and you will end up with only a few more crunchers.



This is a big undertaking. Good luck guys!!



Oh there will be many more if there is consistent work. The work inconsistency of GPU work pushes many away. Compare the support for pogs vs duchamp. Similar projects but duchamp requires vbox

I would say don't mess with a virtualbox application if it would replace the Linux application. Too many headaches. If someone is running Windows, they could easily set up their own virtualbox VM and run it under Linux with the standard app. Win-win for everyone that way. It also give the user more control over the VM. Just my thoughts on it. Also, more and more people are migrating to Windows 10 and it is the direction all new machines are following. So, might as well prepare for the future.

____________



I would say don't mess with a virtualbox application if it would replace the Linux application. Too many headaches. If someone is running Windows, they could easily set up their own virtualbox VM and run it under Linux with the standard app. Win-win for everyone that way. It also give the user more control over the VM. Just my thoughts on it. Also, more and more people are migrating to Windows 10 and it is the direction all new machines are following. So, might as well prepare for the future.

My guess is that they could leave the Linux application as it is, and just add a VirtualBox application for Windows. I have had no particular problems with VBox on either Windows or Linux machines recently. I run LHC, Cosmology and sometimes others on it. I would prefer that they set it up so that I don' have to configure my own machine. All you really have to do is to first ensure that running a Virtual Machine is enabled in your BIOS. A good primer is on the Cosmology site:

http://www.cosmologyathome.org/faq.php#vtx



There is a much more elaborate checklist (if you need it) by Yeti on LHC:

https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4161&postid=29359#29359



After that, you just install VirtualBox and attach to the project. It is all set up from there.



There is actually more to it than that. I have ran every VM project as well and have done so with a whole slew of different hardware and software setups. There is a huge reason why those projects do not get much support. LHC only has a large user base now because it merged the projects with the original Six Track project. Event still, the virtualbox applications remain less popular. Keep also in mind that these projects have problems with new releases of Virtualbox as they have right now. If I make my own VM using the latest release, it does not suffer the same. The only advantage to a vbox application is to allow the scientist to have an easier time compiling a single application. This may sound great to them, but the amount of lost time on the end user far exceeds their lost time.



Also, for reference if it helps, GPUGrid attempted vbox applications back in 2014. Discussion started in 2013 http://www.gpugrid.net/forum_thread.php?id=3542#33874

____________



If you have big problems, don't do it. That is not a reason they should not offer it.



LHC would not exist in its present form without VirtualBox; neither would Cosmology and others.



And if you prefer to set up your own VirtualBox machine, you can still do that in order to run the Linux version from a Windows machine in any case. I think you are arguing the wrong point.

I'm actually arguing for keeping the Linux version rather than replace it. Telling me to not bother because you like them isn't acceptable to me. You play it off like they run great because you had little issue with them. You can scour their forums over and over to find the average user does not agree. You are right about LHC not being in its present form as it would still be Six Track running traditional work and the others doing it in house or eventually adapting things different. Cosmology would just be down one application as well. I don't see how that is relevant. Either way. My vote is to not embrace virtualbox if it means pulling non-virtualbox work.

____________



I'm actually arguing for keeping the Linux version rather than replace it. Telling me to not bother because you like them isn't acceptable to me. You play it off like they run great because you had little issue with them. You can scour their forums over and over to find the average user does not agree. You are right about LHC not being in its present form as it would still be Six Track running traditional work and the others doing it in house or eventually adapting things different. Cosmology would just be down one application as well. I don't see how that is relevant. Either way. My vote is to not embrace virtualbox if it means pulling non-virtualbox work.

I won't bother responding to fiction.

Good. Just don't be delusional in the process...

____________



Also, if serious consideration is being discussed keep in mind the latest 5.2 versions of Virtualbox needs updated vbox wrappers provided.



https://www.rechenkraft.net/forum/viewtopic.php?f=75&t=16780&start=12



http://www.cosmologyathome.org/forum_thread.php?id=7517#21579



https://sourcefinder.theskynet.org/duchamp/forum_thread.php?id=229#864



I can confirm the one box I had upgraded fails every work unit at the moment in my testing at Source Finder.

____________



I'm actually arguing for keeping the Linux version rather than replace it. Telling me to not bother because you like them isn't acceptable to me. You play it off like they run great because you had little issue with them. You can scour their forums over and over to find the average user does not agree. You are right about LHC not being in its present form as it would still be Six Track running traditional work and the others doing it in house or eventually adapting things different. Cosmology would just be down one application as well. I don't see how that is relevant. Either way. My vote is to not embrace virtualbox if it means pulling non-virtualbox work.



I agree that VBox projects/apps get much less support. That's supported by data. It's even more evident when there are competitions and people do not have VBox already setup so it will run with BOINC. They just end up running the non-VBox apps.



LHC may not exist w/o VBox. Maybe they wanted to keep their stuff a secret or whatever. They could definitely get some more support if the rest of the apps were not Vbox.



I received two tasks today and they both worked. Two at once as well.

Multiple tasks at once are the way we intend to go for QC (consistent with your preferences of course). The idea is to limit the number of cores to 4, and the BOINC client should manage the available capacity.

Multiple tasks at once are the way we intend to go for QC (consistent with your preferences of course). The idea is to limit the number of cores to 4, and the BOINC client should manage the available capacity.



I wasn't at home to see them run. I just happened to nice I had two tasks on my account page. They def used more than 4 cores. Looks like a little more than 8 threads.

Run time CPU Time Credit

3,745.85 30,947.64 138.80

3,501.03 30,948.79 129.71



Completed an hour apart. Another client was running some other CPU work so run time could have been better.

[VENETO] boboviz

Send message

Joined: 10 Sep 10

Posts: 142

Credit: 388,132

RAC: 0

Level



Scientific publications

Joined: 10 Sep 10Posts: 142Credit: 388,132RAC: 0LevelScientific publications Message 48333 - Posted: 10 Dec 2017 | 19:25:18 UTC

No news about cpu wus??

No news about cpu wus??



Looks like I received some tasks today as I can see them in my tasks. None have completed yet.