Message boards : Number crunching : mackerel's Ryzen testing

Author Message

Might as well start this thread, as my CPU and mobo just arrived. The next 5.5 hours or so might be the longest at work before I can go home and assemble it.



CPU: Ryzen 7 1700

Mobo: Asus Prime B350M-A (I also have the X370 on order, expected Monday)

Ram: 2x4GB Corsair Vengeance LPX 3000

PSU: Corsair CX550M

GPU: R9 280X

SSD: Kingston V300 240GB



As preparation, I have been running real PrimeGrid tasks on my various systems, in SGS and GCW sieve.



SGS will be used as an indicator of single thread LLR IPC, and I'll back that up with Prime95 benchmarks later to include ram limited scenarios.



GCW sieve will be used as an indicator of HT/SMT benefit. Before Michael reminds me again, I know the units can be different in runtime but I hope an average of many will be "good enough".



Prime95 will be used to investigate the cache/ram bandwidth influence. Early reports are that the memory system is not well tuned at all yet, and running high speed memory may be a challenge for all but select highest end boards. It is hoped that a future bios update will allow lesser boards to improve also, as it was not a focus of their optimisation up to now. Also of concern are reports that the two parts of the L3 cache are only joined by a relatively low bandwidth connection, so I'll also see if running it as "two quads" might give better performance than a single oct.



I'll probably throw in a Cinebench R15 too since everyone does that...



First step would be to look at how clock varies under normal use, and as appropriate, set a fixed clock to perform testing. Overclocking will be investigated separately.



It is not my intention for this to be a general Ryzen bench thread, so I'd like to keep it to my own results.

Like so many things, it appears simple, yet proves more difficult once you try to apply it in practice.



The system is now setup and running at stock settings. In particular, 2133 ram. All core turbo clock seems to be within tolerance of 3.2 GHz. I can't find a way to turn off SMT in the bios.



With the above in mind, I've done a custom Prime95 benchmark which I need to process before showing.



While I do that, I thought I'd run BOINC and put it on SGS. Oh dear... I did what I normally do for Intel. Set it to 50% of CPU, and manually set affinity for boinc client to even cores. I let it grab 8 units and they were all trying to run on two cores. Ok, let's undo affinity. Now it was using the first 10 threads. I don't know if it is like intel or not, where the SMT pairings are adjacent to each other, therefore running the first 10 would only really be stressing 5. Right, I've gone for the worst case situation. One task per CPU thread. Now it is 100% loaded. Estimates are fluctuating a bit, but are around the 45 minute mark. Assuming no benefit to LLR from SMT, that puts the per-core equivalent time to 22 minutes. I have a Haswell at 3.2 GHz, and that is about 11 minutes. So, my predictions of Ryzen being about half the LLR IPC of Intel seems to hold up.



As a comment on the hardware, the 1700 comes with Wraith Spire cooler, which reminds me of the taller round heatsinks Intel used to do. It does have RGB lighting on it, but the fan on mine doesn't seem to be a great sample and ticks when it rotates. Cooling of that was unremarkable. Running Prime95 small FFT stress, it hit around 70C. After replacing it with a Noctua U14S it is 45C. Also, I had the old school fun of when pulling the HSF off, the CPU stuck to it. One point to Intel's socket design.



After the LLR run is complete I'll put it on sieve where I speculate it might be more competitive.





First project based results are in... for GCW sieve, it seems Ryzen has higher IPC than Intel CPUs tested.



As a comment on the results presentation, the no HT results are just that, HT was turned off so this is the performance of the core directly. With HT/SMT on, there are two threads per core.



I still need to find a setting to allow me to turn off SMT.

Looking at this SGS task 783697707 it seems boinc does not use FMA3.

Can you post the Boinc-Managers "Processor features:" line?



Edit:



Start "llr -m" -> 4. Options/CPU

and have a look what the llr-programm thinks about ryzen

____________

144052 *5^2018290+1 is Prime!

Looking at this SGS task 783697707 it seems boinc does not use FMA3.

Can you post the Boinc-Managers "Processor features:" line?



Edit:



Start "llr -m" -> 4. Options/CPU

and have a look what the llr-programm thinks about ryzen



The app doesn't care what BOINC sees or thinks the processor is capable of. It decides on its own whether or not to use AVX.



I'm pretty sure with gwnum (and therefore both LLR and PFGW), existing versions never use AVX or FMA4 on AMD CPUs. Although perhaps George may have changed this behavior recently. If that hasn't changed, then the current LLR simply won't use AVX on AMD processors.



Since neither gwnum nor LLR is developed at PrimeGrid, further questions about this should be asked on the Mersenne forums, since that's where the answers will be.



On the other hand, Genefer (the CPU version) WILL use both AVX and FMA4 on AMD chips, although up until now it hasn't really helped much. If you want to see what Ryzen can do, try running Genefer at a B range suitable for using the 64 bit transforms.

____________

My lucky number is 75898524288+1





Cinebench R15 IPC. Past testing showed this was only minimally affected by ram configuration, so consider these results to have a tolerance of a few % or so. An aggressively tuned system might do better than indicated. I also noted that CB has been updated since I ran the earlier tests on Intel systems. On Ryzen, the latest version ran about 0.6% faster in single core, and 0.3% faster in multi-core than the older version, and I might have mixed the results above but it makes little difference. In this task, the non-SMT IPC performance is in the ball park of Haswell, but multi-thread is comparable to Skylake. This agrees with observations made by many reviewers that SMT scaling seems better than HT.



I will get back on LLR/genefer performance later. I did note Prime95 was picking K10, so it isn't surprising for LLR to pick the same.



I've edited it down a bit, but following is part of a bench I tried with the try all CPU settings or whatever it is called on P95.



FFTlen=1024K, Type=3, Arch=6, Pass1=256, Pass2=4096, clm=2 (16 cpus, 16 workers): Throughput: 644.48 iter/sec.

FFTlen=1024K, Type=3, Arch=5, Pass1=256, Pass2=4096, clm=1 (16 cpus, 16 workers): Throughput: 630.11 iter/sec.

FFTlen=1024K, Type=2, Arch=6, Pass1=256, Pass2=4096, clm=2 (16 cpus, 16 workers): Throughput: 615.92 iter/sec.

FFTlen=1024K, Type=2, Arch=5, Pass1=256, Pass2=4096, clm=0 (16 cpus, 16 workers): Throughput: 616.99 iter/sec.



I don't yet know what's what above, but with about 5% difference between them it isn't going to radically change the performance if it didn't pick the best one. Edit: then again, at that FFT size it should be ram limited, let me retry quickly at smaller.

FFTlen=64K, Type=3, Arch=6, Pass1=256, Pass2=256, clm=1 (16 cpus, 16 workers): Throughput: 19204.32 iter/sec.

FFTlen=64K, Type=3, Arch=5, Pass1=256, Pass2=256, clm=0 (16 cpus, 16 workers): Throughput: 16976.83 iter/sec.

FFTlen=64K, Type=2, Arch=6, Pass1=256, Pass2=256, clm=1 (16 cpus, 16 workers): Throughput: 18210.76 iter/sec.

FFTlen=64K, Type=2, Arch=5, Pass1=256, Pass2=256, clm=0 (16 cpus, 16 workers): Throughput: 16139.61 iter/sec.



Here's similar for 64k FFT. By eyeball that's under 20% difference between best to worst, still need to work out which is which.





Ok, I was surprised by this result. I should have checked HT earlier... GFN13 tasks gain a nice benefit from HT. Ryzen comes out about 55% of Skylake IPC.



Due to an oversight I forgot to run on Broadwell but I wouldn't expect much difference between it and other Intel CPUs.



With that, I've got other things to do on the weekend so don't expect any more updates until late Sunday at the earliest.





TRP sieve results, similar story to GCW sieve although not identical.







The testing with Prime95 needs a little caution, as there is no specific Ryzen code optimisation for it, and it runs the old AMD code. There may be future benefit from optimisation for Ryzen.



Main take away points are:

There is benefit from having faster ram

There wasn't an obvious benefit from going 2R over 1R, even with Intel the gap was smaller than previously seen

Max performance is up to 2MB/core, so it doesn't look like you can add the L2 and L3 like AMD does when specifying the effective cache.





A bios update to the mobo has added an option to turn SMT on and off, so I've added the result for Ryzen SMT off to GCW sieve. I'm making the assumption that units haven't changed since I ran the earlier tests, and haven't verified that.



Results are not surprising as we see a similar pattern to elsewhere. Non-SMT IPC is below Skylake, but overtakes it with SMT.

. I'm making the assumption that units haven't changed since I ran the earlier tests, and haven't verified that.



Task sizes have not changed YET. They will get larger (probably 4x larger) sometime in the next 2 or 3 weeks.



Note that, unlike most sieves, GCW task sizes actually vary slightly because each base's tasks are slightly different. However, this difference is, by design, very small, so it's not going to significantly affect your statistics.



Also note that with any sieve, the tasks do speed up slightly over time. This might cause a slight bias in the statistics if they're taken at significantly different points in the sieve. Along the same lines, tasks with newer sieve files may run slightly faster than older tasks, and this may also cause a bias in the statistics. Jim and I were talking in the last few days about releasing new GCW sieve files, but I don't think that's been done just yet. The current sieve files appear to be from December 31st.

____________

My lucky number is 75898524288+1

Results are not surprising as we see a similar pattern to elsewhere. Non-SMT IPC is below Skylake, but overtakes it with SMT.

This is kinda odd, though. Single thread IPC being lower is expected, but with SMT/HT on, you'd expect the same % of boost for both, still having Ryzen below Skylake. Having it on equal footing (or even better, in this case) is extremely surprising in a Task per core per day chart... smells like OS overhead or something.



Not saying your results are wrong, it just sounds unlogical to me.

This is kinda odd, though. Single thread IPC being lower is expected, but with SMT/HT on, you'd expect the same % of boost for both, still having Ryzen below Skylake. Having it on equal footing (or even better, in this case) is extremely surprising in a Task per core per day chart... smells like OS overhead or something.



It may be that AMD simply have more (non-FP) execution potential in Ryzen, but it isn't utilised with a single thread per core. Through SMT, it is able to make use of that available power.



To verify the results due to different times, I'll run a few units through a 6600k and check if that is same as the earlier numbers.