The parallel GC synchronisation uses pure spinlocks, which leads to a severe decline in performance when one thread is descheduled. This is the main cause of the "last core parallel slowdown": using a -N value that matches the number of cores in the machine can be slower than using one fewer. The effect seems to be quite bad on Linux, reports are that it is less of an issue on OS X.

Switching to mutexes would help, but it isn't easy because we sometimes unlock these from a different thread than they were locked from, and standard mutexes don't let you do that (the locks in question are mut_spin and gc_spin in the GcThread structure).