Analysis IBM boasts that machine learning is not just quicker on its POWER servers than on TensorFlow in the Google Cloud, it's 46 times quicker.

Back in February Google software engineer Andreas Sterbenz wrote about using Google Cloud Machine Learning and TensorFlow on click prediction for large-scale advertising and recommendation scenarios.

He trained a model to predict display ad clicks on Criteo Labs clicks logs, which are over 1TB in size and contain feature values and click feedback from millions of display ads.

Data pre-processing (60 minutes) was followed by the actual learning, using 60 worker machines and 29 parameter machines for training. The model took 70 minutes to train, with an evaluation loss of 0.1293. We understand this is a rough indicator of result accuracy.

Sterbenz then used different modelling techniques to get better results, reducing the evaluation loss, which all took longer, eventually using a deep neural network with three epochs (a measure of the number of times all of the training vectors are used once to update the weights), which took 78 hours.

But IBM wasn't interested in that, wanting to show that its own training framework, running on POWER9 servers plus GPUs, could outperform the Google Cloud Platform's 89 machines on the basic initial training.

Thomas Parnell and Celestine Dünner at IBM Research in Zurich used the same source data – Criteo Terabyte Click Logs, with 4.2 billion training examples and 1 million features – and the same ML model, logistic regression, but a different ML library. It's called Snap Machine Learning.

They ran their session using Snap ML running on four Power System AC922 servers, meaning eight POWER9 CPUs and 16 Nvidia Tesla V100 GPUs. Instead of taking 70 minutes, it completed in 91.5 seconds, 46 times faster.

They prepared a chart showing their Snap ML, the Google TensorFlow and three other results:

A 46x speed improvement over TensorFlow is not to be sneezed at. What did they attribute it to?

They say Snap ML features several hierarchical levels of parallelism to partition the workload among different nodes in a cluster, takes advantage of accelerator units, and exploits multi-core parallelism on the individual compute units

First, data is distributed across the individual worker nodes in the cluster On a node data is split between the host CPU and the accelerating GPUs with CPUs and GPUs operating in parallel Data is sent to the multiple cores in a GPU and the CPU workload is multi-threaded

Snap ML has nested hierarchical algorithmic features to take advantage of these three levels of parallelism.

The IBM researchers don't claim that TensorFlow doesn't take advantage of parallelism, and don't offer any comparison between Snap ML and TensorFlow on that score.

But they do say: "We implement specialised solvers designed to leverage the massively parallel architecture of GPUs while respecting the data locality in GPU memory to avoid large data transfer overheads."

The paper says that the AC922 server with its NVLink 2.0 interface is faster than a Xeon server (Xeon Gold 6150 CPU @ 2.70GHz ) with a PCIe interface to its Tesla GPU. "For the PCIe-based setup we measure an effective bandwidth of 11.8GB/sec and for the NVLink-based setup we measure an effective bandwidth of 68.1GB/sec."

Training data is sent to the GPUs to be processed there. The NVLink systems send chunks to the GPU much faster than the PCIe system, taking 55ms instead of 318ms.

The IBM team also says: "We employ some new optimisations for the algorithms used in our system when applied to sparse data structures."

Overall then, it appears that Snap ML can take more advantage of Nvidia GPUs, transferring data to them faster over NVLink than across a commodity x86 server's PCIe link. We don't know how POWER9 CPU compare to Xeons on speed; IBM has not yet publicly released any direct POWER9 to Xeon SP comparisons as far as we know.

We also can't say how much better Snap ML is than TensorFlow until we run the two suckers on identical hardware configurations.

Whatever the reasons for it, the 46x reduction is impressive, and gives IBM lots of room to push its POWER9 servers as a place to plug in Nvidia GPUs, run its Snap ML library, and do machine learning. ®