For Google, Baidu, and a handful of other hyperscale companies that have been working with deep neural networks and advanced applications for machine learning well ahead of the rest of the world, building clusters for both the training and inference portions of such workloads is kept, for the most part, a well-guarded secret. What is clear from what we have been able to gather, however, is that GPUs are the key to efficiently accelerating training neural networks—a process that is time-consuming and computationally intensive.

GPU maker, Nvidia, has been on the front end of the push to continue this trend. In addition to the new cards released this year aimed at both the training (via the Tesla M40) and inference portions of deep learning and machine learning workloads (Tesla M4), the company also has the lower cost (at least presumably, pricing has not been released for the new Tesla GPUs), and still relatively high performance Titan X GPUs, which pack 3.072 CUDA cores into a $1000 package.

While these might lack some of the more sophisticated features of the forthcoming M40 and M4 GPUs, including a lack of RDMA support, for companies that lack the internal resources to build a racecar deep learning training cluster on the high-end Nvidia Tesla parts might not be within reach. Further still, taking such an approach is difficult, even for companies that have both the hardware and software expertise to build a full deep learning stack. It might be relatively easy to find CUDA experts, high performance computing systems gurus, and those skilled in building and training neural networks, but having all three–and in at a deep knowledge level needed–is an expensive proposition. Even with such know-how in-house, standing up a training cluster can be difficult and the details of such systems are kept under key from the few big companies that have do it at scale to date–namely Google, Baidu, and the expected host of other select hyperscalers.

As one might imagine, the race will be on for integrators to create vertically integrated boxes aimed at the deep learning training segment that can bring as much pre-configured hardware and software knowledge to the table. Although interestingly, systems that are already tuned with a large GPU count and integrated software stack with the necessary CUDA hooks have already been perfected to meet the needs of scientific and technical computing. At this year’s annual Supercomputing Conference (SC15) in Austin last month, Penguin Computing, Supermicro, Cirrascale, OneStop Systems, and a number of others were showing off similar 4U servers that could fit 16 Titan X GPUs in each, effectively packing between 5-7 teraflops (single precision) per unit and between 70-100 teraflops in each node with either Infiniband or Ethernet connections. Some of these can be configured with the Nvidia Tesla K40 or now K80 GPUs, which are aimed at high performance computing, but of course, for the more price conscious, the Titan X cards are equally appealing—particularly for the deep learning training market.

It’s hard to say how interested most of the supercomputing crowd might have been in this particular use case, but for a company that re-emerged (rebranded) out of stealth today, there is indeed a growing convergence, particularly at the hardware level, between supercomputing and machine learning. Although the vendors above are rolling out a number of integrated solutions on the hardware side, as always, it’s the vertically integrated stack that matters. At least according to Sumit Sanyal, CEO and Chief Architect at Nomizo, which is now known as Minds.AI.

The company has been working extensively with Nvidia to build on top of its deep learning libraries. The stack it has put together will run on tailored appliances that the angel-funded company expects to deliver to key reference customers, with the bulk of future business being delivered as a service. Sanyal says that while it is already possible to cobble together such a stack to run on Amazon EC2, for instance, the performance suffers, in part because it is difficult to get reasonable performance without the Infiniband networks and fine-tuned GPU-based stack delivers. As seen on the right, the stack is not unexpected, but Sanyal explains that smaller companies (i.e. those not at a Google scale) are finding it difficult to put together systems with the tight integration needed.

Since products and mid-scale users for such offerings are still limited, it is hard to get a feel for the potential market, but Sanyal says that it will go beyond image and facial recognition. Where there are patterns to be found, there is a market for deep learning. This includes new use cases in oil and gas, medicine, financial services, security and elsewhere. The differentiation for integrators boils down to the combined hardware and software grit as well, Sanyal notes, pointing to his team, which includes CTO, Tijmen Tielman, whose PhD work was done at Geoffrey Hinton’s lab at the University of Toronto, where a great deal of pioneering work on neural networks was performed. Others, including experts from supercomputing centers (including Ross Walker, a research professor in molecular dynamics from the UCSD Supercomputing Center) and other HPC pros from astrophysics and other scientific computing domains are leading R&D efforts as well.

The company’s Nvidia Titan X-based prototype system is being evaluated by two early potential users. While Sanyal could not give company names, he did say they are recognizable organizations, one on the telecommunications front, and another in social media. This prototype system, developed with their hardware partners at Cirrascale, showed a reported 400% increase in training speed using 12 of the Titan X cards as the acceleration basis. They plan to grow the system to over 200 GPUs, although as one might imagine, communication and bandwidth become real concerns. Overcoming such challenges is the reason why an integrated approach, including the Caffe-based framework approach, will be the fastest way the “not-Google” companies will be able to realistically consider training and implementing neural networks, Sanyal says.

MindsAI is focused on accelerating the training portion of these workloads, but says they are also exploring how the same systems can be used for inference. While there are other architectures more suited to executing trained models (lower power, in particular), as seen above, the costs for training represent one of the primary problem areas to solve. The phase 0.5 refers to the current configuration with 12 GPUs while the next phase is that lofty 200-GPU system the company aims for, presumably following actual customer wins. The company will also explore the M40 and M4 processors but for now says that based on sheer economics, the Titan X, with its relatively low price point, has proven itself in development. Having more “enterprise” features inside the two new Tesla GPUs for training will be key once they secure users, but for now, as a startup (they are seeking Series A early next year), they are leaving the RDMA and other enhancements on the table.

If and when neural networks and increasingly sophisticated machine learning algorithms make the leap from hyperscale to general enterprise, a slowing Moore’s Law will continue to push for greater density and efficiency. Sanyal says that while GPUs have won the first round, the fight is not over. In addition to keeping an eye on FPGAs, particularly for the inference portion of deep learning workloads, he notes that other architectures, including forthcoming low-power ARM processors as well as double-precision capable (albeit it comparatively more power-consumptive) Knights Landing processors are also on the radar, the latter because of the much higher bandwidth capabilities in particular.

Ultimately, the question is whether it is too early for a company with an accelerated hardware story to find a deep enough market to make it through the next few years when, arguably, the broader adoption of large-scale deep neural network and other machine learning algorithms will sink in. There are markets on the horizon beyond image and video recognition and analysis. Oil and gas, security analytics, medical analytics—where there are patterns to be found, there could be a fit. The problems are complexity and expertise. Speeding training and inference almost seem like luxuries companies to those two major barriers, but Minds.AI intends on being patient as it watches the “commoditization” of the hardware and software stacks continues to happen.