Is it possible to leverage your current Spark cluster to build Deep Learning models?

Can terabytes of data stored in HDFS, Hive, HBase be analyzed?

Muhammad Rafay Aleem, Nandita Dwivedi, Kiran

Photo by Franki Chamakion Unsplash

Diving into Intel’s BigDL

Apache Spark has rapidly gotten popular over the past couple of years. This comes from its simplicity, speed and support, also referred as 3 S’s of Spark. Many companies leveraged the Hadoop and Spark environments to build powerful data processing pipelines. These pipelines were built to pre-process huge volumes of data on distributed clusters and draw insights from it for business growth. As Deep Learning gained momentum for its high accuracy predictions and analysis potential, many companies wanted to take advantage of the models that could help improve their businesses further. Intel had such corporate customers with huge data pipelines already built and deployed in their Spark/Hadoop clusters. There were a couple of concerns from these customers on how can deep learning models be applied on these datasets.

1) Can the same pipeline be leveraged for Deep Learning? 2) Can Deep Learning be done efficiently for these large datasets, i.e. Deep Learning at scale? 3) Can the existing Spark/Hadoop cluster be used?

These questions arose because of the obvious need for a consistent environment where moving these large datasets across different clusters can be avoided. Moreover, moving proprietary datasets from an already existing infrastructure can be a security risk. The earlier experiments for answering these questions involved trying to add existing Deep Learning frameworks over Spark, but they weren’t coherent. This led to the development of BigDL platform on top of Spark, and hence naturally inheriting the 3 S’s: simplicity, speed and support with Deep Learning features.

BigDL Library

BigDL is a distributed deep learning library that has rapidly started growing in the field of big data analysis. It is natively integrated with Spark and Hadoop ecosystems because of which it supports significant features like incremental scalability, that allows the models to be trained in any number of clusters depending on the size of data. It provides support to build an end-to-end data-analytics, deep learning and AI pipeline. It performs data-parallel distributed training to achieve high scalability. Since it’s an open source release, BigDL users have constructed numerous analytics and deep learning applications such as visual similarity, parameter synchronization, scaling and convergence etc. on Spark and Big Data platforms.

Deep Learning applications can be written as a standard spark program. These libraries that are unified in the Spark framework can read high volumes of data. Moreover, it has support for Python libraries like Numpy, Scipy, NLTK, Pandas, etc. It is integrated with TensorBoard for visualizations and also supports loading of existing Torch models. One of the most important reasons for the enterprise customers wanting to use BigDL and Spark is that in addition to the fact that BigDL is faster than TensorFlow, it also enables them to retrain the models quicker because of parallel computation.

Analytics Zoo Library

Machine Learning in Spark is still in its infancy when compared to numerous standalone libraries available in Python ecosystem. Most of these libraries such as Keras, TensorFlow and PyTorch are not consistent with Spark since they do not support Spark’s underlying core framework that enables distributed computing. Analytics Zoo is a library developed by Intel that is trying to bridge this gap in Spark. It provides a rich set of high-level APIs to seamlessly integrate BigDL, Keras and TensorFlow programs into Spark pipelines. It has several built-in deep learning models for object detection, image classification, text classification, recommendations, etc. The library also provides end-to-end reference use cases such as anomaly detection, fraud detection and image augmentation to apply machine learning on real-world problems.

To put things into more perspective, the following section provides a brief tutorial on BigDLand Analytics Zoo, showing how easily they can be used to implement transfer learning using pre-trained models and train it on a Spark cluster.