Big data burst on to the scene a little over a decade ago. Today it is not an obscure term confined to just a handful of bleeding edge companies. It is a mainstream trend that every enterprise undergoing a digital transformation journey has adopted. The technology landscape around big data has broadened dramatically; in the early days it meant Apache Hadoop, today it includes Apache Spark and NoSQL databases like MongoDB and Apache Cassandra among many other new technologies.

As enterprises realized the untapped potential held within data if captured and analyzed, development teams across the world began to build applications to harness the power of big data. Since development teams live and die by the speed at which they can get applications to market, they needed to be able to provision or change these new environments quickly and they used virtual infrastructure to accelerate their efforts. Today, with the rise of the cloud and these big data applications going mainstream, there still remain some stubborn myths around virtualizing big data. In this article, we will look at the five most common remaining myths and dispel them one by one.

Myth 1: Virtualizing big data applications is fine for development but not for production

It is true that software engineers have used virtual infrastructure to develop big data applications over the last several years. However, these big data applications have now also made their way into production. Virtualized applications make it possible for various users, including business analysts and data scientists, to work on different data analysis tasks simultaneously, resulting in significant productivity increases of these teams.

Myth 2: There’s a performance penalty when virtualizing Hadoop

Misperceptions about the performance of virtualized Hadoop still remain, but it should be a moot point by now. Since 2011, performance benchmarks have consistently shown that running Hadoop on virtual machines is as performant, or more, as running Hadoop on physical machines, with results showing that Map Reduce jobs completed up to 12 percent faster and Spark/Machine Learning jobs up to 10 percent faster. The latest performance benchmarks, performed by VMware in 2016, show that Hadoop scales amply on virtual machines with similar overall performance to bare metal and distinct advantages when it comes to utilization of cluster resources.

Myth 3: You need a SAN for virtualizing Hadoop, but can Hadoop even use a SAN?

These myths are related, so let’s tackle them both. First of all, it is a misperception that the basic features of virtual machines require a SAN. It is common for enterprises to use non-shared direct-attached storage to host Hadoop data in the virtual machines attached to that storage. Vendors in the space both support and often recommend direct-attached storage for performance benefits and cost savings.

Secondly, if you want to take advantage of shared storage solutions like a physical SAN or virtual SAN such as VSAN, then Hadoop not only works, but many users prefer to use a SAN to begin their first Hadoop experiments, often because it is a core part of their infrastructure, and was in place when the enterprise first started to adopt Hadoop.

Myth 4: You can only run the traditional Hadoop stack but not the latest and greatest tech

In many ways Hadoop has become a catchall for big data, but it is a misleading one. At its outset just over a decade ago, Hadoop meant Hadoop Distributed File System and several other tools to consume data from it like MapReduce, Hive and Pig. Today, it encompasses many projects, with other big data projects often dragged into the net. However, Apache Spark is distinct from Hadoop (although it integrates with it), and offers faster and more efficient means to analyze ever-growing volumes of data. The performance benchmark paper cited earlier shows comparable performance between Apache Spark running on virtual machines or bare metal.

It is also not only possible, but common, to find enterprise users running different versions of Hadoop and Spark from multiple Big Data vendors in separate clusters running on virtual machines within the same grouping of hardware.

Myth 5: The hot tech is containers so you should use that instead of VMs

Container technology like Docker is white hot at the moment, and for good reason. It is becoming a popular choice with cutting edge developers because it is easy to use and lightweight. They have quickly become standard operating procedure in many development houses. However, it is important to understand the right use cases for using containers with a big data strategy. Containers are best suited to hold the Compute side of Hadoop – the part that executes your algorithms, such as the NodeManagers of YARN and Executors of Spark. Containers require you to separate out your data storage to a different place. Holding terabytes of data in a container is not the accepted wisdom today. So when applying virtualization to this, the containers are executing in a virtual machine, either one to one with the virtual machine or one to many, where the data is retrieved by the VM. If high levels of security are an enterprise focus then isolation of concerns and users is more optimal through virtual machines. The combination of virtual machines and containers brings mature operations management to the challenges of handling containers in production.

We’ve stared down the most stubborn myths about virtualizing big data, and have dealt with each one in turn. The simple fact is that running big data applications on virtualized infrastructure is now commonplace, and has become a de facto standard in the enterprise.

About the Author

Justin Murray works as a Technical Marketing Manager at VMware and has been at the company for over six years.

Sign up for the free insideBIGDATA newsletter.