We are at the age of Data! Big Big Data! Machines and applications are all around the world, generating tons of logs and data, changing the way of doing business, treating patients, navigating in streets, deciding, and in one sentence, changing the way of living.

Nowadays, many concepts and technologies have emerged to harness this fast paced horse and to store and manage data and computation over the cluster of servers (ex: Distributed File Systems, MapReduce, NoSql), changed and still changing how we develop software projects, and how we deploy them on the production environments (ex: Containers, Cloud computing, ServerLess).

One of the first steps of developing such technologies is to simulate the production environment on a local machine. In this article, I will discuss one way of preparing such an infrastructure using docker-compose and tackle the problems and challenges that may occur in front. Specifically, I will spin up HDFS, Hive, Spark, Hue, Zeppelin, Kafka, Zookeeper, and Streamsets on some Docker containers, using docker-compose. There are also some other options or alternatives to make such a platform out there. For example, you can spin up some VMs on a virtualization tool like VirtualBox or KVM, using Vagrant. Or you may use Kubernetes as your container orchestration instead. It mostly depends on your production environment and each one has it’s own pros and cons which is outside the scope of this article.

In this article, it is assumed that you are already familiar with Docker and the technologies that are used in this article. It worth mentioning that the source code is available in Github.