Apache Spark is a powerful open-source processing engine built around speed, ease of use, and sophisticated analytics. It has originally been developed at UC Berkeley in 2009, while Databricks was founded later by the creators of Spark in 2013.

The Spark engine runs in a variety of environments, from cloud services to Hadoop or Mesos clusters. It is used to perform ETL, interactive queries (SQL), advanced analytics (e.g., machine learning) and streaming over large datasets in a wide range of data stores (e.g., HDFS, Cassandra, HBase, S3). Spark supports a variety of popular development languages including Java, Python and Scala.

In this article, we are going to walk you through the installation process of Spark as well as Hadoop which we will need in the future. So follow the instructions to start working with Spark.

Installing Oracle VM Virtualbox

The detailed instruction how to install VM Virtualbox you can read in this article.

But first of all, check if Java is installed on your OS. Please use following short command in your terminal/console to clarify which version of Java is installed on your computer and to avoid version mismatch:

java -version

You can simply move further if your version of Java is 8. If the Java is not installed yet, paste next commands to the cmd:

sudo apt-get update sudo apt-get install openjdk-8-jdk

Now you can recheck your Java version:

java -version

Hadoop installation

After installing Virtualbox our next step is to install Hadoop for future use. In this article, we are going to show you the installation process only so follow our articles to get a closer look at Hadoop and their integration with Spark.

Hadoop is an open-source software framework for storage and large-scale processing of datasets on clusters of commodity hardware. Follow the instructions and commands below to get it installed:

Download Spark_Hadoop_new.zip file here. Extract an archive to appropriate folder. Open the Ubuntu terminal and move to the newly created folder:

cd /path/to/Spark_Hadoop

4. To change access permissions type the following command next:

sudo chmod 755 setup.sh

5. As a next step, we’re going to install curl

sudo apt-get install curl

6. Run an installation next:

sudo ./setup.sh

7. Hadoop will be installed in your HOME directory. After that, you will find cloudera folder there.

8. Check if HADOOP_HOME variable was set:

echo $HADOOP_HOME

If not, the response will be blank, and in this case, enter the next command:

source ~/.profile

And check again:

echo $HADOOP_HOME

9. To start the HDFS and YARN services type

sudo ./start.sh

10. To check that all services are up and running, check the following URLs:

HDFS Service:

NameNode: Link

DataNode : Link

2. YARN Service: