Reading Time: 3 minutes

Hello associate! Hope you are doing well . Today I am going to share some of my programming experience with Apache Spark.

So if you are getting started with Apache Spark then this blog may helpfull for you.

Prerequisite to start with Apache Spark –

MVN / SBT

Scala

To start with Apache Spark at first you need to either

download pre-built Apache Spark or,

download source code and build on your local machine.

Now, If you downloaded pre-built spark then you only need to extract the tar file at the location where you have the permission to read and write.

Else you ned to extract the source code and run the following command at SPARK_HOME directory to build the spark –

Building with Maven and Scala 2.11

./dev/change-scala-version.sh 2.11 mvn -Pyarn -Phadoop-2.4 -Dscala-2.11 -DskipTests clean package

Building with SBT

build/sbt -Pyarn -Phadoop-2.3 assembly

Now to start spark

goto the SPARK_HOME/bin Execute ./spark-shell

You will get following prompt :

hence Apache spark provedes you following two object by default on spark-shell :

sc : SparkContext spark : SparkSession

Although you can also create your own SparkContext (if creating project apart with Spark-Shell ) :

val conf = new SparkConf().setAppName("Demo").setMaster("local[2]") val sc = new SparkContext(conf)

Now you can load data with two type of Dataset :

Now You know that :

A data frame is a table, or two-dimensional array-like structure, in which each column contains measurements on one variable, and each row contains one case.

DataFrame has additional metadata due to its tabular format, which allows Spark to run certain optimizations on the finalized query.

An RDD, on the other hand, is merely a Resilient Distributed Dataset that is more of a blackbox of data that cannot be optimized as the operations that can be performed against it are not as constrained.

However, you can go from a DataFrame to an RDD via its rdd method, and you can go from an RDD to a DataFrame (if the RDD is in a tabular format) via the toDF method

Creating an object of RDD and load data to RDD dataset

val data = Array(1, 2, 3, 4, 5) val distData = sc.parallelize(data) distData: org.apache.spark.rdd.RDD[Int]

Either you can load data from a file

val distFile = sc.textFile("data.txt") distFile: RDD[String]

Here is a complete example of WordCount to understand RDD :

val textFile = sc.textFile("words.txt") val counts = textFile.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("count.txt")

Simillarly you can create DataFrame object :

val sqlContext = new SQLContext(sc) val df = sqlContext.read.json("emp.json")

Now you can Querry with DataFrame Object .

Example of DataFrame :

val sqlContext = new SQLContext(sc) val df = sqlContext.read.json("emp.json") df.printSchema() df.show() df.select("firstName").show() df.select(df("firstName"), df("age") + 1).show() df.filter(df("age") > 25).show() df.groupBy("age").count().show() println("





Using Collect Method") df.collect.toList.map(aRow=>println(aRow))

Here is Slide for the Same

Here is Youtube Video

Reference :

http://spark.apache.org/

Stay tuned for Spark with Hive

Thanks

