In previous tutorial, we have explained about the SparkSQL and DataFrames Operations using Spark 1.6. Now In this tutorial we have covered DataFrame API Functionalities . And we have provided running example of each functionality for better support. Lets begin the tutorial and discuss about the DataFrame API Operations using Spark 1.6 . DataFrame API Example Using Different types of Functionalities

Different type of DataFrame operations are :- 1.Action 2.Basic 3.Operations Here we are using JSON document named cars.json with the following content and generate a table based on the schema in the JSON document.

[{"itemNo" : 1, "name" : "Ferrari", "speed" : 259 , "weight": 800}, {"itemNo" : 2, "name" : "Jaguar", "speed" : 274 , "weight":998}, {"itemNo" : 3, "name" : "Mercedes", "speed" : 340 , "weight": 1800}, {"itemNo" : 4, "name" : "Audi", "speed" : 345 , "weight": 875}, {"itemNo" : 5, "name" : "Lamborghini", "speed" : 355 , "weight": 1490}]

Action: Action are operations (such as take, count, first, and so on) that return a value after running a computation on an DataFrame. object DataFrameAPI { val sc = SparkCommon.sparkContext val sqlContext = SparkCommon.sparkSQLContext // Use the following command to create SQLContext. val ssc = SparkCommon.sparkSQLContext val schemaOptions = Map("header" -> "true", "inferSchema" -> "true") import sqlContext.implicits._ import org.apache.spark.sql.functions._ def main(args: Array[String]) val cars = "src/main/resources/cars.json" val carsPrice = "src/main/resources/cars_price.json" val carsDataFrame: DataFrame = ssc.read.format("json").options(schemaOptions).load(cars) val carsDataFrame: DataFrame = ssc.read.format("json").options(schemaOptions).load(carsPrice) //If you want to see top 20 rows of DataFrame in a tabular form then use the following command. carDataFrame.show() } } show() If you want to see top 20 rows of DataFrame in a tabular form then use the following command. carDataFrame.show() show(n) If you want to see n rows of DataFrame in a tabular form then use the following command. carDataFrame.show(2) take() take(n) Returns the first n rows in the DataFrame. carDataFrame.take(2).foreach(println) count() Returns the number of rows. carDataFrame.groupBy("speed").count().show() head() head () is used to returns first row. val resultHead = carDataFrame.head() println(resultHead.mkString(",")) head(n) head(n) returns first n rows. val resultHeadNo = carDataFrame.head(3) println(resultHeadNo.mkString(",")) first() Returns the first row. val resultFirst = carDataFrame.first() println("fist:" + resultFirst.mkString(",")) collect() Returns an array that contains all of Rows in this DataFrame. val resultCollect = carDataFrame.collect() println(resultCollect.mkString(","))

Basic DataFrame functions

printSchema()

If you want to see the Structure (Schema) of the DataFrame, then use the following command.

carDataFrame.printSchema()

toDF()

toDF() Returns a new DataFrame with columns renamed. It can be quite convenient in conversion from a RDD of tuples into a DataFrame with meaningful names.

val car = sc.textFile("src/main/resources/fruits.txt") .map(_.split(",")) .map(f => Fruit(f(0).trim.toInt, f(1), f(2).trim.toInt)) .toDF().show()

dtypes()

Returns all column names and their data types as an array.

carDataFrame.dtypes.foreach(println)

columns ()

Returns all column names as an array.

carDataFrame.columns.foreach(println)

Data Frame operations:

sort()

Returns a new DataFrame sorted by the given expressions.

carDataFrame.sort($"itemNo".desc).show()

orderBy()

Returns a new DataFrame sorted by the specified column(s).

carDataFrame.orderBy(desc("speed")).show()

groupBy()

counting the number of cars who are of the same speed .

carDataFrame.groupBy("speed").count().show()

na()

Returns a DataFrame na Functions for working with missing data.

carDataFrame.na.drop().show()

as()

Returns a new DataFrame with an alias set.

carDataFrame.select(avg($"speed").as("avg_speed")).show()

alias()

Returns a new DataFrame with an alias set. Same as `as`.

carDataFrame.select(avg($"weight").alias("avg_weight")).show()

select()

To fetch speed-column among all columns from the DataFrame.

carDataFrame.select("speed").show()

filter()

filter the cars whose speed is greater than 300 (speed > 300).

carDataFrame.filter(carDataFrame("speed") > 300).show()

For more details see here.

We would look at how we can create more useful tutorials to grow it , then we would be adding more content to it together. If you have any suggestion feel free to suggest us 🙂 Stay tuned.