Tushar Kapoor: (https://www.tusharck.com/)

Apache continues to maintain a strong position by showcasing its preview release of Spark 3.0 for Big Data Science. According to the preview, Spark is coming with several big and important features.

You can download the preview release form this link: https://archive.apache.org/dist/spark/spark-3.0.0-preview/

Let’s see some of its major features which invigorate its goal of Unified Analytics for Big Data.

Spark Graph: Cypher Script & Property Graph

Popular graph query language Cypher has been added to Spark 3.0, which is coupled by Property Graph Model a directed multigraph. The Graph query will follow similar principles as SparkSQL with its own Catalysts providing full support for Dataframes.

Click to check out Databricks’s session on Graph API for Spark 3.0.

Python 3, Scala 2.12 and JDK 11

Spark 3.0 is expected to completely move to Python3.

Scala version upgrade to 2.12.

It will fully support JDK 11.

Deep Learning: Adds GPU Support

This is something every Data Engineer and Scientist looks for, and Spark 3.0 matching the expectations. Spark 3.0 with NVIDIA offers GPU acceleration which can run across multiple GPUs. It supports heterogeneous GPUs like AMD, Intel, and Nvidia. For Kubernetes, it offers GPU isolation at the executor pod level. In addition to this we get:

GPU acceleration for Pandas UDF.

You can specify the number of GPUs in your RDD operations.

To easily specify a Deep Learning environment there is YARN and Docker support to launch Spark 3.0 with GPU.

Log Loss: Support

You can also set Logistic loss your metric.

val evaluator = new MulticlassClassificationEvaluator() .setMetricName("logLoss")

Binary Files

You can use binary files as the data source of your spark dataframe, however, as of now write operations in binary are not allowed. We can expect that in the future releases.

val df = spark.read.format(BINARY_FILE).load(dir.getPath)

Kubernetes

Now you will be able to Host clusters via Kubernetes with the support of the latest Kubernetes version. It offers spark-submit changing web-hooks configs modify pods at runtime. It also improves upon the dynamic allocation with Kubernetes. Furthermore, we get:

Support for GPU scheduling.

spark.authenticate secret support in Kubernetes backend.

The Kerberos authentication protocol is now supported in Kubernetes resource manager.

Koalas: Spark scale to Pandas

Koalas is a pandas API on Apache Spark which makes data engineers and scientists more productive when interacting with big data. With 3.0 feature release Koalas can now scale to the distributed environment without the requirement of reading Spark 3.0 dataframe separately, in contrast, to the singe node environment previously.

import databricks.koalas as ks

import pandas as pd



pdf = pd.DataFrame({'x':range(3), 'y':['a','b','b'], 'z':['a','b','b']})



# Create a Koalas DataFrame from pandas DataFrame

df = ks.from_pandas(pdf)



# Rename the columns

df.columns = ['x', 'y', 'z1']



# Do some operations in place:

df['x2'] = df.x * df.x

The above example is taken from Koalas git repo.

Kafka Streaming: includeHeaders

Now you can allow reading headers in Kafka streaming (git).

val df = spark .readStream .format("kafka") .option("kafka.bootstrap.servers", "host1:port1,host2:port2") .option("subscribe", "topic1")

.option("includeHeaders", "true")

.load() df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)", "headers") .as[(String, String, Map)]

YARN Features

YARN also gets a new set of feature mainly:

Spark 3.0 framework can now auto-discover GPU on a cluster or a system.

You can schedule GPUs.

You can check here to see the GPU configuration for YARN: https://hadoop.apache.org/docs/r3.1.0/hadoop-yarn/hadoop-yarn-site/UsingGpus.html

Analyze Cached Data

Now you can analyze the cached data inside Spark 3.0 which is one of the most wanted features from Spark (git).

withTempView("cachedQuery") {

sql("""CACHE TABLE cachedQuery AS

| SELECT c0, avg(c1) AS v1, avg(c2) AS v2

| FROM (SELECT id % 3 AS c0, id % 5 AS c1, 2 AS c2 FROM range(1, 30))

| GROUP BY c0

""".stripMargin)

// Analyzes one column in the cached logical plan

sql("ANALYZE TABLE cachedQuery COMPUTE STATISTICS FOR COLUMNS v1") // Analyzes two more columns

sql("ANALYZE TABLE cachedQuery COMPUTE STATISTICS FOR COLUMNS c0, v2")

Dynamic Partition Pruning

It offerers optimized execution during runtime by reusing the dimension table broadcast results in hash joins. This helps Spark 3.0 work more efficiently specifically with queries based on Star-Schema thereby removing the need to ETL the denormalized tables.

Click to check out Databricks’s session on Dynamic Partition Pruning in Apache Spark.

Delta Lake

Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark 3.0 and with its ease of implementation and up-gradation with any existing Spark applications, it brings reliability to Data Lakes.

See Delta Lake Quick Start: https://docs.databricks.com/delta/quick-start.html

More Features:

Decision Tree in SparkML.

Improved Optimizer during query execution.

Pluggable catalog integration.

Metric for executor memory.

Dynamic Allocation.

The features mentioned above are the not only ones coming to Spark 3.0. No doubt Spark 3.0 is helping data scientists to do more with add-on features, waiting for the final release of Spark 3.0.