Apache Kafka on Kubernetes series:

Kafka on Kubernetes - using etcd



In our previous posts in the Zeppelin series above we’ve already gone into detail about Zeppelin Spark notebooks on K8s. If you’ve read those you should already be familiar with Spark & Zeppelin components, also be aware that it’s no easy task to setup all the pieces of a running environment. Today we want to show you the quickest and easiest way to setup a Zeppelin Server and Spark on K8s with all the necessary components using Helm charts.

The Zeppelin Spark chart is an aggregated chart containing the following sub charts:

The flow diagram below shows how these components are deployed via this aggregated Helm chart.

The Zeppelin Server chart also contains all necessary config files used by Zeppelin (notably all files from conf directory) which are wrapped in config files as you can see below:

configMap.yaml : log4j.properties, log4j_k8s_cluster.properties, shiro.ini.template

: log4j.properties, log4j_k8s_cluster.properties, shiro.ini.template interpreter-settings-config.yaml : interpreter.json

: interpreter.json zeppelin-site-config.yaml : zeppelin-site.xml

Some configuration properties are replaced with template variables, wired out as parameters to values.yaml so that they are configurable via a Helm client. Should you need to change a property at time of deployment you only need to replace its value - in the corresponding configMap - with {{ .Values.configurableProperty }} then you can refer to it as zeppelin.configurableProperty in the Helm client.

In short, Zeppelin Spark offers configurations of the following:

basic security with Shiro

centralised logging to Syslog

the storing of notebooks on different Cloud providers: Google Storage , Amazon S3 , Azure Blob Storage

, , the deployment of Spark History Server and the logging of Spark events to Google Storage , Amazon S3 , Azure Blob Storage

and the logging of Spark events to , , an ingress to reach Zeppelin UI from outside

from outside preconfigured spark-submit options

Let’s walk through each of these, highlighting related Helm chart parameters.

Basic security with Shiro 🔗︎

Shiro is the default authorized framework for Zeppelin, so you can setup several different Realms for LDAP, Active Directory etc. As you’ve already seen above, we have a configMap containing shiro.ini.template, with a default configuration using IniRealm, meaning users and groups are configurable in conf/shiro.ini under the [user] and [group] sections. As of now, you can configure an admin username and password.

zeppelin username : "admin" password : "zeppelin"

These are the default values for username and password. The password will be stored as a secret (zeppelin-secret.yaml), picked up by an initContainer at deployment time running a Shiro hasher tool, replacing ADMIN_PASSWORD in the shiro.ini.template file with the encoded password. Pipeline is already integrated with Bank-Vaults, so you’ll be able to use Vault to store your Zeppelin username & password.

Centralised logging to Syslog 🔗︎

There are two log4j files configured: one for Zeppelin Server (log4j.properties) and one to be used by Spark Driver and Executors (log4j_k8s_cluster.properties). Both are embedded in a config map configured by default for INFO level logging to console. Should you pass the required parameters below, a SyslogAppender will be created for Zeppelin Server and Spark Driver & Executors.

zeppelin logService : logService.host : 10.44.0.12 logService.zeppelinLogPort : 512 logService.sparkLogPort : 512 logService.applicationLogPort : 512

There also exists a separate logger for your application level logs, by default it’s named: application . You can change it, as well as all other optional parameters:

zeppelin logService : zeppelinLogLevel : INFO zeppelinFacility : LOCAL4 zeppelinLogPattern : "%5p [%d] ({%t} %F[%M]:%L) - %m%n" sparkLogLevel : INFO sparkFacility : LOCAL4 sparkLogPattern : "[%p] XXX %c:%L - %m%n" applicationLoggerName : application applicationLogLevel : INFO applicationFacility : LOCAL4 applicationLogPattern : "[%p] XXX %c:%L - %m%n"

Setup notebook storage on different Cloud providers 🔗︎

Zeppelin can store notebooks on several Cloud storage providers. You can easily configure this by setting storage type and path as illustrated below:

zeppelin.notebookStorage.type - ‘s3’ | ‘azure’ | ‘gs’

zeppelin.notebookStorage.path - bucket name in case of S3 / GS, file share name for Azure.

On Google and Amazon we’re using IAM roles and policies so that you can reach buckets belonging to the same account / project automatically, while on Azure you also have to specify zeppelin.azureStorageAccountName & zeppelin.azureStorageAccessKey.

Deploy Spark History Server 🔗︎

By default Spark History Server is not enabled. You can enable it by setting historyServer.enabled to true. You must also specify where to log Spark events, both for Spark HS and and for Spark Driver passing via sparkSubmitOptions:

historyServer : enabled : true spark : spark-hs : app : logDirectory : "gs://spark-k8-logs/" zeppelin : sparkSubmitOptions : eventLogDirectory : "gs://spark-k8-logs/"

Log directory - both logDirectory , and eventLogDirectory - have to reference an existing bucket for each Cloud provider as follows:

s3a://yourBucketName/

wasb://your_blob_container_name@your_storage_account_name.blob.core.windows.net/

gs://yourBucketName/

For a more in-depth examination of this subject, read Spark History Server

Ingress to reach Zeppelin UI from outside 🔗︎

By default a traefik based Ingress service will be created for Zeppelin, so that the Zeppelin UI will be available on an external address and the specified baseURL.

Related chart parameters:

zeppelin pipelineIngress : enabled : true ingressURL : baseURL : /zeppelin

Preconfigured spark-submit options 🔗︎

You can configure the properties below in order to change default values:

zeppelin : sparkSubmitOptions : k8sNameSpace : default sparkDriverCores : 1 sparkDriverLimitCores : 2 sparkExecutorCores : 1 sparkDriverMemory : 4G sparkExcutorMemory : 2G sparkMetricsConf : /opt/spark/conf/metrics.properties dynamicAllocation : true shuffleService : true shuffleNameSpace : default shuffleLabels : app=spark-shuffle-service,spark-version= 2.2.0 DriverImage : banzaicloud/spark-driver-py:v2 .2.1 -k8s -1.0.30 ExecutorImage : banzaicloud/spark-executor-py:v2 .2.1 -k8s -1.0.30 initContainerImage : banzaicloud/spark-init:v2 .2.1 -k8s -1.0.30 resourceStagingServerInt : http://spark-rss: 10000 resourceStagingServerExt : http://spark-rss: 10000 sparkLocalDir : /tmp/spark-local eventLogDirectory : "" driverServiceAccountName : "spark" log4jConfigPath : "file:///var/spark-data/spark-files/log4j_k8s_cluster.properties"

Minimal config example necessary for launch 🔗︎

If you’re fine with most of the defaults, then there’s a minimal config example for an up and running Zeppelin Server with History Server and everything else you might need on Google Cloud:

cat > zeppelin-params.yaml <EOF historyServer : enabled : true spark : spark-hs : app : logDirectory : "gs://spark-k8-logs/" zeppelin : sparkSubmitOptions : eventLogDirectory : "gs://spark-k8-logs/" notebookStorage : type : "gs" path : "zeppelin-nb" EOF helm install -f zeppelin-params.yaml banzaicloud-stable/zeppelin-spark

Note: spark-k8-logs, zeppelin-nb have to be created beforehand and are accessible by project owners