Spark NLP by John Snow Labs

What is Spark NLP?

Spark NLP is a text processing library built on top of Apache Spark and its Spark ML library. It provides simple, performant and accurate NLP annotations for machine learning pipelines, that scale easily in a distributed environment.

There are some eye-catching phrases that got my attention the first time I read an article on Databricks introducing Spark NLP about a year ago. I love Apache Spark and I learned Scala (and still learning) just for that purpose. Back then I wrote my own Stanford CoreNLP wrapper for Apache Spark. I wanted to stay in the Scala ecosystem so I avoided Python libraries such as spaCy, NLTK, etc.

However, I faced many issues since I was dealing with large-scale datasets. Also, I couldn’t seamlessly integrate my NLP codes into Spark ML pipelines. I can sum up my problems by quoting some parts from the same blog post:

Any integration between the two frameworks (Spark and another library) means that every object has to be serialized, go through inter-process communication in both ways, and copied at least twice in memory. We see the same issue when using spaCy with Spark: Spark is highly optimized for loading & transforming data, but running an NLP pipeline requires copying all the data outside the Tungsten optimized format, serializing it, pushing it to a Python process, running the NLP pipeline (this bit is lightning fast), and then re-serializing the results back to the JVM process. This naturally kills any performance benefits you would get from Spark’s caching or execution planner, requires at least twice the memory, and doesn’t improve with scaling. Using CoreNLP eliminates the copying to another process, but still requires copying all text from the data frames and copying the results back in.

So I was really excited when I saw there was an NLP library built on top of Apache Spark and it natively extends the Spark ML Pipeline. I could finally build NLP pipelines in Apache Spark!

Spark NLP is open source and has been released under the Apache 2.0 license. It is written in Scala but it supports Java and Python as well. It has no dependencies on other NLP or ML libraries. Spark NLP’s annotators provide rule-based algorithms, machine learning, and deep learning by using TensorFlow. For a more detailed comparison between Spark NLP and other open-source NLP libraries, you can read this blog post.

As a native extension of the Spark ML API, the library offers the capability to train, customize and save models so they can run on a cluster, other machines or saved for later. It is also easy to extend and customize models and pipelines, as we’ll do here.

The library covers many NLP tasks, such as:

For the full list of annotators, models, and pipelines you can read their online documentation.

Full disclosure: I am one of the contributors!

Installing Spark NLP

My Environments:

Spark NLP 2.0.3 release

Apache Spark 2.4.1

Apache Zeppelin release 0.8.2

Local setup with MacBook Pro/macOS

Cluster setup by Cloudera/CDH 6.2 with 40 servers

Programming language: Scala (but no worries, the Python APIs in Spark and Spark NLP are very similar to the Scala language)

I will explain how to set up Spark NLP for my environment. Nevertheless, if you wish to try something different you can always find out more about how to use Spark NLP either by visiting the main public repository or have look at their showcase repository with lots of examples:

Main public repository:

Showcase repository:

Let’s get started! To use Spark NLP in Apache Zeppelin you have two options. Either use Spark Packages or you can build a Fat JAR yourself and just load it as an external JAR inside Spark session. Why don’t I show you both?

First, with Spark Package:

Either add this to your conf/zeppelin-env.sh

# set options to pass spark-submit command

export SPARK_SUBMIT_OPTIONS="--packages com.johnsnowlabs.nlp:spark-nlp_2.11:2.0.3 "

2. Or, add it to Generic Inline ConfInterpreter (at the beginning of your notebook before starting your Spark Session):

%spark.conf # spark.jars.packages can be used for adding packages into spark interpreter

spark.jars.packages com.johnsnowlabs.nlp:spark-nlp_2.11:2.0.3

Second, loading an external JAR:

To build a Fat JAR all you need to do is:



$ cd spark-nlp

$ sbt assembly $ git clone https://github.com/JohnSnowLabs/spark-nlp $ cd spark-nlp$ sbt assembly

Then you can follow one of the two ways I mentioned to add this external JAR. You just need to change “ — packages” to “ — jars” in the first option. Or for the second solution, just have “spark.jars”.

Start Spark with Spark NLP

Now we can start using Spark NLP 2.0.3 with Zeppelin 0.8.2 and Spark 2.4.1 by importing Spark NLP annotators:

import com.johnsnowlabs.nlp.base._

import com.johnsnowlabs.nlp.annotator._

import org.apache.spark.ml.Pipeline

Apache Zeppelin is going to start a new Spark session that comes with Spark NLP regardless of whether you used Spark Package or an external JAR.

Read the Mueller Report PDF file

Remember the issue about the PDF file not being a real PDF? Well, we have 3 options here:

You can either use any OCR tools/libraries you prefer to generate a PDF or a Text file. Or you can use already searchable and selectable PDF files created by the community. Or you can just use Spark NLP!

Spark NLP comes with an OCR package that can read both PDF files and scanned images. However, I mixed option 2 with option 3. (I needed to install Tesseract 4.x+ for image-based OCR on my entire cluster so I got a bit lazy)

You can download these two PDF files from Scribd:

Of course, you can just download the Text version and read it by Spark. However, I would like to show you how to use the OCR that comes with Spark NLP.

Spark NLP OCR:

Let’s create a helper function for everything related to OCR:

import com.johnsnowlabs.nlp.util.io.OcrHelper

val ocrHelper = new OcrHelper()

Now we need to read the PDF and create a Dataset from its content. The OCR in Spark NLP creates one row per page: