In the modern pipeline, you will invariably have data in S3 in either CSV or other formats. So it is quite inevitable that you will have to pull files from S3, do some manipulations in your spark dataframe and then push them to the database.

In this post, we will pull jsonlines files from S3, create a dataframe out of them and then push to Neo4J which is our graph database. In case you are not using graph databases for your data modeling, I will highly recommend that you try it out. You can take a look at the advantages of using a graph database in this nice article by DZone.

For this purpose, we are going to use jsonline files. Jsonlines are basically records of JSON strings and look like below. I am choosing Jsonlines and not CSV files for this post because there is the added challenge of parsing the JSON string. In case you are interested in CSV files then have a look at a YouTube video that I had uploaded on a similar topic.

Another thing that we will try is to keep the operations as lazy as possible and not do any actual computation till the last moment.

Configuration

I have Neo4J of version 3.4.0 downloaded in my machine. I can now start like below.



Active database: graph.db

Directories in use:

home: /Users/joydeep/Documents/neo4j-community-3.4.0

config: /Users/joydeep/Documents/neo4j-community-3.4.0/conf

logs: /Users/joydeep/Documents/neo4j-community-3.4.0/logs

plugins: /Users/joydeep/Documents/neo4j-community-3.4.0/plugins

import: /Users/joydeep/Documents/neo4j-community-3.4.0/import

data: /Users/joydeep/Documents/neo4j-community-3.4.0/data

certificates: /Users/joydeep/Documents/neo4j-community-3.4.0/certificates

run: /Users/joydeep/Documents/neo4j-community-3.4.0/run

Starting Neo4j.

2018-08-26 04:00:47.627+0000 INFO ======== Neo4j 3.4.0 ========

2018-08-26 04:00:47.684+0000 INFO Starting...

2018-08-26 04:00:50.260+0000 INFO Bolt enabled on 127.0.0.1:7687.

2018-08-26 04:00:54.985+0000 INFO Started.

2018-08-26 04:00:56.122+0000 WARN Low configured threads: (max={} - required={})={} < warnAt={} for {}

2018-08-26 04:00:56.134+0000 INFO Remote interface available at ➜ ./bin/neo4j consoleActive database: graph.dbDirectories in use:home: /Users/joydeep/Documents/neo4j-community-3.4.0config: /Users/joydeep/Documents/neo4j-community-3.4.0/conflogs: /Users/joydeep/Documents/neo4j-community-3.4.0/logsplugins: /Users/joydeep/Documents/neo4j-community-3.4.0/pluginsimport: /Users/joydeep/Documents/neo4j-community-3.4.0/importdata: /Users/joydeep/Documents/neo4j-community-3.4.0/datacertificates: /Users/joydeep/Documents/neo4j-community-3.4.0/certificatesrun: /Users/joydeep/Documents/neo4j-community-3.4.0/runStarting Neo4j.2018-08-26 04:00:47.627+0000 INFO ======== Neo4j 3.4.0 ========2018-08-26 04:00:47.684+0000 INFO Starting...2018-08-26 04:00:50.260+0000 INFO Bolt enabled on 127.0.0.1:7687.2018-08-26 04:00:54.985+0000 INFO Started.2018-08-26 04:00:56.122+0000 WARN Low configured threads: (max={} - required={})={} < warnAt={} for {}2018-08-26 04:00:56.134+0000 INFO Remote interface available at http://localhost:7474/

An interesting thing to notice is that apart from the usual HTTP port on 7474, you also have the Bolt port (7687) which is used to pass Cypher queries to Neo4j. Neo4j is implemented in Java and accessible from software written in other languages, in our case, Scala, using the Cypher query language.

Now let's take a jsonline file in s3. The files look like below.

{"FieldA": "12", "FieldB": "376"}

{"FieldA": "18", "FieldB": "35"}

{"FieldA": "50", "FieldB": "190"}

So what we will try to do is that we will take those FieldA and FieldB , map to a dataframe in spark and then push it to neo4j.

Since you are trying to get the data from S3, you will need to have the AWS keys provided as environment variables.

export AWS_SECRET_ACCESS_KEY=<mysecret>

export AWS_ACCESS_KEY_ID=<mykey>

Now to parse the CSV in Spark, we open the spark-shell. You will need to add these dependencies.

aws-java-sdk:1.7.4

hadoop-aws:2.7.5.1

neo4j-spark-connector:2.1.0-M4

So the spark-shell command will look like this. The spark-shell version that I am using is 2.11.8.

$ spark-shell --packages com.amazonaws:aws-java-sdk:1.7.4,ch.cern.hadoop:hadoop-aws:2.7.5.1,com.fasterxml.jackson.core:jackson-annotations:2.7.9,com.fasterxml.jackson.core:jackson-core:2.7.9,com.fasterxml.jackson.core:jackson-databind:2.7.9.4,org.wso2.orbit.joda-time:joda-time:2.9.4.wso2v1,neo4j-contrib:neo4j-spark-connector:2.1.0-M4

The Spark Pipeline.

Now let's get to the code. Create an RDD out of using the textFile method.

val lines = sc.textFile("s3a://bucketname/pathtofile.jsonl")

Notice that we are using the s3a file convention. Below is the rationale.

The difference between s3 and s3n/s3a is that s3 is a block-based overlay on top of Amazon S3, while s3n/s3a are not (they are object-based). The difference between s3n and s3a is that s3n supports objects up to 5GB in size, while s3a supports objects up to 5TB and has higher performance (both are because it uses multi-part upload). s3a is the successor to s3n.

source: https://stackoverflow.com/a/33356421/5417164

Now that you have the RDD , we will need to parse the JSON on each line. We can create this handy function to parse the JSON. We will import the required libraries, then create a function getJsonContent that takes the FieldA and FieldB columns from the parsedJson . We can then extract the string out of it using the extract[String] method. We can then return the parsed values.

import org.json4s.{DefaultFormats, MappingException}

import org.json4s.jackson.JsonMethods._

import org.apache.spark.sql.functions._ def getJsonContent(jsonstring: String): (Integer, Integer) = {

implicit val formats = DefaultFormats

val parsedJson = parse(jsonstring)

val value1 = (parsedJson \ "FieldA").extract[String].toInt

val value2 = (parsedJson \ "FieldB").extract[String].toInt

(value1, value2)

}

Now that we have the function we can easily pass that to lines and transform that to a dataframe.

val newNames = Seq("id", "count")

val df = lines.map(getJsonContent).toDF(newNames: _*)

Now that you have the dataframe, you can create the nodes and edges in Neo4j.

import org.neo4j.spark._

import org.graphframes._ Neo4jDataFrame.mergeEdgeList(sc, df, ("Event",Seq("id")),("HAS",Seq.empty),("Count",Seq("count")))

which means will use the Neo4jDataFrame.mergeEdgeList method on the Spark context sc and dataframe df to create connections between Event and Count with connection of HAS and the values are taken from id and count of the dataframe respectively. Notice that we the processing also happens only now. We are able to leverage the full power of Sparks lazy processing.

The Result

Now once you are done check the nodes and connections in Neo4j with Cypher query

MATCH p=()-[r:HAS]->() RETURN p LIMIT 25

graph rending in neo4j

Scala offers a convenient and easy way for S3 file processing. This post is aimed at helping beginners use S3 files and Scala with ease. If you found this useful, do leave a comment, we would love to hear from you and share the post with your friends and colleagues.

I have recently completed a book on fastText. FastText is a cool library open-sourced by Facebook for efficient text classification and creating the word embeddings. Do check out the book.