The Neo4j Spark Connector uses the binary Bolt protocol to transfer data from and to a Neo4j server.

It offers Spark-2.0 APIs for RDD, DataFrame, GraphX and GraphFrames, so you’re free to chose how you want to use and process your Neo4j graph data in Apache Spark.

Configure Neo4j-URL, -user and -password via spark.neo4j.bolt.* Spark Config options.

The general usage is:

create org.neo4j.spark.Neo4j(sc) set cypher(query,[params]) , nodes(query,[params]) , rels(query,[params]) as direct query, or

pattern("Label1",Seq("REL"),"Label2") or pattern( ("Label1","prop1"),("REL","prop"),("Label2","prop2") ) optionally define partitions(n) , batch(size) , rows(count) for parallelism choose which datatype to return loadRowRdd , loadNodeRdds , loadRelRdd , loadRdd[T]

loadDataFrame , loadDataFrame(schema)

loadGraph[VD,ED]

loadGraphFrame[VD,ED]

Here is a basic example for loading a RDD[Row] .

org.neo4j.spark.Neo4j(sc).cypher("MATCH (n:Person) RETURN n.name").partitions(5).batch(10000).loadRowRdd

$SPARK_HOME/bin/spark-shell --conf spark.neo4j.bolt.password=<password> \ --packages neo4j-contrib:neo4j-spark-connector:2.0.0-M2,graphframes:graphframes:0.2.0-spark2.0-s_2.11

import org.neo4j.spark._ val neo = Neo4j(sc) val rdd = neo.cypher("MATCH (n:Person) RETURN id(n) as id ").loadRowRdd rdd.count // inferred schema rdd.first.schema.fieldNames // => ["id"] rdd.first.schema("id") // => StructField(id,LongType,true) neo.cypher("MATCH (n:Person) RETURN id(n)").loadRdd[Long].mean // => res30: Double = 236696.5 neo.cypher("MATCH (n:Person) WHERE n.id <= {maxId} RETURN n.id").param("maxId", 10).loadRowRdd.count // => res34: Long = 10

Similar operations are available for DataFrames and GraphX. The GraphX integration also allows to write data back to Neo4j with a save operation.

To use GraphFrames you have to declare it as package. Then you can load a GraphFrame with graph data from Neo4j and run graph algorithms or pattern matchin on it (the latter will be slower than in Neo4j).

import org.neo4j.spark._ val neo = Neo4j(sc) import org.graphframes._ val graphFrame = neo.pattern(("Person","id"),("KNOWS",null), ("Person","id")).partitions(3).rows(1000).loadGraphFrame graphFrame.vertices.count // => 100 graphFrame.edges.count // => 1000 val pageRankFrame = graphFrame.pageRank.maxIter(5).run() val ranked = pageRankFrame.vertices ranked.printSchema() val top3 = ranked.orderBy(ranked.col("pagerank").desc).take(3) // => top3: Array[org.apache.spark.sql.Row] // => Array([236716,70,0.62285...], [236653,7,0.62285...], [236658,12,0.62285])