Reading Time: 2 minutes

To get the problems addressed by “Neighborhood Aggregation”, we can think of the queries like: “Who has the maximum number of followers under 20 on twitter?”

In this blog, we will learn how to aggregate properties of neighboring vertices on a graph with Apache Spark’s GraphX Library. The spark shell will be enough to understand the code example.

So, let us get back on the problem statement. Let us assume following graph as example dataset of Twitter users and followers. The property age is stored as vertex attribute and the arrow from any vertex x to y says: x follows y.

Defining Graph Components

Let us creates the list of vertices for the above graph. The following line create a list of vertices as Twitter users with property VertexId, and Age. The vertices in the graph are labelled with letters and we are using a sequence of letters as vertex ID.

import org.apache.spark._ import org.apache.spark.graphx._ users = sc.parallelize(Array((1L, 17), (2L, 19), (3L, 27), (4L, 13), (5L, 25),(6L, 32),(7L, 35),(8L, 29),(9L,13)))

The Edge is a case class that contains IDs of source and destination vertices followed by an attribute “relationship” as String.

val follows = sc.parallelize(Array(Edge(1L, 3L,"follow"), Edge(2L, 3L,"follow"), Edge(2L, 4L,"follow"), Edge(4L, 5L,"follow"), Edge(4L, 7L,"follow"), Edge(5L, 2L,"follow"), Edge(6L, 7L,"follow"), Edge(6L, 4L,"follow"), Edge(9L, 8L,"follow"), Edge(9L, 1L,"follow"), Edge(9L, 3L,"follow")))

with vertices and edges defined, following line creates a graph with these details.

val twitterGraph = Graph(users, follows)

Aggregating Attributes

The following code aggregates values from the neighboring edges and vertices to compute total followers of each vertex under age twenty:

val followersUnderTwenty = twitterGraph.aggregateMessages[Int] ( tripletFieds=> { if (tripletFields.srcAtt<20) tripletFields.sendToDst(1)},(a, b) => (a + b))

Let us discuss how we got so far with aggregating values around each vertex. GraphX uses operation aggregateMessages as core aggregation operation.

The value tripletFields used in the operation aggregateMessages yields an EdgeContext which contains everything about an Edge i.e. IDs of the source and destination vertices, attributes of the source and destination vertices and attributes of the edge. Fragment tripletFields.srcAttr yields age of each source vertex which is used to compare the age if it is less than 20, and if so, tripletFields.sendToDst(1) sends a message of type Int to destination vertex with follower count 1. The method sendToDst() is just like map function of RDDs which returns Unit.

The function (a,b) => (a+ b) which has been passed in aggregateMessages operation is a reducer which takes two messages of type Int and sums them up whenever a follower is encountered with age under 20.

Result

When collecting the vertex RDD followersUnderTwenty we get the result:

Array((4,1), (8,1), (1,1), (5,1), (3,3), (7,1))

The following line yields result (3,3) i.e. User “C” which has 3 followers under 20.

followersUnderTwenty.reduce((a,b)=>if(a._2>b._2) a else b)

This was a very simple example of aggregation operation with very tiny but clear graph Dataset. I hope it was helpful 🙂

References:

http://spark.apache.org/docs/latest/graphx-programming-guide.html