I am trying to get back into writing mode and post a couple of fresh blog posts every month. There are a lot of cool new features in Neo4j graph algorithms that I haven’t yet written about, so I will try my best to introduce as many of them as possible.

One of the new features is the upgrade of the node and relationship type projection to support loading multiple relationship types. Let’s take a look at how the algorithms engine works and how this feature may come in handy.

Image from https://neo4j.com/docs/graph-algorithms/current/projected-graph-model/

Whenever we run any of the algorithms, the graph loader reads the graph from Neo4j and loads it as the projected graph in memory. Once the projected graph is stored in memory, we can run any of the algorithms on it. Let’s say we want to run the PageRank algorithm on the network described by relationship type REL_TYPE1 and the connected components algorithm on the network described by REL_TYPE2. If you have read any of my previous blogs, you might have seen something like

CALL algo.pageRank('Node','REL_TYPE1');

CALL algo.unionFind('Node','REL_TYPE2');

where I run the two algorithms in a sequence. This approach is not efficient as we first project the graph into memory for the PageRank algorithm, store the results, and offload the graph from memory. We then repeat the same process for the connected components algorithms. To avoid multiple loading of the same graph into memory, we can use the named graphs loading. It allows us to store the considered graph into memory and be able to run many algorithms on it without having to load the graph into memory on every run.

Example of named graph loading:

CALL algo.graph.load('my-graph','Node',

'REL_TYPE1 | REL_TYPE2',

{

duplicateRelationships:'min',

relationshipProperties:{

distance:{

property:'distance'

},

duration:{

property:'duration_avg',

default_value:5

}

},

nodeProperties:{

seed_value:{

property:'seed_value'

}

}

})

As you can see, we have loaded two different types of relationships into the projected graph. We have also loaded the distance and duration attributes of relationships and seed_value attributes of nodes. Those attributes can be used as weights or as seed properties when running graph algorithms.

Now we can run the two algorithms on the same projected graph.

CALL algo.pageRank('Node','REL_TYPE1',{graph:'my-graph'});

CALL algo.unionFind('Node','REL_TYPE2',{graph:'my-graph'});

When we are done with the analysis, remember to remove the graph from memory with:

CALL algo.graph.remove('my-graph')

Relationship de-duplication strategy

To better understand the de-duplication strategy, let’s look at the following example.

CREATE (a:Loc)-[:ROAD{cost:4}]->(b:Loc),

(a)-[:ROAD{cost:7}]->(b),

(a)-[:RAIL{cost:5}]->(b),

(a)-[:RAIL{cost:8}]->(b)

In this example, we have two nodes connected by four relationships, which all point in the same direction. Let’s say we want to search for the shortest weighted path. To understand what will happen when we project the Neo4j stored graph into the graph algorithms in-memory graph it is best to look at this quote from the documentation.

The projected graph model does not support multiple relationships between a single pair of nodes.

Relationship de-duplication strategy bridges the gap between multiple stored relationships in Neo4j and a single projected relationship in the algorithms engine. If there are no weights present on the relationships, then it does not matter, as all the stored relationships will be reduced to a single projected relationship. But if there are weights present on the relationships, then we can choose from one of the following four strategies to handle weight de-duplication:

skip - keeps the first encountered relationship (and associated weight).

- keeps the first encountered relationship (and associated weight). sum - sums the associated weights of all encountered relationships.

- sums the associated weights of all encountered relationships. min - keeps the minimum weight of all encountered relationships.

- keeps the minimum weight of all encountered relationships. max - keeps the maximum weight of all encountered relationships.

When searching for the shortest paths in our graph, we want to use the min de-duplication strategy. We load both the RAIL and the ROAD relationship types separately.

CALL algo.graph.load('my-graph', 'Loc', 'RAIL | ROAD', {relationshipWeight: 'cost', duplicateRelationships: 'min' })

With the graph in memory, we can start searching for the shortest path.

MATCH (start:Loc)-->(end:Loc)

WITH distinct start,end

CALL algo.shortestPath.stream(start,end,'cost',

{graph:'my-graph',relationshipQuery:'RAIL'})

YIELD nodeId,cost

RETURN nodeId,cost

Results

As we could expect, the algorithm chooses the one with the minimum cost out of the two possible RAIL relationships.

Try the same thing with the relationship type ROAD.

MATCH (start:Loc)-->(end:Loc)

WITH distinct start,end

CALL algo.shortestPath.stream(start,end,'cost',

{graph:'my-graph',relationshipQuery:'ROAD'})

YIELD nodeId,cost

RETURN nodeId,cost

Results

Now run the algorithm and don’t specify any relationship type. If we don’t specify the relationship type, the algorithm will traverse all available relationship types.

MATCH (start:Loc)-->(end:Loc)

WITH distinct start,end

CALL algo.shortestPath.stream(start,end,'cost',

{graph:'my-graph'})

YIELD nodeId,cost

RETURN nodeId,cost

Results

Again, we get back the shortest available path in the network of all relationship types(ROAD and RAIL).

You must remember that it is not the shortest path algorithm that decides how to deduplicate relationship weights, but the graph loader. If for example, we used the sum deduplication strategy, the shortest path in the ROAD and RAIL example would cost 24, as that is the sum of all the relationship weights.

Analysis of Rome transportation system

With this new understanding of the projected graph, let’s move to a more practical example. I found this excellent transportation network of Rome dataset. It is quite rich with information and contains information on five different transportation modes like subway, bus, or plain walking.

Graph model

Similarly, as before, we have nodes with only one label. The only difference is that here, we have five different modes of transportation available and stored as a relationship type.

Create constraint

CREATE CONSTRAINT ON (s:Stop) ASSERT s.id IS UNIQUE;

Import

We will first import the nodes of the network and then import the relationships. You need to copy the data to the $Neo4j/import folder before importing it.

Import nodes

LOAD CSV WITH HEADERS FROM “file:///network_nodes.csv” as row FIELDTERMINATOR “;”

MERGE (s:Stop{id:row.stop_I})

SET s+=apoc.map.clean(row,[‘stop_I’],[])

Import relationships

UNWIND ['walk','bus','tram','rail','subway'] as mode

LOAD CSV WITH HEADERS FROM "file:///network_" + mode + ".csv" as row FIELDTERMINATOR ";"

MATCH (from:Stop{id:row.from_stop_I}),(to:Stop{id:row.to_stop_I})

CALL apoc.create.relationship(

from, toUpper(mode),

{distance:toInteger(row.d),

duration_avg:toFloat(row.duration_avg)}, to) YIELD rel

RETURN distinct 'done'

Walking is the only transportation mode that is lacking the average duration attribute. Luckily for us, we can easily calculate it if we assume that a person is walking 5 kilometers per hour on average or around 1.4 meters a second.

WITH 5 / 3.6 as walking_speed

MATCH (:Stop)-[w:WALK]->()

SET w.duration_avg = toFloat(w.distance) / walking_speed

Graph analytics pipeline

Now that the graph is prepared, we can start the graph algorithms pipeline by loading the Neo4j stored graph into the projected in-memory graph. We load the graph with five relationship types and two attributes of relationships. These two attributes can be used as the relationship weights by the algorithms.

CALL algo.graph.load('rome','Stop',

'BUS | RAIL | SUBWAY | TRAM | WALK',

{

duplicateRelationships:'min',

relationshipProperties:{

distance:{

property:'distance'

},

duration:{

property:'duration_avg'

}

}

})

PageRank algorithm

To start the analysis, let’s find the most graphfamous™ stops in the tram transportation network using the PageRank algorithm.

CALL algo.pageRank.stream('Stop','TRAM',{graph:'rome'})

YIELD nodeId, score

WITH nodeId, score

ORDER BY score DESC LIMIT 5

RETURN algo.asNode(nodeId).name as name, score

Results

The graph loader supports loading many relationship types, and so do the algorithms. In this example, we search for the most graphfamous™ stops in the combined network of buses, trams, and rails.

CALL algo.pageRank.stream('Stop','TRAM | RAIL | BUS',{graph:'rome'})

YIELD nodeId, score

WITH nodeId, score

ORDER BY score DESC LIMIT 5

RETURN algo.asNode(nodeId).name as name, score

Results

Connected components algorithm

Graph algorithms pipeline can also be part of a batch processing job, where you load the graph in memory, run a couple of algorithms, write back results to Neo4j, and unload the in-memory graph. Let’s run the connected components algorithm on all of the transportation modes networks separately and write back results.

UNWIND ["BUS","RAIL","SUBWAY","TRAM","WALK"] as mode

CALL algo.unionFind('Stop',mode,{writeProperty:toLower(mode) + "_component"})

YIELD computeMillis

RETURN distinct 'done'

Explore the connected components in the TRAM network.

MATCH (s:Stop)

RETURN s.subway_component as component,

collect(s.name)[..3] as example_members,

count(*) as size

ORDER BY size DESC

LIMIT 10

Results

These results are weird. I have never been to Rome, but I highly doubt there are six disconnected TRAM components. Even looking at results, you might wonder why the components 7848 and 7827 have the same members.

Your component ids will likely be different, so make sure to use the right ones.

MATCH p = (s:Stop)-[:SUBWAY]-()

WHERE s.subway_component in [7848,7827]

RETURN p

Results

I know it is hard to see, but there stops in the network with the same name. While the names of the stops might be the same, the stop ids are not and, as such, are treated as separate nodes. We can guess that this is a single tram line driving in both directions, one on each side of the road. As the stations for each direction are a walking distance apart, this dataset differentiates between them.

Shortest paths algorithms

I found a use-case where you would want to keep the projected graph in memory all the time. Imagine we are building an application that will help us find the shortest or fastest path between two points in Rome. We don’t want to project the graph in memory for every query, but rather have the projected graph in memory all the time.

We can search for the shortest path traversing only a specific relationship type, or in our case transportation mode.

MATCH (start:Stop{name:’Parco Leonardo’}),(end:Stop{name:’Roma Trastevere’})

CALL algo.shortestPath.stream(start,end,’distance’,{graph:’rome’,relationshipQuery:’RAIL’})

YIELD nodeId,cost

RETURN algo.asNode(nodeId).name as name, cost as meters

Results

The problem with using only the RAIL network is that most of the stops are not in the RAIL network. To be able to find the shortest path between any pair of stops in our network, we have to allow the algorithm to traverse the WALK relationships as well.

MATCH (start:Stop{name:'LABICANO/PORTA MAGGIORE'}),(end:Stop{name:'TARDINI'})

CALL algo.shortestPath.stream(start,end,'distance',{graph:'rome',relationshipQuery:'WALK | RAIL'})

YIELD nodeId, cost

RETURN algo.asNode(nodeId).name as name, cost as meters

Results

And if you remember, we stored two attributes of relationships in the graph memory. Let’s now use the duration attribute as weight.

MATCH (start:Stop{name:'LABICANO/PORTA MAGGIORE'}),(end:Stop{name:'TARDINI'})

CALL algo.shortestPath.stream(start,end,'duration',{graph:'rome',relationshipQuery:'WALK | RAIL'})

YIELD nodeId, cost

RETURN algo.asNode(nodeId).name as name, cost / 60 as minutes

Results

Conclusion

I hope I have given you some ideas on how you can design up your graph algorithms pipeline and rip the benefits.

As always, the code is available on Github.