In this post I’d like to demonstrate how to work with the Relato Business Graph. We’ll be using the Titan graph database, but you could use the same code with another database, through the generic graph framework Tinkerpop.

You can grab the source code to this post here.

Getting Started

Start by grabbing the project from Github:

Now download the data from its project on data.world. Copy the files into the data directory of the Github repository.

unzip /tmp/datasyndrome-relato-business-graph-database.zip

copy \

/tmp/datasyndrome-relato-business-graph-database/original/*.json \

~/Software/open_business_graph/data/

Setting up Titan

Now let’s download and setup Titan. We’ll need to specify the Joda-Time jar in our CLASSPATH to load our data.



curl -Lko ~/Software/titan-1.0.0-hadoop1.zip \

http://s3.thinkaurelius.com/downloads/titan/titan-1.0.0-hadoop1.zip

unzip titan-1.0.0-hadoop1.zip # Could not make the hadoop2 binary workcurl -Lko ~/Software/titan-1.0.0-hadoop1.zip \unzip titan-1.0.0-hadoop1.zip cd titan-1.0.0-hadoop1

CLASSPATH=~/Software/open_business_graph/lib/joda-time-2.9.9.jar \

bin/gremlin.sh

Gremlin is a graph query language that provides a REPL. Titan provides various classes to work with the database via Gremlin. Titan can use several databases as its ultimate storage system, including horizontally scalable systems like Cassandra and HBase.

We start by creating a BerkeleyDB backed graph database. Check out setup.groovy:

// Setup the database

conf = new BaseConfiguration()

conf.setProperty("storage.directory", "/Users/rjurney/Software/open_business_graph/data")

conf.setProperty("storage.backend", "berkeleyje")

graph = TitanFactory.open(conf) // Setup our graph schema

mgmt = graph.openManagement()

Next we create a unique index for the company’s web domain property, which serves as our unique key.

// Vertex labels

company = mgmt.makeVertexLabel('company').make()



// Node properties

domain = mgmt.makePropertyKey('domain').dataType(String.class)\

.cardinality(Cardinality.SINGLE).make()



// Indexes

mgmt.buildIndex('byDomainUnique', Vertex.class).addKey(domain).unique().buildCompositeIndex()

Then we create the edge types that define our graph: partnership, customer, competitor, investor. Finally, we commit the changes.

// Relationships

partner = mgmt.makeEdgeLabel('partnership').multiplicity(MULTI).make()

customer = mgmt.makeEdgeLabel('customer').multiplicity(MULTI).make()

competitor = mgmt.makeEdgeLabel('competitor').multiplicity(MULTI).make()

investor = mgmt.makeEdgeLabel('investor').multiplicity(MULTI).make()



// Commit changes

mgmt.commit()

Loading the Business Graph

The following code will now initialize Gremlin to work with our database each time you load gremlin.sh. Check out startup.groovy:

import org.joda.time.DateTime

import org.joda.time.format.ISODateTimeFormat



// Setup our database on top of berkeleydb (for now)

conf = new BaseConfiguration()

conf.setProperty("storage.directory", "/Users/rjurney/Software/open_business_graph/data")

conf.setProperty("storage.backend", "berkeleyje")

graph = TitanFactory.open(conf)



// Get a graph traverser

g = graph.traversal()

Check out the file read_json_load_titan.groovy. First we setup our nodes, the companies themselves.

// Setup JSON Reading of MongoDB mongodump data

jsonSlurper = new JsonSlurper()



companies_filename = "/Users/rjurney/Software/marketing/jsondump/companies.json"

company_reader = new BufferedReader(new FileReader(companies_filename));



while((json = company_reader.readLine()) != null)

{

document = jsonSlurper.parseText(json)



println(document.domain)

v = graph.addVertex('company')

v.property("_id", document._id.$oid)

v.property("udpate_time", document.update_time.$date)

v.property("domain", document.domain)

v.property("name", document.name)

}

Next we load the edges.

// Get a graph traverser

g = graph.traversal()



// Create edges between companies

links_filename = "/Users/rjurney/Software/open_business_graph/data/links.json"

links_reader = new BufferedReader(new FileReader(links_filename));



update_time = new Date().format("yyyy-MM-dd'T'HH:mm:ss'Z'", TimeZone.getTimeZone("UTC"))



while((json = links_reader.readLine()) != null)

{

document = jsonSlurper.parseText(json)



try {

// Add edges to graph

v1 = g.V().has('domain', document.home_domain).next()

v2 = g.V().has('domain', document.link_domain).next()



v1.addEdge(document.type, v2, 'update_time', update_time)

}

catch(Exception ex) {

print("Error: " + ex + "

")

print(document)

print("

")

print(v1.values())

print("

")

print(v2.values())

print("

")

}

}

Verify the data loaded by counting the nodes and edges:

gremlin> g.V().count()

==>51222

g.E().count()

==>376092

Querying the Business Graph

Network centralities are powerful metrics indicating the importance of a node or edge. Gremlin is a powerful language, but it is very hard to learn. Next we demonstrate several types of centrality in Gremlin, starting with degree centrality. In-degree centrality is a measure of popularity.

// Calculate partnership in-degree centrality

inDegreePartnership = g.V().group().by('domain').by(inE('partnership').count()).next() // JSONize and write to disk

inDegreePartnershipJson = new JsonBuilder(inDegreePartnership).toString()

new File("/Users/rjurney/Software/marketing/titan/data/in_degree_partnership.json").write(inDegreePartnershipJson)

Another type of centrality is eigenvector centrality. Wikipedia defines its algorithm as “connections to high-scoring nodes contribute more to the score of the node in question than equal connections to low-scoring nodes.”

// Calculate partnership eigenvector centrality

partnershipLoopCentrality = g.V().repeat(out('partnership').groupCount('m').by('domain')).times(5).cap('m').next() // JSONize and write to disk

partnershipLoopCentralityJson = new JsonBuilder(partnershipLoopCentrality).toString()

new File("/Users/rjurney/Software/marketing/titan/data/partner_loop_centrality.json").write(partnershipLoopCentralityJson)

Conclusion

In this post we’ve loaded the data and demonstrated some basic queries. In the next post, we’ll be diving into more complex use cases. Gremlin provides an incredibly powerful query capability— if you can wrap your head around it!

Need help with working with the Relato Business Graph? Data Syndrome is available to help. We help clients build analytics products from conception through implementation through operation.