Building a Graph Database

The model we’re going to use. The only absolutely critical parts to making this work really are the User’s name, and the Post’s ID. Everything else is just flavour and detail, such as Karma, which is Reddit’s way of saying how well received someone’s contributions to the site are. See the addendums for comments on this.

In building a graph database, there’s an element of working backwards. We need to know what we want to achieve before we go and get the data. In the interest of giving you a flow of development, I haven’t really mentioned the data model, but the image above gives you an indication. We want to go from users to any posts they’ve made or commented on (or both). The easiest way to do this is to create a single CSV for each of the above parts: A user table, a posts table, and a connections table (which also includes the name of the user and PostID so we can link them together).

Each individual connection between a user and a post will have its own line. The only really quirky part here is the Comments JSON field. This just means that this field will contain a JSON markup with any comments made. This is important, as we only want one row per User -> Post connection, but a user can make many comments on a single post. This isn’t critical for our graph database to work, it just means we can extract some extra detail on their types of comments more quickly and easily.

Our graph database software of choice here is going to be TigerGraph. It offers a handy developer edition, though this is admittedly limited to 500MB of data and struggles with large graph visualisations. Plus, I have a particular fondness for GSQL as a graph query language, especially over Gremlin. So, step 1, we define our graph schema.

Users, whether they are commenting or submitting a post in the first place, must always make it on a particular piece of content.

Voila! It really is as simple as that for us. At the absolute lowest level, this is what our graph is trying to do. Who posted on what? We’re actually not treating comments or posts separately because the connection would look the same, just with a different word saying “commented_on” instead.

Next, we need to attach our data. We upload our data file and start telling TigerGraph which fields of our data relate to which features in our graph. For instance, a username in the data relates to the User bubble. A post ID (puid) in the data relates to our Post bubble. The edge connecting the two bubbles is a special case, and takes both of those fields. This special connection is where the magic begins in a graph database.

If you’re curious why the fields don’t match the image table above, see the addendums at the bottom of this post.

On the left of the screen, we tell TigerGraph where we want to make a connection between our data and our graph schema. On the right, we give it the specifics. We repeat this process for the Content vertex (on the Post ID), and again for the posted_on line (which takes two entries, one being the username and the other the Post ID).

Onto step 3: This is really easy. We just say that everything is good and, yes, we want to build a graph database on what we’ve set up. Hit run, let it do its thing (it’s surprisingly fast, even when running just on my 8GB MacBook), and check the stats.

You can see the rate of load in the bottom right graph, showing it took about 30 seconds overall.

We’ll linger here for just a moment to look at those stats. 2,033,637 vertices, of which 17,642 are users and the rest posts. This was from only two hops worth of data! We also have over twice as many edges as vertices, which is a good sign for finding internal connections.

So, let’s get to the good bit.

Exploring the Graph

We move onto stage 4, visually exploring the graph, the final stage of what we’ll be looking at (although there is a stage 5, which is writing GSQL queries, which is beyond what we need to do for now).

To begin with, I’m going to ask TigerGraph to return the user we started with, the cause of this entire project.

One lonely bubble. Normally, TigerGraph shows us relevant information, which would include the username here, but I’ve told it not to in order to save my a big job blurring everything.

And now, we double-click on this bubble. And we get…

Pop!

All their posts, and posts they’ve commented on! So, for a final check, let’s just quickly double click a post and see what happens.

When we double click on a bubble, TigerGraph adds all connected data and highlights any connected data. So above, the highlighted parts to the right are all new, and we can see the post I double clicked on is connected to our original user!

Amazing, looks like it all works. So, we’re going to go one bigger. We can also tell TigerGraph to, starting from our origin user, to expand out to any relevant posts. Then from there, to any relevant users. Then again to their posts, and so on. We actually only need to do this process a small number of times before we duplicate our hops. There is one catch though: This version of TigerGraph can only show a certain amount of data. So we’re going to tell it to sample and bring back most, but not all, data. We could miss useful things this way, so in a production environment this wouldn’t be appropriate. Alas, let’s go ahead and see what we get.

Bit messy though.

And this is what we get. We’ve asked TigerGraph to bring back the first 50 bits of relevant content for our origin user (for which, there is only 15 pieces anyway, which we know from when we double clicked earlier), and then bring back 50 users attached to each of those posts. Luckily for us, the actual volumes here are small. But, it’s not particularly nice to look at. TigerGraph has a handful of options to changing how the data is shown. If we choose the circle mode, we get something nicer.

Pretty! It would be useful if there was a way to filter out where vertices only have a single edge attached, but we’d need the GSQL query mode for this which is beyond the scope of this article.

Amazingly, the circle mode almost perfectly suits what we’re trying to do. We’re now seeing something really useful. Each of those lines dictates where a user has commented on something. What we’re looking for is where users (blue bubbles) have lots of lines. In reality, we’re simply looking for where there’s more than one line.

Zooming in on the centre and bringing in posts (red bubbles) with multiple connections to our central trio makes it easier to see the network.

We have three users here that stand out. They have a number of connections between them, which on viewing more closely we would discover are always a mix of where one of them has posted and another account has commented. We also have a single post which all three accounts commented on (the bottom-right bubble).

But why is this suspicious? Well, look at the image posted before we zoomed in. Of all those blue bubbles, only three have more than a single connection. Reddit is a huge website, it’s deeply uncommon for people to comment on the same things, even on highly popular posts, and we can see from the small number of users (when we asked for 50 per post) that this is not particularly popular content.

So what now?

Findings and Conclusions

This alone isn’t enough to determine astroturfing. Far from it. But I can assure you that looking into these accounts guarantees it (and, as I showed right back at the start with the Soundcloud comments image, we already know it’s happening). But it does suggest graph databases could be used to find astroturfing activity with relative ease.

There are a few counter-arguments. What if friends comment on each others’ content? What about really, wildly popular posts? What if there are ‘Reddit celebrity’ users that always attract the same audience to their content? This approach isn’t perfect, but we can use more traditional methods to filter this out. Rules like “Only show me users with connections to more than a single post” and “Whitelist these users to ignore them and their content, as we know they’re okay” would be a massive help.

The biggest issue I’ve had with this, by far, is getting the data. There’s just so much and the ability to access it through the API so slow that I can’t possibly get enough. However, if you already had all this information and could dive straight into the graph database side of things… (cough cough, Reddit admins, cough).

This approach does work. It’s surprisingly simple to do. In fact, for me, the hardest part was getting the data in the first place. It isn’t a be-all and end-all, you still need to take the output from your graph and investigate it.

During the course of this project, I made a lot of twists and turns, and I endeavour that you read the addendums if you have questions. If you have questions I haven’t covered, do put them in the comments section.

Thanks for reading. It’s been a surprisingly long road putting this together. Let me know in the comments if you’ve got ideas on how you could use graph databases, or where I missed building in functionality for this Reddit tool!