





[As community content, this post reflects the views and opinions of the particular author and does not necessarily reflect the official stance of Neo4j.]

The Challenge with Machine Learning and Art

Collaborative Filtering and Neo4j

What We Learned at Artfinder using Neo4j

EXPLAIN

PROFILE

PROFILE

FILTER

MATCH (user:User {id: 1})-[:LIKES_ARTWORK]->(artwork:Artwork) RETURN artwork

Artwork

MATCH (user:User {id: 1}-[LIKES_ARTWORK]->(artwork) RETURN artwork

(User:User)-[:LIKES]->(Artist) (User:User)-[:LIKES]->(Artwork)

LIKES

:LIKES

(User)-[:LIKES_ARTIST]->(Artist) (User)-[:LIKES_ARTWORK]->(Artwork)

MATCH

User

LIKE

MATCH (:User {id: 1)-[:LIKES]->(something)

In the first portion of this query, you may find that two of the users like multiple things you also like. This would mean that we run two extra traversals in the next portion of the query.



To avoid this, we use DISTINCT to filter out duplicates and reduce the number of further lookups we need to do.



MATCH (:User {id: 1})-[:LIKES]-(something)

By removing any duplicates, we significantly reduce the number of traversals and therefore reduce the query time significantly. This has a really big impact when the graph gets bigger with time – with special thanks to Mark Needham.



Conclusion

So, what have we achieved?



I’d go as far as to say that as the first site in the art space to deliver a completely personalised home page, ‘My Artfinder’ is a new way of shopping for art. In a traditionally curator-led, advisory market, personalised recommendations based on individual users’ tastes are a huge leap forwards.



While we're definitely not done tweaking, refactoring and optimizing our implementation, it still amazes me at the speed with which we could go from concept to production.



An invaluable tool in our quest to lead the way in personalisation within the art space, Neo4j has helped our team to develop, deploy and maintain an in-production graph database system that provides thousands of users with relevant, real-time recommendations on a daily basis.









Download this whitepaper – Powering Recommendations with a Graph Database – and discover how companies like eBay, Walmart and Glassdoor are using graph databases to power their own real-time recommendation engines.



Get the White Paper

They say that that good artists copy, but great artists steal, right?At Artfinder , the global online marketplace for original art, we’ve just launched ‘My Artfinder’, a mix tape of personal recommendations for users, just like the Spotify Discover Weekly. But not to be outdone, our recommendations are updated daily, thanks to the speed of Neo4j!When we first started thinking about implementing artwork recommendations for our users, two methodologies came to the forefront quite early on: machine learning and collaborative filtering.We currently have ~180,000 artworks listed on the platform, with hundreds of new works added every day. Our artists sign on to the platform and once there, they can upload and classify their artwork freely.One thing that quickly became apparent with the machine learning option was the necessity for reliable, concrete artwork classifications in order to feed any recommendation engine we would come to develop.Now, I’m not an art connoisseur by any standard, but one thing that’s clear is that everything about art is highly subjective, so one of the most surefire ways to get correct classifications would be to classify a few hundred thousand artworks manually. This would be the only way to ensure all classifications were equal.Further to this, different people may choose to classify an artwork differently, raising doubts over the validity of the classifications unless we had a single person do all 200,000. Not really an ideal use of time.So, with this in mind, we began looking at the other standout option for real-time recommendations: collaborative filtering. We were looking for something exceptionally powerful, yet flexible enough to meet our particular demands, and were delighted by Neo4j ’s ease of use as well it’s amazing online resources Getting an initial proof-of-concept up and running was a breeze. We already had a fairly large dataset when it came to what our users personally like/dislike, so implementation was straightforward.I’ve spent a lot of time working with emerging technologies and one thing that immediately struck me about Neo4j was the simplicity of the Cypher query language and how easy it is to construct and read queries. I found reading Cypher queries to be infinitely easier to understand than anything I’ve come across in the RDBMS/SQL world.Cypher’s simple, yet powerful, syntax made working with the graph infinitely easier and iterating, testing and profiling queries later on in the development cycle was that much easier with the built-in web front-end (i.e., the Neo4j Browser). We could analyse query performance and get visual results quickly to see where any bottlenecks were and quickly iterate and try out different versions which all aided in getting a usable system up and running quickly.Here’s a few things we learned:We used theandfunctions to get a good understanding for how Neo4j is interpreting our queries.In a lot of cases,was able to show us exactly where our queries were falling down (e.g., whether it be a forgotten index resulting in a full DB scan). Having immediate visibility of these sticking points meant we could solve them that much faster and saved us plenty of time and head-scratching.We were sparing in our use of node labels when performing lookups.At first, it can seem counter-intuitive to be less specific in your queries, but in a lot of cases, specifying a node label on a related note will result in an unneededoperation against the returned set.Here is an example of finding all of the artwork that one of our users likes:Whereas leaving off thelabel from the related node results in:Needless to say, this isn’t always the case, but understanding this concept went a long way towards improving the speed of our queries.When we modelled our data, we werespecific with relationships between our nodes. This allowed us to be much more targeted with our queries. For example:The above re-uses therelationship label to specify a relationship between a user and artwork and artist alike. If you have 10 artists and 20 artworks, doing a scan formay touch all 30 nodes (or more). So it’s much better to be specific with relationships:This relationship specificity limited the amount of nodes returned when querying for a very specific type of relationship.Reduce Cardinality of ResultsOften you will chain results from onestatement into another, using the results from the first as the starting point for the next traversal. In instances where the first statement may return the same nodes multiple times, you can shorten your query time by reducing duplicates.An example of this would be finding allnodes thatthe same items as you and then going on to find out what kind of music those users listen to.