Poincaré Embeddings for Learning Hierarchical Representations

Arxiv link: https://arxiv.org/abs/1705.08039

After an initial read, one of my professor was like:

Idea is quite interesting and the results seem too good to be true.

Well this was the reaction from quite a few people to whom I showed the results.

Firstly I will show you the results I got on a part of WordNet, so that the above picture of infeasibility dilutes a little:

WordNet embeddings for Mammal (only three levels)

I have only used three levels:

Black one is at level 0

Red ones are at level 1

Green ones at level 2.

The results seem good to me.

In one of my previous post, Cross-lingual word embeddings- What they are?, I explained about word embeddings. They can be used in different tasks like information retrieval, sentiment analysis and myriad others.

Similarly we can embed graphs, and have methods like node2vec, latent space embeddings which can help us in representing graphs and subsequently in community detection and link prediction.

Let us look into this from the original node2vec paper →

In node2vec, we learn a mapping of nodes to a low-dimensional space of features that maximizes the likelihood of preserving network neighborhoods of nodes.

Like word2vec embeddings, we can use node2vec embeddings to predict new links. Suppose we have the data of a social network, now when we embed the nodes and by some distance metric we see that two nodes(users) have a small distance then we can suggest the users to be friends on social network. One case when this can happen is when user2 is new to the social network and get similar friends as user1, the chances of user2 and user1 knowing each other increases given overlapping friends(edges in the graph).

There are two main aspects of node2vec embeddings:

Ability to learn representations that embed nodes from the same network community closely together, as well as to learn representations where nodes that share similar roles have similar embeddings

Source: Google Images

Now consider the case of representing hierarchical data, the hierarchical data are structured like tree

And based on an intuition we can compare it to hyperbola.

With the root at center and the leaves spanning outwards.

Source: Google Images

This fact was used by the paper:

To exploit this structural property for learning more efficient representations, we propose to compute embeddings not in Euclidean but in hyperbolic space, i.e., space with constant negative curvature. Informally, hyperbolic space can be thought of as a continuous version of trees and as such it is naturally equipped to model hierarchical structures.

And they used Poincare ball model for the model of hyperbolic space because it is well suited for gradient based optimization, hence backpropagation.

Ofcourse when you are talking about a different space, the metric for distance is different there. In case of poincare ball model, the hyperbolic distance between two points u and v is given by:

Source: Original paper

where ||x|| is the euclidean norm. If you see the formula given in wikipedia, you may notice that 1 has been put in place of |r| because author have taken a ball of radius 1 only.

Now for training the model, we create an optimization function based on this distance like the one used by Mikolov et al with negative sampling.

Source: Original paper

And then backpropagate our way to update the embeddings.

∇E = ( ∂L(θ)/∂d(θ,x) ) * ( ∂d(θ,x)/∂d(θ) )

The first part is already known, and they have given the formula for second one:

Source: Original Paper

They have used a small hack so that the embeddings do not get bigger than the model after update. For which they have deployed the following projection:

Source: Original Paper

And then they update the embeddings by the following rule:

Source: Original Paper

Results

This is the most interesting part

Source: Original Paper

We can see that what this model achieves in a 5 dimensional space is better than what euclidean model achieves in 200 dimensional space.

I hope you are in awe with the results.

TL;DR

They have used hyperbolic space to embed nodes of hierarchical data and achieved some super awesome results.

You can follow me on twitter at @nishantiam and I am on github at @nishnik.