Facebook AI Research has announced it is open-sourcing PyTorch-BigGraph (PBG), a tool that can easily process and produce graph embeddings for extremely large graphs. PBG can also process multi-relation graph embeddings where a model is too large to fit in memory. Facebook boasts that PBG not only performs faster than commonly-used embedding softwares, but also provides higher-quality results compared with state-of-the-art benchmarks.

The company says PBG will allow users to quickly produce high-quality embeddings from a large graph using either a single machine or multiple machines in parallel.

Graphs are widely used in almost all types of programming for representing data. They can be seen as an alternative way of labeling a training dataset which allows for connections between data that describe their relationships. Existing unsupervised learning graph embedding methods can learn a vector representation of each node in the graph. The method predicts the occurrence of edges in the graph by making embeddings closer for pairs of nodes with shared edges than for pairs of nodes without shared edges.

However for training extremely large graphs with billions of nodes and trillions of edges, this method can require weeks or even years. The prohibitively long training time is caused by a lack of embedding systems with sufficiently fast processing speed and massive memory capacity.

Facebook’s PBG overcomes those challenges for large size embedding graphs by enabling three fundamental building blocks:

Partitioning of the graph — Divide the nodes randomly into multiple partitions to overcome the memory limitation of graph embeddings.

PyTorch parallelization primitives — Perform distributed training on multiple machines by leverage shared filesystem framework.

Negative sampling — Reuse the single batch of random nodes to produce corrupted negative samples for training edges. Allow the tool to train many negative examples per true edge training at a low computational cost.

The Facebook AI team used both a publicly available Freebase knowledge graph (120 million nodes and 2.7 billion edges) and a FB15k Freebase graph (15000nodes and 600000 edges) to evaluate PBG performance. On the FB15k dataset PBG showed comparable performance with state-of-the-art embedding methods.

On the publicly full-size dataset PBG showed an 88 percent reduction in memory usage.

The PBG paper PyTorch-BigGraph: A Large-scale Graph Embedding System is on arXiv; and the PBG Github is here.