In just six months since its release, Google’s pretrained model BERT (Bidirectional Encoder Representations from Transformers) has become one of the hottest AI tools for Natural Language Processing (NLP). Although BERT has an unprecedented ability to capture rich semantic meanings from plain text, it’s not quite perfect. For example if you ask the question “Is Bob Dylan a songwriter or a book author?” BERT’s pursuit of a response becomes as tangled as Dylan’s hair.

As shown in the Figure above, BERT has a tough time accurately classifying Bob Dylan’s different creative pursuits into occupations without knowing that “Blowin’ in the Wind” is a song and “Chronicles: Volume One” is a book. A pretrained model like BERT does not incorporate knowledge information for language understanding nor extract fine-grained relations such as “composer” or “author.” BERT will likely recognize the two sample sentences as syntactically ambiguous (“UNK wrote UNK in UNK”, where “UNK’ = “Unknown”),

To address this kind of problem, researchers from Tsinghua University and Huawei Noah’s Ark Lab recently proposed a new model that incorporates knowledge graphs (KG) into training on large-scale corpora for language representation. In a nod to the Sesame Street characters, the new model was named “ERNIE.”

ERNIE tackled two main challenges to incorporate external knowledge into language representation: Structured Knowledge Encoding, and Heterogeneous Information Fusion.

For the Structured Knowledge Encoding, named entities mentioned in the text will first be identified and then aligned to the corresponding entities in KGs. The structure of the KGs will be encoded with knowledge embedding algorithms such as TransE, and then pass the informative entity embedding to the ERNIE model for training.

For the Heterogeneous Information Fusion, ERNIE utilizes a BERT-like architecture (adopting the masked language model and using the next sentence prediction as pretraining objectives), and adds a new pretraining objective for better named entity alignments.

To evaluate ERNIE’s performance, experiments were conducted primarily on two knowledge-driven NLP tasks: Entity typing and Relation classification. English Wikipedia was used as the corpus for pretraining, and Wikidata was used to align the text (Knowledge embeddings were trained on Wikidata with TransE).

Detailed results of the evaluations are shown in the following tables, where FIGER and Open Entity are the datasets used for entity typing; and FewRel and TACRED are the datasets used for relation classification. ERNIE significantly outperforms other state-of-the-art models on the two knowledge-driven NLP tasks.

ERNIE has also achieved comparable results with the basic version of BERT on eights datasets of the GLUE (General Language Understanding Evaluation) benchmark, which indicates that ERNIE does not degenerate its performance on other common NLP tasks.

In their paper researchers identify possible avenues for future research as injecting knowledge into feature-based pretraining models such as ELMo, introducing diverse structured knowledge into language representation models such as ConceptNet, and annotating additional real-world corpora heuristically for larger pretraining data.

The paper ERNIE: Enhanced Language Representation with Informative Entities is on arXiv. All the relevant code is open sourced at GitHub.