In this post I will present one of the first proposed Open Information Extraction systems, which is very simple and effective, relying only on part-of-speech tags. I also implement it and apply it to Portuguese news articles.

But first a small introduction, Information Extraction, in an NLP context, deals with the problem of extracting structured information from text. For a computer text is just a sequence of bytes, with no semantic meaning whatsoever.

A possible approach to extract structured information from text is to extract semantic relationships or semantic triples, for instance, named-entities (e.g., persons, locations, organisations), and the semantic relationships between them.

Semantic relationships are typically represented by triples in the form <e1, rel, e2>, where e1 and e2 are noun-phrases/entities of a relationship, and rel is the type of relationship relating the two noun phrases.

For instance, given the following sentence:

"The linguist Noam Chomsky was born in East Oak Lane neighbourhood of Philadelphia."

we first identify the named-entities, in this case, three different named-entities are recognized.

Then, two different relationships, can be extracted between the recognized named-entities:

<Noam Chomsky, place-of-birth, East Oak Lane>

<East Oak Lane, part-of, Philadelphia>

Typically (or until 2007) most systems aimed at extracting relationships with known a priori relationship types (i.e., place-of-birth, part-of, etc.), and each target relationship is learned from training data or defined by hand-made rules.

Another approach is Open Information Extraction (OIE) (Etzioni et al. (2008), Banko, et al. (2007)), which is suited when the target relations are unknown and the textual data is heterogeneous.

OIE techniques, on the other hand, extract all possible relationship types from a given collections of documents. OIE can be divided into two main categories, data- and rule-based.

Rule-based OIE relies on hand-crafted patterns from PoS-tagged text or rules operating on dependency parse trees. Data-based OIE generates patterns based on training data represented by means of dependency tree or PoS-tagged text.

The figure below shows a taxonomy of different methods to perform relationship extraction. If you are interested you can also check Chapter 2 of my PhD thesis for a review of work in relationship extraction.





ReVerb: Identifying Relations for Open Information Extraction

ReVerb (Fader et al., 2011) extracts relationships based on a simple constraint, every relational phrase, i.e. the sequence of words connecting two entities, must be either:

a verb (e.g., invented),

(e.g., invented), a verb followed immediately by a preposition (e.g., located in),

(e.g., located in), a verb followed by nouns, adjectives, or adverbs ending in a preposition (e.g., has atomic weight of)

This corresponds to the PoS-tags pattern shown in the figure below

If there are multiple possible matches for a single verb, the longest possible match is chosen. If the pattern matches multiple adjacent sequences, ReVerb merges them into a single relation phrase.

During extraction, the system first looks for a matching relational phrase and then for the arguments (e1, e2) of the relationship, thus avoiding confusing a noun in the relational phrase for an argument.

Extracting semantic relationships from Portuguese

For English, there are several software packages ready to use out-of-the-box to perform simple NLP tasks: such as part-of-speech tagging, dependency parsing, named-entity recognition, etc.

But for Portuguese, and many other languages, that’s not the case, therefore I was curious whether I could quickly and easily adapt ReVerb to Portuguese and and extract triples from a collection of documents.

I also did a small change from the original ReVerb algorithm, instead of looking for noun-phrases, I first tagged all the named-entities (i.e., PER, LOC, ORG) in a document collection, and then tried to find relational phrases, according to the pattern based on PoS, which connect the named-entities in a relationship.

I used three main resources for a running quick experiment:

I) Polyglot, is a NLP library supporting several languages, include Portuguese, performing:

sentence boundary detection

part-of-speech tagging

named-entity recognition (NER)

I had some troubles installing PyICU which is needed for Polyglot, this post on stackoverflow helped me.

Polyglot tries do automatically detect the language in which a text is written and apply the correct model (i.e., for PoS-tagging, NER, etc.). But, due to foreign names, sometimes it detects the language as being English or other rather than Portuguese, but you can force the language of a text and override the language detection mechanism





II) I used RegexpParser from NLTK to encode ReVerb’s regular expression adapted to Portuguese. The PoS-tags given by Polyglot follow the Universal Part of Speech tags set, 17 unique tags.

verb = "<ADV>*<AUX>*<VERB><PART>*<ADV>*" word = "<NOUN|ADJ|ADV|DET|ADP>" preposition = "<ADP|ADJ>" rel_pattern = "( %s (%s* (%s)+ )? )+ " % (verb, word, preposition) grammar_long = '''REL_PHRASE: {\\%s}''' % rel_pattern





III) CHAVE, is a collection of portuguese news articles, available free of charge for research proposes distributed by Linguateca.

You just need to give an email address to receive the credentials allowing to download the collection. It contains all the complete published editions for the years of 1994 and 1995 for two popular newspapers, PUBLICO from Portugal and Folha de São Paulo from Brazil. The figure below is a statistical description of the dataset.

I used Polyglot to identify named-entities and perform part-of-speech tagging, then I looked for ReVerb patterns between every pair of named-entities which was no more than 8 tokens apart. I extracted triples from PUBLICO news articles from the following categories:

['Nacional', 'Mundo', 'Economia', 'Sociedade', 'Cultura']

This took around 2 hours to run on a Mac Book Pro, and extracted around 70 000 unique relational triples. I then started to explore the extracted triples by simply using grep, to find triples refering specific entities.

There are interesting facts in the semantics triples, mostly, of course, related to events in 1994 and 1995, for instance:

When Madredeus start to gain international fame and tour outside of Portugal:

Teresa Salgueiro cantar composições do Pedro Ayres Magalhães Madredeus esgota salas em Gent Madredeus vendem tanto como Paul McCartney Madredeus gravar em Inglaterra Madredeus recusam convite do PSD Teresa Salgueiro frequentou aulas de canto com Cortês Medina

When the East-Timor conflict gained larger and political proportions:

Bill Clinton falou no problema dos direitos humanos em Timor-Leste Bill Clinton levante a questão de Timor-Leste Bill Clinton também não deixará de mencionar a questão de Timor Governo da Indonésia não poupa ataques Portugal Coliseu do Porto haverá um espectáculo de solidariedade com Timor Danielle Mitterrand participar numa conferência sobre Timor-Leste Durão Barroso lança apelo por Timor Galvão de Melo não poupou elogios à presença indonésia em Timor Governo russo reconheceu Timor-Leste Indonésia explora petróleo do Timor Indonésia ocupou o território de Timor Indonésia acusa Portugal

The big rock concerts at the Alvalade stadium:

Alvalade montar o espectáculo da banda de David Gilmour Alvalade ouvir a música dos Pink Floyd Jagger tocou em Alvalade Bryan Adams arrasa Alvalade Damásio prometeu roubar os concertos ao seu rival de Alvalade Pink Floyd esgotam Alvalade Pink Floyd tocam Alvalade Rod Stewart dá um concerto em Lisboa Van Halen fazerem a primeira parte do concerto dos Bon Jovi

The typical messages of accusation and support in politics:

António Guterres responde às críticas de Luís Filipe Menezes Manuel Monteiro acusou ontem António Guterres Narciso Miranda dedicou a vitória a António Guterres PCP não poupou críticas a Guterres PS teceu fortes críticas ao PSD PSD criticou o governo de Guterres Sampaio elogiando a capacidade de Guterres Sampaio entra para o longo abraço a Guterres

And the investments, buys and sells of shares in economics:

António Champalimaud aumentar o capital da Mundial Confiança Sonae Investimentos lançou sobre as acções da Interlog SGPS British Aerospace queria vender a totalidade das acções da Rover Carlsberg entrou no capital da Unicer Continente lançou sobre a totalidade do capital da Modelo Microsoft adquiriu todas as acções da Altamira Portucel não acompanhará o aumento de capital da Inapa Renault conduziu a uma forte subida das acções da Volvo Santander entra efectivamente no capital do BCI

But there is also uninformative extractions, which occur mainly due to two problems:

NER component of Polyglot

ReVerb limitations - which are also pointed out in the original paper

Named-Entity Recognition Errors

One of the problems with the extraction has to due with the Polyglot failing to correctly identify the named-entities. Polyglot often detects wrong named-entities, for instance:

! I-ORG ' I-PER ) I-ORG . . ) ? I-PER . . . ' I-ORG . . . . I-PER . . . . . . . . . . . I-PER 17.3.95 I-PER 22.06.94 I-LOC 2C I-ORG 4.10.1952 I-LOC 4AD I-ORG 52n Street I-ORG 72nd West I-ORG ? . . . I-PER Actualidade I-ORG

or sometimes the named-entities are incomplete, for instance, given the phrase:

“O Fundo BPI América, gerido pela Douro Fundos, grupo Banco Português de Investimento, …“

The following triple is extracted, with the incomplete entity “Douro” tagged as LOCATION

BPI América gerido pela Douro

Shallow Analysis of the Sentence

ReVerb operates only at a shallow sentence level, it does not take into consideration syntactic dependencies among words or groups of words, as a consequence it only extracts binary relationships. Given this sentence:

“Fontes diplomáticas citadas pela AFP referiram que Washington sempre encarou com desagrado qualquer anúncio de redução de efectivos da ONU na Bósnia antes da formação de uma nova força multinacional”

the follwing triple is extracted:

AFP referiram Washington

and “Washington” wrongly as tagged as a PERSON.

This is a complex sentence, and there is not a simple binary relationship that expresses all the information in the sentence.

Future Work

This was a quick experiment on how to perform open information extraction in portuguese using only open and out-of-the-box tools. Two things that need to be improved are:

Go beyond part-of-speech tags, i.e., for for syntactic dependencies

Improve the named-entity recognition

It’s easy to solve this for English, but for Portuguese it’s more complicated. For morphological and syntactic information for Portuguese, there is SyntaxNet, but I still have to find a way to make it work in batch or inside my own Python code.

For Named-Entity Recognition, as far as I know, apart from Polyglot, there is no other library, ready to use out-of-the-box. There are datasets available, which can be transformed and used as training data. I will try to address this issue in the future, and hopefully make a post about it.

The full code and the extracted triples are available on my github:

https://github.com/davidsbatista/information-extraction-PT

Related posts