Introduction

Hello, everyone!

Today, let me introduce how to mine Wikipedia Infobox with Perl 6.

Wikipedia Infobox plays a very important role in Natural Language Processing, and there are many applications that leverage Wikipedia Infobox:

Building a Knowlege Base (e.g. DBpedia [0])

Ranking the importance of attributes [1]

Question Answering [2]

Among them, I’ll focus on the infobox extraction issues and demonstrate how to parse the sophisticated structures of the infoboxes with Grammar and Actions.

Are Grammar and Actions difficult to learn?

No, they aren’t!

You only need to know just five things:

Grammar token is the most basic one. You may normally use it. rule makes whitespace significant. regex makes match engine backtrackable.

Actions make prepares an object to return when made calls on it. made calls on its invocant and returns the prepared object.



For more info, see: https://docs.perl6.org/language/grammars

What is Infobox?

Have you ever heard the word “Infobox”?

For those who haven’t heard it, I’ll explain it briefly.

An easy way to understand Infobox is by using a real example:

As you can see, the infobox displays the attribute-value pairs of the page’s subject at the top-right side of the page. For example, in this one, it says the designer (ja: 設計者) of Perl 6 is Larry Wall (ja: ラリー・ウォール).

For more info, see: https://en.wikipedia.org/wiki/Help:Infobox

First Example: Perl 6

Firstly to say, I’ll demonstrate the parsing techniques using Japanese Wikipedia not with English Wikipedia.

The main reason is that parsing Japanese Wikipedia is my $dayjob :)

The second reason is that I want to show how easily Perl 6 can handle Unicode strings.

Then, let’s start parsing the infobox in the Perl 6 article!

The code of the article written in wiki markup is:

There are three problematic portions of the code:

There are superfluous elements after the infobox block, such as the template {{プログラミング言語}} and the lead sentence starting with '''Perl 6''' . We have to discriminate three types of tokens: anchor text (e.g. [[Rakudo]] ), raw text (e.g. Rakudo Star 2016.04 ), weblink (e.g. [https://perl6.org/ Perl6.org] ). The infobox doesn’t start at the top position of the article. In this example, {{Comb-stub}} is at the top of the article.

OK, then I’ll show how to solve the above problems in the order of Grammar, Actions, Caller (i.e. The portions of the code that calls Grammar and Actions).

Grammar

The code for Grammar is:

Solutions to the problem 1: Use .+ to match superfluous portions. (#1)

Solutions to the problem 2: Prepare three types of tokens: anchortext (#2), weblink (#3), and rawtext (#4). The tokens may be separated by delimiter (e.g. , ), so prepare the token delimiter. (#5) Represent the token value-content as an arbitrary length sequence of the four tokens (i.e. anchortext, weblink, rawtext, delimiter). (#6)

Solutions to the problem 3: There are no particular things to mention.



Actions

The code for Actions is:

Solutions to the problem 2: Make the token value-content consist of the three keys: anchortext, weblink, and rawtext.

Solutions to the problem 1 and 3: There are no particular things to mention.



Caller

The code for Caller is:

Solutions to the problem 3: Read the article line-by-line and make a chunk which contains the lines between the current line and the last line. (#1) If the parser determines that: The chunk doesn’t contain the infobox, it returns an undefined value. One of the good ways to receive an undefined value is to use $ sigil. (#2) The chunk contains the infobox, it returns a defined value. Use @() contextualizer and iterate the result. (#3)

Solutions to the problem 1 and 2: There are no particular things to mention.



Running the Parser

Are you ready?

It’s time to run the 1st example!

The example we have seen may be too easy for you. Let’s challenge more harder one!

Second Example: Albert Einstein

As the second example, let’s parse the infobox of Albert Einstein.

The code of the article written in wiki markup is:

As you can see, there are five new problems here:

Some of the templates contain newlines; and are nesting (e.g. {{nowrap|{{仮リンク|...}}...}} ) Some of the attribute-value pairs are empty. Some of the value-sides of the attribute-value pairs contain break tag; and consist of different types of the tokens (e.g. anchortext and rawtext).

So you need to add positional information to represent the dependency between tokens.

I’ll show how to solve the above problems in the order of Grammar, Actions.

The code of the Caller is the same as the previous one.

Grammar

The code for Grammar is:

Solutions to the problem 1.1: Create the token value-content-list-nl which is the newline separated version of the token value-content-list. It is useful to use modified quantifier % to represent this kind of sequence. (#1) Create the token template. In this one, define a sequence that represents Plainlist template. (#2)

Solutions to the problem 1.2: Make the token template enable to call the token value-content-list. This modification triggers recursive call and captures nesting structure, because the token value-content-list contains the token template. (#3)

Solutions to the problem 2: In the token property, define a sequence that value-side is empty (i.e. a sequence that ends with ‘=’). (#4)

Solutions to the problem 3.1: Create the token br (#5) Let the token br follow the token value-content in the two tokens: The token value-content-list (#6) The token value-content-list-nl (#7)



Actions

The code for Actions is:

Solutions to the problem 3.2: Use Match.from and Match.to to get the match starting position and the match ending position respectively when calling make. (#1 ~ #4)



Running the Parser

It’s time to run!

Conclusion

I demonstrated the parsing techniques of the infoboxes. I highly recommend you to create your own parser if you have a chance to use Wikipedia as a resource for NLP. It will deepen your knowledge about not only Perl 6 but also Wikipedia.

See you again!

Citations

[0] Lehmann, Jens, et al. “DBpedia–a large-scale, multilingual knowledge base extracted from Wikipedia.” Semantic Web 6.2 (2015): 167-195.

[1] Ali, Esraa, Annalina Caputo, and Séamus Lawless. “Entity Attribute Ranking Using Learning to Rank.”

[2] Morales, Alvaro, et al. “Learning to answer questions from wikipedia infoboxes.” Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2016.

License

All of the materials from Wikipedia are licensed under the Creative Commons Attribution-ShareAlike 3.0 Unported License.

—

Itsuki Toyota

A web developer in Japan.