



Project Idea [ edit ]

StrepHit (pronounced "strep hit", means "Statement? repherence it!")[1] is a Natural Language Processing pipeline that harvests structured data from raw text and produces Wikidata statements with reference URLs. Its datasets will feed the primary sources tool.[2]

In this way, we believe StrepHit will dramatically improve the data quality of Wikidata through a reference suggestion mechanism for statement validation, and will help Wikidata become the gold-standard hub of the Open Data landscape.

The Problem [ edit ]

The vision of a Web as a freely available repository of machine-readable structured data has not only engaged a long strand of research, but has also been absorbed by the biggest web industry players. Crowdsourced efforts following the Wiki paradigm have enabled the creation of several knowledge bases - most notably DBpedia,[3] Freebase,[4] and Wikidata[5] - which are proven useful for a variety of applications, from question answering to entity summarization and entity linking, just to name a few.

However, the trustworthiness of Wikidata assertions plays the most crucial role in delivering a high-quality, reliable knowledge base: in order to assess their truth, assertions should be validated against third-party resources, and few efforts have been carried out under this perspective. One form of validation can be achieved via references to external (i.e, non-wiki), authoritative sources. This has motivated the development of the primary sources tool: it will serve as a platform for users to either accept or reject new references and/or assertions coming from third-party datasets.

We argue that there is a need for datasets which guarantee at least one reference for each assertion, and StrepHit is conceived to do so.

The Solution [ edit ]

StrepHit applies Natural Language Processing techniques to a selected corpus of authoritative Web sources in order to harvest structured facts. These will serve two purposes: to authenticate existing Wikidata statements, and ultimately to enrich them with references to such sources. More specifically, the solution is based on the following main steps:

Corpus-based relation discovery, as a completely data-driven approach to knowledge harvesting; Linguistically-oriented fact extraction from reliable third-party Web sources.

The solution details are best explained through the use case shown below. The technical implementation is provided in the implementation details section.

Use Case [ edit ]

Soccer is a widely attested domain in Wikidata: it counts a total of 188,085 items describing soccer-related entities,[6] which is a significant portion (around 1.27%) of the whole knowledge base. Moreover, those Items are generally very rich in terms of statements (cf. for instance the Germany national football team).

On account of such observations, the soccer domain properly fits the main challenge of this proposal, namely to automatically validate Wikidata statements against a knowledge base built upon the text of third-party Web sources (from now on, the Web Sources Knowledge Base).

The following table displays four example statements with no reference from the Germany national football team Item, which can be validated by candidate statements extracted from the given references.

Wikidata Sentence Extracted Statement Reference <Germany, participant of, Miracle of Cordoba> "(...) The Miracle of Cordoba, when they eliminated Germany from the 1978 World Cup" <Germany, eliminated in, Miracle of Cordoba> The Telegraph <Germany, team manager, Franz Beckenbauer> "In 1984 Beckenbauer was appointed manager of the West German team" <West German team, manager, Beckenbauer> Encyclopædia Britannica <Germany, inception, 1908> "The story of the DFB’s national team began (...) on April 5th 1908" <DFB’s national team, start, 1908> DFB <Germany, captain, Michael Ballack> "Michael Ballack, the captain of the German national football team" <German national football team, captain, Michael Ballack> Spiegel

Proof of Work [ edit ]

The soccer use case has already been partially implemented: the prototype has yielded a small demonstrative dataset, namely FBK-strephit-soccer, which has been uploaded to the primary sources tool.

We invite reviewers to play with it, by following the instructions in the project page.

The dataset will serve as a proof of work to demonstrate the technical feasibility of the project idea.

Google Summer of Code 2015 [ edit ]

As part of the Google Summer of Code 2015 program,[7] we proposed a project under the umbrella of the DBpedia Association. The goal was to enrich the DBpedia knowledge base via fact extraction techniques, leveraging Wikipedia as input source. The project got accepted,[8] and yielded a dataset similar to FBK-strephit-soccer, which is currently integrated into the Italian DBpedia chapter. An informal overview can be found at the Italian DBpedia chapter Web site.[9] We successfully carried the implementation out,[10] and attracted interest from different communities.[11][12][13] We believe the fact extractor is complementary to StrepHit, and foresee to reuse its codebase as a starting point for the full implementation.

Project Goals [ edit ]

The technical goals of this project are as follows:

to identify a set of authoritative third-party Web sources and to harvest the Web Sources Corpus; to recognize important relations between entities in the corpus via lexicographical and statistical analysis; to implement the StrepHit Natural Language Processing pipeline, serving in all respects as an open source framework that maximizes reusability; to build the Web Sources Knowledge Base for the validation and enrichment of Wikidata statements; to deploy a stable system that automatically suggests references given a Wikidata statement.

The above goals have been formulated keeping in mind that they should be as realistic, pragmatic, precise and measurable as possible. On account of the outreach objective (cf. below), additional emphasis will be given to the StrepHit codebase maintainability and architecture extensibility.

Community Outreach [ edit ]

The target audience is represented by several communities: each one will play a key role at different phases of the project (detailed in the community engagement), and will be attracted accordingly. We list them below, in descending order of specificity:

Wikidata users, involved as data curators;

Wikipedia users and librarians, involved as consultants for the identification of reliable Web sources;

technical contributors (i.e., Natural Language Processing developers and researchers), involved through standard open source and social coding practices;

data donors, encouraged by the availability of a unified platform to push their datasets into Wikidata.

We intend to achieve this goal via constant dissemination activities (cf. timeline of task T10 and its subtasks in the work package), which will also cater for post-mortem sustainability. Special attention will be paid to stimulate multilingual implementations of the StrepHit pipeline.

In Scope [ edit ]

At the end of the project minimal time frame (6 months), we roughly estimate the following outcomes:

the Web Sources Corpus is composed of 250,000 documents (where 1 document yields 1 reference URL), harvested from 50 different sources, in the English language; the corpus analysis yields a set of top 50 relations; the StrepHit pipeline is released as a beta version with an open source compliant license; the Web Sources Knowledge Base contains 2.25 million Wikidata statements; the primary sources tool has a stable release.

The above numbers are computed upon the Google Summer of Code 2015 project output: the input corpus approximately contained 55,000 documents from a single source and returned 50,000 facts expressing 5 relations. Each fact can be translated into 1 Wikidata statement.

Project Plan [ edit ]

Implementation Details [ edit ]

Figure 1: Implementation workflow

The main linguistic theory we aim at implementing is Frame Semantics.[14] A frame can be informally defined as an event triggered by some term in natural language text and embedding a set of participants, called frame elements. For instance, the sentence “Germany played Argentina at the 2014 World Cup Final” evokes the Match frame (triggered by the verb “played”) together with the Team and Opponent participants (respectively Germany and Argentina). Such theory has led to the creation of FrameNet,[15] namely a general-purpose lexical database for English containing manually annotated textual examples of frame usage. Specialized versions include Kicktionary [16] for the soccer domain. Frame Semantics will enable the discovery of relations that hold between entities in raw text. Its implementation takes as input a collection of documents from a set of Web sources (i.e., the corpus) and outputs a structured knowledge base composed of machine-readable statements (according to the Wikibase data model terminology). The workflow is depicted in Figure 1 and is intended as follows:

Extraction of verbs via text tokenization, lemmatization, and part of speech tagging. Verbs serve as the frame triggers (also known as Lexical Units); Selection of top-N meaningful verbs through lexicographical and statistical analysis of the input corpus. The ranking is produced via a combination of term weighting measures such as TF/IDF and purely statistical ￼￼￼ones such as standard deviation; Each selected verb will trigger one or more frames, depending on its ambiguity. The set of frames, together with their participants, represents the input labels for an automatic frame classifier, based on supervised machine learning,[17] namely Support Vector Machines (SVM);[18] Construction of a fully annotated training set, leveraging a novel crowdsourcing methodology[19][20] (implemented and ￼￼￼published in our previous top-conference publications); Massive frame extraction on the input corpus via the classifier trained in the previous step; Structuring the extraction results to fit the Wikibase Data Model. A frame would map to a property, while participants would either map to Items or to values, depending on their role.

Contributions to the Wikidata Development Plan [ edit ]

In general, this project is intended to play a central role in the primary sources tool. A list of specific open issues follows.

Open issue Phabricator ID Reason Framework for source checking T90881 StrepHit seems like a perfect match for this issue Nudge editors to add a reference when adding a new claim T76231 Automatically suggesting references would encourage editors to fulfill these duties Nudge when editing a statement to check reference T76232 Same as above

Work Package [ edit ]

The work package consists of the following tasks:

ID Title Objective Month Effort T1 Development corpus Gather 200,000 documents from 40 authoritative Web sources M1-M3 15% T2 State of the art review Investigate reusable implementations for the StrepHit pipeline M1 5% T3 Corpus analysis Select the top 50 verbal lexical units that emerge from the corpus M2-M3 5% T4 Production corpus Regularly harvest 50,000 new documents from the selected sources M2-M6 5% T5 Training set Construct the training data via crowdsourcing M3-M4 15% T6 Classifier testing Train and evaluate the supervised classifier to achieve reasonable performance M3-M4 20% T7 Frame extraction Transform candidate sentences of the input corpus into structured data via frame classification M5 5% T8 Web Sources Knowledge Base Produce the final 2.5 million statements dataset and upload it to the primary sources tool M5-M6 15% T9 Stable primary sources tool Fix critical issues in the codebase M5-M6 5% T10 Community dissemination Promote the project and engage its key stakeholders M1-M6 10%

Overlaps between certain tasks (in terms of timing) are needed for iterative planning.

Tasks Breakdown [ edit ]

The above tasks may be further split into the following subtasks, depending on the stated effort:

ID Title Description T1.1 Sources identification Select the set of Web sources that meet minimal requirements T1.2 Sources scraping Build scrapers to harvest documents from the set of Web sources T3.1 Verb extraction Extract verbal lexical units via part of speech tagging T3.2 Verb ranking Produce a ranked list of the most meaningful verbal lexical units via lexicography and statistics T5.1 Lexical database selection Investigate the most suitable resource containing frame definitions T5.2 Crowdsourcing job Post the dataset to be annotated to a crowdsourcing platform T5.3 Training set creation Translate the annotation results into the training format T6.1 Evaluation set creation Gold-standard dataset to assess the classifier performance T6.2 Frame evaluation Reach a F1 measure value of 0.75 in the frame classification T6.3 Frame elements evaluation Reach a F1 measure value of 0.70 in the frame elements classification T8.1 Data model mapping Research a sustainable way to map the frame extraction output into the Wikibase data model T8.2 Dataset serialization Serialize the frame extraction output into the QuickStatements syntax,[21] based on T8.1 T10.1 Wikipedians + librarians engagement These communities represent a precious support for T1.1 T10.2 Wikidatans engagement Data curation and feedback loop for the Web Sources Knowledge Base T10.3 NLP developers engagement Find collaborators to make StrepHit go multilingual T10.4 Open Data organizations engagement Encourage them to donate data to Wikidata via the primary sources tool

Budget [ edit ]

The total amount requested is 30,000 USD.

Budget Breakdown [ edit ]

Item Description Commitment PM(1) Cost Project Leader Responsible for the whole work package Full time (40 hrs/week) 6 16,232 € NLP Developer Assistant for the StrepHit pipeline implementation (English language) Part time (20 hrs/week) 3 7,095 € Training Set Crowdsourced job payment for the annotation of training sentences Una tantum N.A. 1,090 € Dissemination Participation (travel, board & lodging) to relevant community conferences, e.g., Wikimania 2016 Una tantum N.A. 1,500 € Total 25,917 €

(1) Person Months

The item costs are computed as follows:

the project leader's and the NLP developer's gross salaries are estimated upon the hosting research center (i.e., Fondazione Bruno Kessler) standard salaries, [22] namely "Ricercatore di terza fascia" (grade 3 researcher) and "Tecnologo/sperimentatore di quarto livello" (level 4 technologist). The salaries comply both with (a) the provincial collective agreement as per the provincial law n. 14 , [23] and with (b) the national collective agreement as per the national law n. 240 . [24] These laws respectively regulate research and innovation activities in the area where the research center is located (i.e., Trentino, Italy), and at a national level. More specifically, the former position is set to a gross labor rate of 16.91 € per hour, and the latter to 14.78 € per hour. The rates are in line with other national research institutions, such as the universities of Trieste, [25] Firenze, [26] and Roma; [27]

namely (grade 3 researcher) and (level 4 technologist). The salaries comply both with (a) the provincial collective agreement as per the , and with (b) the national collective agreement as per the . These laws respectively regulate research and innovation activities in the area where the research center is located (i.e., Trentino, Italy), and at a national level. More specifically, the former position is set to a gross labor rate of 16.91 € per hour, and the latter to 14.78 € per hour. The rates are in line with other national research institutions, such as the universities of Trieste, Firenze, and Roma; the training set construction job has an average cost of 4.35 ¢ per annotated sentence, for a total of 500 sentences for each of the 50 target relations;

the dissemination boils down to attending 2 relevant conferences.

The total budget expressed in Euros is approximately equivalent to the requested amount in U.S. Dollars, given the current exchange rate of 1.14 USD = 1 €.

N.B.: Fondazione Bruno Kessler will be physically hosting the grantees, but it will not be directly involved into this proposal: the project leader will serve as the main grantee and will appropriately allocate the funding.

Community Engagement [ edit ]

All the following target communities have been notified before the start of the project (cf. the community notification) and will be involved according to the different phases:

Wikidatans;

Wikipedians;

Librarians (and GLAM-related communities);

Natural Language Processing developers and researchers;

Open Data organizations.

The engagement process will mainly be based on a constant presence on community endpoints and social media, as well as on the physical presence of the project leader to key events.

Phase 0: Testing the Prototype [ edit ]

The FBK-strephit-soccer demonstrative dataset contains references extracted from sources in Italian. Hence, we have invited the relevant Italian communities to test it. This effort has a double impact:

it may catch early signals to assess the potential of the project idea; it spreads the word about the primary sources tool.

Phase 1: Corpus Collection [ edit ]

The Wikipedia community has defined comprehensive guidelines for sources verifiability.[28] Therefore, it will be crucial to the early stage of the project, as it can discover and/or review the set of authoritative Web sources that will form the input corpus. Librarians are also naturally vital to this phase, due to the relatedness of their work activity.

Phase 2: Multilingual StrepHit [ edit ]

Besides the Italian demo dataset, the first StrepHit release will support the English language. We aim at attracting Natural Language Processing experts to implement further language modules, since Wikidata publishes multilingual content and benefits from a multilingual community. We believe that references from sources in multiple languages will have a huge impact in improving the overall data quality.

Phase 3: Further Data Donation [ edit ]

The project outcomes will serve as an encouragement for third-party Open Data organizations to donate their data to Wikidata through a standard workflow, leveraging the primary sources tool.

Sustainability [ edit ]

Once the project gets integrated into the Wikidata workflow and the target audience gets involved, we can ensure further self-sustainability by fulfilling the following requirements:

to enable a shared vision with strategic partners; to foster multilingual implementations of the StrepHit pipeline.

Out of Scope: the Vision [ edit ]

The project builds upon the findings of our previous research efforts, which aim at constructing a knowledge base with large amounts of real-world entities of international and local interest (cf. Figure 2). The different Wikipedia chapters constitute its core. Governmental and research Open Data are interlinked to the knowledge base. This will allow the deployment of a central data hub acting as a reference access point for the user community. Hence, data consumers such as journalists, digital libraries, software developers or Web users in general will be able to leverage it as input for writing articles, enriching a catalogue, building applications or simply satisfying their information retrieval needs.

Figure 2: High-level project vision

Strategic Partners [ edit ]

We aim at sharing the aforementioned vision with the following partners (besides Wikidata):

Partner Reason Supporting references Wikimedia Engineering community Actively working on a similar vision Wiki Loves Open Data[29][30] initiative, part of the quarterly goals[31] Google Responsible for the primary sources tool development Primary sources tool codebase[32] Freebase (now Google Knowledge Graph team) Eventual migration of Freebase data to Wikidata Freebase shutdown announcement,[33] migration project page,[34] migration FAQ[35] Ontotext Interested in collaborating under the umbrella of the Multisensor FP7 project Multisensor project homepage,[36] Ontotext involvement[37]

Measures of Success [ edit ]

All the quantitative local metrics to measure success are related to the primary sources tool, can be verified at the Wikidata primary sources status page and are presented below in descending order of specificity.(1)

50,000 new curated statements (namely the sum of approvals and rejects), currently 19,201; 100 new primary sources tool active users,(2) given that (a) the top 10 users have performed 22,971 actions and (b) the currently active Wikidata users amounts to 15,603;[38] involvement of 5 data donors from Open Data organizations.

The following global metrics naturally map to the local ones:

From a qualitative perspective, success signals will be collected through:

a dedicated Wikidata project page (similar to e.g., WikiProject Freebase); a Wikidata request for comment process; a survey.

(1) the displayed numbers were looked up on September 11th 2015;

(2) in order to count the users specifically engaged by StrepHit, a distinction among datasets should be clearly visible. However, the primary sources status API endpoint does not seem to handle dataset grouping yet. An issue has been filed in the code repository and will be included in task T9 of the work package.

References [ edit ]

Get Involved [ edit ]

Participants [ edit ]

Marco Fossati is a researcher with a double background in Natural Languages and Information Technologies. He works at the Data and Knowledge Management (DKM) research unit at Fondazione Bruno Kessler (FBK), Trento, Italy. He is member of the DBpedia Association board of trustees, founder and representative of its Italian chapter. He has interdisciplinary skills both in linguistics and in programming. His research focuses on bridging the gap between Natural Language Processing techniques and Large Scale Structured Knowledge Bases in order to drive the Web of Data towards its full potential. His current interests involve Structured Data Quality, Crowdsourcing for Lexical Semantics annotation, Content-based Recommendation Strategies.

Advisor Claudio Giuliano is a researcher with more than 16 years experience in Natural Language Processing and Machine Learning. He is currently head of the Future Media High Impact Initiative unit at FBK, focusing on applied research to meet industry needs. He founded and led Machine Linking, a spin-off company incubated at the Human Language Technologies research unit: the main outcome is The Wiki Machine, an open source framework that performs word sense disambiguation in more than 30 languages by finding links to Wikipedia articles in raw text. Among The Wiki Machine applications, Pokedem is a socially-aware intelligent agent that analyses Italian politicians profiles, integrating data from social media and news sources. Claudio will serve as the scientific advisor of this project.

Volunteer We use FrameNet in FP7 MultiSensor and devised embedding in NIF. Ontotext would be interested to help. Mainly with large-scale data/NLP wrangling. Vladimir Alexiev (talk) 13:43, 9 September 2015 (UTC)

We use FrameNet in FP7 MultiSensor and devised embedding in NIF. Ontotext would be interested to help. Mainly with large-scale data/NLP wrangling. Vladimir Alexiev (talk) 13:43, 9 September 2015 (UTC) Volunteer with contributing embodied cognition concepts into StrepHit, by expanding the concept of lexical units, frames and scenarios, starting from Perception_active, Perception_body, Perception_experience.

Furthermore I would like to contribute to the discussion of the project, and to test the use case and how it could be applied in the future to a medical domain. Projekt ANA (talk) 22:17, 19 September 2015 (UTC)

Volunteer Use, test and give feedback. Danrok (talk) 17:13, 22 September 2015 (UTC)

Use, test and give feedback. Danrok (talk) 17:13, 22 September 2015 (UTC) Volunteer I am a student with background in computer science and computational linguistics. I have close to two years of experience in NLP. I can join this project as a freelance NLP developer Nisprateek (talk) 05:03, 25 September 2015 (UTC)

I am a student with background in computer science and computational linguistics. I have close to two years of experience in NLP. I can join this project as a freelance NLP developer Nisprateek (talk) 05:03, 25 September 2015 (UTC) Volunteer Use case and validation (Andrea Bolioli, CELI) BolioliAndrea (talk) 14:36, 1 October 2015 (UTC)

Use case and validation (Andrea Bolioli, CELI) BolioliAndrea (talk) 14:36, 1 October 2015 (UTC) Volunteer I'm a PhD student in NLP and I'd like to work as a freelancer programmer. So far I worked mainly on QA, and, more precisely, in searching supporting passages for automatically answering open-domain questions.

I'd like to give an hand by building software for validating the statement already present in on Wikidata by using the candidate statements extracted from text passages present on authoritative sources. Auva87 (talk) 14:05, 24 December 2015 (UTC)

Volunteer Ciao Marco, sono Pasquale Signore, Università degli Studi di Bergamo e Intern presso Wikimedia Italia.

Aspetto un tuo feedback ;) PasqualeSignore (talk) 07:28, 5 May 2016 (UTC)

Community Notification [ edit ]

The following list displays all the links where relevant communities have been notified of this proposal, and to any other relevant community discussions. The list is sorted in descending order of community specificity.

Wikidata

N.B.: As per the phase 0 of the community engagement plan, we have invited the relevant Italian-speaking communities to test the soccer demo dataset, since the references are extracted from Italian Web sources.

Wikipedia

Librarians

Natural Language Processing practitioners

Open Data organizations

Endorsements [ edit ]

Do you think this project should be selected for an Individual Engagement Grant? Please add your name and rationale for endorsing this project in the list below. (Other constructive feedback is welcome on the talk page of this proposal).

Community member: add your name and rationale here.

I support StrepHit because humans play an important role in improving the data quality on sites such as Wikidata. The reality of the Web of Data will come about through human and machine cooperation. BernHyland (talk) 15:15, 21 September 2015 (UTC)

Because it is an important issue to have reliable References. Crazy1880 (talk) 11:20, 22 September 2015 (UTC)

This is a promising direction for technological development, and an IEG project is a good umbrella for kick-starting this. I believe that the project can yield valuable results for Wikidata, in a way that is complementary to other ongoing efforts. I also see some critical points: The project plan is very fine-grained, maybe too fine-grained for a 6 month project (speaking from experience here). I would like a clearer commitment to creating workable technical infrastructure here. Content (extracted facts) should not be the main outcome of an IEG; rather there should be a fully open sourced processing pipeline that can be used after the project. How does the interaction with OntoText fit into the open source strategy of WMF? (As far as I recall, OntoText does not have open source options for its products.) I'm offering help with embedding FrameNet to NIF and data wrangling. Ontotext tools don't need to be used int he project at all. --Vladimir Alexiev (talk) 12:54, 8 December 2015 (UTC) One of the main goals are 100 new active users of the primary sources tool. But how would this be measured? Since Primary Sources is still under development, it is to be expected that the user numbers will grow naturally over the next year. How can this be distinguished from the users attracted by this IEG project?



It would be good if these could be clarified. Yet, overall, I support the project. --Markus Krötzsch (talk) 12:34, 22 September 2015 (UTC)