Pluto Open Project (2)

Author Name Disambiguation using Self-citation

Hi, it’s Pluto Network’s data mining team.

In a previous post, we introduced two ideas for Author Name Disambiguation — self-citation and coauthor similarity — and some problems such as malformed data.

This post will describe the preprocessing of data and implementation of those ideas from the previous post. Before we start, let’s define the terms used in this post. Some are cited from articles, and the others are my own.

We focus on merging(i.e. the former), not splitting, because we have met few authors to split contrary to the large number of authors to be merged.

Beside them, we use the term publications to indicate objects which are regarded as papers in Scinapse database. Those may be patents, news letters, or even musics (example below). Among them, articles indicate academic research papers, and other publications that are not articles will be called non-article publications in this post.

Citation is needed for diambiguation, source: Dan4th Nicholas, Flickr (CC BY 2.0)

Data Preprocessing

In the preprocessing step, we focused on removing publications that are not article, i.e. non-article publications. We don’t know the reason why such a large number of non-article publications are included in the database, but it is obvious that they should be removed from our database.

Bach is one of the greatest musician, but his musical works are not articles

Since they are not labeled as articles nor non-articles, we cannot remove them easily. At first, we aggregated the non-article publications which we’ve found so far. And then we inspected their attributes such as citations, authorship patterns, and abstracts.

The problem was, even if those non-article publications have a specific pattern, say X, not all the publications with pattern X were non-article publications. For instance, non-article publications tend to have short abstracts but some articles have very short abstracts too(less than 15 words). What’s worse, some abstracts were even malformed.

Thus, we had to repeat the cycle; find a pattern, investigate the publications with the pattern to verify whether it’s a necessary condition for non-article publications. It took a great amount of time and contribution of open source contributors.

* Note that we couldn’t use machine learning to find such rules, since we didn’t have enough examples.

Finally, we found two distinctive patterns for distinguishing non-article publications. The first one is that some publications with specific domains are not articles. Obviously, papers from “google.patent.com” must be a patent, not an article. Secondly, some of the non-article publications tend to be written by the same author groups repeatedly. The biggest number of publications with the same author group was greater than 20,000.

Is it possible to publish such a lot of articles with same author set?

Author Name Disambiguation

We tested the two ideas — coauthor similarity and self-citation — within small author blocks grouped by their surnames, or surname blocks. Unbelievably, even though the coauthor similarity is one of the most mentioned feature in the literature on Author Name Disambiguation, there was a little number of authors with similar coauthors and names. Instead, there were much more authors who cite other authors with similar or same names, which implies that they may be the same individuals.

We found the reason for this in a Microsoft Academic’s post.

Thus, we decided to focus only on the self-citation, because Author Name Disambiguation using coauthor similarity is already adopted by Microsoft Academic.

(*Majority of our database comes from Microsoft Academic)

Although there were a lot of authors with similar or same names within each subgraph, we were still not sure whether they were the same individuals even after looking over them. For instance, even though J. Kim cited the article of another J. Kim, we cannot tell that they are the same individual since there may be many J. Kim’s in academia. (this may happen even within a single laboratory in South Korea)

Since we are sensitive to false positive, we prefer strict rules even though the number of results is small. After inspecting many subgraphs, we adopted the following rules.

1. Citation subgraph with identical surnames

We assumed that the authors of a single individual would have exactly same surnames, since researchers tend not to abbreviate their surnames. Also, we filtered non-English surnames, since it was not easy and even ineffective to handle every single language.

Thus, we blocked the data by surnames and made subgraphs with authors as nodes and citations as edges. Afterwords, we proceeded disambiguation within each subgraph.

2. Exact name match

In the previous step, we found several authors who cited other authors with the same surnames. However, it’s obvious they do not necessarily represent the same individual. To find certain cases, we inspected some subgraphs.

In doing so, we met many authors with the exactly same names within subgraphs. They mostly represented the same individuals, except when the first name or the whole name is too common(especially when the first names are written in initials).

3. Unique existence

In summary, the authors within each subgraph at this stage have citation relationships and they have the exactly same names. We considered other attributes such as fields of study, journals, and affiliations, but the citation was the most powerful indicator among them. Since we don’t want to stick to this problem, we just decided to exclude authors with common names.

To quantify the commonness of the names, we used our database itself. We picked the subgraphs where the names within the subgraphs do not exist outside the subgraphs (i.e. no disconnected same names). For example, there are only two authors named C. Gram in our database, and one of them cited another, then it is the case. Of course, this relationship can be expanded to more authors (subgraphs with bigger size).

We thought these criteria are quite strict, and verifying with random sampling showed that the result was credible.

Conclusions

To sum up, we removed unnecessary non-article publications by 58,796,366(28.05%) in the preprocessing step, and merged 1,608,289 authors into 649,519 by Author Name Disambiguation using self-citations.

The following is the issues that we were concerned about while proceeding the project.

1. Duplicates

Several publications could represent a single article in reality, and it might be called ‘publication name disambiguation’. However, even though several articles have the exactly same title and share some of authors, but we may still be uncertain whether they represent the same article or not since some of them had different DOIs or publishing dates.

Anyway, we suspect that there are more than tens of thousands of publications which may be duplicated, and we should merge them to improve the quality of our data.

(*This may also involve version controls)

2. Impact

We respect researchers and articles. But since the impact of them varies, metrics are often used in evaluations within academia.

In this aspect, even though it is difficult to measure it, disambiguating notable individuals such as Nobel Laureates may be more meaningful than disambiguating other authors on average. This is not to say such differences of meaningfulness exist between their research achievements, but we’re speaking in terms of the impact to the information system, by disambiguating such individuals.

We never know how many individuals, not authors, are in academia, but certainly the number is estimated to be less than 100 millions. We merged about a million of authors from the total of approx. 150 millions of authors, and indeed it is a tiny fraction. But still, it is important for Team Pluto to have well organized database and we successfully took an import first step.

In next post, we will disambiguate the duplicated papers and try to merge the notable authors in a new way.