The database is built in accordance with the FAIR (Findability, Accessibility, Interoperability, and Reusability) principle, which is first espoused by Wilkinson et al.19.

Overview

The data collection process, summarized in Fig. 1, is comprised of three phases: searching, creating science profiles, and constructing master datasets. The goals are to:

i collect the data on every Vietnamese social sciences and humanities (SS&H) researcher who has published in Scopus-indexed journals from 2008 to 2018; ii ensure reliability and accuracy.

Figure 1: Project conceptualization. The project, from data collection to datasets construction, consists of three phases. Phase 1 is about identifying Vietnamese social scientists who have international publications and cross-checking among various data sources. Phase 2 is about creating a personal science profile for each author which includes information on the author’s scientific output and demographic characteristics. Phase 3 is about constructing cross-section and network datasets from all the science profiles. Full size image

To achieve the goals, the search covers only Vietnamese researchers in SS&H of Vietnamese nationality that meet at least one of the following criteria:

They are affiliated with an organization in Vietnam; OR

They have published at least one paper about Vietnam or use data collected in Vietnam related to SS&H topics.

The search is further confined to Vietnamese authors who have published in Scopus-indexed scientific journals. It is important to note that the method could in principle cover publications indexed in the WoS, MathSciNet, PubMed and other reliable scientific databases. For comparison purpose, Scopus indexed about 22,600 titles20, which is almost twice as many as its counterpart WoS21. Given the project aims to serve Vietnamese science policymakers, we take into account the fact that Scopus is one of many scholarly databases used by the Vietnamese government to judge academic credentials22. Specifically, in a governmental decision, the Vietnam National Foundation for Science & Technology Development (henceforth referred to as NAFOSTED), Vietnam’s leading funder for science and technology research, has provided a list of prestigious international and national journals in the field of SS&H, which includes being indexed in Scopus as a criteria23. This is also a common practice in various countries including the United States, Spain, and Russia24–26 as well as for highly influential rankings such as the Times Highers Education27,28.

Based on these basic principles, next we will delve into the manual data collection system, its procedure and shortcomings that prompt the need for the semi-automatic system.

NVSS Manual System

The manual process of data collection and verification was carried out from 1st February 2017 to 15th July 2017, which resulted in the creation of the Network of Vietnamese Social Scientists (NVSS) dataset. NVSS contains 412 science profiles for 412 distinct Vietnamese researchers in social sciences and humanities who have published in Scopus-indexed journals. An example of these first science profiles could be found in Data Citation 1.

Procedure

The first step of the data collection process was to access websites of research institutions in Vietnam to identify researchers who fit the above criteria. Then, based on their public CVs, we marked down the number of publications they have authored and their demographic information. Next, we cross-checked these newly gathered data with websites of journals, Google Scholars, Scimagojr, and Scopus to make sure the information claimed on the CVs was in fact accurate. The Scopus system, therefore, has only value for us to double-check by examining if a randomly chosen research item has been present in their indexing system.

To ensure that the manual process covers as many eligible Vietnamese researcher as possible, we also looked at the references lists of the articles and experts’ opinions, as well as used varied keywords (‘Vietnamese economic development’, ‘Vietnamese history’, ‘Vietnamese culture’, etc.), and other resources such as social media, online news outlets, to name a few. The experts are from organizations such as the State Council for Professor title of Vietnam; the Scientific Committees of NAFOSTED; other scientific boards of leading research institutions such as national universities; Vietnam Academy of Social Sciences; etc. or others with long-term experiences or high productivity in their respective disciplines. In the data collection stage, our team members would reach out to the experts for suggestions or confirmation of eligible researchers, then subject these suggestions to the rigorous cross-validation process.

The second step was to create a personal science profile for each Vietnamese author. Each said science profile corresponded to 13 lines of data (see Table 1). This process resulted in a clean, concise dataset of the most updated and complete profiles. We then contacted and invited the researchers to corroborate the profiles made by our team; the examples of some corroborated profiles could be found in Data Citation 1’s Scientific Profiles (Examples) folder. A list of input names and explanations appears in Table 1 while their relationships are illustrated in Fig. 2.

Table 1 Input names and explanation. Full size table

Figure 2: The relationship among variables recorded in this study. A personal science profile consists of five groups of factors: scientific output, demographic factors, collaboration factors, fields of study and affiliation. Scientific output factors concern with total number of publications, solo publications, publications in leading (key) position, and contribution-adjusted productivity. Demographic factors include age, gender, regions, and career age. Collaboration factors concern with total number of collaborators, of domestic collaborators, and of foreign collaborators. Two other factors are fields of study and affiliations. Full size image

The third step involved summarizing all the profiles into a master file. The example of the master file resulted from the manual system could be found in Ho et al.2,9.

Shortcomings

This manual method, albeit rigorous, faces two major shortcomings. First, the manual input of data is time-consuming and rigid, thus prone to human errors. The resulting dataset enables us to count how many publications each author has but lacks the capability for counting how many unique publications and journals exist in the entire database. This loophole excludes us from answering important questions such as how many new articles Vietnamese social scientists produce each year; or from generating data on international co-authorship network. Second, because the contribution-adjusted productivity (‘cp’) was computed manually, it would be immensely costly to switch to a different counting method such as the norm of all authors getting an equal share or the norm of first-last emphasis29,30.

SSHPA Semi-automatic system

The semi-automatic system, called Social Sciences & Humanities Peer Awards (SSHPA), was kicked off on 1st December 2017 and wrapped up on 2nd February 2018 to resolve problems posed in the manual process. The purpose was to have a system capable of: (i) validating the quality of data previously collected, and (ii) making our database more flexible, less time-consuming to construct, and less prone to human errors. The semi-automated process also enables us to cover as close as possible to the actual number of eligible Vietnamese social scientists. For a brief overview of the distribution by sex, there were 262 female (39.88%) and 391 male researchers (59.51%), with four left unknown. Table 2 shows the descriptive statistics for continuous variables used in the SSHPA system. Other datasets related to these statistics can also be viewed in Data Citation 1’s Extracted and Computed Data’s table.

Table 2 SSHPA’s descriptive statistics on the productivity of Vietnamese researchers in SS&H from 2008 to 2018. Full size table

System architecture

The SSHPA system, accessible online at https://sshpa.com/, is structured in MS SQL Server 2012 and is indexed to search Fulltext to centralize the management process. Its architecture is organized according to Client-server architecture. The software Server is built using Net Core which provides the APIs connections and functional modules such as Data Search & Filters, Data Validation, Network Builder and Reports. In addition, SSHPA Client software is built with C# that connects the database server through REST API Interface, this is intended to provide the users with complete data-input and data-check functions.

Similar to the manual data collection process, the first step was to search for profiles of Vietnamese social scientists fitting our criteria. As shown in Fig. 3, we collected the profiles provided by researchers and organizations then verified with other sources such as government websites, NAFOSTED’s designated publications, journal websites, Scimagojr, Google Scholars, Scopus’ freely accessible data, etc.

Figure 3: The system architecture of SSHPA. The system consists essentially of three major steps: (i) collecting the profiles of social scientists and cross-verifying with five other sources, (ii) entering the verified data into the SSHPA database and getting checked by the automated quality assurance, after data are in the system, the quality control auto-checkers would screen the database again for consistency and accuracy, and (iii) authenticating and authorizing (through three levels of admins, supervisors, collectors) the final science profiles in the SSHPA database. Full size image

The verified data were then entered into the SSHPA database and put through automated quality assurance and quality control steps. SSHPA was also designed with an authorization system with three levels: admins, supervisors, and collectors. Collectors could only input and edit unapproved data. Supervisors could approve a data entry, however, once the data entry is approved as most complete and accurate by the judgement of the supervisors, it cannot be changed or removed by either the supervisors or the collectors. Only the admins could remove a data entry or unlock the approved data for changes. Hence, in each level of authorization, each person must be accountable for the accuracy and reliability of the data entered into the system. With the nature of being semi-automated, SSHPA was still prone to human errors; this authorization mechanism was a way to uncover the mistakes in a timely manner, and thus, minimizing the consequences.

Data structure

The data, once entered into our system, were organized in table structure in RDBMS.

We designate Article as the fundamental unit of SSHPA’s data structure (center of Fig. 4), because: (i) an article’s name is often long enough to reduce the odds of data duplication, and (ii) an article published on a journal’s website will provide the other information such as authors, authors’ affiliation, publication year, and so on. This means all the other kinds of data: Author, Affiliation, Source, Publisher, Network, etc. are connected through Article.

Figure 4: SSHPA’s data structure diagram: relationships among authors, articles, affiliations, fields, sources, and publishers. These are four kinds of data in SSHPA system and they are related to each other through one fundamental unit—datArticle. The pink block contains boxes pertaining to the authors and their networks information. The green block contains boxes pertaining to the sources, publishers and articles information. The yellow block contains boxes pertaining to the authors’ affiliations. Full size image

For example, the datArticle box and the datAuthor box are connected to each other through an intermediary, datArticleAuthor, which holds information that connects the authors with their publications such as: author IDs, article ID, order of the author(s), affiliations of the authors, etc. The datArticle box contains the relevant data on the articles or publications in the database: title, document type (proceedings or journal articles for example), publisher ID, journal ID, etc. The data are fed from other boxes which contain information on the publishers (lstPulisher), the sources (journals or proceedings of conferences or books) of the articles (lstSource), the citation information (lstCitation), or the document type (lstDocumentType). Similar principles are applied to Network data (datNetwork, datNeworkviz) and Affiliation data (datAffiliation, datAffiliationAuthor).

The structure of the database may seem redundant, for example, the author’s biographical information (datAuthor and datAuthorName) could have been merged into one file, but the separation serves a function. This splitting enables the SSHPA system to filter out overlapping author names faster because: (i) a Vietnamese author might have his or her name written differently in different publications, and (ii) the names recorded in our database are in Vietnamese spelling which has some digraphs and the addition of nine accent marks or diacritics.

As we now understand how the data are structured in the database of the SSHPA system, next we will examine how SSHPA can help improve control over the quality of data.

Data quality assurance and control

The basic principle for building a good data verification process here is to ensure four intertwined layers of check are always carried out: (i) inter-data-sources check: different publicly accessible sources were used to cross-validate the accuracy of collected data; (ii) inter-data-types check: the different types of data collected were checked for coherence with one another; (iii) inter-data-collectors check: the data collectors involved in this study cross-checked the information collected by each other, especially contents that have raised doubts over accuracy; (iv) random and periodic check. In each step, every mistake would be classified either as a one-off or systematic type and corrected accordingly.

In the SSHPA system, based on the above principle, the process is divided into quality assurance, which refers to the techniques implemented prior to entering data, and quality control, which indicates the techniques implemented after data is entered to check for errors. Another way the quality of data could be improved is to spot strange pattern in the data through generating network visualization of authors or articles’ connections. The codes that are relevant to these processes can be found in Data Citation 1’s Codes for SSHPA.pdf.

Quality assurance

The purpose of this step is to prevent bad data from ever being entered into the database in the first place. Several logic tests have been built into our semi-automatic system to help recognize suspicious authors or articles’ data. For the authors’ data, there are tests for:

- whether the name of an author already existed in the database

- the name of an author must not be blank

- if the author is Vietnamese, his/ her SSHPA ID must start with ‘v’; ‘f’ if foreign author

- if the author is female, her SSHPA ID must have the ‘f’ followed the initial ‘v’ or ‘f’; ‘m’ if male author; ‘?’ if sex is unknown

- the correct format of SSHPA ID must be ‘geography specifier + sex specifier + number’; for example: vm.1 is a Vietnamese male researcher numbered 1 or ff.1001 is a foreign female researcher numbered 1001.

For the articles’ data, there are tests for:

- whether the article of the same title already existed in the database

- the title for the article must not be blank

- the publisher and journal of the article must not be blank

- the year of article publication must fall in the range 2008-now

- fuzzy search article title for 90% similarity

Failure to meet these requirements and the system will notify or even block the data collector from moving to the next data points in some cases. The data, when being entered, will also be changed to match the format designated by the system. For example, the paragraph break, the quotation mark (“) ascii 147 code will be changed to (”) ascii 34 in the title of the articles.

Quality control

This step is about applying the data validation tools to control the quality of data. The data validation tools include data filter, the search function (for relative and unique subjects), and the automatic data check functions. Here are some examples of these data validation tools.

Two authors with different SSHPA-IDs but same full names or middle names could easily be compared. And if they are suspected as being one person, the software can perform a three-step verification:

- Through name: Check the author’s name with all other authors with the same name in the system

- Through affiliations: Check the author with all other authors with the same affiliation

- Through publication: Check the author with all others with the same publication

Furthermore, the software could filter out the low-quality data such as:

- Authors with missing or invalid information: year of birth, sex, affiliation, article.

- Articles with no authors

A notable feature of our quality control is that our data team members have invited the Vietnamese researchers to cooperate by directly verifying their information in our database. Though we have yet to hear from all of them, the responses we got to date do raise the credibility of the open database.

Automated construction of network data

There are several kinds of network data being automatically recorded with SSHPA: co-authorship among authors (undirected network data), leading-author to non-leading author(s) connection (directed network data), co-authorship among affiliations, co-authorship among geographical locations, etc. The network data allows for different ways to representing the data visually as shown in Figs. 5, 6, 7, 8. This function enables the data collector to visualize the connections among the articles and authors in the database, thus providing him or her a new way to spot strange patterns in the data.

Figure 5: Visualizing the networks: examples. An example of both incorrect (a) and correct (b) network visualization of the data extracted from the article in 2017 by Phan et al.32. Here, each dot represents a researcher that has a connection with Phan Van Phuc, a researcher with SSHPA-ID vm.780. Purple is coded for male, blue is coded for female; the square shape represents foreign researchers while the round shape is for the Vietnamese. Full size image

Figure 6: Maps of Vietnamese international and domestic scientific collaborations. (a) A world map of research collaborations between Hanoi, Vietnam and other places in the world. The link represents the co-authoring collaboration between Hanoi and international scholars. (b) A Vietnam map of the distribution of scientific publications of Vietnamese social scientists in NVSS database. The circle’s size represents the count of publications in each province; the bigger the circle the more publications. The link represents the co-authoring collaboration among scholars of each province. Full size image

Figure 7: Evolution of a research group: examples of real data. The temporal evolution of a scientific group through three periods: (a) 2008–2010; (b) 2008–2014; (c) 2008–2018. Here, each dot represents a researcher that has a connection with vm.4. Purple is coded for male, blue is coded for female; the square shape represents foreign researchers while the round shape is for the Vietnamese. The size of the dot is the number of publications an author has within the designated period. The arrow shows the direction from key-author (first-author) to the others author in a paper. Full size image

Figure 8: An overview of the Vietnamese scientific collaboration network. A growing network of scientific collaborations of the Vietnam’s social sciences in two periods: (a) 2008–2011; and, (b) 2008–2018. Here, each dot represents a researcher. Purple is coded for male, blue female, and orange foreign authors. The size of the dot is the number of publications an author has within the designated period. The arrow shows the direction from key-author (first-author) to the others author in a paper. Full size image

Code availability

The codes that are relevant for the data quality assurance, quality control and automated construction of network data of the SSHPA system could be found in (Computer Codes, Data Citation 1).