We’ve discussed that, understanding the basic definition and roles of citations, they can be better practiced by creating them in such ways conforming to the basics. Now that they are created well, remaining quest is to collect, analyze, and use them in better ways.

Ahituv & Neumann (1982) have discussed that the value of information can never be absolute and universal. Rather, they argued, the values are determined by those who use it, with regard to the specific decision being made with the information. As such, the value of citations depends on the actual use of them, how well they are utilized to the decision of interest. The essence in better use of citations is to be scientific in using it: understand what they are, and use them in such a way that is proper to the decision being made.

Eugene Garfield, a pioneer in citation index, source: Science History Institute under CC-BY-SA

Tracking comes with Database

Citation databases(or indexes), specifically in science, are databases where scientific resources (usually journal articles) and the links between them (citations) are recorded. These databases are sometimes expressed as “Indexing & Abstracting services” since they index metadata of scientific resources including their abstracts. The indexing role is specifically important regarding the information retrieval (like search engines). The types of metadata usually include:

titles; abstracts; authorship(names, affiliations, etc.); publication details(journal title, publishing year, issue, volume, page, etc.); DOIs or relevant URLs; list of cited references; etc.

and often there are some other miscellaneous information necessary for information system. See for example this from Crossref.

The origins of citation index (digital or not) is said to be in religions and laws, but let’s not touch them as they’re not relevant here. The modern, digital origin dates back to 1960, when Eugene Garfield founded Institute for Scientific Information. ISI had been servicing the famous Science Citation Index and JCR, the analytics report on the index. When Garfield devised citation index it wasn’t meant to measure the impact of individual articles by counting their received citations. It was rather purposed to aid librarians in selecting which journals to subscribe to by giving more aggregate level analyses.

Including the one by Garfield, some notable examples of citation databases include (but not limited to, of course) Web of Science Core Collection (WOSCC), SCOPUS, Google Scholar, and Crossref. While these citation databases enable the analytical metrics that can be used to assess research and their related objects, many scientists including Garfield himself have warned of caveats in them.

Never Use them Alone

One of the major criticisms is that the citation metrics should not be used as a single measure. Citations are usually used as a proxy of impact for publications. This notion of impact is often criticized for being ambiguous, or that this proxy of impact neglects other values such as social values, originality, or scientific significance. Above all, whatever value this citation metric be proxy of, the diverse and complex aspects of scientific works cannot, and should not, be measured with a single, simplistic indicator.

Thus, when using citation metrics with these caveats in consideration, it is desirable to use other evaluative methods in combination. As for aggregate level evaluations on journals or authors (as opposed to individual articles), h-index came up in this context whereby the citation counts were used in combination with the number of publications. More recently, the limitations of simple citation counts were challenged with techniques such as source normalization, field normalization, or some graph-based analysis like eigenvector centrality. (EigenFactor, SJR, etc.)

But when it comes to the individual articles, more often than not these techniques cannot be applied, or practically impossible. For individual articles, alternative metrics (i.e. altmetrics) have been proposed such as the usage stats (e.g. view, download) or web metrics (e.g. mentions on Twitter, links in Wikipedia, etc.). These metrics are criticized for being fragile to manipulation. (easier than self-citation!) Therefore, when dealing with individual articles, it is great to adopt peer (group) evaluation in combination with other metrics. After all, reading the articles is the best way to evaluate them.

Understand what they are

Used in combination with other methods, citation metrics can still be practiced even better: use them scientifically. In science, it is without doubt important to read the references and understand them before you cite. Using design of study, experimental protocols, methodologies, or analytical methods comes with exactly the same requirements. That’s how we do it scientifically. The very same holds for citation metrics. One should clearly understand what the citation metric is, which underlying database is used, and how they relate to, or fit, the decision being made.

In many cases, the impact metric used on journals, the Impact Factor, is used omitting the specific details of the metric. (or possibly that’s why they named it so instead of calling it “journal average citation”) For example, when we speak of JIFs, say in 2019, this generally implies that we are talking about JIFs reported for 2018. (i.e. census period) Another omitted detail is the target window in which period the publications used for calculating JIFs are published in, which is 2 preceding years (2017,2016) in this case. Regarding the time-wise aspect of citation counts, it is highly significant to note that they are dynamic with respect to time. They accumulate over time, and some publications may even attract citations after substantial time have passed since their publishing.

The most significant information often omitted is the underlying citation database used for calculation. Of course, we have the names of the database, for JIF we know it’s on WOSCC, for CiteScore on SCOPUS. What is missing, not their names, is how they are constructed, why specific journals are included in it, and so forth. Understanding how the underlying database is constructed is a critical thing in using citation metrics scientifically, but it is necessarily required that those databases be clear and transparent about such information.

Metrics that Fit to your Decision

All the metrics mentioned so far are “preset”. That is, they are widely known with their names, and the way they are calculated is quite well documented. And they are readily calculated by analytical services even before you asked. And it’s a good thing. But still, that never means they need to be used in every decisions by anyone. That is, unless the preset metrics are best fit to the decision being made, one should think of how to better utilize the citation metrics in a more customized way. For example, if the decision of interest isn’t significantly relevant to the recent two years and instead is more about long term historic records, one should not use the preset metric based on recent two years.

Beyond staying away from presets, citations can be even better tracked by knowing where the edges are coming from. Given a specific publication of interest, looking at the details of citing works (i.e. those which cite this publication) helps with more diverse analysis. A publication often cited by textbooks may imply that the work is widely known and sets the basics to the field. Useful details of citing works may include types of them, their reference lengths, and how influential they are. These details enable more contextual analysis with citations.

In short, to use citation metrics better, i) they shouldn’t be used on their own, ii) they should be understood meticulously, and iii) customized analysis needs to be considered before using preset metrics, to better fit to the decision. For this things to be possible, however, the citation databases take significant roles, and thus are required tremendous changes. How should citation databases be to enable better use of them?

Scientific Citation Databases

Just as the use of citation metrics should be scientific, citation databases as well need to be scientific, or enable such scientific use of them. Citation databases being scientific means that they become more clear and transparent about their evidences. And the most major evidences of citation databases are: raw (citation) data and inclusion process.

Supporting the use of raw data from citation databases is necessary requirement for the diverse and contextual analysis discussed above. Indeed, there are some citation databases which do support, but usually they only supports them in partial datasets. (for example WOSCC gives raw data on specific field) Due to this limitation, a lot of literature on citation analysis were confined to their relevant fields, failing to capture their implications on more comprehensive, macro scales.

Provision of raw data shouldn’t be limited to the data on publications and the links in-between. For more contextual analysis on citations, metadata on the publications should also be provided. As already discussed, knowing where the citations are coming from gives more potential to diverse and contextual analysis. And as we discussed in the previous post, providing raw data is the way they become “separable, structured, and open” citations.

Citation databases can also be scientific by being clear and transparent about how they include certain sources. Major citation databases, as they are used in decision making by a lot of institutes and universities worldwide, practically set the norms around “where academics should publish their outputs to be regarded as (best) scientific works”. Unlike this enormous influence on the whole academia, there is not much one can find about how these databases are constructed.

For both WOSCC and SCOPUS, the most I could find was some requirements and evaluation criteria to be included in them, and the names of a very small number of representative who involve in such evaluating process (for SCOPUS). WOSCC does not even have those names, and on a page describing their database it is said they include journals that “ researchers themselves have judged to be the most important and useful”. Who are these researchers? How did they judge?

For citation databases to be operated more scientifically, they need to be more transparent about how certain sources are included (of course when they are excluded as well*), who was involved in such process, and how exactly the decisions were made. In short, citation databases should be clear and transparent about what, why, and how they index certain journals.

(*suppression from JCR is reported with self-citations and stacking)

Detaching Databases from Services

Another thing that can be done with citation databases is detaching them from other services such as publishing, analytics, or retrieval services (like search engines). Specifically, among them, publishing service may pose significant conflict of interest if provided by the same organization providing citation database. Simply put, you publish journals and then you provide the database and metrics whereby those journals are assessed. (power overwhelming!) Moreover, when these services are provided by single organization, that organization may acquire serious control over the overall research workflow. This kind of perspectives was seen in the attempt to acquire Thomson Reuters IP&S by BC Partners, the major shareholder of Springer Nature. (which instead was acquired by Baring and ONEX to form what is now known as Clarivate Analytics)

Furthermore, citation databases can still be detached from what seems to be identical to what it is but actually is not. Like Clarivate described its WOSCC as “database of journals”, citation database can be detached from “curated list of journals”.* If we have a relatively broader-coverage of citation database, like how Google Scholar crawls the whole web to construct its DB, a separate, detached “curated list of journals” can be applied upon them to constitute, for example, WOSCC or SCOPUS.** To give an analogy, this is just like how the stock exchanges list whole tradable assets, and a curated list of assets can be applied on them to form the “indices”.

(*maybe we could call this segment of database)

(**assuming the broader-coverage means “exhaustively” broader)

Coordination Required

After all, citation databases are collection of information from multiple sources. Thus, things cannot be done without coordination between the stakeholders. (this will be more true when they’re completely detached) Specifically, there is already some issues that citation databases have been suffering (e.g. name ambiguity, inconsistent identifier system, or version controls). Many players in the academic scene, including journals, publishers, databases, service providers and funding agencies, are thus required to collaborate to solve these problems, especially ones that necessarily needs coordination.

To briefly sum up, citation databases can be made better by giving more evidences of them: specifically their raw data and transparent records of their operation. They are even better when separable, structured, and open, detached from other services so that no conflict of interest arises. Other players involved in the academic world such as journals, publishers, and funding agencies can coordinate to enhance the citation databases even more powerful.

Once the citation databases are made better in such ways, others using them should still be cautious in using citation metrics. To make the use of citation analysis more scientific, one should understand what those metrics are and what citation database is used to calculate them. Using customized analysis instead of preset metrics is another great way to utilize citations. And yet, they shouldn’t be used alone as they are criticized by even the original inventor that they cannot capture the complex and diverse aspects of scientific values.