As knowledge builds on knowledge, it is vital to preserve the connection between units of information in scholarly communication. There are a number of challenges in this respect. In this section I focus on the persistence of the identifiers for units, as well as content integrity and preservation of context. In the most general sense, trust and accessibility are integral to preservation of content, integrity, and the context in which they are used. This is so that knowledge can be scrutinised with the help of transparent trail of reproducibility and replicability of research results. I look at existing research and approaches in this space.

Domain name registration and maintenance is one of the key factors for long-term reliable persistence as per URI ownership . Berners-Lee outlines two issues for the persistence of HTTP URIs:

The persistence of the opaque string which follows the domain name, and the persistence of the domain name itself. Persistent Domains , Tim Berners-Lee, 2000

In Cool URIs don't change , Berners-Lee, 1998, discusses some of the approaches that can be taken towards usefulness and longevity of URIs. The article focuses on practices that a publisher making a commitment to persistence by designing and managing the URI path and the content it resolves to, as well as the domain name it uses. The owner of a domain name has the obligation to define what the things mean for its URIs. This is also a form of social contract made by the authority that names and defines a URI to anyone that’s using it – see also Philosophical Engineering and Ownerhip of URIs .

Persistence policies can come in different forms. For example, W3C’s URI Persistence Policy is a document making a pledge about how some of the resources under its domain will persist throughout the lifetime of the Consortium; any changes to persistent resources will be archived; and in case that the organisation is disbanded, its resources can be made available under the same rights and license. These human-readable statements are useful institutional commitment to persistence. The ODRL vocabulary can be used in a similar way to provide a machine-readable policy about resources.

From the archiving perspective, Van de Sompel came to the conclusion that in a long enough timeline, HTTP URIs are not inherently persistent but persistable. The units of information that are registered using URIs are more of a promise made by its original or current authority. Hence, along with the examples from earlier, URI registration is ultimately a social agreement. URI owners declare a policy eg. implicit, written, verbal. If a policy is announced for a collection of URIs eg. what happens in 1000 years, then that says something about its intentions and expected level of availability. From this perspective, as discussed earlier in the registration of identifiers with social contracts , PID s such as DOI, PURL, w3id, and ORCID can help to prolong such promises and to extend the lifetime of accessibility of units of scholarly information.

Decentralized Identifiers ( DID ) are identifiers for verifiable self-sovereign digital identity, where they are under the control of the subject, independent from any centralized registry, identity provider, or certificate authority. In a way, DIDs go around the shortcomings of the domain name system where they can be created and managed without the authority of the registrar.

In Analyzing the Persistence of Referenced Web Resources with Memento , Sanderson, 2011, presents the results of a study on the persistence and availability of Web resources cited from research articles in two scholarly repositories. The results show that within a few years of the URL being cited, 45% of the URLs referenced from arXiv still exist but are not preserved, and 28% of the resources referenced by articles in the UNT digital library have been lost. In order to address this commonly known as URIs ceasing to exist (link rot), authors suggest that repositories expose the links in the articles through an API so that Web crawlers can be used to archive. With the help of archives supporting the Memento protocol, the original context of the citation can still be reconstructed.

Given the dynamic and ephemeral nature of the Web, and in particular management of URIs and corresponding representations at URLs, it poses a threat to integrity of Web-based scholarly content, and the consistency of scholarly records, as well as everywhere else. One special area is about the formal citation of scholarly resources eg. DOI, HTTP-DOI-URI, and informal referencing of other resources on the Web ie. any HTTP URI. The Hiberlink project investigates reference rot in Web-based scholarly communication, and introduces the term to denote two problems in using URI references:

Link rot The resource identified by a URI may cease to exist and hence a URI reference to that resource will no longer provide access to referenced content. Content drift The resource identified by a URI may change over time and hence, the content at the end of the URI may evolve, even to such an extent that it ceases to be representative of the content that was originally referenced. Scholarly Context Not Found: One in Five Articles Suffers from Reference Rot , Martin Klein, 2014

In Scholarly Context Not Found: One in Five Articles Suffers from Reference Rot , Klein, 2014, acknowledge extensively studied phenomenon on link rot and content drift, and themselves investigate the extent to which reference rot impacts the ability to revisit the web context that surrounds STM articles some time after their publication . The results show that significant amount of HTTP URIs cited in STM articles are no longer responsive or adequate archived snapshots available. Authors state that it is impossible to adequately recreate the temporal context of the scholarly discourse, hence suggest that robust solutions are needed to combat the problem of reference rot. Authors can take practical steps to remedy some of these issues eg. using archives that support on-demand snapshots, embedding the archived URI with datetime information alongside the reference to the original resource – I discuss this further in Robust Links .

In Scholarly Context Adrift: Three out of Four URI References Lead to Changed Content , Jones, 2016, reuse the same dataset from Klein study, to investigate to what extent the textual content remained stable since the publication of the referencing article based on various well-established similarity measures based on the comparison of the representative Memento and the live resource. They find that for over 75% of references the content has drifted away from what it was when referenced. The authors support the idea that in order to partly work around this issue, authors should pro-actively create snapshots of the referenced resources at Web archives, and referencing them in their scholarly literature. However, the authors also state that such robust embedding in the infrastructure of the existing authoring, reviewing, and publishing workflow is still an open challenge. To that end, applying the Robust Links approach can help. While the DOI-paradigm for scholarly units help to improve the link rot scenario when the custodians of the domains or the URLs of the scholarly resources relocate, the resources on the Web at large remain to be a problem given that their incentives towards longevity and access differ.

From the point of persistence of the domain name (losing ownership) as Berners-Lee describes, one kind of content drift would be if the content published at https://csarven.ca/ today may be different tomorrow if another authority gets to own csarven.ca and defines what goes there. Alternatively, it may be that the content at that location is dynamic, and could differ from one request to another. In both cases, content drift creates a situation where if the originally referenced resource is still the same. From the perspective of Web-based scholarly publications, changes to content – accidental or intentional – can impact the degree of reproducibility, replicability, comparability of research results, and maintaining a reliable scholarly record.

In Persistent Identifiers for Scholarly Assets and the Web: The Need for an Unambiguous Mapping Van de Sompel, 2014, posit that while PIDs are assigned to resources outside of information access protocols (like HTTP), there is a need to unambiguously bridge the discovery of the Web-oriented resource eg. from PID to HTTP URI, in a way that is machine-actionable. For example, the PID paradigm has the following discovery path:

PID is the resource identifier eg. 10.2218/ijdc.v9i1.320 HTTP-URI-PID is the resolving URI eg. https://doi.org/10.2218/ijdc.v9i1.320 HTTP-URI-LAND is the redirect URI (landing page) eg. http://www.ijdc.net/article/view/9.1.331 HTTP-URI-LOC is the location URI of the content eg. http://www.ijdc.net/article/download/9.1.331/362/

Authors propose that using existing standards and practice, the essential ingredients for such a mapping is as follows. A PID has a Web equivalent HTTP-URI-PID, which is a requisite – minted by the naming authority. The HTTP-URI-PID can be content-negotiated to result in a) a human-readable representation at HTTP-URI-LAND or b) machine-readable representations with distinct HTTP-URI-MACH. The HTTP-URI-LAND remains the same for discovery, however, HTTP-URI-MACH uses an RDF-based approach to describe the aggregations of scholarly assets based on the OAI-ORE specification.

It is worth briefly revisiting the notion of social agreements around Web resources. Conceptually, the agreement that I make with you about the persistence of my website’s resources is in essence the same as a naming authority controlling a PID, as well as all of the nodes in between HTTP-URI-LOC. The kind of mapping between a domain name and the IP address it points to is similar to a PID being mapped to a HTTP-URI-PID. Hence, URI Ownership essentially involves two possibilities: either I “own” and control a URI space or someone else does.

In Persistent URIs Must Be Used To Be Persistent , Van de Sompel, 2016, reveal the results of a study where authors do not use persistent URIs like DOIs even when available, and instead use the location URIs. In order to alleviate this issue, authors propose that an HTTP Link header is used at the location URI to announce the identifying HTTP-URI-PID. The current proposal is to use cite-as: A Link Relation to Convey a Preferred URI for Referencing , Van de Sompel, 2019, and a number of related patterns are outlined at Signposting the Scholarly Web .

Trusty URI is a technique to include cryptographic hash values in URIs to uniquely associate them with an artifact. In Trusty URIs: Verifiable, Immutable, and Permanent Digital Artifacts for Linked Data , Kuhn, 2014, outline how specific resources and their entire reference trees can be verifiable. If the trusty URI of an artifact is known, it can be used to verify if the content of an artifact corresponds to what it is suppose to represent. This is useful to determine if the content is corrupted or manipulated. It follows that trusty URI artifacts are immutable as each version of a content generates a unique trusty URI. Trusty URI artifacts are considered to be permanent in that once archived or cached, the artifact can still be verified even if the original location is no longer available. Applications can use trusty URI to encode their own references, as well as compute an artifacts’ trusty URI and verify before using it. Trusty URIs can be used to represent byte-level file content and for RDF graphs, and is compatible with named information URIs – Naming Things with Hashes ( RFC 6920 ). An example from the Trusty URI Specification – Version 1 : given resource http://example.org/r1 , its trusty URI would be http://example.org/r1.RAcbjcRIQozo2wBMq4WcCYkFAjRz0AX-Ux3PquZZrC68s , where RA identifies the module as an RDF graph (independent of its serialization) of the resource, and the remaining characters signify the computed hash of the content.

The Anatomy of a Nanopublication , Groth, 2010, propose to improve the efficiency in finding, connecting, and curating core scientific statements with associated context, with an annotation model and a format based on the RDF language realized with Named Graphs. The Nanopublication Guidelines , 2015, specify how to denote unique RDF graphs for assertions, provenance, and publication information, which make up the body of a nanopublication that can be used as a single publishable and citable entity. Nanopublications can be independently used to expose and disseminate individual quantitative and qualitative structured scientific data, without being accompanied with narrative research articles. For example, hypothesis, claims, and negative results can exist on their own, be identifiable, and reused in different places, as well as embedded in articles.

In Decentralized provenance-aware publishing with nanopublications , Kuhn, 2016, argue that due to publication and archival of scientific results is still based on print-centric workflows and commonly considered to be a responsibility of third-party publishers, there is currently no efficient, reliable, and agreed-upon Web-centric methods for publishing scientific datasets, and therefore a bottom-up process is necessary. To this end, the authors present a decentralised server network with a REST API to store, archive, find, and serve data in the form of nanopublications, where the identifiers for the units of information are based on the trusty URI method. The authors argue that the underlying architecture can serve as a reliable and trustworthy low-level semantic publishing, archiving, and data sharing layer that can be used by different knowledge domains.

Signing HTTP Messages , 2019, describes a way for servers and clients to simultaneously add authentication and message integrity to HTTP messages by using a digital signature. The signing is made by signing the HTTP message body and some HTTP headers at the time of making the HTTP request. As the signature is carried in the HTTP headers, the HTTP message is not altered, and thereby does not require a normalisation algorithm to be used on the message structure.

Linked Data Proofs is a mechanism to ensure authenticity and integrity of Linked Data documents through the use of public/private key cryptography. The digital signature is comprised of information about the i) signature suite that was used to create the signature, ii) parameters required to verify it, and iii) the signature value generated by the signature algorithm. The signature typically accompanies the Linked Data document so that the receiver can verify its authenticity using the available information. In contrast to the Signing HTTP Messages method, Linked Data Proofs requires a normalisation algorithm.