An early paper on OpenAPC was presented, in which self-reported spending on OA journal articles by German universities and research organizations was compared to other initiatives. 16 In 2018 the formerly separated data sources from FWF, Jisc and Wellcome Trust were unified within the OpenAPC data set.

This context assured the willingness of the first German academic institutions to deliver data to OpenAPC, which motivated international funders and academic institutions to contribute to OpenAPC soon afterwards. In addition, the INTACT framework enabled synergy between the three partnering projects with the result that:

These data collections have been referenced by authoritative OA transformation studies. 9 , 10 , 11 The various developments on the subject of OA monitoring were discussed in the context of Knowledge Exchange workshops in 2015 and 2016 and summarized in a report in early 2017. 12 Since 2015, OpenAPC has been funded by the German Research Foundation (DFG) within the project ‘Transparent infrastructure for open access publication fees’ (INTACT), 13 and supported by the German DINI working group ‘Electronic Publishing’.

The documentation of APC expenditure was significantly boosted in 2013 by the publication of corresponding data from the Austrian Fund for Scientific Research (FWF) 4 and in the UK by the Wellcome Trust 5 and Jisc Collections 6 on the data repository figshare in 2014. Also in 2014, APC data and analyses were published on the Dataverse research platform 7 and Bielefeld University Library began to publish APC data on GitHub, laying the foundation of the OpenAPC project. 8

The externalization of the cost of the academic publishing system to libraries, disproportionate price increases by publishers, and complex purchasing models and confidentiality agreements have led to market inefficiency and dysfunctionality of the subscription system for academic journals. 3 Therefore, the transparent presentation of the costs for OA publication fees or article processing charges (APCs) is an important contribution to the reintroduction of price-limiting market mechanisms in the academic publication system. In turn, this benefits libraries, funders and authors.

In order for open access (OA) to be sustainable as the standard for academic publishing, the associated costs require monitoring. Relevant science policy strategy papers on open access and open science therefore address the issues of cost transparency and monitoring as essential success factors of a targeted OA transformation, for example the ‘Amsterdam Call on Open Science’ 1 or the ‘Open Access Strategy for Germany of the Federal Ministry of Education and Research’. 2 In particular, the challenges lie in the creation of standardized and inter-institutional data collection and reporting routines as well as in the continuous and, as far as possible, automated quality control of the data.

Methods, data and tools

OpenAPC: general approach OpenAPC follows an open science approach, which by our own standards means that everything should be visible by anyone at any time. This includes the data as well as the scripts for enrichment steps, normalization or quality checks. Using the version control system git, all files relevant to the project are kept and redacted in a repository on the platform GitHub, meaning that not only their current status, but also their complete version history is always available to the public. In addition to this, everything is automatically synchronized to a local GitLab installation at Bielefeld University Library.

The data: structure and origins All data accumulated by OpenAPC is willingly provided by external participants based on the principle of open data. These participants are usually called institutions in the project’s parlance, although their actual nature may vary. There are data reports by individual universities or institutes, scientific organizations or research funders. Additionally, cost data may also be reported on a higher level by aggregating services (like Jisc in the UK). Taking a look at different countries, the distribution of OpenAPC institutions is described below. Germany OpenAPC started as a German national project, aiming at collecting cost data from participants of the Open Access Publishing Programme set up by the DFG.17 In consequence, APC data in Germany is usually reported individually by universities (40 in total), with the vast majority of them reporting only articles funded by the said programme. German research organizations are another source of data, but again there are differences in workflows: The Max Planck Society employs a central billing, where the Max Planck Digital Library (MPDL) is responsible for accounting and data reports to OpenAPC. In the case of the Helmholtz Association and the Leibniz Association, research centres operate autonomously in terms of APC payments, so they have to decide independently if they want to participate in the initiative. Altogether, 51 institutions from Germany take part in OpenAPC. Austria In May 2016 Austria was the first country outside Germany to provide data to OpenAPC, thus extending the project’s scope to an international level. Most data are reported by the Austrian Science Fund (FWF), with two participating universities completing the picture. UK Comparable to Germany, a lot of higher education institutions in the UK are data contributors. However, those institutions are not directly in contact with OpenAPC but report their data only once to Jisc, which acts as a national aggregator18 and compiles yearly collections of cost data. The Wellcome Trust represents another important source of APC data for the UK, also publishing annual reports of all their funded articles.19 It is noteworthy that there is a significant overlap between the Jisc and the Wellcome data, since many institutions will also report their Trust-funded articles to Jisc. This requires a deduplication step in the OpenAPC workflows, where the Jisc data is given precedence for being more detailed on the participating institutions. In total, 51 institutions from the UK participate. Sweden Again, there are many individual participating higher education institutions, whose data reaches OpenAPC in an aggregated form. It is noteworthy that the aggregation service is again an OpenAPC project: in May 2016 the Swedish National Library (Kungbib) launched its own survey of cost data (OpenAPC Sweden),20 as at that time no comparable collection existed on a national level. The project was built in close co-operation with OpenAPC, with intensive reuse of tools and infrastructure. Data from 13 Swedish institutions are currently being incorporated into OpenAPC. Norway In January 2018 the National Centre for Systems and Services for Research and Studies (CERES) provided the first APC data for 15 universities and research institutions in Norway for the years 2015 and 2016 in aggregated form. Switzerland The Swiss National Science Foundation (SNSF) operates a fund to support the OA transition of all publications emanating from SNSF-funded research until 2020. The corresponding APC data was made available to OpenAPC in February 2018. Italy, Spain, Canada and the USA There are examples of isolated participation by universities – two institutions from the US and one each from the other three countries – which often play a pioneering role within their countries with regard to open data and open access. Altogether, as of May 2018, OpenAPC has compiled a database of 50,863 articles, with total reported costs amounting to more than 96 million euros. Figure 1 shows the evolution over time.

Cost data As the project name implies, OpenAPC is intended to collect and publish data on costs incurred by institutions for publishing articles in OA journals (both hybrid and full gold OA). It is therefore very important to define what ‘costs’ are in the scope of OpenAPC. This attribution is less trivial than it might seem. The first insight is that costs are not equivalent to prices. Many publishers and journals explicitly state the APCs to be paid on their web pages (so-called list prices); this information is also collected, for example, in the Directory of Open Access Journals (DOAJ).21 However, experience shows that list prices usually differ from the amounts actually paid, meaning they can only be considered a rough starting point. First, it has to be taken into account that most publishers employ a dynamic price model. Institutions in the global south usually receive discounts. There are also a number of factors that may influence actual pricing. Aside from the results of individual negotiations, there may be other forms of benefit, for example due to frequency of publication, prepayment deals, society memberships or editor/reviewer activities. The latter may also lead to a publisher granting a number of ‘free’ articles which are published open access without any further costs. Furthermore, APCs may need to be paid in a currency other than the institution’s accounting currency, raising the question of how precisely the required conversions have been calculated. This is particularly problematic for participants from outside the eurozone who pay APCs in their domestic currency. Since no conversion takes place in those cases, exact information on the date of payment (which is necessary for precise conversion to euro amounts) is often missing. And, finally, there is even the very elementary question of whether value added tax should be included in the reported amount. Another aspect is that there are other settlement models to pay for the OA status of a journal article. An example would be the Royal Society of Chemistry’s (now discontinued) ‘Gold for Gold’ programme, which offered the purchase of a number of vouchers for a fixed amount, each one entitling the publication of a single OA article in a hybrid RSC journal.22 On the other hand, there are offers that are in line with the APC approach but do not relate to journal articles, for example IntechOpen23 publishes OA books that charge comparable fees for submitting book chapters. Confusingly, for some time the publisher explicitly referred to these fees as APCs, although this type of publication is clearly not an article according to bibliographic standards. A similar case exists with the Association for Computing Machinery, which also charges OA fees for publishing in conference proceedings.24 As a final point, it should be mentioned that APCs are not the only costs that may arise during OA publishing. Some journals levy additional fees for manuscript submission, while elsewhere, page and colour charges are not a thing of the past even in the age of electronic publishing. All these questions had to be answered in order to derive guidelines for the participating institutions under the premise that cost data should be as uniform and comparable as possible but at the same time easy to collect and report. OpenAPC has developed the following policy: For consistency, OpenAPC only collects data on fees paid for journal articles (APCs). Other publication types such as conference papers or book chapters will not be included.

All reported APC costs are considered ‘final sums’. All modifying factors such as taxes or discounts should already be included. In other words, OpenAPC is only interested in the amount that was ultimately deducted from an institution’s budget. To limit complexity, those modifiers are not included in the data set directly (see also the following section), but participants are encouraged to give more details on them as free text in an optional README file.

The final sum principle only applies to APC costs. Additional costs such as submission fees or page/colour charges should not be included in the reported amount.

Only articles that conform to a ‘standard’ APC model will be included, i.e. OA publication against direct payment. Alternative models where costs can only be calculated in hindsight (such as the aforementioned voucher system or offsetting contracts) should not be considered.

From the previous point, it also follows that only articles with a positive APC amount should be reported. Entries with costs of zero are not included.

Data format and enrichment With the first delivery of cost data from an external participant (publication fund data of Regensburg University Library on 30 July 2014),25 a fundamental question was the extent and scope of additional metadata to be collected. At that time only the publication of APC data from the UK by the Wellcome Trust and Jisc were available as an example. The latter was of particular interest, as Jisc acted as a national aggregator, collecting and processing cost data from external institutions. Jisc decided on a very comprehensive approach. Following the recommendations of a pilot study conducted by service provider Information Power,26 the first version of the template in 2014 to be filled out by participants consisted of 34 metadata fields, with extensive bibliographic data (author, title, journal, publisher) as well as typical identifiers (DOI, PubMed ID).27 However, this approach proved not to be without problems, as an analysis of the aggregated data shows. The resulting table columns are filled to varying degrees depending on the reporting institution, there are different formatting standards (dates, monetary amounts) and inconsistent designations for publisher and journal names. As a result, the OpenAPC project employed a diametrically opposed approach: while at the beginning some bibliographic data were still required, in the end the number of mandatory data points was reduced to only five out of 18 total fields:28 top-level organization which covered the fee (institution)

year of payment (period)

APC amount (euro)

article DOI (doi)

a Boolean indicator if the journal is hybrid or gold OA (is_hybrid). Only for those articles without a DOI, four more fields are mandatory: publisher (publisher)

journal title (journal_full_title)

International Standard Serial Number (issn)

a link to the article full text or landing page (url). The nine remaining fields are not required: ISSN for print version (issn_print)

ISSN for electronic version (issn_electronic)

linking ISSN (issn_l)

a Boolean indicator if the DOI is indexed in Crossref (indexed_in_crossref)

the licence under which the paper has been published (license_ref)

PubMed ID (pmid)

PubMed Central ID (pmcid)

Web of Science unique item ID (ut)

a Boolean indicator if the journal is listed in the DOAJ (doaj). All non-mandatory fields of the OpenAPC data set are automatically enriched from external sources via scripts, specifically Crossref, Europe PubMed Central, DOAJ, Web of Science and the ISSN organization. The first three offer public APIs, while requests to Web of Science are restricted to members. The ISSN organization does not provide a distinct API; for every enrichment an updated mapping table has to be downloaded manually instead. Figure 2 shows all steps of the enrichment process. This approach has a number of advantages: The workload for data from supplying institutions remains manageable, since only the three data points – costs, DOI and journal type – have to be determined for each article. At the same time, a simple format lowers the entry threshold for new participants. The automatic enrichment ensures consistent assignments of publisher names and journal titles, which is very important for later evaluations and visualizations. Input data are normalized and reformatted during the enrichment process so that the results always conform to the OpenAPC data schema. Corrections to secondary identifiers (ISSN-L, PubMed IDs, Web of Science identifiers) or licence information can be automatically included for the entire data set at regular intervals. The enrichment process itself is also subject to the open data principle. Every submitted file is stored as an unmodified original in the institution’s data directory on GitHub. The enriched result is then added as a second file (usually marked by the ‘_enriched’ suffix), making input and output comparable. The enrichment scripts are placed under an open source licence (MIT License)29 and are also made public on GitHub. Finally, the content of all enriched files is aggregated into a main CSV file (the core data file), which represents the OpenAPC data set.30

Automated data verification All data reported to OpenAPC have been manually created and combined at some point in their life cycle. It is thus inevitable that the reports contain errors. Typing and copying mistakes (especially problematic in connection with DOIs), flawed formatting of monetary values or erroneous assignment of journal hybrid status are some examples. Some of these issues already get fixed during enrichment, where, for example, non-resolving DOIs are logged for review or incorrect journal titles are overwritten by imports from Crossref. This, however, is not enough. Errors at the semantic level, such as duplicate entries or inconsistencies in journal designations, cannot be resolved this way and it also cannot be guaranteed that the external metadata themselves are correct in all cases. For this reason a small programme was written to automatically check the whole OpenAPC data set for errors.31 From a formal point of view, this is a test suite as it is usually employed in software development, although the principle has been turned upside down. While predefined data are commonly used in such setups to test variable functions, here predefined functions are used to test variable data (namely, the articles in the OpenAPC data set). The general principle is that every entry must pass a set of tests, both individually as well as interdependently (i.e. tests against each other article). The following properties are checked: each row has to be composed of exactly 18 columns

publisher and journal names may not be empty or unknown (NA)

all Boolean variables (is_hybrid, indexed_in_crossref, doaj) must either be TRUE or FALSE

all values in the doi column must represent a formally valid DOI (tested using a regular expression). If the DOI is unknown (NA), the url column may not be empty. No DOI may occur more than once

the issn column may not be empty or NA. Its value is checked both syntactically (regular expression) and semantically (ISSN check digit calculation) if it represents a formally valid ISSN. The other ISSN columns may be empty, but if they are not, they must pass the same checks

the value in the euro column must represent a numerical value larger than zero (no thousands separator; dot as decimal mark)

if the doaj column is TRUE, the is_hybrid column must be FALSE. (The DOAJ only lists fully open access journals)

articles with identical values in at least one of the issn, issn_print or issn_electronic columns must also be identical in the is_hybrid, journal_full_title and publisher columns. This test is not always reliable since title, publisher or hybrid status may change over the course of time. In those cases, ISSNs can be whitelisted to skip this set of tests. In its primary work mode the test script can be executed on a local machine to verify any changes made to the central APC file before pushing them to GitHub. In addition, the code is also bound to a continuous integration service (in our case: Travis CI).32 This web-based service monitors the OpenAPC repository, calls the test routines whenever a change occurs (a so-called build) and makes the results publicly accessible. While this may seem redundant as it just repeats the local tests, it has two distinct advantages. Firstly, it puts the open data principle into practice once more. A user can see the integrity of the OpenAPC data set at first glance (since a small widget on the main OpenAPC page displays and links to the latest test results). Secondly, it creates historical context, since test results of previous builds are also kept accessible. For an example, one may look at an early build created on 23 June 2016.33 At this stage the data set contained several errors because some articles included neither a DOI nor a URL. (The corresponding rule was not in place at the beginning of OpenAPC.)

Dynamic documentation In the previous sections it has been shown how automated scripts and routines support OpenAPC during data ingestion and verification. In the following sections we will shift the focus to a third aspect, which is usually more important to data reusers: dissemination and representation. The OpenAPC data are collected in a CSV file, which means that on the one hand it is highly compatible and easily processable by a wide range of tools and programmes, but on the other, not really suited to human readers. To tackle this problem, one of our first steps was the creation of a descriptive page which provides information on the current state of the OpenAPC data set. For instance, some basic statistics like the total number of articles, total sum of costs or number of participating institutions, but also advanced figures like a graphical plot showing the development of average costs over the course of time. This representation is realized as a Markdown file34 and displayed on the main page of the OpenAPC GitHub repository. (If a file called README.md is present in a directory, GitHub tries to render it below the file tree per convention.) While this solution is a good way to disseminate some basic numbers about the project, it comes with its own problems. Since the OpenAPC data set is prone to changes, the information on the page will become outdated very quickly, requiring time-consuming recalculations and edits to the Markdown file. Fortunately, there is an elegant solution to this problem: the usage of dynamic reporting. This concept means that a document is not maintained as a static entity, which can only be edited by a human user, but instead it is generated from a template file, where small, interwoven chunks of programming code generate all the dynamic parts directly from underlying data. In our case, the generating template35 is another Markdown file, with the code parts being written in the statistical programming language R (thus the template’s .Rmd file ending, meaning ‘R Markdown’). As it is easy to see, the template closely resembles the README file, but wherever a number, table or plot is meant to appear in the result, a code snippet can be found instead, which will produce the according entity directly from the current version of the OpenAPC data set. The generation process itself is realized by an R package called knitr.36 This concept hails from the paradigm of reproducible research,37 which can be seen as a subtopic of open science. Dynamic reporting also comes into play in OpenAPC’s project blog,38 the main channel to disseminate information about new data contributions. Technically, the blog is another git project,39 with posts being written in Markdown and then transformed into regular HTML with Jekyll (done automatically by the underlying hosting platform, GitHub pages). Since most blog posts also contain several elements which are directly dependent on OpenAPC data (both the main data file and the latest enriched file contributed by an institution) and at the same time are quite uniformly structured, it is an obvious solution to employ the same dynamic reporting techniques for them. In practice, for every new blog post an individual R Markdown file is derived from a generalized template by filling in the necessary information (institution, URLs, date, contact person, data file links) and then again knitr is used to generate a Markdown file with all numbers and plots from it. (In the project directory all R Markdown templates are stored in the Rmd folder, the posts folder holds the generated results.) This workflow makes it possible to create many standardized yet individual and informative blog posts for every data contribution in a short amount of time.