Methods A model for impact We assume a community of researchers who publish papers. We consider two types of researchers: those sharing and not sharing research data associated with those papers. We make the simplifying assumption that the goal for both types of researchers is to perform well by making a significant contribution to science, i.e., to have a large impact on science. We assume that produced papers, P s for sharers and P ns for non sharers, create impact by getting cited a number of times c. We assume c is constant, which means we do not distinguish between low and highly cited papers. To increase their performance, researchers need to be efficient, i.e., they should try to minimize the time spent on producing a paper, so more papers can be produced within the same timeframe. Papers from which the dataset is shared gain an extra citation advantage, increasing the impact of that paper by a factor b. In our model we consider only papers with a dataset as a basis, i.e., no review or opinion papers. So, the performance of researchers is expressed as an impact rate, in terms of citations per year, i.e., the impact for sharing and non-sharing researchers is defined as (1) E s = P s ⋅ c ⋅ 1 + b E n s = P n s ⋅ c . From the above expressions it is clear that the difference in impact between sharing and not sharing researchers is to a large extent dependent on the number of publications P per year. These publications can be expressed in terms of an average time to write a paper T s for sharers and T ns for not sharers. (2) P s = 1 T s P n s = 1 T n s . The time T consists of several elements that we make explicit here. Each paper costs time t a to produce. Producing the associated dataset costs a certain time t d . Sharing a dataset implies a time cost t c . We do not distinguish between large and small efforts to prepare a dataset for sharing; all datasets take the same amount of time. We assume there is a certain probability f to find an appropriate dataset for a paper from the pool of shared datasets X, in which case the time needed to produce a dataset t d is avoided. We do acknowledge that some time is needed for a good ‘getting to know’ the external dataset and to process it, resembled in the time cost t r . We calculate the time to produce a paper by (3) T s = t a + t d 1 + f ⋅ X + t r − t r 1 + f ⋅ X + t c T n s = t a + t d 1 + f ⋅ X + t r − t r 1 + f ⋅ X . In these formulae, the pool of available datasets X determines the value of the terms with t d and t r . When X is close to zero, the term with t d approaches t d . This implies that everybody has to produce their own dataset with time cost t d . In contrast, when X is very large the term approaches zero, implying almost everyone can reuse a dataset and almost no time is spent in the community to produce datasets. Between these two extremes, the term first rapidly declines with increasing X and then ever more slowly approaches zero (see the plots in the last column in the figure in Appendix S2). This is under the assumption that at a small number of available datasets, adding datasets will have a profound influence on the reuse possibilities. If datasets are already superfluous, adding extra datasets will have less influence on the reuse rate. The term representing the effort to reuse a paper t r works opposite to the term representing t d . When X is close to zero, the term approaches zero, implying nobody spends time to prepare a set for reuse. When X is very large the term approaches t r ; everyone spends this time because everyone has found a set for reuse. While the pool of datasets X determines the values of the terms with t d and t r and with that the number of shared datasets, at the same time the shared datasets accumulate in the pool of shared datasets X. To come to a specification of this pool size X we formulate a differential equation for the pool size. A change in the pool of available, shared datasets X depends on adding datasets belonging to papers P s from sharing researchers Y s , minus the decay q x ⋅ X of the datasets. Such a decay rate could be a result from a fixed storage time after which datasets would be disposed of or by a loss of data value, for instance by outdated techniques. (4) d X d t = Y s ⋅ P s − q x ⋅ X . Using Formula (2) and (3) with the system at steady state i.e., dX/dt = 0, the pool size X as function of the publication parameters and the size of the group of sharing researchers is given by (5) X = − q x t a + t c + t d − Y s f + q x t a + t c + t d − Y s f 2 − 4 q x ⋅ f t a + t c + t r ⋅ − Y s 2 q x f t a + t c + t r (Formula (5) is derived in Appendix S1). So, for each parameter setting, we calculate X, and consequently, we calculate the impact in terms of citation rates E s and E ns with Formulae (1)–(3). Table 1 gives the default parameter settings that we use for our simulations. Parameter Meaning Value Source Unit t a Time-cost to produce a paper 0.13 Derived: t a + t d amount to 121 days; leading to ∼3 papers a year (similar to the average in Fig. 1) Year/paper t d Time-cost to produce a dataset 0.2 Derived: t a + t d amount to 121 days; leading to ∼3 papers a year (similar to the average in Fig. 1) Year/paper t c Time-cost to prepare a dataset for sharing 0.1 Estimated: 36.5 days Year/paper t r Time-cost to prepare a dataset to reuse 0.05 Estimated: 18.25 days Year/paper q x Decay rate of shared datasets 0.1 Derived: based on a storage time of 10 years 1/year b Citation benefit (sharing researcher) 0 Estimated: percent extra citations Percent f Probability to find an appropriate dataset 0.00001 Fitted 1/dataset c Citations per paper produced 3.4 Derived: approximate from ’baselines’; average citation rate by year three, Thompson Reuters Citation/paper State variables Meaning Value Unit E Impact See formula (1) Calculated Citation/year P Number of papers See formula (2) Calculated Paper/year T Time for a publication See formula (3) Calculated Year/paper X Pool of shared datasets See formula (5) Calculated Dataset Y Number of researchers 10000 Defined n.a. DOI: 10.7717/peerj.1242/table-1 An individual based model In addition to the model for impact we set up an individual based model to assess the impact for individual researchers depending on their personal publication rate, sharing and reuse habits, rather than to work with averages. We use the ‘model for impact’ as a basis for the calculations and then assign characteristics to individuals. First, a publication rate P r per researcher is assigned at random to individual researchers. P r is based on the distribution as seen in Fig. 1, fitted with the function (6) P r = Y ⋅ e − t a + t d . As a next step we introduce parameters that have to do with sharing. The percentage of sharing researchers is a fixed parameter in this model. The researchers sharing type is assigned at random to individuals. The actual reuse of a dataset, based on the probability to find an appropriate dataset for a paper, is assigned at random to publications. The portion of papers R for which an appropriate dataset for reuse is found is calculated as (7) R = 1 − 1 1 + f ⋅ X . We now have a mix of individual researchers that share or do not share, find a dataset for reuse or not for any of their papers, and publish different number of papers in a year. Based on the parameters in Table 1 we assign costs and benefits with these traits. These factors determine the performance of researchers in terms of impact by citations. Figure 1: Publication distribution. The sampled (bars) and fitted (line) distribution of published papers per researcher in a given year, in this case 2013. For reasons of visualisation the distribution is shown up to thirty publications, whereas the sampling sporadically included more publications per researcher. The fitted line is used as the published papers’ distribution for the simulated community. To determine the publication rate distribution in Fig. 1, we sampled the bibliographic database Scopus. We selected the first four papers for each of the 26 subject areas in Scopus-indexed papers, published in 2013. If a paper appeared within the first four in more than one subject area, it was replaced by the next paper in that subject area. For each of the selected papers, we noted down all authors and checked how many papers each author (co-) authored in total in 2013. We came to 366 unique authors in our selected papers. Authors that were ambiguous, because they seemingly published many papers, were checked individually and excluded if it was a group of authors publishing under the same name with different affiliations between the papers. For the data, see Pronk, Wiersma & van Weerden (2015). This distribution, based on our sampling, implies that most researchers publish one- and a few researchers publish many papers in a given year. We fitted an exponential distribution through the sampled population (Formula (6)). The average for the distribution is close to three papers per researcher in a given year. Simulations For the R-scripts to generate the plots for all simulations, see Pronk, Wiersma & van Weerden (2015). We start with a set of simulations regarding performances per sharing type, with the model for impact. We calculate the impact for the two types of researchers over a range of sharing from zero to a hundred percent of all researchers. In addition to the default values (see Table 1), we change parameters to assess their influence on the publication rate and associated impact by citations for sharing and not sharing researchers. In Table 2 we list the parameters changed in the simulations and a score of the measures that would have these effects in a ‘real world’ scientific community (Chan et al., 2014). Parameters investigated in the model Possible associated measures to improve this Time ‘t r ’ spent to assess and include an external dataset • Improve data quality, for instance by the use of data journals (Costello et al., 2013; Atici et al., 2013; Gorgolewski, Margulies & Milham, 2013), or peer review of datasets (i.e., a ‘comment’ field in data repositories). • Offer techniques or tools for easy assessment of dataset quality i.e., (Eijssen et al., 2013), faster pre-processing or data cleaning (i.e., ‘OpenRefine’ or ‘R statistical language’). Chance ‘f’ to find an external dataset • Harvest databases through data portals to reduce ‘scattering’ of datasets. • Standardization of metadata and documentation. • Advanced community and project-specific databases. • Library assistance in finding and using appropriate datasets. Time ‘t c ’ associated with sharing of research data • Offer a good storing & sharing IT infrastructure. • Assistance with good data management planning at the early stages of a research project. Benefit in citation per paper ’b’ associated with sharing of research data • Provide a permanent link between paper and dataset. • Increase attribution to datasets by citation rules . • Establish impact metrics for datasets. Percentage of scientists sharing their research data • Promote sharing by a top down policy from an institute, funder, or journal. • Promote sharing bottom up by offering education on the benefits of sharing, to change researchers’ mind set. DOI: 10.7717/peerj.1242/table-2 To have a closer look on individual performance, we perform the same set of simulations with the individual based model. For each setting, we calculate the difference between the publication rate assigned in Formula (6) at no costs or benefits with sharing or reuse, and a new, calculated publication rate based on sharing and reuse traits per researcher under the assumption that half of the researchers share. So, again we change the parameters in Table 2 and assess their influence, as in the first simulation. We end by zooming out to community performance with the model for impact. We calculate the average impact over all researchers in the community, now at more extreme settings of the citation benefit b and in a second simulation at even higher cost t c for preparing a dataset for sharing. This is to provide a broader range of results. Citation benefit b and the sharing rate are changed within their range in one hundred equal steps.

Results Shown in Fig. 2 are the simulations with the model for impact (Formulae (1)–(5)). The simulation in (A) is at default parameter values (Table 1). In (B–F) we simulated measures to improve upon impact. There are two important observations. First, in all (but the last) subfigure of Fig. 2A–2E) the average impact of not sharing researchers exceeds that of sharing researchers irrespective of how many sharing researchers there are. This means that not sharing is the best option, at all percentages sharing researchers. In this scenario, it would be logical if all individual researchers would choose not to share and eventually end up getting the average impact by citations depicted at zero percent sharing. So we see here a classical example of the tragedy of the commons or prisoners dilemma phenomenon. What is important to note though is that the measures in (B) (C) (D) and (E) ascertain a key effect when compared to the default in (A). The average impact of sharing researchers at the highest percentage sharing researchers (straight horizontal light-grey line; stripes) is increasingly higher with the measures than the average impact for not sharing researchers at zero percentage sharing researchers (straight horizontal dark-grey line; dots–stripes). Should a policy enforce the sharing, or all would agree to cooperate and share, a higher gain is achieved than in the case that researchers would all choose not to share. This illustrates the conflicting interest for individual researchers, who are better off not sharing, while they would do better if all of them did share. Subfigure (F) of Fig. 2 shows the potential of the citation benefit with sharing. In the picture it is profitable to share at low sharing rates, and profitable not to share at high sharing rates, leading to a stable coexistence of sharing and not sharing researchers. This means that the community would exist of researchers from both strategies. Hypothetically, should the citation benefit be even higher, the sharing strategy would outperform the not sharing strategy at all sharing percentages. Researchers would in this case choose to share even without measures to promote sharing, simply because it directly increases their impact. Figure 2: Impact per sharing type. f increased threefold (C) default but with t r decreased threefold (D) default but with t c decreased threefold (E) default but with b set to 0.1 (F) default but with b set to 0.4. The curved light-grey line depicts the impact of the sharing researchers . The curved dark-grey line depicts the impact of the not sharing researchers. The thin dotted curved black line is the averaged community impact. The straight black vertical dotted line depicts the percentage of sharing researchers at which community impact is maximized. The straight horizontal lines respectively depict the impact at zero percent researchers sharing (dark-grey line; dots-stripes) and hundred percent sharing researchers (light-grey line; stripes). Citations (‘impact’) per year for researchers sharing and not sharing, at different percentages of sharing researchers. The simulations are done at parameter settings (A) default (see Table 1 ), (B) default but withincreased threefold (C) default but withdecreased threefold (D) default but withdecreased threefold (E) default but withset to 0.1 (F) default but withset to 0.4. The curved light-grey line depicts the impact of the sharing researchers . The curved dark-grey line depicts the impact of the not sharing researchers. The thin dotted curved black line is the averaged community impact. The straight black vertical dotted line depicts the percentage of sharing researchers at which community impact is maximized. The straight horizontal lines respectively depict the impact at zero percent researchers sharing (dark-grey line; dots-stripes) and hundred percent sharing researchers (light-grey line; stripes). Second, it can be noted that in some subfigures of Figs. 2A–2C and 2E the average citations are the highest at intermediate sharing. This means that if sharing increases further, it has a detrimental effect on average community impact. This is because the model is formulated in formula (3) in a way that total costs for sharing increase for the community as more researchers share, whereas total benefits cease to increase at high sharing rate. The extra datasets do not contribute much to the benefits, or in other words, the research community has become saturated with datasets. Compared to the average community citations, which are highest at intermediate sharing, for both sharing and not sharing researchers the highest impact by citations is at the point at which everyone is sharing. Results from the individual based in Fig. 3 model show that the individual researchers have various gains depending on their publication rate, reuse, and dataset sharing habits. In (A) are the gains and losses in impact, at default parameter values (Table 1). In (B–F) we simulated measures to improve gains or limit losses. A possible desired effect of sharing of datasets would be that every individual researcher can benefit, sharing or not sharing. It can be observed that in (Figs. 3A–3E) most of the sharing researchers have lower benefits or even costs compared to not sharing researchers. This logically is in line with the lower averages for sharing researchers in Fig. 2. Also, it can be noted in all subfigures of Fig. 3 that there are always sharing researchers that do not benefit from the availability of datasets by the reuse of datasets. These researchers were not (fully) able to compensate for the cost to share their data. It is notable that in (B) individual researchers are left with lower costs than in (C). This is because in (B) the probability of finding an appropriate dataset for reuse f is set higher, compensating the sharing costs for many of the researchers. In (C) the time cost t r with reuse per paper is lower, benefitting only those few researchers that do find a reusable set. In (D) the lowering of the time cost t c for preparing a dataset for sharing improves the situation for all researchers compared to the default in (A), but still some researchers are not fully compensated. In (E) the introduction of the citation benefit b does not help much to improve the benefits for sharing researchers. Only when in (F) a substantial citation benefit b is introduced for sharing researchers, the costs associated with sharing are (more than) compensated for, for all sharing researchers. Figure 3: Individual gains with sharing. Gains from sharing in number of citations per individual researcher. These gains are calculated for the situation with fifty percent sharing researchers compared to the same situation without sharing researchers. For visualization purposes, the researchers are sorted according to sharing habitat: not sharing researchers (dark grey circles) to the left, sharing researchers (light grey circles) to the right. See the legend of Fig. 2 for parameter settings in all subfigures. When simulating community impact in Figs. 4A and 4B it can be seen that, as the benefits b for sharing increase towards the right of the plot, the average community impact increasingly starts to rise with more sharing in both plots. Even the drop after the initial increase at increased sharing caused by the datasets saturation is eventually compensated for with the increase of the citation benefit with sharing. In subfigure (B) at the left side of the plot, without a citation benefit and with the very high cost for sharing t c , there appears an alarming effect. At these parameter values the average impact becomes lower at high sharing than at no sharing at all. Policies increasing sharing would, if successful, in this case backfire and reduce scientific community impact. Figure 4: Community impact. b. Figures are calculated at default parameter values (see b which is varied, and for subplot (B) t c , of which the value was set from 0.1 to 0.2. On the z-axis is the average community impact. On the x and y axes, respectively, increasing benefits b for sharing from 0 to 0.8 (0 to 80% citation benefit with sharing) and increasing percentage of sharing researchers from 0 to 100%. Average community impact with varying percentage of sharing researchers and varying sharing benefit. Figures are calculated at default parameter values (see Table 1 ) with the exception ofwhich is varied, and for subplot (B), of which the value was set from 0.1 to 0.2. On the-axis is the average community impact. On the x and y axes, respectively, increasing benefitsfor sharing from 0 to 0.8 (0 to 80% citation benefit with sharing) and increasing percentage of sharing researchers from 0 to 100%.