Modeling cumulative measures

Here we consider linear regression models of the h-index but the analysis presented can be trivially extended to any cumulative measure of impact. A recent publication proposes a model for predicting an individual's future h-index based on linear regression of five other metrics10. As a group, these five metrics were found to be the best for predicting future h-index. In this linear regression model the h-index h(t + Δt) of an individual at time t + Δt is given by

The variables found on the right-hand side of Eq. 1 are values calculated for a given t, the number of years since the researcher's first publication. We will also refer to t as “career age”. For a given researcher, at a given career age t, the other variables are as follows: h(t) is the h-index; n p (t) is number of publications authored or co-authored; j(t) is the number of distinct journals of the publications; q(t) is the number of papers published in high impact journals. The parameter associated with each independent variable is arrived at using linear regression with elastic net regularization (see Methods). We apply the above model to predict the future h-index (as measured by the percentage variance explained, given by the squared correlation coefficient R2) for both prominent physicists and prominent biologists. For both data sets the model shows high R2 when lumping together all career ages (red curves in Fig. 2). Even 15 years into the future the model yields R2 values of 0.75 and 0.76, respectively. These results are consistent with previous analyses and give the impression that the model is quite good at predicting a scientist's future h-index. For both these datasets, the variations of standardized coefficient are shown in Supplementary Fig. S1. The coefficient related to the h-index at the time of prediction (career age t) is the largest; the coefficient for the number of article published is also quite high especially in the distant future. In contrast, coefficients for publishing in many distinct journals and top journals are relatively small.

Figure 2: The “predictive power” of the regression model of the h-index for different disciplinary datasets and for different career age cohorts (years since first publication t = 3, 5, 7). The curve for t = All shows the model of Eq. 1, where all career ages were lumped together. For all the cases, overall regression model is significant (p < 10−6, calculated from F-statistic). Full size image

Age-dependent cumulative model

To assess the suitability of prediction models for applications in the real world, we analyze the t-dependence of the above model. We use the same regression variables as in Eq. 1 but disaggregate the prediction problem into sets of fixed career age (t). By modeling each career age separately we analyze the robustness of the above model with respect to varying career age. In this case the predicted h-index Δt years in the future, of a scientist who is at a career age t, is given by:

Note that as the data is already segregated by career age, t is not considered as an independent variable in this version of the model. In Figure 2 we also show the model's predictive power for different career ages, for prominent physicists and biologists. The model's predictive power for early career researchers is far lower than the previous model where all career ages were lumped together (t = All). Although these results indicate the future of scientists at early stages of their career is less predictable, the R2 values are still quite high, particularly for biologists. Those who are at the 3rd and 5th year of their career have R2 = 0.63 and R2 = 0.73 respectively, 10 years into the future. These values are notably high and may give the impression that an individual researcher's career trajectory is easily predicted even from a very early point. However, in the following section we show that cumulative measures like the h-index contain an intrinsic auto-correlation that not only results in this career age difference in the predictive power, but more importantly, to a dangerous overestimation of the model's overall predictive power. Further, the variations of standardized coefficients as shown in Supplementary Fig. S2 for t = 3 and t = 7 are different compared to the t = All case. Although, the coefficient related to the h-index is still largest, the coefficient for the number of papers in high impact journals is comparable, especially for biologist career. The variation of the coefficient related to h-index also increases with time, which is in contrast to the observation when all career ages were lumped together (t = All). Moreover, different coefficients for different career age means that they can not be aggregated together for regression analysis. Further, when a given dataset is sliced into two different groups, both the R2 values as well as the coefficients of the regression models were different (Fig. S3–S4), suggesting another weakness of this analysis.

Non-stationary time series

An academic career is an endeavor influenced by many factors, and in that light the Acuna model takes a step in the right direction by integrating several different variables into a prediction. However, the h-index is a cumulative measure and hence, is non-stationary. This makes the h-index the incorrect dependant variable to target for prediction. In this context we are using the weak definition of stationarity, which requires the mean and variance of a generic stochastic process to be time independent and the auto-covariance between the variable at t and t + Δt be a function only of Δt. As we show below its non-stationary nature makes the h-index a poor predictor because it implies an intrinsic correlation that (i) explains, in part, the career age dependence noted above and (ii) results in an overestimation of the predictive power of models focused on predicting the future h-index and all other cumulative measures.

First, we consider a simple model for the evolution of an individual researcher's h-index, in which his/her h-index in a given year is a sum of yearly independent and random increments Δh. Hence, for a given researcher s, his/her h-index after t-years is given by

where the are independent displacements with and , for all i.

Next we consider the statistical properties of the above model. The expected value of the h-index at career age t is

and the variance

The auto-covariance is

Thus, the correlation between h(t + Δt) and h(t) equals

The mean, variance and auto-covariance depend on t. Further, hs(t + Δt) and hs(t) are completely correlated when Δt/t ≈ 0, that is when the researcher's career age is much greater than the number of years into the future you are attempting to predict his/her h-index. Likewise, hs(t + Δt) and hs(t) are completely uncorrelated as Δt/t → ∞, i.e. when attempting to predict an individual's h-index many more years into their future than the current career age.

Even disregarding the limiting behavior, Eq. 7 shows why regression models that attempt to predict the future h-index cannot perform as well for 'young' careers as for ‘old’ ones. Further, the fact that the correlation between current and future h-index intrinsically increases with researcher's age (for fixed Δt) indicates that the observed predictive power of models of h(t + Δt) may only be an outcome of general properties of the evolution of cumulative measures, rather than true ability to predict the future impact of a researcher.

Empirical evidence of overestimation

In this section we provide additional evidence that a trivial correlation is indeed present in h-index and it leads to a significant overestimation of the predictive power of linear models. To do this we resort to null models. That is, we explore a number of methods for constructing synthetic careers from the real career data, and show that when linear models for h-index are applied to these careers high R2 values result. However, within these models all information that a linear regression model should be using to predict an individual's future h-index has been ‘scrambled’, thus the resulting R2 values should be (essentially) nil in the absence of the correlation arising from the fact h-index is a cumulative measure.

We refer to our first null model as the Δh null model. Here we construct synthetic careers of physicists with the following procedure. First we generate the distribution of single year h-index increases for all careers in a given dataset. Figure 3 (a) shows this distribution is narrow, with 98% of the yearly increments less than 5. Second we generate a career by constructing a sequence of yearly h-index increases, drawn randomly from the distribution generated in the previous step. Two such career trajectories can be found in Figure 3 (b). Finally we apply a simple linear model, h(t + Δt) = β 0 + β h h(t). The R2 values produced by this approach can be found in Figure 3 (c). The R2 values are quite high, far higher than the cumulative model of Eq. 1 applied to real careers. But what do these high R2 values mean? Are they an indicator of predictive power and ability to discriminate between promising and not so promising careers? This is not the case as due to the manner in which the careers are generated, over any interval, the h-index of a researcher will increase by the same (average) amount at each step, regardless of whether the researcher has a high or a low h-index at that point. We conclude that such high R2 values do not indicate predictive power, but they are rather evidence of intrinsic autocorrelation.

Figure 3: Correlation in non-stationary time series. (A) Distribution of Δh, i.e., the increment in scientist's h-index in consecutive years. (B) The evolution of h-index of two scientists in our dataset and their randomized version. (C) Variation of “predictability” R2 with time for two different null models considered in the paper. (D) The auto-correlation of the actual value of the stock market index (not the price return) of 5 different countries. In (C) and (D), overall regression model is significant (p < 10−6). Full size image

We refer to our second null model as the paper shuffle null model. In this case all papers published in year t are shuffled and distributed randomly across all researchers (see Supplementary Text for details). Hence, in this model the number of papers a researcher published in each year of his/her career is conserved. However, since papers are randomly assigned to each researcher each career is, statistically speaking, indistinguishable from each other in that every one has the same probability of ‘writing’ a high impact paper. In Figure 3 (c) it can be seen that, as with the δh null model, this null model produces high R2 values again indicating not predictive power but the presence of inherent correlation.

Finally, as an example of a system where simple models are known to have little predictive power, yet produce significant R2 values, we turn to financial time series. We considered the stock market index of 5 different markets for the 15-year time period October 1997 to September 2012. In Figure 3 (d) we plot the correlation (regression) of the index value at time p(t + Δt) against p(t) as a function of Δt. We note that this quantity exhibits a high degree of correlation even after 100 days. However, the analysis of the autocorrelation of index return (the actual predictability) shows that it decays quickly, thus supporting the efficient market hypothesis14,15.

Modeling non-cumulative measures

The results presented above provide significant evidence that linear regression models are not so much predicting future impact as they are picking up on a correlation intrinsic to cumulative measures. Auto correlation, Eq. 7, is only present in cumulative measures like total number of publications, total number of citations, total number of publications in distinct journals, etc. It is not present in non-cumulative measures, e.g., the incremental h-index, Δh(t, Δt) = h(t + Δt) − h(t). Following the derivation above, the mean and variance Var[Δh(t, Δt)] = σ2Δt are independent of time, resulting in the auto-covariance Cov[Δh(t + τ, Δt), Δh(t, Δt)] = 0 if τ > 0. Hence, it is important to examine the R2 for non-cumulative measures. Here we focus on a regression model for the incremental h-index Δh(t, Δt) of a scientist at career age t, which by analogy with Eq. 2 reads

In Supplementary Fig. S5 we show this model's “predictive power”, as measured by R2, for different career ages t and varying horizons Δt. All the curves except for early career years t = 1 and t = 2 follow similar behavior and there is no consistent trend of decreasing R2 with decreasing t. The careers at t = 1 show lower correlation, indicating that the state of an individual's CV after his/her first year of publishing is a poor predictor of his/her future trajectory. In Figure 4 we show this average predictive power for the model when applied to established physicists, biologists and mathematicians from different age cohorts. It is immediately clear that when dealing with the non-cumulative measure, Δh(t, Δt), the model has significantly less predictive power.

Figure 4: The “predictive power” of h-index increments (Δh(t, Δt)) for different discipline. (A,B,C) Variation of the mean R2 as a function of time period Δt over which the increment is calculated for established physicists, biologists and mathematicians. The mean is calculated by averaging over different career age cohorts t = 2, …, 15. (D,E,F) Variation of the mean standard coefficient as a function of Δt. The shaded region indicates the 95% confidence error bars. Similar plots are also shown for relatively young researchers in (G,H,I) for assistant professors in physics, biologists and graphene researchers. As the careers of young scientists are short, in this case the mean is calculated by averaging over different career age cohorts t = 2, …, 8. In all the cases, overall regression model is significant (p < 10−2). Full size image

Figure 4 also shows that the incremental variation in the h-index of a prominent biologist is more tightly connected to his/her past metrics. We speculate this may be due to other factors, like leading a large laboratory. We note similar behavior for prominent mathematicians. As these three datasets represent only prominent scientists, selected based upon their high success, the R2 values give an upper bound on predictability of scientists in that field. In contrast the dataset of physics assistant professors, young biologists and graphene researchers, all relatively young scientists, exhibit much lower R2. Finally we show the variation of the mean of the standard coefficient of the model. The coefficient related to h-index is not as important as we found for Eq. 1, and other factors such as number of publications, number of publications in distinct journals, and number of publications in top journals are more important. For prominent biologists the coefficients for publication in top journals and number of publications are higher than for physicists. For mathematicians the coefficient related to the number of distinct journals is largest. In relative terms, the coefficient of the h-index is more important for physicists.

Although this figure shows the average trend, one ought to exercise caution in interpreting the results because coefficients for scientists at different stages of their careers are also different. For example, Supplementary Fig. S6 shows the coefficient for age t = 3, t = 5 and t = 10 for both prominent physicists and biologists. It is easy to see that the coefficient related to the number of papers decreases as Δh is measured over larger Δt. Further, for biologists, the coefficient for the number of publications in top journals is larger in the late part of the career than in the early stages. Nevertheless, the coefficients of the regression analysis were different even when for the same set of scientist during different age of their career. This variation in the coefficients across fields, as well as across career stages, indicates that it is unlikely there is a unique set of parameter that can be used to predict future impact for all cases.

Correlating past and true future

Although in the previous section we considered non-cumulative measures of scientific productivity and impact, the correlation between an individual's past accomplishments and future achievements deserves a more fine grained examination. For example, the number of citations received by a scientist at career age t, during the period Δt years into the future depends both upon the papers he/she has written up to year t and upon the papers published up to year t + Δt. Similarly, the increase in h-index during any given period is due to citations to papers he/she has already written in past years as well as citations to papers published during the period in question. In order to investigate the career uncertainty across academic transition points we analyze each scientist's citation impact over 3 consecutive non-overlapping periods. The first period, {T early }, starts at the beginning of his/her career, t = 1, and extends up to t = 5. The second period, {T mid }, starts at year t = 6 and extends to t = 10, while the third period, {T late }, starts at year t = 11 and extends to t = 15 years. For each period, we collect for each scientist only the publications that he/she published within that period, and, considered the citations received by these publications within the same period.

We calculate three non-cumulative impact measures for each scientist: (a) the total number of publications n p (t|{T j }); (b) the square root of total number of citations ; (c) the h-index h(t|{T j }). These measures account only for citations within the period to papers also published within that period. In this way, we test the predictability of the citation impact of a scientist's future work using publication information measuring his/her earlier research. Figure 5 shows a scatter plot of physicists for all the three measures. The left panels show the correlation between the ‘early’ and the ‘mid’ career and the right panels show the correlation between the ‘mid’ and the ‘late’ career. The correlation coefficient R is also shown for each measure. These values are lower than, but qualitatively similar to, the observation in Fig. 4, indicating that future measures are indeed somewhat correlated with the past. We found that for all the measures the correlation between past and future is similar. Thus our analysis suggests that all these measures are equally good (or equally bad) in predicting future impact. Further, the correlation between mid and late career is slightly higher than the correlation between early and mid. This is reasonable in so far as there is greater fluctuation in the early career stage than the later stages when scientists are more established. Additionally, our results diverge from recent work showing that future citations to future work are hardly predictable13. Instead, we found low but significant correlation between past and future measures. It is possible that this difference arises from the fact that this portion of our analysis focuses on scientists that are all relatively well established, thus missing scientists that produce low impact work and ultimately exit academia. This result does nevertheless suggest that the predictability of top scientists can be used as an extreme upper bound for the predictability of all scientific careers. The results for prominent biologists and mathematicians are qualitatively similar, whereas for young researchers, physics assistant professors, young biologists and graphene researchers correlation is much smaller (Fig. S7–S11).