Empirical analysis

In order to measure the acceleration in our diverse ensembles of trajectories, we use statistical methods (e.g., ensemble averages) as a rough estimate, and the distributions of various observables for a more detailed picture. Generally, we analyze groups of the most popular items, because we aim to observe topics that reach societal-scale awareness and avoid small and heterogeneous subgroups. This choice allows us to use a well-mixed modeling approach, without the need to account for possibly complex network structures in the onset of lesser known topics. This choice also assures a constant number of parallel events in each measurement. We choose group sizes that are particular to each data source according to their size and to achieve reasonable data densities. We tested for possible artifacts from the choice of absolute top groups, by considering relative top groups and thresholds, and found the developments reported in the main text to be robust against these choices (see Supplementary Figure 1).

The temporally discrete nature of empirical data requires us to chose a reasonable bin size to aggregate the timestamped occurrences. The minimal size we use is 24 h, starting from 00:00 and reaching to 23:59, to avoid confusing popularity dynamics with diurnal rhythms. If the data is only available in coarser granularity, we choose that binning size.

The aggregation windows of values, used to obtain the distributions for comparison, are chosen in a similar way. We generally compare distributions year-by-year and more generally, we choose aggregation windows for the losses, gains, etc. in the following way: if the data granularity is available on a daily level, we stick to the choice of yearly aggregation (e.g., in Twitter, Google, or Reddit). When the granularity of data points is coarser, e.g., weekly or yearly, we used wider aggregation windows, such as 5 or 20 years (e.g., Movies or Publications). This choice results in sample sizes that allow us to extract smooth distributions and cover the available time frame evenly.

Some of the observables depend on market size or infrastructure and cover long time scales. Whenever possible we choose observables to be independent of these influences. For example, we consider normalized proxies for popularity as in box-office sales and Google Books.

Sampling techniques

To measure collective attention we focus on popular and globally visible cultural items. On Twitter the acceleration developments appear weaker the lower the items are ranked (Supplementary Figure 1). A possible explanation for this behavior is additional mechanisms driving smaller popularity peaks, that are influenced differently, e.g., by social network structures or daily rhythms and less by the global news-spreading velocity. To focus on what we call collective attention we choose the following popularity-related samples. Twitter: From the set of hourly aggregated hashtag occurrences from altogether 43 billion individual tweets we sample the 50 most used of every hour. The resulting sample sizes are listed in extended data Supplementary Table 2. The raw sampling counts already hint at the observed development: the total number of different hashtags ranked in the hourly top 50 within a year increases from 2013 to 2016. In 2013, for example, a hashtag has been part of the set of the top ranked hashtags on average for the period of (365 × 24 × 50)/25031 = 17.5 hours, while in 2016 it stayed only (365 × 24 × 50)/36703 = 11.9 hours. Hence, we can observe a development towards higher turnover rates in the popular discussion. This observation as well as the other results (Figs. 1 and 2) are stable with respect to changes in the size of the top group and become even more pronounced within the highest ranked hashtags (e.g., in the top ten of each hour, see Supplementary Figure 1).

For books, we extract the 1000 most frequently used n-grams from each year (n = 1, ..., 5), normalized by the number of books in which they appear. A n-gram is a specific sequence of n character strings. Especially for higher n such n-grams are likely to correspond to a specific turn of phrase, related to a specific topic, if they are used several times per book. In this case the occurrences of phrases in books are taken to be a proxy for the attention that their authors spend on certain topics. Similarly to the usage of specific hashtags we take such phrases to be representations of topics of collective attention. The normalization per book allows a measurement that corresponds to the relative volume of a phrase usage without the total growth of the book market. Within the 20 year bins used in our investigation, we observe a similar effect in the raw counts as described above for Twitter: The number of different n-grams within the yearly top group over 20 years grows as listed in Supplementary Table 2 (the last number is smaller due to the shorter observation window of just 14 years).

As another example from the offline world we collected box-office sales (i.e., sold ticket prices) of popular Hollywood movies from the past four decades. This measurement is another quantification of the collective attention towards cultural items in form of movies. The more people view a given movie, the more collective attention it received. We average values by the number of theaters in which each movie is screened. This avoids a bias due to a growing number of cinemas. We use 4000 individual movies from the weekly top list of Box office Mojo29. Similar to previous datasets, the number of individual movies within the same time interval increases (Supplementary Table 2) (the last observation window is only 3 years).

For Google, we chose the top 20 search terms from every month in the period of 2005 to 2015 as they are listed from Google Trends topcharts. We use different categories, namely: people, songs, actors, cars, brands, TV shows and sports teams. We interpret the number of web-enquirers concerning these items as collective attention spent on them. In each category, five queries can be compared at once and their volume is normalized to the maximum among them. On the one hand this provides a normalization, but on the other hand the limitations from Google Trends imply that these values are not perfectly comparable. We overcome this by defining relative gains and losses as independent from system size, but note that, in this case, the distribution of maximum values in Supplementary Figure 5c is still non-trivial to interpret.

On Reddit we use as a popularity proxy the volume of discussion on submissions. The more a submission is discussed the more people are actively engaged by spending their attention to the particular topic. We extract the number of comments on each submission from the corpus of a 10% random sample from all comments on Reddit and focused on the top 1000 commented submissions of each month. The growing sample sizes again confirm the increasing turnover rates.

From the body of papers in the APS corpus we analyzed the citation counts of papers with more than 15 citations within a month. Whenever researches cite another work they are discussing a related idea, and by that focusing their attention on that topic. In this dataset any acceleration of citation dynamics is visible only for highly cited papers, we speculate that this because only the mechanisms of, e.g., imitation apply in this case (detailed discussion see Supplementary notes).

We also include the 100 most visited articles of every hour on the English Wikipedia in our analysis. Again, the active visit of a Wikipedia article is an act of focusing ones attention on the topic, described by the article. In the case of Wikipedia, we do not observe a pronounced development towards larger turnover rates within this definition of top articles. This is corroborated by the lack of strong changes in the other observables for this dataset (discussion see Supplementary notes).

Average trajectories

From the dynamics of L i (t) of an entity i, we compute the ensemble average of the trajectories around local maxima L i (t peak ) as a first step. The grey lines in Fig. 1b show 1% of of -trajectories L i (t) around a local maximum, contributing to an intuitive visualization of the dynamics in the Twitter dataset. The colored lines are the ensemble average values of all sub-trajectories L i (t m ± t) from each trajectory L i (t) around all its M i local maxima, at respective times \(t_1 \ldots t_{M_i}\) for each year: \(\langle L(t_m \pm t)\rangle = \mathop {\sum}

olimits_{i = 1}^N (\mathop {\sum}

olimits_{{\mathrm{m}} = 1}^{M_i} L_i(t_m \pm t)/M_i)/N\). Evaluated three timesteps before and after the maxima, we observe long-term changes in the average values 〈ΔL〉 and a relatively stable average height 〈L(t peak )〉. Following this insight we analyze and compare the statistical distributions of all the relative changes ΔL i /L i in each time step and the maxima L i (t peak ) with respect to their long-term developments.

Distribution of relative changes P (ΔL i /L i )

For a more refined picture on the system level than the average of L i (t) around a local maximum, we examine the changes in popularity at every given point in time. In order to avoid trivial effects such as system size dependence, we do not consider the total changes ΔL i for measuring the velocity of attention allocation, but the relative gains ΔL i /L i in positive direction: \([\Delta L_i^{{\mathrm{(g)}}}/L_i](t) = (L_i(t) - L_i(t - 1))/L_i(t - 1) > 0\) at every instance t. Additionally we define the losses, symmetrically by reversing the order of events: \([\Delta L_i^{{\mathrm{(l)}}}/L_i](t) = (L_i(t) - L_i(t + 1))/L_i(t + 1) > 0\). The normalized empirical distribution of relative losses \(P(\Delta L_i^{{\mathrm{(l)}}}/L_i)\) is plotted in Figs. 1d and 3g with an inverted x-axis to visually clarify their negative character (in Supplementary Figure 4 for the other datasets without that visualization). The distributions of gains \(P(\Delta L_i^{{\mathrm{(g)}}}/L_i)\) are shown in Figs. 1e, 2, S3, and 3h. All plots are in double logarithmic scale and distributed among log-scaled bins.

Fitting continuous distributions

Finding a good fit for the distribution described above is not the central point of this work, but the correct fit contributes to a better understanding of the data. In addition, the fitted parameters reveal the visually observed developments quantitatively as listed in Supplementary Tables 4–10. To find the best suited probability density function we use maximum likelihood fitting44 considering an array of well known continuous distribution functions, namely: exponential, power-law, log-logistic (Fisk), Cauchy, normal, gamma, Pareto, logistic, uniform, Weibull, power-law with exponential cutoff, and log-normal. Supplementary Figure 2 shows the empirical data and the fitted candidate functions. For the optimal choice of distribution we consider the KS-statistics in order to quantify the goodness-of-fit. As listed in Supplementary Table 3, the log-normal distribution is the best choice in most cases and is plotted using a thick red line. Other good candidates are the Pareto, the Weibull, and the log-logistic distribution. In the case of some of the datasets, these provide an even better fit (e.g., Pareto for Wikipedia). For simplicity we use the log-normal distribution: \(P(x) = 1/(x\sigma \sqrt {2\pi } )\exp\left[ { - (\ln(x) - \mu )^2/(2\sigma ^2)} \right]\) in all cases. The corresponding KS-statistics are generally very small, they are listed alongside the detailed fitted values in Supplementary Tables 4–10.

Since it is not the main point of this work to identify the exact shape of the distributions, we use box-and-whisker plots for the representation in Fig. 2. Specifically, we use notched box plot representations, with median values shown as black bars and mean values as black diamonds. Whiskers are chosen to show the 1.5 of the interquartile range. The position of the median relative to the mean and the size of the upper box and whisker are good measures for the right-skewness of the data.

Distribution of maximal popularity P (L i (t peak ))

To quantify the observation of the average peak heights, the inset of Fig. 1e and S5 show the distribution P(L i (t peak )) of values of local maxima L i (t peak ) from each trajectory. These plots are in double logarithmic scale with log-scaled bins. Corresponding box plots as described above are shown in the insets of Fig. 2. Their shape has various characteristics, but generally the maxima are broadly distributed in all data.

Statistical testing

For the distributions of gains and losses we fitted a log-normal distribution to the data. This choice is based on the comparison of many candidate functions as described above and shown in Supplementary Figure 2. To check the goodness-of-fit we use the two-sided Kolmogorov–Smirnov (KS) test45 for continuous functions, which return low statistics values (Supplementary Tables 4–10) for all fits. The p values (for the hypothesis that the data is drawn from a log-normal distribution) vary largely and are generally small. In some datasets small sample sizes can contribute to the low-p values. As mentioned above the aim here is not to find the optimal probability density function to describe our data but to get an estimate for their broadness.

We also us the two-sided KS test for two samples to compare the distributions we obtain from the simulation to the empirically observed distributions. The distribution we aim to fit with the chosen parameters (α = 0.005, c = 2.4, N = 300, and r = 12) is the gain distribution from Twitter in 2016. We reach a KS-distance of 0.01 and a p value of 0.85 (for the hypothesis that the two sets are drawn from the same distribution, Supplementary Table 11). Additionally we find that the distributions from the other datasets agree very well with our simulations (see Supplementary Figs. 10 and 11, KS-statistics in the caption).

Burst events

Another way to demonstrate the development of accelerating media is to visualize the density of extreme events in collective attention dynamics. This can also be understood as a visualization of the broadening distributions of relative changes from Supplementary Figs. 3 and 4. We define this as an relative increase above a threshold, directly followed by a steep decline. These events can be understood as bursts or spikes. To define such an event in Twitter we detect a relative gain \([\Delta L_i^{{\mathrm{(g)}}}/L_i](t_{{\mathrm{burst}}}) > 5\) followed by a loss \([\Delta L_i^{{\mathrm{(l)}}}/L_i](t_{{\mathrm{burst}}}) > 5\). For the other datasets we adjust these values to obtain a reasonable density of events for visualization (Google Books: 12, 12; Movies:1.5, 1.5; Google Trends: 2, 2; Reddit: 25, 25; Citations: 1, 1; Wikipedia: 35, 35). These values depend on the broadness of the gain and loss distributions and are chosen to track only the most extreme events (furthest points in the tail). The resulting bursts are plotted in Supplementary Figure 6 as scatter plots over time, for more recent times from bottom to top. The resulting average heights at the burst event 〈L(t burst )〉 and the time interval 〈τ〉 between them are shown in Supplementary Figure 7. The developments are robust to the choice of slope thresholds.

Reducing dimensionality

The proposed model consists of N coupled distributed-delay differential equations that can be transformed into a system of 2N ordinary differential equations. Introducing the auxiliary variable \(Y_i(t) \equiv {\int}_{ - \infty }^t {\mathrm{e}}^{ - \alpha (t - t{\prime})}L_i(t{\prime})dt{\prime}\) in Eq. (1) of the main text and applying the Leibniz rule, we obtain the transformed system:

$$\frac{{dL_i(t)}}{{dt}} = r_{p}L_i(t)\left( {1 - \frac{{r_{\mathrm{c}}}}{K}Y_i(t) - c\mathop {\sum}\limits_{j = 1,j

e i}^N L_j(t)} \right){,}$$ (2)

$$\frac{{dY_i(t)}}{{dt}} = L_i(t) - \alpha Y_i(t).$$ (3)

This system can be solved numerically, using standard solvers like the Runge–Kutta method. The parameters for our simulations for the Twitter dataset are N = 300, α = 0.005, c = 2.4, K = 1.0 and using the condition for stable peak heights r p ~ r c we set r c = r p ≡ r, which we varied within a range r ∈ {9.0,10.0,11.0,12.0}.

The interpretation of the numerical parameter values is not in the focus of the work, but they were optimized to fit the empirical distributions. The system size N = 300 contributes to a richer dynamic than using only, e.g., N = 50, the value of the memory decay α and the global competition coupling c are difficult to measure in empirical data, but a possible investigation in future research. The growth rates of Fig. 3d can not be mapped directly to the rate r p , because we measure global production rates, while r p describes a per capita growth per piece of content for each topic, but also this can be approached in the future. Generally, the qualitative results are very robust within wide ranges in the parameter space and the fitting is possible without extensively fine tuning the parameters.

After starting with random initial conditions (L i (0) ∈ [0,1]) and empty memory (Y i (0) = L i (0)), we allow for a transient of 500 time units in our simulations before starting the measurement. The finite memory makes self-sustaining dynamics possible without the constant introduction of new topics. Introducing additional topics could be another variant of the simulation to account explicitly for exogenous driving by ever new events, but is not in the scope of this work. Solving the equations above results in N trajectories that can be analyzed for the same observables as the ones we observe in the datasets. We average over 100 realizations to obtain the statistical distributions, because ergodicity is not necessarily given. In Fig. 3a–c such trajectories are shown as examples, and in Fig. 3e the average values around local maxima are plotted. This plot is much smoother than the empirical one in Fig. 1b, because we are able to refine granularity of our time-discretization for the simulation. From this data it is straight forward to extract the extreme events (Fig. 3f), the distributions of gains and losses (Fig. 3g–h) or the values of maxima (inset Fig. 3h, values rescaled to meet empirical distribution). The simulation parameters were chosen by parameter scanning to minimize the KS-distance (0.01) of the resulting relative gain distribution to the one observed in 2016 on Twitter. The other distributions then follow from that simulation and overlap well with the data. Only the parameter r is varied to fit the distributions of the earlier years. For the other datasets the same procedure is applied to achieve the fits in Supplementary Figs. 10 and 11.

Analytic solutions for a single topic

To gain a deeper insight to the interplay of the two rates r p and r c , we consider the minimal scenario of the model, an isolated topic (c = 0). This reduces Eq. (1) to:

$$\frac{{dL(t)}}{{dt}} = r_{p}L(t)\left( {1 - \frac{{r_{c}}}{K}{\int}_{ - \infty }^t {{\mathrm{e}}^{ - \alpha (t - t\prime )}} L(t\prime )dt\prime } \right),$$ (4)

choosing the limiting case of infinite memory α → 0 leads to

$$\frac{{dL(t)}}{{dt}} = r_{p}L(t)\left( {1 - \frac{{r_{c}}}{K}{\int}_{ - \infty }^t L (t{\prime})dt{\prime}} \right),$$ (5)

which can be written as:

$$\frac{{dL(t)}}{{dt}} = r_{p}L(t)\left( {1 - \frac{{r_{c}}}{K}Y(t)} \right)$$ (6)

$$\frac{{dY(t)}}{{dt}} = L(t).$$ (7)

The chain rule yields

$$\frac{{dL(t)}}{{dY(t)}} = r_{p}\left( {1 - \frac{{r_{c}}}{K}Y(t)} \right),$$ (8)

which leads to (with the necessary condition L(Y = 0) = 0)

$$L(t) = r_{p}Y(t) - \frac{{r_{p}r_{\mathrm{c}}}}{{2K}}Y(t)^2 = \frac{{dY(t)}}{{dt}}.$$ (9)

This first-order equation can be solved with the substitution v(t) = Y(t)−1 and has the solution:

$$Y(t) = \left( {\frac{{r_{c}}}{{2K}} + C{\mathrm{e}}^{ - r_{p}t}} \right)^{ - 1}.$$ (10)

With the condition \(Y(t_{{\mathrm{peak}}}) = \mathop {{lim}}

olimits_{t \to \infty } \frac{1}{2}Y(t) = K/r_{c}\) for logistic growth processes, where t peak is the time when L(t) is maximal as we used for empirical data in the main text, this yields the solution:

$$Y(t) = \left( {\frac{{r_{c}}}{{2K}} + \frac{{r_{c}}}{{2K}}{\mathrm{e}}^{ - r_{p}(t - t_{{\mathrm{peak}}})}} \right)^{ - 1} = \frac{{2K}}{{r_{c}}}\left( {1 + {\mathrm{e}}^{ - r_{p}(t - t_{{\mathrm{peak}}})}} \right)^{ - 1},$$ (11)

which directly leads to the solution for L(t):

$$L(t) = \frac{{2K}}{{r_{c}}}\frac{{r_{p}{\mathrm{e}}^{ - r_{p}(t - t_{{\mathrm{peak}}})}}}{{\left( {1 + {\mathrm{e}}^{ - r_{p}(t - t_{{\mathrm{peak}}})}} \right)^2}}.$$ (12)

The observation of stable peak heights becomes clear by considering the maximum:

$$L(t)|_{\frac{{dL(t)}}{{dt}} = 0} = L\left( {t_{{\mathrm{peak}}}} \right) = 2K\frac{{r_{p}}}{{r_{c}}},$$ (13)

which is independent of r under the assumption of the direct proportionality of the two rates r p ~ r c ≡ r. Eq. (9) is shown in Supplementary Figure 8a under this condition. This seems to be approximately true for the case of the full system (Eqs. (2) and (3)), as the numerical results show in Fig. 3e.

In the opposite limit α → ∞, corresponding to an audience without any memory, Eq. (4) becomes

$$\mathop {{\lim }}\limits_{\alpha \to \infty } \frac{{dL(t)}}{{dt}} = r_{\mathrm{p}}L(t)\left( {1 - \mathop {{\lim }}\limits_{\alpha \to \infty } \frac{{r_{\mathrm{c}}}}{K}{\int}_{ - \infty }^t {{\mathrm{e}}^{ - \alpha (t - t{\prime})}} L_i(t{\prime})dt{\prime}} \right) = r_{\mathrm{p}}L(t),$$ (14)

corresponding to an exponential growth process. In this scenario, even the present state L(t) is immediately forgotten and does not contribute to the intra-specific competition, causing an unhindered growth.