Abstract This article presents evidence of performance deterioration in online user sessions quantified by studying a massive dataset containing over 55 million comments posted on Reddit in April 2015. After segmenting the sessions (i.e., periods of activity without a prolonged break) depending on their intensity (i.e., how many posts users produced during sessions), we observe a general decrease in the quality of comments produced by users over the course of sessions. We propose mixed-effects models that capture the impact of session intensity on comments, including their length, quality, and the responses they generate from the community. Our findings suggest performance deterioration: Sessions of increasing intensity are associated with the production of shorter, progressively less complex comments, which receive declining quality scores (as rated by other users), and are less and less engaging (i.e., they attract fewer responses). Our contribution evokes a connection between cognitive and attention dynamics and the usage of online social peer production platforms, specifically the effects of deterioration of user performance.

Citation: Singer P, Ferrara E, Kooti F, Strohmaier M, Lerman K (2016) Evidence of Online Performance Deterioration in User Sessions on Reddit. PLoS ONE 11(8): e0161636. https://doi.org/10.1371/journal.pone.0161636 Editor: Tobias Preis, University of Warwick, UNITED KINGDOM Received: April 23, 2016; Accepted: August 9, 2016; Published: August 25, 2016 Copyright: © 2016 Singer et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Data Availability: Data is publicly available online. Information can be found at https://www.reddit.com/r/datasets/comments/3bxlg7/i_have_every_publicly_available_reddit_comment and direct access is e.g., possible in Google’s BigQuery engine at https://bigquery.cloud.google.com/table/fh-bigquery:reddit_comments.2015_04. Additionally, for reproducibility and own inference, we make our code and experimental steps available online in the form of jupyter notebooks using R kernels in the supplementary material. The utilized Reddit data was not crawled by ourself, but is publicly available and promoted on Reddit. In this work, we only use aggregated data of the whole dataset and thus, the data used adheres to the terms and conditions of Reddit. Funding: The authors received no specific funding for this work. Competing interests: The authors have declared that no competing interests exist.

Introduction Performance deterioration following a period of sustained mental effort has been documented in settings that include student performance [1], driving [2], data entry [3], and exerting self-control [4]. Although the mechanisms for deteriorating performance are still debated [5–7], deterioration has been shown to be accompanied by physiological brain changes [8–10], suggesting a cognitive origin, whether due to mental fatigue, boredom, or strategic choices to limit attention. Outside of vigilance tasks, however, relatively little is known about whether and how this phenomenon affects online behavior. As our society becomes increasingly interconnected and people spend more time interacting through various online platforms, analyzing online performance is important for understanding how content is produced and consumed [11], how information spreads [12–14], and how people decide what and who to pay attention to [15, 16]. In this work, situated under the broad umbrella of user behavior modeling [17], we study online performance on Reddit, a popular peer production and social news platform. We measure online peer production performance as the quality of comments produced by Reddit users over the course of a session, defined as a period of activity without a prolonged break. The dataset we study contains over 55 million comments posted on Reddit in April 2015, and includes a variety of related meta-data, such as time stamps, information about the users, and the score attributed by others to each comment. We segment user activity into sessions, defined as periods of commenting without a break longer than 60 minutes, as suggested in [18] (cf. Fig 1). We link an individual’s commenting performance over the course of a session to different proxy measures for a comment’s quality, such as its length, readability, the score it receives from others, and the number of responses it triggers. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Fig 1. Sessions and randomization. Circles represent comments C i and arrows depict the time difference Δt i, j between subsequent comments C i and C j . Sessions are derived by breaking at time differences exceeding 60 min. Original data sessions are shown in the first row. The middle row shows randomized sessions where time differences between comments are swapped for deriving new sessions while retaining the original order of comments. The bottom row depicts the randomized index data where sessions are retained but the order of comments within sessions is swapped. https://doi.org/10.1371/journal.pone.0161636.g001 Our analyses uncover deteriorating online performance over the course of user sessions, with a decline in quality of subsequent comments across different proxy measures. Fig 2 illustrates the decline in the average score received by comments posted during sessions with ten comments: the data shows that each subsequent comment receives a rating that is on average 0.3 points lower than the preceding one. The size of this effect is quite large: It is equivalent to a 30% probability increase of receiving a downvote to a comment, for each extra comment posted after the first one in the session. Additionally, we observe that users tend to start with higher quality comments the longer the sessions are. To statistically study these effects, we design and implement mixed-effects models—allowing the incorporation of heterogeneous behavioral differences—which model the effect of session duration on the deterioration of online performance. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Fig 2. Performance of comments within sessions. We show the average Reddit score for comments in sessions of length 10 (original session data, blue solid line). The average rating of each comment decreases starkly, by about 0.3 points for each comment after the first one in the session. This suggests the presence of (super linear) performance deterioration throughout user sessions. The effect disappears in randomized data having shuffled comments within sessions (red dashed line). https://doi.org/10.1371/journal.pone.0161636.g002 Our findings may be linked to effects of cognitive depletion: Exerting mental effort to compose a comment may diminish an individual’s capacity to continue producing quality comments, whether through the loss of attention, mental fatigue, or simply the onset of boredom. Evidence also suggests that people, and other primates, have finite cognitive capacity for managing interpersonal relationships [19] limiting their amount of social interaction [20, 21]. Only recently, our research community started investigating the possible relationship between cognitive limits and online interactions, showing the impact of information overload on user behavior [20, 22–24]. Possibly, within-session deterioration of performance could explain the difficulty for users to continue exerting effort to discover information deeper in their social stream [15, 16, 25]. Also, deterioration might be influenced by the passive content consumption within a session, e.g., replies by other users to own comments maybe being toxic or hateful leading to flame wars [26]. The relation between the session length (i.e., number of comments) and the session’s first comment’s quality might also be explained by different starting capacities to make quality contributions, or, that the perceived quality of the first comment encourages users to produce more follow-up comments. Although unveiling the mechanism(s) behind observed phenomena goes well beyond the scope of the current study, performance deterioration occurs throughout various critical daily activities, including learning (e.g., prolonged study sessions) and self-regulation (e.g., coping with stress, inhibition, refraining from behaving, or sticking to dietary restrictions). We believe that shedding light on the complex interplay between cognitive limits and individual performance can further our understanding of human behavior in many contexts. Thus, showing initial evidence of online performance deterioration is important and we expect this work to have implications for both computer and cognitive sciences communities.

Discussion Our work presents novel evidence of performance deterioration during prolonged online activity. By analyzing Reddit, a popular online social network that attracts millions of users, we showed that sessions with more activity are significantly associated with production of lower quality content, as measured by the length of the comment posted, its readability score, its average score and the number of responses it receives. In light of these findings, we developed a mixed-effects model that captures online performance deterioration. The code and results for all model analytics are available online [27] and available in the supplementary information (S1–S8 Notebooks and S1–S8 Tables). Our analysis can be expanded in several directions. For example, we have only accounted for the basic differences between distinct Reddit users in the mixed-effects models. Yet, a much more nuanced analysis of heterogeneous effects of online performance deterioration would be warranted. One interesting direction involves understanding whether all individuals exhibit the same levels of performance deterioration, or whether these effects vary from user to user. For example, we might find that all users consistently exhibit deterioration or that different subgroups of users exist, where some users might even show improvements in performance over time. Neuroscience studies found individual differences in working memory and other cognitive activities in the human brain [28]. However, it remains unclear from a physiological standpoint whether capacity to process or produce information varies from person to person [29]. Online performance deterioration may also depend on acquired experience (as a form of cognitive dexterity) with a system. A new, and thus unfamiliar, user in a system may experience faster performance deterioration than an experienced user, because e.g., the cognitive or attention cost associated with the same operations may be experience-dependent (this is particularly true for information discovery and content production activities). A computational study of online performance in this direction could be very valuable. Additionally, other hypotheses can be studied, such as that performance deterioration depends on the topic (politics vs. funny images), the time of the day, or the intensity of sessions (shorter average time differences between comments). A further aspect to consider is, that we have considered all comments posted to Reddit as equal, meaning that we did not distinguish between those comments posted at the root of a comment hierarchy and those posted further down the hierarchy. Future research in that direction is necessary to better understand observed deterioration effect. For example, top-level comments might generally be of higher quality than low-level comments, or performance deterioration might be stronger for successive posts in the same submission thread compared to comments across submissions. Also, the position of a comment in the hierarchy also influences its visibility to others which might have an impact on perceived quality. These and similar questions can be studied by our proposed models. They are highly adaptable and fixed and random effects can be utilized to model these potential heterogeneous effects; for example, including a random effect allowing the deteriorating effects to vary between users could already allow us to make further inference about individual differences. Furthermore, the set of quality features can be extended arbitrarily and also investigated more closely. In this work, we have focused on two features that are static (text length and readability) and two features that express the perception of the content by others (score and number of responses). Specifically the latter category of features warrants future studies, e.g., in light of potential social influence bias (herding) effects [30]. Yet, also other categories of quality features might be of interest, such as the sentiment of the comment. Although our study was confined to Reddit, performance deterioration may generalize to other online activities. Future studies are needed to identify the mechanisms leading to observed deterioration, whether through the loss of attention, mental fatigue, or simply the onset of boredom. Regardless of the causes, understanding the complex interplay between individual’s cognitive limits and dynamic behavior is key to optimizing individual—and collective—performance in peer production and other online systems.

Materials & Methods Here, we thoroughly describe utilized data, corresponding pre-processing steps, and statistical mixed-effects modeling approach. Data For studying performance deterioration we utilized a publicly available dataset ever written on Reddit starting from the first one on October 17, 2007 to the last one at the end of May 2015 containing all comments (nearly 1.7 billion) [31]. For our experiments, we extracted a smaller sample that limits the data to all comments posted in April 2015. An advantage of this limited data is that we do not need to additionally account for changes in Reddit’s platform not only in its interface, but also in its voting mechanisms as well as the general usage patterns of users on the site [32]. Our results are robust to sample data from other months showing similar observations. Quality features For measure online performance, we studied the following comment quality features. Text length. This feature counts the number of characters in a comment and is an indicator for its textual length. Each URL in a comment accounts for one additional character. The overall mean of text lengths is μ = 168.08, the median is m = 86.00, and the standard deviation is σ = 281.88. Score. The score is a measure the perception of other users and is the difference between their up- and downvotes (the starting score is 1). All ratings can be summarized by the mean μ = 6.05, the median m = 1.00 and the standard deviation σ = 51.57. Number of responses. We see the number of replies a comment triggers as a proxy for engagement and a comment’s success. We only count direct replies in the comment hierarchy. The mean number of responses is μ = 0.61, the median is m = 0.00 and the standard deviation is σ = 1.44. Readability. George Klare provided the original definition of readability [33] as “the ease of understanding or comprehension due to the style of writing”. For measuring readability of Reddit comments, we use the so-called Flesch-Kincaid grade level [34] representing the readability of a piece of text by the number of years of education needed to understand the text upon first reading; it contrasts the number of words, sentences and syllables. It is defined as follows: The lowest possible grade is −3.4 which e.g., emerges for comments that only contain a sentences having a single syllable such as “OK”, only a single URL or only emoticons. We set the maximum Flesch-Kincaid grade to be 22. Simply put, a higher Flesch-Kincaid grade indicates higher readability complexity of a given comment. The overall mean of the Flesch-Kincaid grade is μ = 5.12, the median is m = 4.91 and the standard deviation is σ = 4.61. Correlation of features. As shown in Table 2, most of the features are not strongly correlated (Pearson’s ρ) with each other; however, we can identify two special cases. First, readability and text length have a correlation of ρ = 0.296, which is not surprising given that shorter texts are easier to read, which is accounted for in the Flesch-Kincaid grade level formula. Second, the two success features score and number of responses have a correlation of ρ = 0.558, meaning that comments that get a high score also tend to receive more replies. However, overall, these correlation results indicate that each feature represents interesting aspects on its own. All correlation coefficients are strongly significant (p-values close to zero) for a significance test with the null hypothesis stating no correlation (also accounting for multiple comparison by e.g., Bonferroni adjustment). PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Table 2. Pearson correlation between features. https://doi.org/10.1371/journal.pone.0161636.t002 Sessions We decided to take the time differences between consecutive comments as session indicators. To that end, we followed the approach advised in [18] where a strong regularity in how social media users initiate events across several different platforms was identified. Authors argue that a good rule-of-thumb is an inactivity threshold of 60 minutes to separate sessions. However, as postulated, we first visually and analytically inspect the log-scaled histogram of time differences between consecutive comments (after cleaning comments, before filtering sessions) as depicted in Fig 4. Similar to the results presented for other platforms [18, 35], there is a peak for very short time scales (minutes) and a peak for time differences of one day suggesting daily routine. By fitting a Gaussian Mixture Model (using EM-algorithm, log-normal mixture) with two components to the log-transformed data, we end up with the two means μ 1 = 6.85min. and μ 2 = 794min. A natural valley is visible between the two peaks and thus, combined with the results from the log-normal mixture fitting, we follow the rule-of-thumb of [18] and pick a time difference Δt i, j of one hour between consecutive comments C i and C j to separate sessions. Note that other (similar) choices of break time (e.g., 30 or 90 minutes) produce similar inference. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Fig 4. Time differences between consecutive comments of users on Reddit. The x-axis depicts the time differences between consecutive comments (tick labels refer to major ticks) and the y-axis illustrates respective frequency. The log-scaled histogram shows a peak for very short time scales (minutes) and very long ones (1 day) suggesting daily routines. A natural valley emerges between both peaks arguing for the choice of a one hour break between comments for sessions. https://doi.org/10.1371/journal.pone.0161636.g004 Data pre-processing We took several steps for pre-processing and cleaning the data. First, we removed users from our data based on these rules: (i) They have posted the exact same comment more than 100 times, (ii) their username is part of an unofficial Reddit bot list [36], or (iii) their account has been deleted; this accounts for around 4.5M comments. Second, we deleted all sessions containing at least one comment (i) that has been deleted, (ii) that is completely empty, or (iii) that contains characters that are not in the ASCII character set (e.g., Chinese characters)—accounting for additional 3M comments. Finally, we removed all sessions containing more than 10 comments accounting for around 7.25M allowing for easier experimental tractability and the removal of further bot accounts. Note though that the inclusion of these sessions into the experiments does not change the main observations of this paper. Our final dataset contains 40, 064, 930 comments produced by 2, 669, 969 different users and posted in 47, 462 different subreddits. Randomizing sessions For comparison, we created two randomized datasets to which we applied our analysis. The first baseline—which we call randomized session dataset—attempts to preserve as much information as possible while randomizing the process of deriving user commenting behavior sessions. To do so, we shuffled the time differences Δt i, j between consecutive comments made by each user, but preserved all other features, including the temporal order of comments. Then, we simply derived user activity sessions based on shuffled times. An example is provided in Fig 1 (middle row). This baseline dataset is very conservative in terms of randomization and retains many original sessions. For example, many parts of a session stay intact as only the short time differences are potentially swapped, which does not alter the sessions. The second baseline—which we call randomized index dataset—keeps the sessions intact, but randomizes the order of comments inside each session (e.g., exchanging C 1 by C 3 ). Thus, it does not preserve the original order of comments; see Fig 1 (bottom row). Multiple randomization iterations did not alter the results. Mixed-effects models For statistically modeling performance deterioration, we utilized mixed-effects models allowing for the incorporation of heterogeneous effects and behavioral differences accounting for the non-independent nature of longitudinal data at hand. Mixed-effects models include both fixed and random effects; following [37], we refer to fixed effects as effects being constant across levels (e.g., individuals) and random effects as those varying between different levels. An overview of mixed-effects models can be found in [38]. In our setting, the introduction of random effects enabled us to consider variations between different levels; the most important level being different users accounting for the inherent differences between individual Reddit users (e.g., the average quality of their comments). As highlighted in [39], mixed-effects models have further advantages, such as flexibility in handling (i) missing data and (ii) continuous and categorical responses, as well as (iii) the capability of modeling heteroscedasticity. For simplicity, let us specify mixed-effects model equations using the following syntax [40]: (1) This specification describes a model where an outcome (dependent variable) is explained by an intercept 1, one or more fixed effect(s), as well as one or more random effects allowing for variations between levels. For all our experiments, we utilize the lme4 R package [40] and fit the models with maximum likelihood. Examples about model specifications can be found online [41]. As each of our experiments is conducted on one of our four different features that all exhibit different properties—e.g., count (text length) vs. continuous (readability) data—we performed extensive model analytics to find the most suitable model for each problem setting. Overall, we aimed at finding the most appropriate model for each feature at hand by not only focusing on simple linear mixed-effects models, but also on generalized mixed-effects models such as Poisson or negative Binomial regression suitable for count data. When fitting regression models, several assumptions need to be considered, such as for linear models we need to check for normally distributed residuals and heteroscedasticity. Thus, we performed model diagnostics on the individual models and successively tried to improve our models, for example going from a linear model to a Poisson model. Additionally, we checked for overdispersion and zero-inflation in our count data models (Poisson and negative binomial) and accounted for it. We also tackled problems like multicollinearity, outlier bias, as well as convergence problems. The models reported in this article are the ones that we judged as the most useful ones for each setting at hand after extensive model diagnostics outlined above. For judging significance of fixed and random effects, we followed an incremental modeling approach starting with the most simple model only explaining the outcome by the intercept and then subsequently adding effects to the model. For comparing the relative fits of these models we used the Bayesian Information Criterion (BIC) [42] which balances the likelihood of a model with its complexity. An interpretation table presented by Kass and Raftery [43] can be consulted to determine the strength of the differences between BIC scores. This allows to gain confidence in the significance of observed effects allowing us to make inference on them. All reported fixed effects in this work are highly significant—except where mentioned (randomized baseline data)—meaning that the differences in BIC scores between the models including the effect and those excluding it are far larger than the maximum threshold of 10 indicating strong evidence as postulated in [43]. For completeness, we also conducted additional significance tests for the fixed effects such as t-tests or F-tests confirming our BIC diagnostics. In order to enable the reader to follow our individual steps and also allow for personal inference, we provide detailed reports for each experiment—based on a sample of 1 million data points—in the form of jupyter notebooks using R kernels both online [27] and in the supplementary material (S1–S8 Notebooks). In the main article, we only reported the fixed effects and corresponding inference as those were the main effects we were interested in. However, we make the full regression outputs available in the supplementary information (S1–S8 Tables). Making our code and all experiments publicly available allows us to carefully document the results, as well as encourage other researchers to make their own inference and further refine our models. At the same time, utilized Reddit data is freely available [31].

Author Contributions Conceived and designed the experiments: PS EF FK KL MS. Performed the experiments: PS FK. Analyzed the data: PS EF FK KL. Contributed reagents/materials/analysis tools: PS FK. Wrote the paper: PS EF FK KL MS.