Abstract Metrics derived from Twitter and other social media—often referred to as altmetrics—are increasingly used to estimate the broader social impacts of scholarship. Such efforts, however, may produce highly misleading results, as the entities that participate in conversations about science on these platforms are largely unknown. For instance, if altmetric activities are generated mainly by scientists, does it really capture broader social impacts of science? Here we present a systematic approach to identifying and analyzing scientists on Twitter. Our method can identify scientists across many disciplines, without relying on external bibliographic data, and be easily adapted to identify other stakeholder groups in science. We investigate the demographics, sharing behaviors, and interconnectivity of the identified scientists. We find that Twitter has been employed by scholars across the disciplinary spectrum, with an over-representation of social and computer and information scientists; under-representation of mathematical, physical, and life scientists; and a better representation of women compared to scholarly publishing. Analysis of the sharing of URLs reveals a distinct imprint of scholarly sites, yet only a small fraction of shared URLs are science-related. We find an assortative mixing with respect to disciplines in the networks between scientists, suggesting the maintenance of disciplinary walls in social media. Our work contributes to the literature both methodologically and conceptually—we provide new methods for disambiguating and identifying particular actors on social media and describing the behaviors of scientists, thus providing foundational information for the construction and use of indicators on the basis of social media metrics.

Citation: Ke Q, Ahn Y-Y, Sugimoto CR (2017) A systematic identification and analysis of scientists on Twitter. PLoS ONE 12(4): e0175368. https://doi.org/10.1371/journal.pone.0175368 Editor: Lutz Bornmann, Administrative Headquarter, GERMANY Received: August 19, 2016; Accepted: March 26, 2017; Published: April 11, 2017 Copyright: © 2017 Ke et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Data Availability: Raw data were collected via Twitter REST APIs (https://dev.twitter.com/rest/public). They cannot be shared to comply with the Twitter terms of service. Funding: YYA acknowledges support from Microsoft Research. CRS is supported by the Alfred P. Sloan Foundation Grant #G-2014-3-25. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing interests: YYA acknowledges support from Microsoft Research. This does not alter our adherence to PLOS ONE policies on sharing data and materials.

Introduction Twitter and other social media have become important communication channels for the general public. It is thus not surprising that various stakeholder groups in science also participate on these platforms. Scientists, for instance, use Twitter for generating research ideas and disseminating and discussing scientific results [1–3]. Many biomedical practitioners use Twitter for engaging in continuing education (e.g., journal clubs on Twitter) and other community-based purposes [4]. Policy makers are active on Twitter, opening lines of discourse between scientists and those making policy on science [5]. Quantitative investigations of scholarly activities on social media—often called altmetrics—can now be done at scale, given the availability of APIs on several platforms, most notably Twitter [6]. Much of the extant literature has focused on the comparison between the amount of online attention and traditional citations collected by publications, showing low levels of correlation. Such low correlation has been used to argue that altmetrics provide alternative measures of impact, particularly the broader impact on the society [7], given that social media provide open platforms where people with diverse backgrounds can engage in direct conversations without any barriers. However, this argument has not been empirically grounded, impeding further understanding of the validity of altmetrics and the broader impact of articles. A crucial step towards empirical validation of the broader impact claim of altmetrics is to identify scientists on Twitter, because altmetric activities are often assumed to be generated by “the public” rather than scientists, although it is not necessarily the case. To verify this, we need to be able to identify scientists and non-scientists. Although there have been some attempts, they suffer from a narrow disciplinary focus [8–10] and/or small scale [8, 10, 11]. Moreover, most studies use purposive sampling techniques, pre-selecting candidate scientists based on their success in other sources (e.g., highly cited in Web of Science), instead of organically finding scientists on the Twitter platform itself. Such reliance on bibliographic databases binds these studies to traditional citation indicators and thus introduces bias. For instance, this approach overlooks early-career scientists and favors certain disciplines. Here we present the first large-scale and systematic study of scientists across many disciplines on Twitter. As our method does not rely on external bibliographic databases and is capable of identifying any user types that are captured in Twitter list, it can be adapted to identify other types of stakeholders, occupations, and entities. Our study serves as a basic building block to study scholarly communication on Twitter and the broader impact of altmetrics.

Background We classify current literature into two main categories, namely product- vs. producer-centric perspectives. The former examines the sharing of scholarly papers in social media and its impact, the latter focuses on who generates the attention. Product-centric perspective. Priem and Costello formally defined Twitter citations as “direct or indirect links from a tweet to a peer-reviewed scholarly article online” and distinguished between first- and second-order citations based on whether there is an intermediate web page mentioning the article [12]. The accumulation of these links, they argued, would provide a new type of metric, coined as “altmetrics,” which could measure the broader impact beyond academia of diverse scholarly products [13]. Many studies argued that only a small portion of research papers are mentioned on Twitter [6, 14–19]. For instance, a systematic study covering 1.4 million papers indexed by both PubMed and Web of Science found that only 9.4% of them have mentions on Twitter [17], yet this is much higher than other social media metrics except Mendeley. The coverages vary across disciplines—medical and social sciences papers that may be more likely to appeal to a wider public are more likely to be covered on Twitter [19, 20]. Mixed results have been reported regarding the correlation between altmetrics and citations [17, 21–24]. A recent meta-analysis showed that the correlation is negligible (r = 0.003) [25]; however, there is dramatic differences across studies depending on disciplines, journals, and time window. Producer-centric perspective. Survey-based studies examined how scholars present themselves on social media [26–30]. A large-scale survey with more than 3, 500 responses conducted by Nature in 2014 revealed that more than 80% were aware of Twitter, yet only 13% were regular users [29]. A handful of studies analyzed how Twitter is used by scientists. Priem and Costello examined 28 scholars to study how and why they share scholarly papers on Twitter [12]. An analysis of 672 emergency physicians concluded that many users do not connect to their colleagues while a small number of users are tightly interconnected [4]. Holmberg and Thelwall selected researchers in 10 disciplines and found clear disciplinary differences in Twitter usages, such as more retweets by biochemists and more sharing of links for economists [11]. Note that these studies first selected scientists outside of Twitter and then manually searched their Twitter profiles. Two limitations thus exist for these studies. First, the sample size is small due to the nature of manual searching [4, 8, 11, 12, 31]. Second, the samples are biased towards more well-known scientists. One notable exception is a study by Hadgu and Jäschke, who presented a supervised learning based approach to identifying researchers on Twitter, where the training set contains users who were related to some computer science conference handles [9, 32]. Although this study used a more systematic method, it still relied on the DBLP, an external bibliographic dataset for computer science, and is confined to a single discipline.

Identifying scientists Scientist occupations Defining science and scientists is a Herculean task and beyond the scope of this paper. We thus adopt a practical definition, turning to the 2010 Standard Occupational Classification (SOC) system (http://www.bls.gov/soc/) released by the Bureau of Labor Statistics, United States Department of Labor. We use SOC because not only it is a practical and authoritative guidance for the definition of scientists but also many official statistics (e.g., total employment of social scientists) are released according to this classification system. SOC is a hierarchical system that classifies workers into 23 major occupational groups, among which we are interested in two, namely (1) Computer and Mathematical Occupations (code 15-0000) and (2) Life, Physical, and Social Science Occupations (code 19-0000). Other groups, such as Management Occupations (code 11-0000) and Community and Social Service Occupations (code 21-0000), are not related to science occupations. From the two groups, we compile 28 scientist occupations (S1 Table). Although authoritative, the SOC does not always meet our intuitive classifications of scientists. For instance, “biologists” is not presented in the classification. We therefore consider another source—Wikipedia—to augment the set of scientist occupations. In particular, we add the occupations listed at http://en.wikipedia.org/wiki/Scientist#By_field. We then compile a list of scientist titles from the two sources. This is done by combining titles from SOC, Wikipedia, and illustrative examples under each SOC occupation. We also add two general titles: “scientists” and “researchers.” For each title, we consider its singular form and the core disciplinary term. For instance, for the title “clinical psychologists,” we also consider “clinical psychologist,” “psychologists,” and “psychologist.” We assemble a set of 322 scientist titles using this method (S1 Data). List-based identification of scientists Our method of identifying scientists is inspired by a previous study that used Twitter lists to identify user expertise [33]. A Twitter list is a set of Twitter users that can be created by any Twitter user. The creator of a list needs to provide a name and optional description. Although the purpose of lists is to help users organize their subscriptions, the names and descriptions of lists can be leveraged to infer attributes of users in the lists. Imagine a user creating a list called “economist” and putting @BetseyStevenson in it; this signals that @BetseyStevenson may be an economist. If @BetseyStevenson is included in numerous lists all named “economist,” which means that many independent Twitter users classify her as an economist, it is highly likely that @BetseyStevenson is indeed an economist. This is illustrated in Fig 1 where the word cloud of the names of Twitter lists containing @BetseyStevenson is shown. We can see that “economist” is a top word frequently appeared in the titles, signaling the occupation of this user. In other words, we “crowdsource” the identity of each Twitter user. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Fig 1. User identity recored from Twitter list names. We show the word cloud of Twitter lists containing @BetseyStevenson. https://doi.org/10.1371/journal.pone.0175368.g001 In principle, we could use Twitter’s memberships API (https://dev.twitter.com/rest/reference/get/lists/memberships), for each user, to get all the lists containing this user, and then infer whether this user is a scientist by analyzing the names and descriptions of these lists. However, this method is highly infeasible, because (1) most users are not scientists, (2) the distribution of listed counts is right-skewed: Lady Gaga, for example, is listed more than 237K times (https://www.electoralhq.com/twitter-users/most-listed), and (3) Twitter API has rate limits. We instead employ a previously introduced list-based snowball sampling method [34] that starts from a given initial set of users and expands to discover more. We improve this approach by more systematically obtaining the job title lexicon, as described in the last section. Moreover, instead of choosing a few preselected users, we obtain a total of 8, 545 seed users by leveraging the results of a previous work that identified user attributes using Twitter lists [33] (S1 Text). We use the snowball sampling (breadth-first search) on Twitter lists. We first identify seed users (S1 Text) and put them into a queue. For each public user in the queue, we get all the lists in which the user appears, using the Twitter memberships API. Then, for each public list in the subset resulting lists whose name contains at least one scientist title, we get its members using the Twitter members API (https://dev.twitter.com/rest/reference/get/lists/members) and put those who have not been visited into the queue. The two steps are repeated until the queue is empty, which completes the sampling process. Note that to remove many organizations and anonymous users as well as to speed up the sampling, we only consider users whose names contain spaces. We acknowledge that this may drop many users with non-English names or the ones who do not disclose their names in a standard way. Also note that this procedure is inherently blind towards those scientists who are not listed. From the sampling procedure, we get 110, 708 users appearing in 4, 920 lists whose names contain scientist titles. To increase the precision of our method, the final dataset contains those users whose profile descriptions also contain scientist titles. A total number of 45, 867 users are found.

Discussion Our work presents an improvement over earlier methods of identifying scientists on Twitter by selecting a wider array of disciplines and extending the sampling method beyond the paper-centric approach. Our method may serve as a useful step towards more extensive and sophisticated analyses of scientists on Twitter—it cannot be assumed that the population of scientists on Twitter is similar in composition and behavior to the population of scientists represented in traditional bibliometric databases. Therefore, sampling should be independent of these external data and metrics. Furthermore, in seeding with terms from the Standard Occupational Classification provided by the Bureau of Labor Statistics, we are able to classify both scholarly and practitioner scientific groups, thus widening the conceptualization of scientists on Twitter. The triangulation of list- and bio-based classifications of scholars allows us to integrate two perspectives on identity: how scientists self-identified and how they were identified by the community. Our approach favors precision over recall; that is, we feel confident that those identified were scientists, but there is a much larger population of scientists who were not identified in this way. Our disciplinary analyses suggest that Twitter is employed by scholars across the disciplinary spectrum—historians were widely represented, as were physicists, political scientists, computer scientists, biologists, economists, and sociologists. Practitioners were also highly represented—psychologists and nutritionists were in the top five in terms of disciplines with the highest number of identified members. However, a large percentage was also explicitly academic scholars: self-identified students and faculty members comprised 21.9% of the total population (S1 Text). Our analysis suggests that social scientists are overrepresented on Twitter, given their proportional representation in the scientific workforce, and that mathematicians are particularly underrepresented. Our findings resonate with some previous results [19], which looked at social media metric coverage of publications by field. They found higher Twitter density in the social and life sciences and lower density for mathematics and computer science. This provides some intuitive alignment: if a group is systematically underrepresented on the platform, we might expect a lower degree of activity around papers within that discipline. Of those whose gender could be identified, 38.6% were female and 61.4% were male. This represents a more equal representation of women than seen in other statistics on the scientific workforce, such as number of publications [35], suggesting that Twitter scientists may be more gender-balanced than the population of publishing scientists. As might be expected, scientists tweet in much the same way as the general population: Instagram, Facebook, YouTube are among the most tweeted domains, along with general news sites such as The Guardian, New York Times, and the BBC. However, scientists also have a distinct imprint of scholarly sites, such as generalists publications (i.e., Nature and Science) and reinforce the academic oligarchy of journal publishers [38]. The popular pre-print server, arXiv, also occupies a prominent spot among the top 20 cited domains. However, overall, tweets to these URLs identified as scientific only represented a small fraction of the overall tweets, suggesting that the content of scientists’ tweets is highly heterogeneous. This reinforces previous studies, which showed a strong blurring of boundaries between the personal and professional on Twitter, under a single Twitter handle [30]. We operationalized centralities in three ways: by followers, retweets, and mentions. Social and life scientists dominate these networks and mathematicians and computer scientists are relatively isolated. However, once these centralities are normalized by the size of the group, social scientists actually underperform, given their size. This is imperative information for the construction of indicators on the basis of these metrics. Just as it is standard bibliometric practice to normalize by field, so too should altmetric practices integrate normalization, given the uneven distribution of disciplines represented on these platforms. Analysis of assortativity suggests that disciplinary communities prevail in the unfiltered realm of social media—scholars from the same disciplines tended to follow each other. This could suggest a negative result in terms of broader impact of social media metrics—if disciplinary walls are maintained in this space, it may not provide the unfettered access to scholarship that was promised. Furthermore, networks of communities reveal some isolation: e.g., although they represent a large proportion of the total users identified, historians are largely isolated in the Twitter network. Our work has the following limitations. First, the reliance of Twitter lists leads to our method inherently blind towards those scientists who are not listed. Furthermore, the use of lists may skew towards the elite and high profile science communicators (e.g., Neil deGrasse Tyson). Second, in the sampling process, the exclusion of users whose names are without spaces biases the sample towards English-speaking users and causes many scientists not discovered. Third, the existence of private lists prohibits us to get the members there and affects further discovery of new users. Fourth, how list members were curated is largely unknown, and this might be done automatically and thus decrease the precision of identified scientists. Fifth, in the post-processing, the filtering of users whose profile descriptions do not contain scientist titles biases the sample towards self-disclosed scientists.

Conclusion In this work, we have developed a systematic method to discovering scientists who are recognized as scientists by other Twitter users through Twitter list and self-identify as scientists through their profile. We have studied the demographics of identified scientists in terms of discipline and gender, finding over-representation of social scientists, under-representation of mathematical and physical scientists, and a better representation of women compared to the statistics from scholarly publishing. We have analyzed the sharing behaviors of scientists, reporting that only a small portion of shared URLs are science-related. Finally, we find an assortative mixing with respect to disciplines in the follower, retweet, and mention networks between scientists. Future work is needed to examine the use of machine learning methods [9] by leveraging information from retweet and mention networks to improve our identification method, to investigate the degree to which a more equal representation of women is due to age, status, or the representation of practitioners in our dataset, and to ascertain to what extent altmetric communities (i.e., follow, retweet, and mention networks) align with or differ from bibliometrically-derived communities (i.e., citation and collaboration networks).

Acknowledgments We thank Onur Varol for early discussions and Filippo Radicchi for providing the computing resource. YYA acknowledges support from Microsoft Research. CRS is supported by the Alfred P. Sloan Foundation Grant #G-2014-3-25.

Author Contributions Conceptualization: QK. Data curation: QK. Formal analysis: QK YYA CRS. Funding acquisition: YYA CRS. Investigation: QK YYA CRS. Methodology: QK YYA CRS. Project administration: QK. Resources: QK YYA CRS. Software: QK. Supervision: QK YYA CRS. Validation: QK YYA CRS. Visualization: QK YYA CRS. Writing – original draft: QK CRS. Writing – review & editing: QK YYA CRS.