Abstract We perform a large-scale analysis of language diatopic variation using geotagged microblogging datasets. By collecting all Twitter messages written in Spanish over more than two years, we build a corpus from which a carefully selected list of concepts allows us to characterize Spanish varieties on a global scale. A cluster analysis proves the existence of well defined macroregions sharing common lexical properties. Remarkably enough, we find that Spanish language is split into two superdialects, namely, an urban speech used across major American and Spanish citites and a diverse form that encompasses rural areas and small towns. The latter can be further clustered into smaller varieties with a stronger regional character.

Citation: Gonçalves B, Sánchez D (2014) Crowdsourcing Dialect Characterization through Twitter. PLoS ONE 9(11): e112074. https://doi.org/10.1371/journal.pone.0112074 Editor: Tobias Preis, University of Warwick, United Kingdom Received: July 28, 2014; Accepted: September 27, 2014; Published: November 19, 2014 Copyright: © 2014 Gonçalves, Sánchez. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Data Availability: The authors confirm that all data underlying the findings are fully available without restriction. All relevant data are publicly available. Documentation on how to query the Twitter API can be found here: https://dev.twitter.com/overview/documentation. Funding: The authors have no support or funding to report. Competing interests: The authors have declared that no competing interests exist.

Introduction Language is the most characteristic trait of human communication but takes on many heterogeneous forms. Dialects, in particular, are linguistic varieties which differ phonologically, gramatically or lexically in geographically separated regions [1]. However, despite its fundamental importance and many recent developments, the way language varies spatially is still poorly understood. Traditional methodological approaches in the study of regional dialects are based on interviews and questionnaires administered by a researcher to a small number (typically, a few hundred) of selected speakers known as informants [2]. Based on the answers provided, linguistic atlases are generated that are naturally limited in scope and subject to the particular choice of locations and informants and perhaps not completely free of unwanted influences from the dialectologist. Another approach is the use of mass media corpora which provide a wealth of information on language usage but suffer from the tendency of media and newspapers to use standard norms (the “BBC English” for example) [3] that limits their usefulness for the study of informal local variations. On the other hand, the recent rise of online social tools has resulted in an unprecedented avalanche of content that is naturally and organically generated by millions or tens of millions of geographically distributed individuals that are likely to speak in vernacular and do not feel constrained to use standard linguistic norms. This, combined with the widespread usage of GPS enabled smartphones to access social media tools provides a unique opportunity to observe how languages are used in everyday life and across vast regions of space. In this work, we use a large dataset of geolocated Tweets to study local language variations across the world. Similar datasets have recently been used to map public opinion and social behavior [4]–[11] and to analyze planetary language diversity [12]. Preliminary results demonstrating the feasibility of this approach have thus far been limited to considering only few words or just a few geographical areas [13], [14]. Here, we move beyond the mere proof of concept and provide a detailed global picture of spatial variants for a specific language. For definiteness, we choose Spanish as it is not only one of the most spoken in the world but it has the added advantage of being spatially distributed across several continents [15], [16]. Several other languages such as Mandarin or English have more native speakers or higher supra-regional status but their use is hindered by the limited local availability of Twitter (Mandarin) or a high abundance of homographs that percludes a detailed lexicographic analysis (English).

Methods We used the Twitter gardenhose to gather an unbiased sample of all tweets written in Spanish that contained GPS information over the course of over two years. Language detection was performed using the state of the art Chromium Compact Language Detector [17] software library. The resulting dataset contained over geolocated tweets written in Spanish distributed across the world (see Fig. 1). As expected, most tweets are localized in Spain, Spanish America and extensive areas of the United States. These results are consistent with recent sociolinguistic data [18], [19], providing an initial level of validation to our approach. Interestingly, we also find significant contributions from major non-Spanish-speaking cities in Latin America and Western Europe, likely due to considerable population of temporary settlers and tourists. See Ref. [12] for further details and results on this dataset. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Figure 1. Spanish tweet locations. The overwhelming majority of Spanish tweets are located in Spain and Spanish America but significant contributions arise in certain US states and major Western European and Brazilian cities. https://doi.org/10.1371/journal.pone.0112074.g001 Traditional approaches in dialectology have preferred rural, male informants while modern analyses include interactions with urban speakers regardless of age and gender. On average, Twitter users are young, urban [20] and more likely to be technologically savvy thus providing more modern perspective on the use of language. To be able to determine exactly what the major local varieties of Spanish are, we use a list of concepts and utterances selected from an exhaustive study of lexical variants in major Spanish-speaking cities. The Varilex database [21] provides a comprehensive list of possible words representing several concepts, such as ‘popcorn’, ‘car’, ‘bus’, etc. We selected a subset of concepts that minimized possible semantic ambiguities by ensuring that they contained no common words. The complete list of words and maps for each concept studied can be accessed at http://www.bgoncalves.com/languages/spanish.html. In our initial set of Tweets we observed geolocated instances where words from our catalogue were used. Individual instances were then agregated geographically into cells of , which corresponds to an approximate area of km2 in the equator. Finally, we define the dominant word for each concept in each geographical cell by a simple majority rule and generate a matrix where element is 1 when word is the dominant for a given concept in cell and otherwise. The resulting matrix has rows and columns and constitutes the dataset used for the analysis presented in the remainder of this paper.

Conclusions Using a large dataset of user generated content in vernacular Spanish, we analyse the diatopic structure of modern day Spanish language at the lexical level. By applying standard machine learning techniques, we find, for the first time, two large Spanish varieties which are related to, respectively, international and local speeches. We can also identify regional dialects and their approximate isoglosses. Our results are relevant to empirically understand how languages are used in real life across vastly different geographical regions. We believe that our work has considerable latitude for further applications in the computational study of linguistics, a field full of rewarding opportunities. One can envisage much deeper analyses pointing the way towards new developments in sociolinguistic studies (bilingualism, creole varieties). Our work is based on a synchronous approach to language. However, the possibilities presented by the combination of large scale online social networks with easily affordable GPS enabled devices are so remarkable that might permit us to observe, for the first time, how diatopic differences arise and develop in time.

Acknowledgments We thank I. Fernández-Ordóñez and F. Moreno Fernández for useful discussions. Disclaimer: This product was made utilizing the LandScan 2007 High Resolution global Population Data Set copyrighted by UT-Battelle, LLC, operator of Oak Ridge National Laboratory under Contract No. DE-AC05-00OR22725 with the United States Department of Energy. The United States Government has certain rights in this Data Set. Neither UT-BATTELLE, LLC NOR THE UNITED STATES DEPARTMENT OF ENERGY, NOR ANY OF THEIR EMPLOYEES, MAKES ANY WARRANTY, EXPRESS OR IMPLIED, OR ASSUMES ANY LEGAL LIABILITY OR RESPONSIBILITY FOR THE ACCURACY, COMPLETENESS, OR USEFULNESS OF THE DATA SET.

Author Contributions Conceived and designed the experiments: BG DS. Performed the experiments: BG. Analyzed the data: BG. Wrote the paper: BG DS.