An Open Resource for the Global Research Community

Due to the relevance of the COVID-19 global pandemic, we are releasing our dataset of tweets acquired from the Twitter Stream related to COVID-19 chatter. Since our first release we have received additional data from our new collaborators, allowing this resource to grow to its current size. Dedicated data gathering started from March 11th yielding over ~3.3 million tweets a day.

The data collected from the stream captures all languages, but the higher prevalence are: English, Spanish, and French. We release all tweets and retweets on the full_dataset.tsv file (681,410,294 unique tweets), and a cleaned version with no retweets on the full_dataset-clean.tsv (162,073,837 unique tweets). There are several practical reasons for us to leave the retweets, tracing important tweets and their dissemination is one of them. For NLP tasks we provide the top 1000 frequent terms in frequent_terms.csv, the top 1000 bigrams in frequent_bigrams.csv, and the top 1000 trigrams in frequent_trigrams.csv. Some general statistics per day are included for both datasets in the statistics-full_dataset.tsv and statistics-full_dataset-clean.tsv files.

The latest version of the dataset and usage instructions can be found in our github page: https://github.com/thepanacealab/covid19_twitter

If you are going to cite or reuse this dataset, please use:

This dataset is released only for Non-Commercial research purposes

This dataset is being mantained by Georgia State University's Panacea Lab. Curators: Juan M. Banda, Ramya Tekumalla and Gerardo Chowell-Puente

Additional data provided by Guanyu Wang (Missouri school of journalism, University of Missouri), Jingyuan Yu (Department of social psychology, Universitat Autònoma de Barcelona), Tuo Liu (Department of psychology, Carl von Ossietzky Universität Oldenburg), Yuning Ding (Language technology lab, Universität Duisburg-Essen), Katya Artemova (NRU HSE) and Elena Tutubalina (KFU)

Feel Free to share in any medium you want: