Preprint Article Version 1 Preserved in Portico This version is not peer-reviewed

Redundancy Reduction in Twitter Event Streams

Version 1 : Received: 12 February 2020 / Approved: 13 February 2020 / Online: 13 February 2020 (12:45:44 CET)



How to cite: Kratzke, N. Redundancy Reduction in Twitter Event Streams. Preprints 2020, 2020020170 (doi: 10.20944/preprints202002.0170.v1). Kratzke, N. Redundancy Reduction in Twitter Event Streams. Preprints 2020, 2020020170 (doi: 10.20944/preprints202002.0170.v1). Copy

Cite as: Kratzke, N. Redundancy Reduction in Twitter Event Streams. Preprints 2020, 2020020170 (doi: 10.20944/preprints202002.0170.v1). Kratzke, N. Redundancy Reduction in Twitter Event Streams. Preprints 2020, 2020020170 (doi: 10.20944/preprints202002.0170.v1). Copy CANCEL COPY CITATION DETAILS

Abstract

The data from social networks like Twitter is a valuable source for research but full of redundancy, making it hard to provide large-scale, self-contained, and small datasets. The data recording is a common problem in social media-based studies and could be standardized. Sadly, this is hardly done. This paper reports on lessons learned from a long-term evaluation study recording the complete public sample of the German and English Twitter stream. It presents a recording solution proposal that merely chunks a linear stream of events to reduce redundancy. If events are observed multiple times within the time-span of a chunk, only the latest observation is written to the chunk. A 10 Gigabyte Twitter raw dataset covering 1,2 Million Tweets of 120.000 users recorded between June and September 2017 was used to analyze expectable compression rates. It turned out that resulting datasets need only between 10\% and 20\% of the original data size without losing any event, metadata or the relationships between single events. This kind of redundancy reduction recording makes it possible to curate large-scale (even nation-wide), self-contained, and small datasets of social networks for research in a standardized and reproducible manner.

Subject Areas

Twitter; dataset; redundancy; reduction; archive

Copyright: This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Comments (0)

We encourage comments and feedback from a broad range of readers. See criteria for comments and our diversity statement.