This page catalogues datasets annotated for hate speech, online abuse, and offensive language. They may be useful for e.g. training a natural language processing system to detect this language.

The list is maintained by Leon Derczynski and Bertie Vidgen.

Please make contributions via pull request or email. Accompanying data statements preferred for all corpora.

If you use these resources, please cite (and read!) our paper: Directions in Abusive Language Training Data: Garbage In, Garbage Out. And if you would like to find other resources for researching online hate, visit The Alan Turing Institute’s Online Hate Research Hub or read The Alan Turing Institute’s Reading List on Online Hate and Abuse Research.

List of datasets

Arabic

1. Are They our Brothers? Analysis and Detection of Religious Hate Speech in the Arabic Twittersphere

Link to publication: https://ieeexplore.ieee.org/document/8508247

Link to data: https://github.com/nuhaalbadi/Arabic_hatespeech(https://github.com/nuhaalbadi/Arabic_hatespeech)

Task description: Binary (Hate, Not)

Details of task: Religious subcategories

Size of dataset: 6,136

Percentage abusive: 0.45

Language: Arabic

Level of annotation: Posts

Platform: Twitter

Medium: Text

Reference: Albadi, N., Kurdi, M. and Mishra, S., 2018. Are they Our Brothers? Analysis and Detection of Religious Hate Speech in the Arabic Twittersphere. In: International Conference on Advances in Social Networks Analysis and Mining. Barcelona, Spain: IEEE, pp.69-76.

2. Multilingual and Multi-Aspect Hate Speech Analysis (Arabic)

Link to publication: https://arxiv.org/abs/1908.11049

Link to data: https://github.com/HKUST-KnowComp/MLMA_hate_speech

Task description: Detailed taxonomy with cross-cutting attributes: Hostility, Directness, Target Attribute, Target Group, How annotators felt on seeing the tweet.

Details of task: Gender, Sexual orientation, Religion, Disability

Size of dataset: 3,353

Percentage abusive: 0.64

Language: Arabic

Level of annotation: Posts

Platform: Twitter

Medium: Text

Reference: Ousidhoum, N., Lin, Z., Zhang, H., Song, Y. and Yeung, D., 2019. Multilingual and Multi-Aspect Hate Speech Analysis. ArXiv,.

3. L-HSAB: A Levantine Twitter Dataset for Hate Speech and Abusive Language

Link to publication: https://www.aclweb.org/anthology/W19-3512

Link to data: https://github.com/Hala-Mulki/L-HSAB-First-Arabic-Levantine-HateSpeech-Dataset

Task description: Ternary (Hate, Abusive, Normal)

Details of task: Group-directed + Person-directed

Size of dataset: 5,846

Percentage abusive: 0.38

Language: Arabic

Level of annotation: Posts

Platform: Twitter

Medium: Text

Reference: Mulki, H., Haddad, H., Bechikh, C. and Alshabani, H., 2019. L-HSAB: A Levantine Twitter Dataset for Hate Speech and Abusive Language. In: Proceedings of the Third Workshop on Abusive Language Online. Florence, Italy: Association for Computational Linguistics, pp.111-118.

4. Abusive Language Detection on Arabic Social Media (Twitter)

Link to publication: https://www.aclweb.org/anthology/W17-3008

Link to data: http://alt.qcri.org/~hmubarak/offensive/TweetClassification-Summary.xlsx

Task description: Ternary (Obscene, Offensive but not obscene, Clean)

Details of task: Incivility

Size of dataset: 1,100

Percentage abusive: 0.59

Language: Arabic

Level of annotation: Posts

Platform: Twitter

Medium: Text

Reference: Mubarak, H., Darwish, K. and Magdy, W., 2017. Abusive Language Detection on Arabic Social Media. In: Proceedings of the First Workshop on Abusive Language Online. Vancouver, Canada: Association for Computational Linguistics, pp.52-56.

5. Abusive Language Detection on Arabic Social Media (Al Jazeera)

Link to publication: https://www.aclweb.org/anthology/W17-3008

Link to data: http://alt.qcri.org/~hmubarak/offensive/AJCommentsClassification-CF.xlsx

Task description: Ternary (Obscene, Offensive but not obscene, Clean)

Details of task: Incivility

Size of dataset: 32,000

Percentage abusive: 0.81

Language: Arabic

Level of annotation: Posts

Platform: AlJazeera

Medium: Text

Reference: Mubarak, H., Darwish, K. and Magdy, W., 2017. Abusive Language Detection on Arabic Social Media. In: Proceedings of the First Workshop on Abusive Language Online. Vancouver, Canada: Association for Computational Linguistics, pp.52-56.

6. Dataset Construction for the Detection of Anti-Social Behaviour in Online Communication in Arabic

Link to publication: https://www.sciencedirect.com/science/article/pii/S1877050918321756

Link to data: https://onedrive.live.com/?authkey=!ACDXj_ZNcZPqzy0&id=6EF6951FBF8217F9!105&cid=6EF6951FBF8217F9

Task description: Binary (Offensive, Not)

Details of task: Incivility

Size of dataset: 15,050

Percentage abusive: 0.39

Language: Arabic

Level of annotation: Posts

Platform: YouTube

Medium: Text

Reference: Alakrot, A., Murray, L. and Nikolov, N., 2018. Dataset Construction for the Detection of Anti-Social Behaviour in Online Communication in Arabic. Procedia Computer Science, 142, pp.174-181.

Croatian

7. Datasets of Slovene and Croatian Moderated News Comments

Link to publication: https://www.aclweb.org/anthology/W18-5116

Link to data: http://hdl.handle.net/11356/1202

Task description: Binary (Deleted, Not)

Details of task: Flagged content

Size of dataset: 17,000,000

Percentage abusive: 0.02

Language: Croatian

Level of annotation: Posts

Platform: 24sata website

Medium: Text

Reference: Ljubešić, N., Erjavec, T. and Fišer, D., 2018. Datasets of Slovene and Croatian Moderated News Comments. In: Proceedings of the 2nd Workshop on Abusive Language Online (ALW2). Brussels, Belgium: Association for Computational Linguistics, pp.124-131.

Danish

8. Offensive Language and Hate Speech Detection for Danish

Link to publication: http://www.derczynski.com/papers/danish_hsd.pdf

Link to data: https://figshare.com/articles/Danish_Hate_Speech_Abusive_Language_data/12220805

Task description: Branching structure of tasks: Binary (Offensive, Not), Within Offensive (Target, Not), Within Target (Individual, Group, Other)

Details of task: Group-directed + Person-directed

Size of dataset: 3,600

Percentage abusive: 0.12

Language: Danish

Level of annotation: Posts

Platform: Twitter, Reddit, newspaper comments

Medium: Text

Reference: Sigurbergsson, G. and Derczynski, L., 2019. Offensive Language and Hate Speech Detection for Danish. ArXiv.

English

9. Automated Hate Speech Detection and the Problem of Offensive Language

Link to publication: https://arxiv.org/pdf/1703.04009.pdf

Link to data: https://github.com/t-davidson/hate-speech-and-offensive-language

Task description: Hierarchy (Hate, Offensive, Neither)

Details of task: Hate per se

Size of dataset: 24,802

Percentage abusive: 0.06

Language: English

Level of annotation: Posts

Platform: Twitter

Medium: Text

Reference: Davidson, T., Warmsley, D., Macy, M. and Weber, I., 2017. Automated Hate Speech Detection and the Problem of Offensive Language. ArXiv,.

10. Hate Speech Dataset from a White Supremacy Forum

Link to publication: https://www.aclweb.org/anthology/W18-5102.pdf

Link to data: https://github.com/Vicomtech/hate-speech-dataset

Task description: Ternary (Hate, Relation, Not)

Details of task: Hate per se

Size of dataset: 9,916

Percentage abusive: 0.11

Language: English

Level of annotation: Sentence - with context of the converstaional thread taken into account

Platform: Stormfront

Medium: Text

Reference: de Gibert, O., Perez, N., García-Pablos, A., and Cuadros, M., 2018. Hate Speech Dataset from a White Supremacy Forum. In: Proceedings of the 2nd Workshop on Abusive Language Online (ALW2). Brussels, Belgium: Association for Computational Linguistics, pp.11-20.

11. Hateful Symbols or Hateful People? Predictive Features for Hate Speech Detection on Twitter

Link to publication: https://www.aclweb.org/anthology/N16-2013

Link to data: https://github.com/ZeerakW/hatespeech

Task description: 3-topic (Sexist, Racist, Not)

Details of task: Racism, Sexism

Size of dataset: 16,914

Percentage abusive: 0.32

Language: English

Level of annotation: Posts

Platform: Twitter

Medium: Text

Reference: Waseem, Z. and Horvy, D., 2016. Hateful Symbols or Hateful People? Predictive Features for Hate Speech Detection on Twitter. In: Proceedings of the NAACL Student Research Workshop. San Diego, California: Association for Computational Linguistics, pp.88-93.

12. Detecting Online Hate Speech Using Context Aware Models

Link to publication: https://arxiv.org/pdf/1710.07395.pdf

Link to data: https://github.com/sjtuprog/fox-news-comments(https://github.com/sjtuprog/fox-news-comments)

Task description: Binary (Hate / not)

Details of task: Hate per se

Size of dataset: 1528

Percentage abusive: 0.28

Language: English

Level of annotation: Posts

Platform: Fox News

Medium: Text

Reference: Gao, L. and Huang, R., 2018. Detecting Online Hate Speech Using Context Aware Models. ArXiv,.

13. Are You a Racist or Am I Seeing Things? Annotator Influence on Hate Speech Detection on Twitter

Link to publication: https://pdfs.semanticscholar.org/3eeb/b7907a9b94f8d65f969f63b76ff5f643f6d3.pdf

Link to data: https://github.com/ZeerakW/hatespeech

Task description: Multi-topic (Sexist, Racist, Neither, Both)

Details of task: Racism, Sexism

Size of dataset: 4,033

Percentage abusive: 0.16

Language: English

Level of annotation: Posts

Platform: Twitter

Medium: Text

Reference: Waseem, Z., 2016. Are You a Racist or Am I Seeing Things? Annotator Influence on Hate Speech Detection on Twitter. In: Proceedings of 2016 EMNLP Workshop on Natural Language Processing and Computational Social Science. Copenhagen, Denmark: Association for Computational Linguistics, pp.138-142.

14. When Does a Compliment Become Sexist? Analysis and Classification of Ambivalent Sexism Using Twitter Data

Link to publication: https://pdfs.semanticscholar.org/225f/f8a6a562bbb64b22cebfcd3288c6b930d1ef.pdf

Link to data: https://github.com/AkshitaJha/NLP_CSS_2017

Task description: Hierarchy of Sexism (Benevolent sexism, Hostile sexism, None)

Details of task: Sexism

Size of dataset: 712

Percentage abusive: 1

Language: English

Level of annotation: Posts

Platform: Twitter

Medium: Text

Reference: Jha, A. and Mamidi, R., 2017. When does a Compliment become Sexist? Analysis and Classification of Ambivalent Sexism using Twitter Data. In: Proceedings of the Second Workshop on Natural Language Processing and Computational Social Science. Vancouver, Canada: Association for Computational Linguistics, pp.7-16.

15. Overview of the Task on Automatic Misogyny Identification at IberEval 2018 (English)

Link to publication: http://ceur-ws.org/Vol-2150/overview-AMI.pdf

Link to data: https://amiibereval2018.wordpress.com/im nt-dates/data/

Task description: Binary (misogyny / not), 5 categories (stereotype, dominance, derailing, sexual harassment, discredit), target of misogyny (active or passive)

Details of task: Sexism

Size of dataset: 3,977

Percentage abusive: 0.47

Language: English

Level of annotation: Posts

Platform: Twitter

Medium: Text

Reference: Fersini, E., Rosso, P. and Anzovino, M., 2018. Overview of the Task on Automatic Misogyny Identification at IberEval 2018. In: Proceedings of the Third Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2018).

14. CONAN - COunter NArratives through Nichesourcing: a Multilingual Dataset of Responses to Fight Online Hate Speech (English)

Link to publication: https://www.aclweb.org/anthology/P19-1271.pdf

Link to data: https://github.com/marcoguerini/CONAN

Task description: Binary (Islamophobic / not), multi-topic (Culture, Economics, Crimes, Rapism, Terrorism, Women Oppression, History, Other/generic)

Details of task: Islamophobia

Size of dataset: 1,288

Percentage abusive: 1

Language: English

Level of annotation: Posts

Platform: Synthetic / Facebook

Medium: Text

Reference: Chung, Y., Kuzmenko, E., Tekiroglu, S. and Guerini, M., 2019. CONAN - COunter NArratives through Nichesourcing: a Multilingual Dataset of Responses to Fight Online Hate Speech. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics, pp.2819-2829.

17. Characterizing and Detecting Hateful Users on Twitter

Link to publication: https://arxiv.org/pdf/1803.08977.pdf

Link to data: https://github.com/manoelhortaribeiro/HatefulUsersTwitter

Task description: Binary (hateful/not)

Details of task: Hate per se

Size of dataset: 4,972

Percentage abusive: 0.11

Language: English

Level of annotation: Users

Platform: Twitter

Medium: Text

Reference: Ribeiro, M., Calais, P., Santos, Y., Almeida, V. and Meira, W., 2018. Characterizing and Detecting Hateful Users on Twitter. ArXiv,.

18. A Benchmark Dataset for Learning to Intervene in Online Hate Speech (Gab)

Link to publication: [https://arxiv.org/abs/1909.04251] (https://arxiv.org/abs/1909.04251)

Link to data: https://github.com/jing-qian/A-Benchmark-Dataset-for-Learning-to-Intervene-in-Online-Hate-Speech

Task description: Binary (hateful/not)

Details of task: Hate per se

Size of dataset: 33,776

Percentage abusive: 0.43

Language: English

Level of annotation: Posts (in the context of a conversation)

Platform: Gab

Medium: Text

Reference: Qian, J., Bethke, A., Belding, E. and Yang Wang, W., 2019. A Benchmark Dataset for Learning to Intervene in Online Hate Speech. ArXiv,.

19. A Benchmark Dataset for Learning to Intervene in Online Hate Speech (Reddit)

Link to publication: https://arxiv.org/abs/1909.04251

Link to data: https://github.com/jing-qian/A-Benchmark-Dataset-for-Learning-to-Intervene-in-Online-Hate-Speech

Task description: Binary (hateful/not)

Details of task: Hate per se

Size of dataset: 22,324

Percentage abusive: 0.24

Language: English

Level of annotation: Posts (with context of the converstaional thread taken into account)

Platform: Reddit

Medium: Text

Reference: Qian, J., Bethke, A., Belding, E. and Yang Wang, W., 2019. A Benchmark Dataset for Learning to Intervene in Online Hate Speech. ArXiv,.

20. Multilingual and Multi-Aspect Hate Speech Analysis (English)

Link to publication: https://arxiv.org/abs/1908.11049

Link to data: https://github.com/HKUST-KnowComp/MLMA_hate_speech

Task description: Detailed taxonomy with cross-cutting attributes: Hostility, Directness, Target attribute and Target group.

Details of task: Gender, Sexual orientation, Religion, Disability

Size of dataset: 5,647

Percentage abusive: 0.76

Language: English

Level of annotation: Posts

Platform: Twitter

Medium: Text

Reference: Ousidhoum, N., Lin, Z., Zhang, H., Song, Y. and Yeung, D., 2019. Multilingual and Multi-Aspect Hate Speech Analysis. ArXiv,.

21. Exploring Hate Speech Detection in Multimodal Publications

Link to publication: https://arxiv.org/pdf/1910.03814.pdf

Link to data: https://gombru.github.io/2019/10/09/MMHS/

Task description: Six primary categories (No attacks to any community, Racist, Sexist, Homophobic, Religion based attack, Attack to other community)

Details of task: Racism, Sexism, Homophobia, Religion-based attack

Size of dataset: 149,823

Percentage abusive: 0.25

Language: English

Level of annotation: Posts

Platform: Twitter

Medium: Text and Images/Memes

Reference: Gomez, R., Gibert, J., Gomez, L. and Karatzas, D., 2019. Exploring Hate Speech Detection in Multimodal Publications. ArXiv,.

22. Predicting the Type and Target of Offensive Posts in Social Media

Link to publication: https://arxiv.org/pdf/1902.09666.pdf

Link to data: http://competitions.codalab.org/ competitions/20011

Task description: Branching structure of tasks: Binary (Offensive, Not), Within Offensive (Target, Not), Within Target (Individual, Group, Other)

Details of task: Group-directed + Person-directed

Size of dataset: 14,100

Percentage abusive: 0.33

Language: English

Level of annotation: Posts

Platform: Twitter

Medium: Text

Reference: Zampieri, M., Malmasi, S., Nakov, P., Rosenthal, S., Farra, N. and Kumar, R., 2019. SemEval-2019 Task 6: Identifying and Categorizing Offensive Language in Social Media (OffensEval). ArXiv,.

23. hatEval, SemEval-2019 Task 5: Multilingual Detection of Hate Speech Against Immigrants and Women in Twitter (English)

Link to publication: https://www.aclweb.org/anthology/S19-2007

Link to data: competitions.codalab.org/competitions/19935

Task description: Branching structure of tasks: Binary (Hate, Not), Within Hate (Group, Individual), Within Hate (Agressive, Not)

Details of task: Group-directed + Person-directed

Size of dataset: 13,000

Percentage abusive: 0.4

Language: English

Level of annotation: Posts

Platform: Twitter

Medium: Text

Reference: Basile, V., Bosco, C., Fersini, E., Nozza, D., Patti, V., Pardo, F., Rosso, P. and Sanguinetti, M., 2019. SemEval-2019 Task 5: Multilingual Detection of Hate Speech Against Immigrants and Women in Twitter. In: Proceedings of the 13th International Workshop on Semantic Evaluation. Minneapolis, Minnesota: Association for Computational Linguistics, pp.54-63.

24. Peer to Peer Hate: Hate Speech Instigators and Their Targets

Link to publication: https://aaai.org/ocs/index.php/ICWSM/ICWSM18/paper/view/17905/16996

Link to data: https://github.com/mayelsherif/hate_speech_icwsm18

Task description: Binary (Hate/Not), only for tweets which have both a Hate Instigator and Hate Target

Details of task: Hate per se

Size of dataset: 27,330

Percentage abusive: 0.98

Language: English

Level of annotation: Posts

Platform: Twitter

Medium: Text

Reference: ElSherief, M., Nilizadeh, S., Nguyen, D., Vigna, G. and Belding, E., 2018. Peer to Peer Hate: Hate Speech Instigators and Their Targets. In: Proceedings of the Twelfth International AAAI Conference on Web and Social Media (ICWSM 2018). Santa Barbara, California: University of California, pp.52-61.

25. Overview of the HASOC track at FIRE 2019: Hate Speech and Offensive Content Identification in Indo-European Languages

Link to publication: https://dl.acm.org/doi/pdf/10.1145/3368567.3368584?download=true

Link to data: https://hasocfire.github.io/hasoc/2019/dataset.html

Task description: Branching structure of tasks. A: Hate / Offensive or Neither, B: Hatespeech, Offensive, or Profane, C: Targeted or Untargeted

Details of task: Group-directed + Person-directed

Size of dataset: 7,005

Percentage abusive: 0.36

Language: English

Level of annotation: Posts

Platform: Twitter and Facebook

Medium: Text

Reference: Mandl, T., Modha, S., Majumder, P., Patel, D., Dave, M., Mandlia, C. and Patel, A., 2019. Overview of the HASOC track at FIRE 2019. In: Proceedings of the 11th Forum for Information Retrieval Evaluation,.

26. Large Scale Crowdsourcing and Characterization of Twitter Abusive Behavior

Link to publication: https://arxiv.org/pdf/1802.00393.pdf

Link to data: https://dataverse.mpi-sws.org/dataset.xhtml?persistentId=doi:10.5072/FK2/ZDTEMN

Task description: Multi-thematic (Abusive, Hateful, Normal, Spam)

Details of task: Hate per se

Size of dataset: 80,000

Percentage abusive: 0.18

Language: English

Level of annotation: Posts

Platform: Twitter

Medium: Text

Annotation process: Very detailed information is given: multiple rounds, using a smaller 300 tweet dataset for testing the schema. For the final 80k, 5 judgements per tweet. CrowdFlower

Annotation agreement: 55.9% = 4/5, 36.6% = 3/5, 7.5% = 2/5

Reference: Founta, A., Djouvas, C., Chatzakou, D., Leontiadis, I., Blackburn, J., Stringhini, G., Vakali, A., Sirivianos, M. and Kourtellis, N., 2018. Large Scale Crowdsourcing and Characterization of Twitter Abusive Behavior. ArXiv,.

27. A Large Labeled Corpus for Online Harassment Research

Link to publication: http://www.cs.umd.edu/~golbeck/papers/trolling.pdf

Link to data: jgolbeck@umd.edu

Task description: Binary (Harassment, Not)

Details of task: Person-directed

Size of dataset: 35,000

Percentage abusive: 0.16

Language: English

Level of annotation: Posts

Platform: Twitter

Medium: Text

Reference: Golbeck, J., Ashktorab, Z., Banjo, R., Berlinger, A., Bhagwan, S., Buntain, C., Cheakalos, P., Geller, A., Gergory, Q., Gnanasekaran, R., Gnanasekaran, R., Hoffman, K., Hottle, J., Jienjitlert, V., Khare, S., Lau, R., Martindale, M., Naik, S., Nixon, H., Ramachandran, P., Rogers, K., Rogers, L., Sarin, M., Shahane, G., Thanki, J., Vengataraman, P., Wan, Z. and Wu, D., 2017. A Large Labeled Corpus for Online Harassment Research. In: Proceedings of the 2017 ACM on Web Science Conference. New York: Association for Computing Machinery, pp.229-233.

28. Ex Machina: Personal Attacks Seen at Scale, Personal attacks

Link to publication: https://arxiv.org/pdf/1610.08914

Link to data: https://github.com/ewulczyn/wiki-detox

Task description: Binary (Personal attack, Not)

Details of task: Person-directed

Size of dataset: 115,737

Percentage abusive: 0.12

Language: English

Level of annotation: Posts

Platform: Wikipedia

Medium: Text

Reference: Wulczyn, E., Thain, N. and Dixon, L., 2017. Ex Machina: Personal Attacks Seen at Scale. ArXiv,.

29. Ex Machina: Personal Attacks Seen at Scale, Toxicity

Link to publication: https://arxiv.org/pdf/1610.08914

Link to data: https://github.com/ewulczyn/wiki-detox

Task description: Toxicity/healthiness judgement (-2 == very toxic, 0 == neutral, 2 == very healthy)

Details of task: Person-directed

Size of dataset: 100,000

Percentage abusive: NA

Language: English

Level of annotation: Posts

Platform: Wikipedia

Medium: Text

Reference: Wulczyn, E., Thain, N. and Dixon, L., 2017. Ex Machina: Personal Attacks Seen at Scale. ArXiv,.

30. Detecting cyberbullying in online communities (World of Warcraft)

Link to publication: http://aisel.aisnet.org/ecis2016_rp/61/

Link to data: http://ub-web.de/research/

Task description: Binary (Harassment, Not)

Details of task: Person-directed

Size of dataset: 16,975

Percentage abusive: 0.01

Language: English

Level of annotation: Posts

Platform: World of Warcraft

Medium: Text

Reference: Bretschneider, U. and Peters, R., 2016. Detecting Cyberbullying in Online Communities. Research Papers, 61.

31. Detecting cyberbullying in online communities (League of Legends)

Link to publication: http://aisel.aisnet.org/ecis2016_rp/61/

Link to data: http://ub-web.de/research/

Task description: Binary (Harassment, Not)

Details of task: Person-directed

Size of dataset: 17,354

Percentage abusive: 0.01

Language: English

Level of annotation: Posts

Platform: League of Legends

Medium: Text

Reference: Bretschneider, U. and Peters, R., 2016. Detecting Cyberbullying in Online Communities. Research Papers, 61.

32. A Quality Type-aware Annotated Corpus and Lexicon for Harassment Research

Link to publication: https://arxiv.org/pdf/1802.09416.pdf

Link to data: https://github.com/Mrezvan94/Harassment-Corpus

Task description: Multi-topic harassment detection

Details of task: Racism, Sexism, Appearance-related, Intellectual, Political

Size of dataset: 24,189

Percentage abusive: 0.13

Language: English

Level of annotation: Posts

Platform: Twitter

Medium: Text

Reference: Rezvan, M., Shekarpour, S., Balasuriya, L., Thirunarayan, K., Shalin, V. and Sheth, A., 2018. A Quality Type-aware Annotated Corpus and Lexicon for Harassment Research. ArXiv,.

33. Ex Machina: Personal Attacks Seen at Scale, Aggression and Friendliness

Link to publication: https://arxiv.org/pdf/1610.08914

Link to data: https://github.com/ewulczyn/wiki-detox

Task description: Aggression/friendliness judgement on a 5 point scale. (-2 == very aggressive, 0 == neutral, 3 == very friendly).

Details of task: Person-Directed + Group-Directed

Size of dataset: 160,000

Percentage abusive: NA

Language: English

Level of annotation: Posts

Platform: Wikipedia

Medium: Text

Reference: Wulczyn, E., Thain, N. and Dixon, L., 2017. Ex Machina: Personal Attacks Seen at Scale. ArXiv,.

French

34. CONAN - COunter NArratives through Nichesourcing: a Multilingual Dataset of Responses to Fight Online Hate Speech (French)

Link to publication: https://www.aclweb.org/anthology/P19-1271.pdf

Link to data: https://github.com/marcoguerini/CONAN

Task description: Binary (Islamophobic / not), Multi-topic (Culture, Economics, Crimes, Rapism, Terrorism, Women Oppression, History, Other/generic)

Details of task: Islamophobia

Size of dataset: 1,719

Percentage abusive: 1

Language: French

Level of annotation: Posts

Platform: Synthetic / Facebook

Medium: Text

Reference: Chung, Y., Kuzmenko, E., Tekiroglu, S. and Guerini, M., 2019. CONAN - COunter NArratives through Nichesourcing: a Multilingual Dataset of Responses to Fight Online Hate Speech. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics, pp.2819-2829.

35. Multilingual and Multi-Aspect Hate Speech Analysis (French)

Link to publication: https://arxiv.org/abs/1908.11049

Link to data: https://github.com/HKUST-KnowComp/MLMA_hate_speech

Task description: Detailed taxonomy with cross-cutting attributes: Hostility, Directness, Target Attribute, Target Group, How annotators felt on seeing the tweet.

Details of task: Gender, Sexual orientation, Religion, Disability

Size of dataset: 4,014

Percentage abusive: 0.72

Language: French

Level of annotation: Posts

Platform: Twitter

Medium: Text

Reference: Ousidhoum, N., Lin, Z., Zhang, H., Song, Y. and Yeung, D., 2019. Multilingual and Multi-Aspect Hate Speech Analysis. ArXiv,.

German

36. Measuring the Reliability of Hate Speech Annotations: The Case of the European Refugee Crisis

Link to publication: https://arxiv.org/pdf/1701.08118.pdf

Link to data: https://github.com/UCSM-DUE/IWG_hatespeech_public

Task description: Binary (Anti-refugee hate, None)

Details of task: Refugees

Size of dataset: 469

Percentage abusive: NA

Language: German

Level of annotation: Posts

Platform: Twitter

Medium: Text

Reference: Ross, B., Rist, M., Carbonell, G., Cabrera, B., Kurowsky, N. and Wojatzki, M., 2017. Measuring the Reliability of Hate Speech Annotations: The Case of the European Refugee Crisis. ArXiv,.

37. Detecting Offensive Statements Towards Foreigners in Social Media

Link to publication: https://pdfs.semanticscholar.org/23dc/df7c7e82807445afd9f19474fc0a3d8169fe.pdf

Link to data: http://ub-web.de/research/

Task description: Hierarchical (Anti-foreigner prejudice, split into (1) slightly offensive/offensive and (2) explicitly/substantially offensive). 6 targets (Foreigner, Government, Press, Community, Other, Unknown)

Details of task: Anti-foreigner prejudice

Size of dataset: 5,836

Percentage abusive: 0.11

Language: German

Level of annotation: Posts

Platform: Facebook

Medium: Text

Reference: Bretschneider, U. and Peters, R., 2017. Detecting Offensive Statements towards Foreigners in Social Media. In: Proceedings of the 50th Hawaii International Conference on System Sciences.

38. GermEval 2018

Link to publication: https://www.researchgate.net/publication/327914386_Overview_of_the_GermEval_2018_Shared_Task_on_the_Identification_of_Offensive_Language

Link to data: https://github.com/uds-lsv/GermEval-2018-Data

Task description: Branching structure: Binary (Offense, Other), 3 levels within Offense (Abuse, Insult, Profanity)

Details of task: Group-directed + Incivility

Size of dataset: 8,541

Percentage abusive: 0.34

Language: German

Level of annotation: Posts

Platform: Twitter

Medium: Text

Reference: Wiegand, M., Siegel, M. and Ruppenhofer, J., 2018. Overview of the GermEval 2018 Shared Task on the Identification of Offensive Language. In: Proceedings of GermEval 2018, 14th Conference on Natural Language Processing (KONVENS 2018). Vienna, Austria: Research Gate.

39. Overview of the HASOC track at FIRE 2019: Hate Speech and Offensive Content Identification in Indo-European Languages

Link to publication: https://dl.acm.org/doi/pdf/10.1145/3368567.3368584?download=true

Link to data: https://hasocfire.github.io/hasoc/2019/dataset.html

Task description: A: Hate / Offensive or neither, B: Hatespeech, Offensive, or Profane

Details of task: Group-directed + Person-directed

Size of dataset: 4,669

Percentage abusive: 0.24

Language: German

Level of annotation: Posts

Platform: Twitter and Facebook

Medium: Text

Reference: Mandl, T., Modha, S., Majumder, P., Patel, D., Dave, M., Mandlia, C. and Patel, A., 2019. Overview of the HASOC track at FIRE 2019. In: Proceedings of the 11th Forum for Information Retrieval Evaluation,.

Greek

40. Deep Learning for User Comment Moderation, Flagged Comments

Link to publication: https://www.aclweb.org/anthology/W17-3004

Link to data: http://www.straintek.com/data/

Task description: Binary (Flagged, Not)

Details of task: Flagged content

Size of dataset: 1,450,000

Percentage abusive: 0.34

Language: Greek

Level of annotation: Posts

Platform: Gazetta

Medium: text

Reference: Pavlopoulos, J., Malakasiotis, P. and Androutsopoulos, I., 2017. Deep Learning for User Comment Moderation. In: Proceedings of the First Workshop on Abusive Language Online. Vancouver, Canada: Association for Computational Linguistics, pp.25-35.

41. Deep Learning for User Comment Moderation, Moderated Comments

Link to publication: https://www.aclweb.org/anthology/W17-3004

Link to data: http://www.straintek.com/data/

Task description: Binary (Flagged, Not)

Details of task: Flagged content

Size of dataset: 1,500

Percentage abusive: 0.22

Language: Greek

Level of annotation: Posts

Platform: Gazetta

Medium: text

Reference: Pavlopoulos, J., Malakasiotis, P. and Androutsopoulos, I., 2017. Deep Learning for User Comment Moderation. In: Proceedings of the First Workshop on Abusive Language Online. Vancouver, Canada: Association for Computational Linguistics, pp.25-35.

42. Offensive Language Identification in Greek

Link to publication: https://arxiv.org/pdf/2003.07459v1.pdf

Link to data: https://sites.google.com/site/offensevalsharedtask/home

Task description: Branching structure of tasks: Binary (Offensive, Not), Within Offensive (Target, Not), Within Target (Individual, Group, Other)

Details of task: Group-directed + Person-directed

Size of dataset: 4779

Percentage abusive: 0.29

Language: Greek

Level of annotation: Posts

Platform: Twitter

Medium: Text

Reference: Pitenis, Z., Zampieri, M. and Ranasinghe, T., 2020. Offensive Language Identification in Greek. ArXiv.

Hindi-English

43. Aggression-annotated Corpus of Hindi-English Code-mixed Data

Link to publication: https://arxiv.org/pdf/1803.09402

Link to data: https://github.com/kraiyani/Facebook-Post-Aggression-Identification

Task description: 3 part hierachy for hate (None, Covert Aggression, Overt Aggression), 4 part target categorisation (Physical threat, Sexual threat, Identity threat, Non-threatening aggression), 3-part discursive role categorisation (Attack, Defend, Abet)

Details of task: Numerous sub-categorizations

Size of dataset: 18,000

Percentage abusive: 0.06

Language: Hindi-English

Level of annotation: Posts

Platform: Facebook

Medium: Text

Reference: Kumar, R., Reganti, A., Bhatia, A. and Maheshwari, T., 2018. Aggression-annotated Corpus of Hindi-English Code-mixed Data. ArXiv,.

44. Aggression-annotated Corpus of Hindi-English Code-mixed Data

Link to publication: https://arxiv.org/pdf/1803.09402

Link to data: https://github.com/kraiyani/Facebook-Post-Aggression-Identification

Task description: 3 part hierachy for hate (None, Covert Aggression, Overt Aggression), 4 part target categorisation (Physical threat, Sexual threat, Identity threat, Non-threatening aggression), 3-part discursive role categorisation (Attack, Defend, Abet)

Details of task: Numerous sub-categorizations

Size of dataset: 21,000

Percentage abusive: 0.27

Language: Hindi-English

Level of annotation: Posts

Platform: Twitter

Medium: Text

Reference: Kumar, R., Reganti, A., Bhatia, A. and Maheshwari, T., 2018. Aggression-annotated Corpus of Hindi-English Code-mixed Data. ArXiv,.

45. Did You Offend Me? Classification of Offensive Tweets in Hinglish Language

Link to publication: https://www.aclweb.org/anthology/W18-5118

Link to data: https://github.com/pmathur5k10/Hinglish-Offensive-Text-Classification

Task description: Hierarchy (Not Offensive, Abusive, Hate)

Details of task: Sexism

Size of dataset: 3,189

Percentage abusive: 0.65

Language: Hindi-English

Level of annotation: Posts

Platform: Twitter

Medium: Text

Reference: Mathur, P., Sawhney, R., Ayyar, M. and Shah, R., 2018. Did you offend me? Classification of Offensive Tweets in Hinglish Language. In: Proceedings of the 2nd Workshop on Abusive Language Online (ALW2). Brussels, Belgium: Association for Computational Linguistics, pp.138-148.

46. A Dataset of Hindi-English Code-Mixed Social Media Text for Hate Speech Detection

Link to publication: https://www.aclweb.org/anthology/W18-1105

Link to data: https://github.com/deepanshu1995/HateSpeech-Hindi-English-Code-Mixed-Social-Media-Text

Task description: Binary (Hate, Not)

Details of task: Hate per se

Size of dataset: 4,575

Percentage abusive: 0.36

Language: Hindi-English

Level of annotation: Posts

Platform: Twitter

Medium: Text

Reference: Bohra, A., Vijay, D., Singh, V., Sarfaraz Akhtar, S. and Shrivastava, M., 2018. A Dataset of Hindi-English Code-Mixed Social Media Text for Hate Speech Detection. In: Proceedings of the Second Workshop on Computational Modeling of People’s Opinions, Personality, and Emotions in Social Media. New Orleans, Louisiana: Association for Computational Linguistics, pp.36-41.

47. Overview of the HASOC track at FIRE 2019: Hate Speech and Offensive Content Identification in Indo-European Languages

Link to publication: https://dl.acm.org/doi/pdf/10.1145/3368567.3368584?download=true

Link to data: https://hasocfire.github.io/hasoc/2019/dataset.htm

Task description: A: Hate, Offensive or Neither, B: Hatespeech, Offensive, or Profane, C: Targeted or Untargeted

Details of task: Group-directed + Person-directed

Size of dataset: 5,983

Percentage abusive: 0.51

Language: Hindi

Level of annotation: Posts

Platform: Twitter and Facebook

Medium: Text

Reference: Mandl, T., Modha, S., Majumder, P., Patel, D., Dave, M., Mandlia, C. and Patel, A., 2019. Overview of the HASOC track at FIRE 2019. In: Proceedings of the 11th Forum for Information Retrieval Evaluation,.

Indonesian

48. Hate Speech Detection in the Indonesian Language: A Dataset and Preliminary Study

Link to publication: https://ieeexplore.ieee.org/document/8355039

Link to data: https://github.com/ialfina/id-hatespeech-detection

Task description: Binary (Hate, Not)

Details of task: Hate per se

Size of dataset: 713

Percentage abusive: 0.36

Language: Indonesian

Level of annotation: Posts

Platform: Twitter

Medium: Text

Reference: Alfina, I., Mulia, R., Fanany, M. and Ekanata, Y., 2017. Hate Speech Detection in the Indonesian Language: A Dataset and Preliminary Study. In: International Conference on Advanced Computer Science and Information Systems. pp.233-238.

49. Multi-Label Hate Speech and Abusive Language Detection in Indonesian Twitter

Link to publication: https://www.aclweb.org/anthology/W19-3506

Link to data: https://github.com/okkyibrohim/id-multi-label-hate-speech-and-abusive-language-detection

Task description: (No hate speech, No hate speech but abusive, Hate speech but no abuse, Hate speech and abuse), within hate, category (Religion/creed, Race/ethnicity, Physical/disability, Gender/sexual orientation, Other invective/slander), within hate, strength (Weak, Moderate and Strong)

Details of task: Religion, Race, Disability, Gender

Size of dataset: 13,169

Percentage abusive: 0.42

Language: Indonesian

Level of annotation: Posts

Platform: Twitter

Medium: Text

Reference: Okky Ibrohim, M. and Budi, I., 2019. Multi-label Hate Speech and Abusive Language Detection in Indonesian Twitter. In: Proceedings of the Third Workshop on Abusive Language Online. Florence, Italy: Association for Computational Linguistics, pp.46-57.

50. A Dataset and Preliminaries Study for Abusive Language Detection in Indonesian Social Media

Link to publication: https://www.sciencedirect.com/science/article/pii/S1877050918314583

Link to data: https://github.com/okkyibrohim/id-abusive-language-detection

Task description: Hierarchical (Not abusive, Abusive but not offensive, Offensive)

Details of task: Incivility

Size of dataset: 2,016

Percentage abusive: 0.54

Language: Indonesian

Level of annotation: Posts

Platform: Twitter

Medium: Text

Reference: Ibrohim, M. and Budi, I., 2018. A Dataset and Preliminaries Study for Abusive Language Detection in Indonesian Social Media. Procedia Computer Science, 135, pp.222-229.

Italian

51. An Italian Twitter Corpus of Hate Speech against Immigrants

Link to publication: https://www.aclweb.org/anthology/L18-1443

Link to data: https://github.com/msang/hate-speech-corpus

Task description: Binary (Immigrants/Roma/Muslims, Not), additional categories. Within Hate, Intensity measurement (Aggressiveness: No, Weak, Strong, Offensiveness: No, Weak, Strong, Irony: No, Yes, Stereotype: No, Yes, Incitement degree: 0-4)

Details of task: Immigrants, Roma and Muslims + numerous sub-categorizations

Size of dataset: 1,827

Percentage abusive: 0.13

Language: Italian

Level of annotation: Posts

Platform: Twitter

Medium: Text

Reference: Sanguinetti, M., Poletto, F., Bosco, C., Patti, V. and Stranisci, M., 2018. An Italian Twitter Corpus of Hate Speech against Immigrants. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). Miyazaki, Japan: European Language Resources Association (ELRA).

52. Overview of the EVALITA 2018 Hate Speech Detection Task (Facebook)

Link to publication: http://ceur-ws.org/Vol-2263/paper010.pdf

Link to data: http://www.di.unito.it/~tutreeb/haspeede-evalita18/data.html

Task description: Binary (Hate, Not), Within hate for Facebook only, strength (No hate, Weak hate, Strong hate) and theme ((1) religion, (2) physical and/or mental handicap, (3) socio-economic status, (4) politics, (5) race, (6) sex and gender, (7) Other)

Details of task: Religion, physical and/or mental handicap, socio-economic status, politics, race, sex and gender

Size of dataset: 4,000

Percentage abusive: 0.51

Language: Italian

Level of annotation: Posts

Platform: Facebook

Medium: Text

Reference: Bosco, C., Dell’Orletta, F. and Poletto, F., 2018. Overview of the EVALITA 2018 Hate Speech Detection Task. In: EVALITA 2018-Sixth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. CEUR, pp.1-9.

53. Overview of the EVALITA 2018 Hate Speech Detection Task (Twitter)

Link to publication: http://ceur-ws.org/Vol-2263/paper010.pdf

Link to data: http://www.di.unito.it/~tutreeb/haspeede-evalita18/data.html

Task description: Binary (Hate, Not), Within Hate For Twitter only Intensity (1-4 rating), Aggressiveness (No, Weak, Strong), Offensiveness (No, Weak, Strong), Irony (Yes, No)

Details of task: Group-directed

Size of dataset: 4,000

Percentage abusive: 0.32

Language: Italian

Level of annotation: Posts

Platform: Twitter

Medium: Text

Reference: Bosco, C., Dell’Orletta, F. and Poletto, F., 2018. Overview of the EVALITA 2018 Hate Speech Detection Task. In: EVALITA 2018-Sixth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. CEUR, pp.1-9.

54. CONAN - COunter NArratives through Nichesourcing: a Multilingual Dataset of Responses to Fight Online Hate Speech (Italian)

Link to publication: https://www.aclweb.org/anthology/P19-1271.pdf

Link to data: https://github.com/marcoguerini/CONAN

Task description: Binary (Islamophobic, Not), Multi-topic (Culture, Economics, Crimes, Rapism, Terrorism, Women Oppression, History, Other/generic)

Details of task: Islamophobia

Size of dataset: 1,071

Percentage abusive: 1

Language: Italian

Level of annotation: Posts

Platform: Synthetic / Facebook

Medium: Text

Reference: Chung, Y., Kuzmenko, E., Tekiroglu, S. and Guerini, M., 2019. CONAN - COunter NArratives through Nichesourcing: a Multilingual Dataset of Responses to Fight Online Hate Speech. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics, pp.2819-2829.

55. Creating a WhatsApp Dataset to Study Pre-teen Cyberbullying

Link to publication: https://www.aclweb.org/anthology/W18-5107

Link to data: https://github.com/dhfbk/WhatsApp-Dataset

Task description: Binary (Cyberbullying, Not)

Details of task: Person-directed

Size of dataset: 14,600

Percentage abusive: 0.08

Language: Italian

Level of annotation: Posts, structured into 10 chats, with token level information

Platform: Synthetic / Whatsapp

Medium: Text

Reference: Sprugnoli, R., Menini, S., Tonelli, S., Oncini, F. and Piras, E., 2018. Creating a WhatsApp Dataset to Study Pre-teen Cyberbullying. In: Proceedings of the 2nd Workshop on Abusive Language Online (ALW2) Month: October. Brussels, Belgium: Association for Computational Linguistics, pp.51-59.

Polish

56. Results of the PolEval 2019 Shared Task 6:First Dataset and Open Shared Task for Automatic Cyberbullying Detection in Polish Twitter

Link to publication: http://poleval.pl/files/poleval2019.pdf

Link to data: http://poleval.pl/tasks/task6

Task description: Harmfulness score (three values), Multilabel from seven phenomena

Details of task: Person-directed

Size of dataset: 10,041

Percentage abusive: 0.09

Language: Polish

Level of annotation: Posts

Platform: Twitter

Medium: Text

Reference: Ogrodniczuk, M. and Kobyliński, L., 2019. Results of the PolEval 2019 Shared Task 6: First Dataset and Open Shared Task for Automatic Cyberbullying Detection in Polish Twitter. In: Proceedings of the PolEval 2019 Workshop. Warszawa: Institute of Computer Science, Polish Academy of Sciences.

Portuguese

57. A Hierarchically-Labeled Portuguese Hate Speech Dataset

Link to publication: https://www.aclweb.org/anthology/W19-3510

Link to data: https://b2share.eudat.eu/records/9005efe2d6be4293b63c3cffd4cf193e

Task description: Binary (Hate, Not), Multi-level (81 categories, identified inductively; categories have different granularities and content can be assigned to multiple categories at once)

Details of task: Multiple identities inductively categorized

Size of dataset: 3,059

Percentage abusive: 0.32

Language: Portuguese

Level of annotation: Posts

Platform: Twitter

Medium: Text

Reference: Fortuna, P., Rocha da Silva, J., Soler-Company, J., Warner, L. and Nunes, S., 2019. A Hierarchically-Labeled Portuguese Hate Speech Dataset. In: Proceedings of the Third Workshop on Abusive Language Online. Florence, Italy: Association for Computational Linguistics, pp.94-104.

58. Offensive Comments in the Brazilian Web: A Dataset and Baseline Results

Link to publication: http://www.each.usp.br/digiampietri/BraSNAM/2017/p04.pdf

Link to data: https://github.com/rogersdepelle/OffComBR

Task description: Binary (Offensive, Not), Target (Xenophobia, homophobia, sexism, racism, cursing, religious intolerance)

Details of task: Religion/creed, Race/ethnicity, Physical/disability, Gender/sexual orientation

Size of dataset: 1,250

Percentage abusive: 0.33

Language: Portuguese

Level of annotation: Posts

Platform: g1.globo.com

Medium: Text

Reference: de Pelle, R. and Moreira, V., 2017. Offensive Comments in the Brazilian Web: A Dataset and Baseline Results. In: VI Brazilian Workshop on Social Network Analysis and Mining. SBC.

Slovene

59. Datasets of Slovene and Croatian Moderated News Comments

Link to publication: https://www.aclweb.org/anthology/W18-5116

Link to data: http://hdl.handle.net/11356/1201

Task description: Binary (Deleted, Not)

Details of task: Flagged content

Size of dataset: 7,600,000

Percentage abusive: 0.08

Language: Slovene

Level of annotation: Posts

Platform: MMC RTV website

Medium: Text

Reference: Ljubešić, N., Erjavec, T. and Fišer, D., 2018. Datasets of Slovene and Croatian Moderated News Comments. In: Proceedings of the 2nd Workshop on Abusive Language Online (ALW2). Brussels, Belgium: Association for Computational Linguistics, pp.124-131.

Spanish

60. Overview of MEX-A3T at IberEval 2018: Authorship and Aggressiveness Analysis in Mexican Spanish Tweets

Link to publication: http://ceur-ws.org/Vol-2150/overview-mex-a3t.pdf

Link to data: https://mexa3t.wixsite.com/home/aggressive-detection-track

Task description: Binary (Aggressive, Not)

Details of task: Group-directed

Size of dataset: 11,000

Percentage abusive: 0.32

Language: Spanish

Level of annotation: Posts

Platform: Twitter

Medium: Text

Reference: Alvarez-Carmona, M., Guzman-Falcon, E., Montes-y-Gomez, M., Escalante, H., Villasenor-Pineda, L., Reyes-Meza, V. and Rico-Sulayes, A., 2018. Overview of MEX-A3T at IberEval 2018: Authorship and aggressiveness analysis in Mexican Spanish tweets. In: Proceedings of the Third Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2018).

61. Overview of the Task on Automatic Misogyny Identification at IberEval 2018 (Spanish)

Link to publication: http://ceur-ws.org/Vol-2150/overview-AMI.pdf

Link to data: https://amiibereval2018.wordpress.com/important-dates/data/

Task description: Binary (Misogyny, Not), 5 categories (Stereotype, Dominance, Derailing, Sexual harassment, Discredit), Target of misogyny (Active or Passive)

Details of task: Sexism

Size of dataset: 4,138

Percentage abusive: 0.5

Language: Spanish

Level of annotation: Posts

Platform: Twitter

Medium: Text

Reference: Fersini, E., Rosso, P. and Anzovino, M., 2018. Overview of the Task on Automatic Misogyny Identification at IberEval 2018. In: Proceedings of the Third Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2018).

62. hatEval, SemEval-2019 Task 5: Multilingual Detection of Hate Speech Against Immigrants and Women in Twitter (Spanish)

Link to publication: https://www.aclweb.org/anthology/S19-2007

Link to data: competitions.codalab.org/competitions/19935

Task description: Branching structure of tasks: Binary (Hate, Not), Within Hate (Group, Individual), Within Hate (Agressive, Not)

Details of task: Group-directed + Person-directed

Size of dataset: 6,600

Percentage abusive: 0.4

Language: Spanish

Level of annotation: Posts

Platform: Twitter

Medium: Text

Reference: Basile, V., Bosco, C., Fersini, E., Nozza, D., Patti, V., Pardo, F., Rosso, P. and Sanguinetti, M., 2019. SemEval-2019 Task 5: Multilingual Detection of Hate Speech Against Immigrants and Women in Twitter. In: Proceedings of the 13th International Workshop on Semantic Evaluation. Minneapolis, Minnesota: Association for Computational Linguistics, pp.54-63.

Turkish

63. A Corpus of Turkish Offensive Language on Social Media

Link to publication: https://coltekin.github.io/offensive-turkish/troff.pdf

Link to data: https://sites.google.com/site/offensevalsharedtask/home

Task description: Branching structure of tasks: Binary (Hate, Not), Within Hate (Group, Individual), Within Hate (Agressive, Not)

Details of task: Group-directed + Person-directed

Size of dataset: 36232

Percentage abusive: 0.19

Language: Turkish

Level of annotation: Posts

Platform: Twitter

Medium: Text

Reference: Çöltekin, C., 2020. A Corpus of Turkish Offensive Language on Social Media. In: Proceedings of the 12th International Conference on Language Resources and Evaluation.

Lists of abusive keywords

This page is http://hatespeechdata.com/.