A Quality Type-aware Annotated Corpus and Lexicon for Harassment Research

TitleA Quality Type-aware Annotated Corpus and Lexicon for Harassment Research
Publication TypeConference Paper
Year of Publication2018
AuthorsMohammadreza Rezvan, Saeedeh Shekarpour, Lakshika Balasuriya, Krishnaprasad Thirunarayan, Valerie Shalin, Amit Sheth
Conference NameProceedings of the 10th {ACM} Conference on Web Science, WebSci 2018, Amsterdam, The Netherlands, May 27-30, 2018
Pagination33–36
Date Published05/2018
PublisherACM
Conference LocationAmsterdam, The Netherlands
KeywordsAnnotated corpus, appearance- related, context, cyberbullying, harassment, intellectual, offensive Lexicon, political, profane word., racial, sexual
Abstract

A quality annotated corpus is essential to research. Despite the recent focus of the Web science community on cyberbullying research,the community lacks standard benchmarks. This paper provides both a quality annotated corpus and an offensive words lexicon capturing different types of harassment content: (i) sexual, (ii) racial, (iii) appearance-related, (iv) intellectual, and (v) political. We first crawled data from Twitter using this content-tailored offensive lexicon. As mere presence of an offensive word is not a reliable indicator of harassment, human judges annotated tweets for the presence of harassment. Our corpus consists of 25,000 annotated tweets for the five types of harassment content and is available on the Git repository.

DOI10.1145/3201064.3201103
Full Text

Mohammadreza Rezvan, Saeedeh Shekarpour, Lakshika Balasuriya, Krishnaprasad Thirunarayan, Valerie L. Shalin, Amit Sheth.
A Quality Type-aware Annotated Corpus and Lexicon for Harassment Research. Web Science, WebSci 2018, Amsterdam, The Netherlands, May 27-30, 2018

Projects: 
Harassment