Automatic Labelling of Malay Cyberbullying Twitter Corpus using Combinations of Sentiment, Emotion and Toxicity Polarities

Automatic labelling is essential in large corpuses. Engaging in human experts to label can be challenging. Semantic understanding can differ from one labeler to another based on individual's language ability. Platforms such as AmazonTurk are not able to ensure the quality of annotations in ever...

Full description

Bibliographic Details
Published in:ACM International Conference Proceeding Series
Main Author: Maskat R.; Faizzuddin Zainal M.; Ismail N.; Ardi N.; Ahmad A.; Daud N.
Format: Conference paper
Language:English
Published: Association for Computing Machinery 2020
Online Access:https://www.scopus.com/inward/record.uri?eid=2-s2.0-85102973769&doi=10.1145%2f3446132.3446412&partnerID=40&md5=5c668651d4c77ca8be2f2761336f508e
Description
Summary:Automatic labelling is essential in large corpuses. Engaging in human experts to label can be challenging. Semantic understanding can differ from one labeler to another based on individual's language ability. Platforms such as AmazonTurk are not able to ensure the quality of annotations in every domain. Extensive steps such as qualification and counter checking of labels may be implemented which will increase the cost of data annotation. Thus, the higher quality of labelled data expected, the greater the cost that needs to be expended. This scenario is made worse when the language is of low resource where in this work is the Malay language. Malay is a language used mostly in Malaysia, Indonesia, Singapore and Brunei. Unlike English which has large resources to tap into the semantics of sentences, making automatic labelling faster to mature, resources in Malay language are still limited. Further compounded is the use of social media data where the text is short, unnormalized and the inherent presence of code switching. The availability of qualified native Malay labelers is also scarce. To overcome this, we devised a method to automatically label a total of 219,444 Malay tweets by using a combination of sentiment, emotion and toxicity polarities. We extend the work from Arslan et al. who proposed the use of sentiment and emotion to identify cyberbullying text. Our work added toxicity polarity in the context of automatic labelling of cyberbully tweets in Malay. We were able to employ 5 experts with formal degrees in Malay language to label our training set. We applied this method to Malay cyberbullying corpus to determine "bully"and "not bully"labels. We have tested our method on 54,867 manually labelled data and achieved high accuracy. © 2020 ACM.
ISSN:
DOI:10.1145/3446132.3446412