Automatic Labelling of Malay Cyberbullying Twitter Corpus using Combinations of Sentiment, Emotion and Toxicity Polarities

Automatic labelling is essential in large corpuses. Engaging in human experts to label can be challenging. Semantic understanding can differ from one labeler to another based on individual's language ability. Platforms such as AmazonTurk are not able to ensure the quality of annotations in ever...

Full description

Bibliographic Details
Published in:ACM International Conference Proceeding Series
Main Author: Maskat R.; Faizzuddin Zainal M.; Ismail N.; Ardi N.; Ahmad A.; Daud N.
Format: Conference paper
Language:English
Published: Association for Computing Machinery 2020
Online Access:https://www.scopus.com/inward/record.uri?eid=2-s2.0-85102973769&doi=10.1145%2f3446132.3446412&partnerID=40&md5=5c668651d4c77ca8be2f2761336f508e
id 2-s2.0-85102973769
spelling 2-s2.0-85102973769
Maskat R.; Faizzuddin Zainal M.; Ismail N.; Ardi N.; Ahmad A.; Daud N.
Automatic Labelling of Malay Cyberbullying Twitter Corpus using Combinations of Sentiment, Emotion and Toxicity Polarities
2020
ACM International Conference Proceeding Series


10.1145/3446132.3446412
https://www.scopus.com/inward/record.uri?eid=2-s2.0-85102973769&doi=10.1145%2f3446132.3446412&partnerID=40&md5=5c668651d4c77ca8be2f2761336f508e
Automatic labelling is essential in large corpuses. Engaging in human experts to label can be challenging. Semantic understanding can differ from one labeler to another based on individual's language ability. Platforms such as AmazonTurk are not able to ensure the quality of annotations in every domain. Extensive steps such as qualification and counter checking of labels may be implemented which will increase the cost of data annotation. Thus, the higher quality of labelled data expected, the greater the cost that needs to be expended. This scenario is made worse when the language is of low resource where in this work is the Malay language. Malay is a language used mostly in Malaysia, Indonesia, Singapore and Brunei. Unlike English which has large resources to tap into the semantics of sentences, making automatic labelling faster to mature, resources in Malay language are still limited. Further compounded is the use of social media data where the text is short, unnormalized and the inherent presence of code switching. The availability of qualified native Malay labelers is also scarce. To overcome this, we devised a method to automatically label a total of 219,444 Malay tweets by using a combination of sentiment, emotion and toxicity polarities. We extend the work from Arslan et al. who proposed the use of sentiment and emotion to identify cyberbullying text. Our work added toxicity polarity in the context of automatic labelling of cyberbully tweets in Malay. We were able to employ 5 experts with formal degrees in Malay language to label our training set. We applied this method to Malay cyberbullying corpus to determine "bully"and "not bully"labels. We have tested our method on 54,867 manually labelled data and achieved high accuracy. © 2020 ACM.
Association for Computing Machinery

English
Conference paper

author Maskat R.; Faizzuddin Zainal M.; Ismail N.; Ardi N.; Ahmad A.; Daud N.
spellingShingle Maskat R.; Faizzuddin Zainal M.; Ismail N.; Ardi N.; Ahmad A.; Daud N.
Automatic Labelling of Malay Cyberbullying Twitter Corpus using Combinations of Sentiment, Emotion and Toxicity Polarities
author_facet Maskat R.; Faizzuddin Zainal M.; Ismail N.; Ardi N.; Ahmad A.; Daud N.
author_sort Maskat R.; Faizzuddin Zainal M.; Ismail N.; Ardi N.; Ahmad A.; Daud N.
title Automatic Labelling of Malay Cyberbullying Twitter Corpus using Combinations of Sentiment, Emotion and Toxicity Polarities
title_short Automatic Labelling of Malay Cyberbullying Twitter Corpus using Combinations of Sentiment, Emotion and Toxicity Polarities
title_full Automatic Labelling of Malay Cyberbullying Twitter Corpus using Combinations of Sentiment, Emotion and Toxicity Polarities
title_fullStr Automatic Labelling of Malay Cyberbullying Twitter Corpus using Combinations of Sentiment, Emotion and Toxicity Polarities
title_full_unstemmed Automatic Labelling of Malay Cyberbullying Twitter Corpus using Combinations of Sentiment, Emotion and Toxicity Polarities
title_sort Automatic Labelling of Malay Cyberbullying Twitter Corpus using Combinations of Sentiment, Emotion and Toxicity Polarities
publishDate 2020
container_title ACM International Conference Proceeding Series
container_volume
container_issue
doi_str_mv 10.1145/3446132.3446412
url https://www.scopus.com/inward/record.uri?eid=2-s2.0-85102973769&doi=10.1145%2f3446132.3446412&partnerID=40&md5=5c668651d4c77ca8be2f2761336f508e
description Automatic labelling is essential in large corpuses. Engaging in human experts to label can be challenging. Semantic understanding can differ from one labeler to another based on individual's language ability. Platforms such as AmazonTurk are not able to ensure the quality of annotations in every domain. Extensive steps such as qualification and counter checking of labels may be implemented which will increase the cost of data annotation. Thus, the higher quality of labelled data expected, the greater the cost that needs to be expended. This scenario is made worse when the language is of low resource where in this work is the Malay language. Malay is a language used mostly in Malaysia, Indonesia, Singapore and Brunei. Unlike English which has large resources to tap into the semantics of sentences, making automatic labelling faster to mature, resources in Malay language are still limited. Further compounded is the use of social media data where the text is short, unnormalized and the inherent presence of code switching. The availability of qualified native Malay labelers is also scarce. To overcome this, we devised a method to automatically label a total of 219,444 Malay tweets by using a combination of sentiment, emotion and toxicity polarities. We extend the work from Arslan et al. who proposed the use of sentiment and emotion to identify cyberbullying text. Our work added toxicity polarity in the context of automatic labelling of cyberbully tweets in Malay. We were able to employ 5 experts with formal degrees in Malay language to label our training set. We applied this method to Malay cyberbullying corpus to determine "bully"and "not bully"labels. We have tested our method on 54,867 manually labelled data and achieved high accuracy. © 2020 ACM.
publisher Association for Computing Machinery
issn
language English
format Conference paper
accesstype
record_format scopus
collection Scopus
_version_ 1809677783511597056