A bi-annotated Malay-English code-switching (Manglish) dataset of X posts for biological gender identification and authorship attribution

Low -resource languages, like Malay, face the threat of extinction when linguistic resources become scarce. This paper addresses the scarcity issue by contributing to the inventory of low -resource languages, specifically focusing on Malay -English, known as Manglish. Manglish speakers are primarily...

Full description

Bibliographic Details
Published in:DATA IN BRIEF
Main Authors: Maskat, Ruhaila; Azman, Norazmiera Ayunie; Nulizairos, Nur Shaheera Shastera; Zahidin, Nurul Athirah; Mahadi, Adibah Humairah; Norshamsul, Siti Rubaya; Sharif, Mohd Mukhlis Mohd; Mahdin, Hairulnizam
Format: Article; Data Paper; Early Access
Language:English
Published: ELSEVIER 2024
Subjects:
Online Access:https://www-webofscience-com.uitm.idm.oclc.org/wos/woscc/full-record/WOS:001157110000001
author Maskat
Ruhaila; Azman
Norazmiera Ayunie; Nulizairos
Nur Shaheera Shastera; Zahidin
Nurul Athirah; Mahadi
Adibah Humairah; Norshamsul
Siti Rubaya; Sharif
Mohd Mukhlis Mohd; Mahdin
Hairulnizam
spellingShingle Maskat
Ruhaila; Azman
Norazmiera Ayunie; Nulizairos
Nur Shaheera Shastera; Zahidin
Nurul Athirah; Mahadi
Adibah Humairah; Norshamsul
Siti Rubaya; Sharif
Mohd Mukhlis Mohd; Mahdin
Hairulnizam
A bi-annotated Malay-English code-switching (Manglish) dataset of X posts for biological gender identification and authorship attribution
Science & Technology - Other Topics
author_facet Maskat
Ruhaila; Azman
Norazmiera Ayunie; Nulizairos
Nur Shaheera Shastera; Zahidin
Nurul Athirah; Mahadi
Adibah Humairah; Norshamsul
Siti Rubaya; Sharif
Mohd Mukhlis Mohd; Mahdin
Hairulnizam
author_sort Maskat
spelling Maskat, Ruhaila; Azman, Norazmiera Ayunie; Nulizairos, Nur Shaheera Shastera; Zahidin, Nurul Athirah; Mahadi, Adibah Humairah; Norshamsul, Siti Rubaya; Sharif, Mohd Mukhlis Mohd; Mahdin, Hairulnizam
A bi-annotated Malay-English code-switching (Manglish) dataset of X posts for biological gender identification and authorship attribution
DATA IN BRIEF
English
Article; Data Paper; Early Access
Low -resource languages, like Malay, face the threat of extinction when linguistic resources become scarce. This paper addresses the scarcity issue by contributing to the inventory of low -resource languages, specifically focusing on Malay -English, known as Manglish. Manglish speakers are primarily located in Malaysia, Indonesia, Brunei, and Singapore. As global adoption of second languages and social media usage increases, language code -switching, such as Spanglish and Chinglish, becomes more prevalent. In the case of Malay -English, this phenomenon is termed Manglish. To enhance the status of the Malay language and its transition out of the low -resource category, this unique text corpus, with binary annotations for biological gender and anonymized author identities is presented. This bi-annotated dataset offers valuable applications for various fields, including the investigation of cyberbullying, combating gender bias, and providing targeted recommendations for gender -specific products. This corpus can be used with either of the annotations or their composite. The dataset comprises of posts from 50 Malaysian public figures, equally split between biological males and females. The dataset contains a total of 709,012 raw X posts (formerly Twitter), with a relatively balanced distribution of 53.72% from biological female authors and 46.28% from biological male authors. Twitter API was used to scrape the posts. After pre-processing, the total posts reduced to 650,409 posts, widening the gap between the genders with the 56.88% for biological female and 43.12% for biological male. This dataset is a valuable resource for researchers in the field of Malay -English code -switching Natural Language Processing (NLP) and can be used to train or enhance existing and future Manglish language transformers. (c) 2024 Published by Elsevier Inc. This is an open access article under the CC BY license ( http://creativecommons.org/licenses/by/4.0/ )
ELSEVIER
2352-3409

2024
52

10.1016/j.dib.2024.110034
Science & Technology - Other Topics
gold
WOS:001157110000001
https://www-webofscience-com.uitm.idm.oclc.org/wos/woscc/full-record/WOS:001157110000001
title A bi-annotated Malay-English code-switching (Manglish) dataset of X posts for biological gender identification and authorship attribution
title_short A bi-annotated Malay-English code-switching (Manglish) dataset of X posts for biological gender identification and authorship attribution
title_full A bi-annotated Malay-English code-switching (Manglish) dataset of X posts for biological gender identification and authorship attribution
title_fullStr A bi-annotated Malay-English code-switching (Manglish) dataset of X posts for biological gender identification and authorship attribution
title_full_unstemmed A bi-annotated Malay-English code-switching (Manglish) dataset of X posts for biological gender identification and authorship attribution
title_sort A bi-annotated Malay-English code-switching (Manglish) dataset of X posts for biological gender identification and authorship attribution
container_title DATA IN BRIEF
language English
format Article; Data Paper; Early Access
description Low -resource languages, like Malay, face the threat of extinction when linguistic resources become scarce. This paper addresses the scarcity issue by contributing to the inventory of low -resource languages, specifically focusing on Malay -English, known as Manglish. Manglish speakers are primarily located in Malaysia, Indonesia, Brunei, and Singapore. As global adoption of second languages and social media usage increases, language code -switching, such as Spanglish and Chinglish, becomes more prevalent. In the case of Malay -English, this phenomenon is termed Manglish. To enhance the status of the Malay language and its transition out of the low -resource category, this unique text corpus, with binary annotations for biological gender and anonymized author identities is presented. This bi-annotated dataset offers valuable applications for various fields, including the investigation of cyberbullying, combating gender bias, and providing targeted recommendations for gender -specific products. This corpus can be used with either of the annotations or their composite. The dataset comprises of posts from 50 Malaysian public figures, equally split between biological males and females. The dataset contains a total of 709,012 raw X posts (formerly Twitter), with a relatively balanced distribution of 53.72% from biological female authors and 46.28% from biological male authors. Twitter API was used to scrape the posts. After pre-processing, the total posts reduced to 650,409 posts, widening the gap between the genders with the 56.88% for biological female and 43.12% for biological male. This dataset is a valuable resource for researchers in the field of Malay -English code -switching Natural Language Processing (NLP) and can be used to train or enhance existing and future Manglish language transformers. (c) 2024 Published by Elsevier Inc. This is an open access article under the CC BY license ( http://creativecommons.org/licenses/by/4.0/ )
publisher ELSEVIER
issn 2352-3409

publishDate 2024
container_volume 52
container_issue
doi_str_mv 10.1016/j.dib.2024.110034
topic Science & Technology - Other Topics
topic_facet Science & Technology - Other Topics
accesstype gold
id WOS:001157110000001
url https://www-webofscience-com.uitm.idm.oclc.org/wos/woscc/full-record/WOS:001157110000001
record_format wos
collection Web of Science (WoS)
_version_ 1809678632459698176