Construction of Malay abbreviation corpus based on social media data

This study describes a construction of Malay abbreviation corpus by extracting and normalizing selected social media data with multilayer filtration pattern matching technique along with statistical machine translation approach. In this study, one million Malay Lingo user-generated-posts via Twitter...

Full description

Bibliographic Details
Published in:Journal of Engineering and Applied Sciences
Main Author: Omar N.; Hamsani A.F.; Abdullah N.A.S.; Abidin S.Z.Z.
Format: Article
Language:English
Published: Medwell Journals 2017
Online Access:https://www.scopus.com/inward/record.uri?eid=2-s2.0-85017468132&doi=10.3923%2fjeasci.2017.468.474&partnerID=40&md5=5840d75936698a956bbcf8ca6fc858bb
id 2-s2.0-85017468132
spelling 2-s2.0-85017468132
Omar N.; Hamsani A.F.; Abdullah N.A.S.; Abidin S.Z.Z.
Construction of Malay abbreviation corpus based on social media data
2017
Journal of Engineering and Applied Sciences
12
3
10.3923/jeasci.2017.468.474
https://www.scopus.com/inward/record.uri?eid=2-s2.0-85017468132&doi=10.3923%2fjeasci.2017.468.474&partnerID=40&md5=5840d75936698a956bbcf8ca6fc858bb
This study describes a construction of Malay abbreviation corpus by extracting and normalizing selected social media data with multilayer filtration pattern matching technique along with statistical machine translation approach. In this study, one million Malay Lingo user-generated-posts via Twitter and Facebook are extracted for sampling. Each word will undergo pre-processing stage which involves filtration and association and stored in MySQL database table. Then, each word in the corpus is linked with its respective word in existing vocabulary; otherwise, it is considered as abbreviation word will be further processed by using N-grams approach and added to the existing corpus. Based on the result, it can be seen that the longer the length of text, the translation probability is decreased. Furthermore, the style of writing is very important. The lack of space usage to separate in between words will cause more than one word are merged and became out-of-vocabulary word. The worst case is the strange merged word has no link to any recognizable root word in the dictionary. In the first attempt of processing 1000 selected posts from the social media, a lot of uncommon abbreviation words are found. As a result, a lower translation percentage is achieved. Nevertheless when the post uses common abbreviations that exist in the Malay Social Media Corpus then the result of the translation is able to achieve 100% accuracy. Nevertheless, the source of user-generated word is infinite and there is still many ways to improve the combination of NLP techniques in constructing a better and reliable corpus due to the dynamic nature of user's behaviour and their informal ways of writing texts. The corpus is very much needed in analysing public's sentiments in various dimensions such as product-related evaluations and service-oriented feedbacks which are propagated across various platforms of social media. © Medwell Journals, 2017.
Medwell Journals
1816949X
English
Article

author Omar N.; Hamsani A.F.; Abdullah N.A.S.; Abidin S.Z.Z.
spellingShingle Omar N.; Hamsani A.F.; Abdullah N.A.S.; Abidin S.Z.Z.
Construction of Malay abbreviation corpus based on social media data
author_facet Omar N.; Hamsani A.F.; Abdullah N.A.S.; Abidin S.Z.Z.
author_sort Omar N.; Hamsani A.F.; Abdullah N.A.S.; Abidin S.Z.Z.
title Construction of Malay abbreviation corpus based on social media data
title_short Construction of Malay abbreviation corpus based on social media data
title_full Construction of Malay abbreviation corpus based on social media data
title_fullStr Construction of Malay abbreviation corpus based on social media data
title_full_unstemmed Construction of Malay abbreviation corpus based on social media data
title_sort Construction of Malay abbreviation corpus based on social media data
publishDate 2017
container_title Journal of Engineering and Applied Sciences
container_volume 12
container_issue 3
doi_str_mv 10.3923/jeasci.2017.468.474
url https://www.scopus.com/inward/record.uri?eid=2-s2.0-85017468132&doi=10.3923%2fjeasci.2017.468.474&partnerID=40&md5=5840d75936698a956bbcf8ca6fc858bb
description This study describes a construction of Malay abbreviation corpus by extracting and normalizing selected social media data with multilayer filtration pattern matching technique along with statistical machine translation approach. In this study, one million Malay Lingo user-generated-posts via Twitter and Facebook are extracted for sampling. Each word will undergo pre-processing stage which involves filtration and association and stored in MySQL database table. Then, each word in the corpus is linked with its respective word in existing vocabulary; otherwise, it is considered as abbreviation word will be further processed by using N-grams approach and added to the existing corpus. Based on the result, it can be seen that the longer the length of text, the translation probability is decreased. Furthermore, the style of writing is very important. The lack of space usage to separate in between words will cause more than one word are merged and became out-of-vocabulary word. The worst case is the strange merged word has no link to any recognizable root word in the dictionary. In the first attempt of processing 1000 selected posts from the social media, a lot of uncommon abbreviation words are found. As a result, a lower translation percentage is achieved. Nevertheless when the post uses common abbreviations that exist in the Malay Social Media Corpus then the result of the translation is able to achieve 100% accuracy. Nevertheless, the source of user-generated word is infinite and there is still many ways to improve the combination of NLP techniques in constructing a better and reliable corpus due to the dynamic nature of user's behaviour and their informal ways of writing texts. The corpus is very much needed in analysing public's sentiments in various dimensions such as product-related evaluations and service-oriented feedbacks which are propagated across various platforms of social media. © Medwell Journals, 2017.
publisher Medwell Journals
issn 1816949X
language English
format Article
accesstype
record_format scopus
collection Scopus
_version_ 1809677786818805760