Construction of Malay abbreviation corpus based on social media data

This study describes a construction of Malay abbreviation corpus by extracting and normalizing selected social media data with multilayer filtration pattern matching technique along with statistical machine translation approach. In this study, one million Malay Lingo user-generated-posts via Twitter...

Full description

Bibliographic Details
Published in:Journal of Engineering and Applied Sciences
Main Author: Omar N.; Hamsani A.F.; Abdullah N.A.S.; Abidin S.Z.Z.
Format: Article
Language:English
Published: Medwell Journals 2017
Online Access:https://www.scopus.com/inward/record.uri?eid=2-s2.0-85017468132&doi=10.3923%2fjeasci.2017.468.474&partnerID=40&md5=5840d75936698a956bbcf8ca6fc858bb
Description
Summary:This study describes a construction of Malay abbreviation corpus by extracting and normalizing selected social media data with multilayer filtration pattern matching technique along with statistical machine translation approach. In this study, one million Malay Lingo user-generated-posts via Twitter and Facebook are extracted for sampling. Each word will undergo pre-processing stage which involves filtration and association and stored in MySQL database table. Then, each word in the corpus is linked with its respective word in existing vocabulary; otherwise, it is considered as abbreviation word will be further processed by using N-grams approach and added to the existing corpus. Based on the result, it can be seen that the longer the length of text, the translation probability is decreased. Furthermore, the style of writing is very important. The lack of space usage to separate in between words will cause more than one word are merged and became out-of-vocabulary word. The worst case is the strange merged word has no link to any recognizable root word in the dictionary. In the first attempt of processing 1000 selected posts from the social media, a lot of uncommon abbreviation words are found. As a result, a lower translation percentage is achieved. Nevertheless when the post uses common abbreviations that exist in the Malay Social Media Corpus then the result of the translation is able to achieve 100% accuracy. Nevertheless, the source of user-generated word is infinite and there is still many ways to improve the combination of NLP techniques in constructing a better and reliable corpus due to the dynamic nature of user's behaviour and their informal ways of writing texts. The corpus is very much needed in analysing public's sentiments in various dimensions such as product-related evaluations and service-oriented feedbacks which are propagated across various platforms of social media. © Medwell Journals, 2017.
ISSN:1816949X
DOI:10.3923/jeasci.2017.468.474