Construction of Malay abbreviation corpus based on social media data
This study describes a construction of Malay abbreviation corpus by extracting and normalizing selected social media data with multilayer filtration pattern matching technique along with statistical machine translation approach. In this study, one million Malay Lingo user-generated-posts via Twitter...
Published in: | Journal of Engineering and Applied Sciences |
---|---|
Main Author: | |
Format: | Article |
Language: | English |
Published: |
Medwell Journals
2017
|
Online Access: | https://www.scopus.com/inward/record.uri?eid=2-s2.0-85017468132&doi=10.3923%2fjeasci.2017.468.474&partnerID=40&md5=5840d75936698a956bbcf8ca6fc858bb |
id |
2-s2.0-85017468132 |
---|---|
spelling |
2-s2.0-85017468132 Omar N.; Hamsani A.F.; Abdullah N.A.S.; Abidin S.Z.Z. Construction of Malay abbreviation corpus based on social media data 2017 Journal of Engineering and Applied Sciences 12 3 10.3923/jeasci.2017.468.474 https://www.scopus.com/inward/record.uri?eid=2-s2.0-85017468132&doi=10.3923%2fjeasci.2017.468.474&partnerID=40&md5=5840d75936698a956bbcf8ca6fc858bb This study describes a construction of Malay abbreviation corpus by extracting and normalizing selected social media data with multilayer filtration pattern matching technique along with statistical machine translation approach. In this study, one million Malay Lingo user-generated-posts via Twitter and Facebook are extracted for sampling. Each word will undergo pre-processing stage which involves filtration and association and stored in MySQL database table. Then, each word in the corpus is linked with its respective word in existing vocabulary; otherwise, it is considered as abbreviation word will be further processed by using N-grams approach and added to the existing corpus. Based on the result, it can be seen that the longer the length of text, the translation probability is decreased. Furthermore, the style of writing is very important. The lack of space usage to separate in between words will cause more than one word are merged and became out-of-vocabulary word. The worst case is the strange merged word has no link to any recognizable root word in the dictionary. In the first attempt of processing 1000 selected posts from the social media, a lot of uncommon abbreviation words are found. As a result, a lower translation percentage is achieved. Nevertheless when the post uses common abbreviations that exist in the Malay Social Media Corpus then the result of the translation is able to achieve 100% accuracy. Nevertheless, the source of user-generated word is infinite and there is still many ways to improve the combination of NLP techniques in constructing a better and reliable corpus due to the dynamic nature of user's behaviour and their informal ways of writing texts. The corpus is very much needed in analysing public's sentiments in various dimensions such as product-related evaluations and service-oriented feedbacks which are propagated across various platforms of social media. © Medwell Journals, 2017. Medwell Journals 1816949X English Article |
author |
Omar N.; Hamsani A.F.; Abdullah N.A.S.; Abidin S.Z.Z. |
spellingShingle |
Omar N.; Hamsani A.F.; Abdullah N.A.S.; Abidin S.Z.Z. Construction of Malay abbreviation corpus based on social media data |
author_facet |
Omar N.; Hamsani A.F.; Abdullah N.A.S.; Abidin S.Z.Z. |
author_sort |
Omar N.; Hamsani A.F.; Abdullah N.A.S.; Abidin S.Z.Z. |
title |
Construction of Malay abbreviation corpus based on social media data |
title_short |
Construction of Malay abbreviation corpus based on social media data |
title_full |
Construction of Malay abbreviation corpus based on social media data |
title_fullStr |
Construction of Malay abbreviation corpus based on social media data |
title_full_unstemmed |
Construction of Malay abbreviation corpus based on social media data |
title_sort |
Construction of Malay abbreviation corpus based on social media data |
publishDate |
2017 |
container_title |
Journal of Engineering and Applied Sciences |
container_volume |
12 |
container_issue |
3 |
doi_str_mv |
10.3923/jeasci.2017.468.474 |
url |
https://www.scopus.com/inward/record.uri?eid=2-s2.0-85017468132&doi=10.3923%2fjeasci.2017.468.474&partnerID=40&md5=5840d75936698a956bbcf8ca6fc858bb |
description |
This study describes a construction of Malay abbreviation corpus by extracting and normalizing selected social media data with multilayer filtration pattern matching technique along with statistical machine translation approach. In this study, one million Malay Lingo user-generated-posts via Twitter and Facebook are extracted for sampling. Each word will undergo pre-processing stage which involves filtration and association and stored in MySQL database table. Then, each word in the corpus is linked with its respective word in existing vocabulary; otherwise, it is considered as abbreviation word will be further processed by using N-grams approach and added to the existing corpus. Based on the result, it can be seen that the longer the length of text, the translation probability is decreased. Furthermore, the style of writing is very important. The lack of space usage to separate in between words will cause more than one word are merged and became out-of-vocabulary word. The worst case is the strange merged word has no link to any recognizable root word in the dictionary. In the first attempt of processing 1000 selected posts from the social media, a lot of uncommon abbreviation words are found. As a result, a lower translation percentage is achieved. Nevertheless when the post uses common abbreviations that exist in the Malay Social Media Corpus then the result of the translation is able to achieve 100% accuracy. Nevertheless, the source of user-generated word is infinite and there is still many ways to improve the combination of NLP techniques in constructing a better and reliable corpus due to the dynamic nature of user's behaviour and their informal ways of writing texts. The corpus is very much needed in analysing public's sentiments in various dimensions such as product-related evaluations and service-oriented feedbacks which are propagated across various platforms of social media. © Medwell Journals, 2017. |
publisher |
Medwell Journals |
issn |
1816949X |
language |
English |
format |
Article |
accesstype |
|
record_format |
scopus |
collection |
Scopus |
_version_ |
1809677786818805760 |