Name entity recognition for malay texts using cross-lingual annotation projection approach

Cross-lingual annotation projection methods can benefit from richresourced languages to improve the performance of Natural Language Processing (NLP) tasks in less-resourced languages. In this research, Malay is experimented as the less-resourced language and English is experimented as the rich-resou...

Full description

Bibliographic Details
Published in:Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Main Author: Zamin N.; Bakar Z.A.
Format: Conference paper
Language:English
Published: Springer Verlag 2015
Online Access:https://www.scopus.com/inward/record.uri?eid=2-s2.0-84948970178&doi=10.1007%2f978-3-319-21404-7_18&partnerID=40&md5=f93e12612885f286f1f3b8b50c3400ca
id 2-s2.0-84948970178
spelling 2-s2.0-84948970178
Zamin N.; Bakar Z.A.
Name entity recognition for malay texts using cross-lingual annotation projection approach
2015
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
9155

10.1007/978-3-319-21404-7_18
https://www.scopus.com/inward/record.uri?eid=2-s2.0-84948970178&doi=10.1007%2f978-3-319-21404-7_18&partnerID=40&md5=f93e12612885f286f1f3b8b50c3400ca
Cross-lingual annotation projection methods can benefit from richresourced languages to improve the performance of Natural Language Processing (NLP) tasks in less-resourced languages. In this research, Malay is experimented as the less-resourced language and English is experimented as the rich-resourced language. The research is proposed to reduce the deadlock in Malay computational linguistic research due to the shortage of Malay tools and annotated corpus by exploiting state-of-the-art English tools. This paper proposes an alignment method known as MEWA (Malay-English Word Aligner) that integrates a Dice Coefficient and bigram string similarity measure with little supervision to automatically recognize three common named entities – person (PER), organization (ORG) and location (LOC). Firstly, the test collection of Malay journalistic articles describing on Indonesian terrorism is established in three volumes – 646, 5413 and 10002 words. Secondly, a comparative study between selected state-of-the-art tools is conducted to evaluate the performance of the tools against the test collection. Thirdly, MEWA is experimented to automatically induced annotations using the test collection and the identified English tool. A total of 93% accuracy rate is achieved in a series of NE annotation projection experiment. © Springer International Publishing Switzerland 2015.
Springer Verlag
3029743
English
Conference paper

author Zamin N.; Bakar Z.A.
spellingShingle Zamin N.; Bakar Z.A.
Name entity recognition for malay texts using cross-lingual annotation projection approach
author_facet Zamin N.; Bakar Z.A.
author_sort Zamin N.; Bakar Z.A.
title Name entity recognition for malay texts using cross-lingual annotation projection approach
title_short Name entity recognition for malay texts using cross-lingual annotation projection approach
title_full Name entity recognition for malay texts using cross-lingual annotation projection approach
title_fullStr Name entity recognition for malay texts using cross-lingual annotation projection approach
title_full_unstemmed Name entity recognition for malay texts using cross-lingual annotation projection approach
title_sort Name entity recognition for malay texts using cross-lingual annotation projection approach
publishDate 2015
container_title Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
container_volume 9155
container_issue
doi_str_mv 10.1007/978-3-319-21404-7_18
url https://www.scopus.com/inward/record.uri?eid=2-s2.0-84948970178&doi=10.1007%2f978-3-319-21404-7_18&partnerID=40&md5=f93e12612885f286f1f3b8b50c3400ca
description Cross-lingual annotation projection methods can benefit from richresourced languages to improve the performance of Natural Language Processing (NLP) tasks in less-resourced languages. In this research, Malay is experimented as the less-resourced language and English is experimented as the rich-resourced language. The research is proposed to reduce the deadlock in Malay computational linguistic research due to the shortage of Malay tools and annotated corpus by exploiting state-of-the-art English tools. This paper proposes an alignment method known as MEWA (Malay-English Word Aligner) that integrates a Dice Coefficient and bigram string similarity measure with little supervision to automatically recognize three common named entities – person (PER), organization (ORG) and location (LOC). Firstly, the test collection of Malay journalistic articles describing on Indonesian terrorism is established in three volumes – 646, 5413 and 10002 words. Secondly, a comparative study between selected state-of-the-art tools is conducted to evaluate the performance of the tools against the test collection. Thirdly, MEWA is experimented to automatically induced annotations using the test collection and the identified English tool. A total of 93% accuracy rate is achieved in a series of NE annotation projection experiment. © Springer International Publishing Switzerland 2015.
publisher Springer Verlag
issn 3029743
language English
format Conference paper
accesstype
record_format scopus
collection Scopus
_version_ 1809677687961157632