Projecting named entity tags from a resource rich language to a resource poor language

Named Entities (NE) are the prominent entities appearing in textual documents. Automatic classification of NE in a textual corpus is a vital process in Information Extraction and Information Retrieval research. Named Entity Recognition (NER) is the identification of words in text that correspond to...

Full description

Bibliographic Details
Published in:	Journal of Information and Communication Technology
Main Author:	Zamin N.; Oxley A.; Bakar Z.A.
Format:	Article
Language:	English
Published:	Universiti Utara Malaysia Press 2013
Online Access:	https://www.scopus.com/inward/record.uri?eid=2-s2.0-84893004882&partnerID=40&md5=f8a6bb4af5b4d7f06f9e5e8666f91c47

id	2-s2.0-84893004882
spelling	2-s2.0-84893004882 Zamin N.; Oxley A.; Bakar Z.A. Projecting named entity tags from a resource rich language to a resource poor language 2013 Journal of Information and Communication Technology 12 1 https://www.scopus.com/inward/record.uri?eid=2-s2.0-84893004882&partnerID=40&md5=f8a6bb4af5b4d7f06f9e5e8666f91c47 Named Entities (NE) are the prominent entities appearing in textual documents. Automatic classification of NE in a textual corpus is a vital process in Information Extraction and Information Retrieval research. Named Entity Recognition (NER) is the identification of words in text that correspond to a pre-defined taxonomy such as person, organization, location, date, time, etc. This article focuses on the person (PER), organization (ORG) and location (LOC) entities for a Malay journalistic corpus of terrorism. A projection algorithm, using the Dice Coefficient function and bigram scoring method with domain-specific rules, is suggested to map the NE information from the English corpus to the Malay corpus of terrorism. The English corpus is the translated version of the Malay corpus. Hence, these two corpora are treated as parallel corpora. The method computes the string similarity between the English words and the list of available lexemes in a pre-built lexicon that approximates the best NE mapping. The algorithm has been effectively evaluated using our own terrorism tagged corpus; it achieved satisfactory results in terms of precision, recall, and F-measure. An evaluation of the selected open source NER tool for English is also presented. Universiti Utara Malaysia Press 1675414X English Article
author	Zamin N.; Oxley A.; Bakar Z.A.
spellingShingle	Zamin N.; Oxley A.; Bakar Z.A. Projecting named entity tags from a resource rich language to a resource poor language
author_facet	Zamin N.; Oxley A.; Bakar Z.A.
author_sort	Zamin N.; Oxley A.; Bakar Z.A.
title	Projecting named entity tags from a resource rich language to a resource poor language
title_short	Projecting named entity tags from a resource rich language to a resource poor language
title_full	Projecting named entity tags from a resource rich language to a resource poor language
title_fullStr	Projecting named entity tags from a resource rich language to a resource poor language
title_full_unstemmed	Projecting named entity tags from a resource rich language to a resource poor language
title_sort	Projecting named entity tags from a resource rich language to a resource poor language
publishDate	2013
container_title	Journal of Information and Communication Technology
container_volume	12
container_issue	1
doi_str_mv
url	https://www.scopus.com/inward/record.uri?eid=2-s2.0-84893004882&partnerID=40&md5=f8a6bb4af5b4d7f06f9e5e8666f91c47
description	Named Entities (NE) are the prominent entities appearing in textual documents. Automatic classification of NE in a textual corpus is a vital process in Information Extraction and Information Retrieval research. Named Entity Recognition (NER) is the identification of words in text that correspond to a pre-defined taxonomy such as person, organization, location, date, time, etc. This article focuses on the person (PER), organization (ORG) and location (LOC) entities for a Malay journalistic corpus of terrorism. A projection algorithm, using the Dice Coefficient function and bigram scoring method with domain-specific rules, is suggested to map the NE information from the English corpus to the Malay corpus of terrorism. The English corpus is the translated version of the Malay corpus. Hence, these two corpora are treated as parallel corpora. The method computes the string similarity between the English words and the list of available lexemes in a pre-built lexicon that approximates the best NE mapping. The algorithm has been effectively evaluated using our own terrorism tagged corpus; it achieved satisfactory results in terms of precision, recall, and F-measure. An evaluation of the selected open source NER tool for English is also presented.
publisher	Universiti Utara Malaysia Press
issn	1675414X
language	English
format	Article
accesstype
record_format	scopus
collection	Scopus
_version_	1809677788026765312

Projecting named entity tags from a resource rich language to a resource poor language

Similar Items