Feature Substitution Using Latent Dirichlet Allocation for Text Classification

Text classification plays a pivotal role in natural language processing, enabling applications such as product categorization, sentiment analysis, spam detection, and document organization. Traditional methods, including bag-of-words and TF-IDF, often lead to high-dimensional feature spaces, increas...

Full description

Bibliographic Details
Published in:	INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS
Main Authors:	Mathivanan, Norsyela Muhammad Noor; Janor, Roziah Mohd; Abd Razak, Shukor; Ghani, Nor Azura Md.
Format:	Article
Language:	English
Published:	SCIENCE & INFORMATION SAI ORGANIZATION LTD 2025
Subjects:	Computer Science
Online Access:	https://www-webofscience-com.uitm.idm.oclc.org/wos/woscc/full-record/WOS:001437132300001

author	Mathivanan Norsyela Muhammad Noor; Janor Roziah Mohd; Abd Razak Shukor; Ghani Nor Azura Md.
spellingShingle	Mathivanan Norsyela Muhammad Noor; Janor Roziah Mohd; Abd Razak Shukor; Ghani Nor Azura Md. Feature Substitution Using Latent Dirichlet Allocation for Text Classification Computer Science
author_facet	Mathivanan Norsyela Muhammad Noor; Janor Roziah Mohd; Abd Razak Shukor; Ghani Nor Azura Md.
author_sort	Mathivanan
spelling	Mathivanan, Norsyela Muhammad Noor; Janor, Roziah Mohd; Abd Razak, Shukor; Ghani, Nor Azura Md. Feature Substitution Using Latent Dirichlet Allocation for Text Classification INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS English Article Text classification plays a pivotal role in natural language processing, enabling applications such as product categorization, sentiment analysis, spam detection, and document organization. Traditional methods, including bag-of-words and TF-IDF, often lead to high-dimensional feature spaces, increasing computational complexity and susceptibility to overfitting. This study introduces a novel Feature Substitution technique using Latent Dirichlet Allocation (FS-LDA), which enhances text representation by replacing non-overlapping high-probability topic words. FS-LDA effectively reduces dimensionality while retaining essential semantic features, optimizing classification accuracy and efficiency. Experimental evaluations on five ecommerce datasets and an SMS spam dataset demonstrated that FS-LDA, combined with Hidden Markov Models (HMMs), achieved up to 95% classification accuracy in binary tasks and significant improvements in macro and weighted F1-scores for multiclass tasks. The innovative approach lies in FS-LDA's ability to seamlessly integrate dimensionality reduction with feature substitution, while its predictive advantage is demonstrated through consistent performance enhancement across diverse datasets. Future work will explore its application to other classification models and domains, such as social media analysis and medical document categorization, to further validate its scalability and robustness. SCIENCE & INFORMATION SAI ORGANIZATION LTD 2158-107X 2156-5570 2025 16 1 Computer Science WOS:001437132300001 https://www-webofscience-com.uitm.idm.oclc.org/wos/woscc/full-record/WOS:001437132300001
title	Feature Substitution Using Latent Dirichlet Allocation for Text Classification
title_short	Feature Substitution Using Latent Dirichlet Allocation for Text Classification
title_full	Feature Substitution Using Latent Dirichlet Allocation for Text Classification
title_fullStr	Feature Substitution Using Latent Dirichlet Allocation for Text Classification
title_full_unstemmed	Feature Substitution Using Latent Dirichlet Allocation for Text Classification
title_sort	Feature Substitution Using Latent Dirichlet Allocation for Text Classification
container_title	INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS
language	English
format	Article
description	Text classification plays a pivotal role in natural language processing, enabling applications such as product categorization, sentiment analysis, spam detection, and document organization. Traditional methods, including bag-of-words and TF-IDF, often lead to high-dimensional feature spaces, increasing computational complexity and susceptibility to overfitting. This study introduces a novel Feature Substitution technique using Latent Dirichlet Allocation (FS-LDA), which enhances text representation by replacing non-overlapping high-probability topic words. FS-LDA effectively reduces dimensionality while retaining essential semantic features, optimizing classification accuracy and efficiency. Experimental evaluations on five ecommerce datasets and an SMS spam dataset demonstrated that FS-LDA, combined with Hidden Markov Models (HMMs), achieved up to 95% classification accuracy in binary tasks and significant improvements in macro and weighted F1-scores for multiclass tasks. The innovative approach lies in FS-LDA's ability to seamlessly integrate dimensionality reduction with feature substitution, while its predictive advantage is demonstrated through consistent performance enhancement across diverse datasets. Future work will explore its application to other classification models and domains, such as social media analysis and medical document categorization, to further validate its scalability and robustness.
publisher	SCIENCE & INFORMATION SAI ORGANIZATION LTD
issn	2158-107X 2156-5570
publishDate	2025
container_volume	16
container_issue	1
doi_str_mv
topic	Computer Science
topic_facet	Computer Science
accesstype
id	WOS:001437132300001
url	https://www-webofscience-com.uitm.idm.oclc.org/wos/woscc/full-record/WOS:001437132300001
record_format	wos
collection	Web of Science (WoS)
_version_	1828987784675721216

Feature Substitution Using Latent Dirichlet Allocation for Text Classification

Similar Items