Summary: | Predicting annual workers income requires ones to deep dive into several factors. Factors that majorly being discuss were age, gender, education and occupation. On the other hand, there are other factors that may affect the annual workers income where it yet to be discussed. The traditional way of predicting the annual workers income was multiple linear regression. This parametric approach requires assumptions to be fulfilled and this will actions is a time-consuming activity. Data mining approach in predicting the workers income is important to understand on how the economy and compensation work in the United States. Machine learning will cover all aspect without needing to fulfil certain assumptions as compared to traditional method. Hence, the best way to predict the worker's income in the United States is the best using machine learning and concurrently solve the SDG 8: Decent Work & Economic Growth aspect. The dataset used in this study is acquired from Kaggle website. At first, features weight using filter method (Weight by Information Gain, Weight by Information gain Ratio and Weight by Chi - Squared Statistics) were taken to identify the influential factors towards annual workers' income. The three different methods employed in the model to predict worker income are logistic regression, decision trees, and artificial neural networks. The second goal is to contrast the effectiveness of worker income prediction using under sampling and oversampling techniques. The results show that, with the exception of decision tree, oversampling strategy provides the best performance of prediction model when compared to under sampling technique. Since under sampling techniques randomly delete observations when there is a chance that such observations could be significant to the data and have an impact on the prediction model, oversampling techniques perform better than under sampling techniques. The third goal is to identify the most effective classification model for predicting worker's income. The oversampling strategy with backward selection represents the best model when applying the Logistic Regression model. Additionally, the optimal model for Decision Trees is the backward selection with under sampling strategy. The best model criterion for artificial neural networks is the oversampling method via backward selection. Data mining approach in predicting the workers income is important to understand on how the economy and compensation work in the United States. © 2024 Author(s).
|