Enhancing Pretrained Multilingual Machine Translation Model with Code-Switching: A Study on Chinese, English and Malay Language

In the field of multilingual machine translation, many pretrained language models have achieved the inspiring results. However, the results based on pretrained models are not yet very satisfactory for low-resource languages. This paper investigates how to leverage code-switching data to fine-tune pr...

Full description

Bibliographic Details
Published in:ICCPR 2024 - Proceedings of the 2024 13th International Conference on Computing and Pattern Recognition
Main Author: Liu H.; Seman N.
Format: Conference paper
Language:English
Published: Association for Computing Machinery, Inc 2025
Online Access:https://www.scopus.com/inward/record.uri?eid=2-s2.0-85218347057&doi=10.1145%2f3704323.3704346&partnerID=40&md5=da2c58dd104d36641e0865c691fbe6df
Description
Summary:In the field of multilingual machine translation, many pretrained language models have achieved the inspiring results. However, the results based on pretrained models are not yet very satisfactory for low-resource languages. This paper investigates how to leverage code-switching data to fine-tune pretrained multilingual machine translation model, in order to boost the performance of few-shot low resource machine translation. By utilizing a multilingual mixed corpus, the code-switching method can enhance the cross-linguistic generalization ability of the model and improve its overall understanding of the language. In this paper, on the smaller size model of FLORES-101 benchmark, we use the code-switching data augmentation method to achieve the results of benchmark's larger model on six direction pairs of three languages, Chinese, English and Malay. This paper studied various corpus mixture mechanisms to construct the data in code-switching, and the experimental findings show that the results using the code-switching fine-tuning model improve the spBLEU score by an average of 2 to 3 points over the results without code-switching. © 2024 Copyright held by the owner/author(s).
ISSN:
DOI:10.1145/3704323.3704346