Enhancing Pretrained Multilingual Machine Translation Model with Code-Switching: A Study on Chinese, English and Malay Language

In the field of multilingual machine translation, many pretrained language models have achieved the inspiring results. However, the results based on pretrained models are not yet very satisfactory for low-resource languages. This paper investigates how to leverage code-switching data to fine-tune pr...

Full description

Bibliographic Details
Published in:ICCPR 2024 - Proceedings of the 2024 13th International Conference on Computing and Pattern Recognition
Main Author: Liu H.; Seman N.
Format: Conference paper
Language:English
Published: Association for Computing Machinery, Inc 2025
Online Access:https://www.scopus.com/inward/record.uri?eid=2-s2.0-85218347057&doi=10.1145%2f3704323.3704346&partnerID=40&md5=da2c58dd104d36641e0865c691fbe6df
id 2-s2.0-85218347057
spelling 2-s2.0-85218347057
Liu H.; Seman N.
Enhancing Pretrained Multilingual Machine Translation Model with Code-Switching: A Study on Chinese, English and Malay Language
2025
ICCPR 2024 - Proceedings of the 2024 13th International Conference on Computing and Pattern Recognition


10.1145/3704323.3704346
https://www.scopus.com/inward/record.uri?eid=2-s2.0-85218347057&doi=10.1145%2f3704323.3704346&partnerID=40&md5=da2c58dd104d36641e0865c691fbe6df
In the field of multilingual machine translation, many pretrained language models have achieved the inspiring results. However, the results based on pretrained models are not yet very satisfactory for low-resource languages. This paper investigates how to leverage code-switching data to fine-tune pretrained multilingual machine translation model, in order to boost the performance of few-shot low resource machine translation. By utilizing a multilingual mixed corpus, the code-switching method can enhance the cross-linguistic generalization ability of the model and improve its overall understanding of the language. In this paper, on the smaller size model of FLORES-101 benchmark, we use the code-switching data augmentation method to achieve the results of benchmark's larger model on six direction pairs of three languages, Chinese, English and Malay. This paper studied various corpus mixture mechanisms to construct the data in code-switching, and the experimental findings show that the results using the code-switching fine-tuning model improve the spBLEU score by an average of 2 to 3 points over the results without code-switching. © 2024 Copyright held by the owner/author(s).
Association for Computing Machinery, Inc

English
Conference paper

author Liu H.; Seman N.
spellingShingle Liu H.; Seman N.
Enhancing Pretrained Multilingual Machine Translation Model with Code-Switching: A Study on Chinese, English and Malay Language
author_facet Liu H.; Seman N.
author_sort Liu H.; Seman N.
title Enhancing Pretrained Multilingual Machine Translation Model with Code-Switching: A Study on Chinese, English and Malay Language
title_short Enhancing Pretrained Multilingual Machine Translation Model with Code-Switching: A Study on Chinese, English and Malay Language
title_full Enhancing Pretrained Multilingual Machine Translation Model with Code-Switching: A Study on Chinese, English and Malay Language
title_fullStr Enhancing Pretrained Multilingual Machine Translation Model with Code-Switching: A Study on Chinese, English and Malay Language
title_full_unstemmed Enhancing Pretrained Multilingual Machine Translation Model with Code-Switching: A Study on Chinese, English and Malay Language
title_sort Enhancing Pretrained Multilingual Machine Translation Model with Code-Switching: A Study on Chinese, English and Malay Language
publishDate 2025
container_title ICCPR 2024 - Proceedings of the 2024 13th International Conference on Computing and Pattern Recognition
container_volume
container_issue
doi_str_mv 10.1145/3704323.3704346
url https://www.scopus.com/inward/record.uri?eid=2-s2.0-85218347057&doi=10.1145%2f3704323.3704346&partnerID=40&md5=da2c58dd104d36641e0865c691fbe6df
description In the field of multilingual machine translation, many pretrained language models have achieved the inspiring results. However, the results based on pretrained models are not yet very satisfactory for low-resource languages. This paper investigates how to leverage code-switching data to fine-tune pretrained multilingual machine translation model, in order to boost the performance of few-shot low resource machine translation. By utilizing a multilingual mixed corpus, the code-switching method can enhance the cross-linguistic generalization ability of the model and improve its overall understanding of the language. In this paper, on the smaller size model of FLORES-101 benchmark, we use the code-switching data augmentation method to achieve the results of benchmark's larger model on six direction pairs of three languages, Chinese, English and Malay. This paper studied various corpus mixture mechanisms to construct the data in code-switching, and the experimental findings show that the results using the code-switching fine-tuning model improve the spBLEU score by an average of 2 to 3 points over the results without code-switching. © 2024 Copyright held by the owner/author(s).
publisher Association for Computing Machinery, Inc
issn
language English
format Conference paper
accesstype
record_format scopus
collection Scopus
_version_ 1825722573662453760