Enhancing Pretrained Multilingual Machine Translation Model with Code-Switching: A Study on Chinese, English and Malay Language
In the field of multilingual machine translation, many pretrained language models have achieved the inspiring results. However, the results based on pretrained models are not yet very satisfactory for low-resource languages. This paper investigates how to leverage code-switching data to fine-tune pr...
Published in: | ICCPR 2024 - Proceedings of the 2024 13th International Conference on Computing and Pattern Recognition |
---|---|
Main Author: | |
Format: | Conference paper |
Language: | English |
Published: |
Association for Computing Machinery, Inc
2025
|
Online Access: | https://www.scopus.com/inward/record.uri?eid=2-s2.0-85218347057&doi=10.1145%2f3704323.3704346&partnerID=40&md5=da2c58dd104d36641e0865c691fbe6df |
id |
2-s2.0-85218347057 |
---|---|
spelling |
2-s2.0-85218347057 Liu H.; Seman N. Enhancing Pretrained Multilingual Machine Translation Model with Code-Switching: A Study on Chinese, English and Malay Language 2025 ICCPR 2024 - Proceedings of the 2024 13th International Conference on Computing and Pattern Recognition 10.1145/3704323.3704346 https://www.scopus.com/inward/record.uri?eid=2-s2.0-85218347057&doi=10.1145%2f3704323.3704346&partnerID=40&md5=da2c58dd104d36641e0865c691fbe6df In the field of multilingual machine translation, many pretrained language models have achieved the inspiring results. However, the results based on pretrained models are not yet very satisfactory for low-resource languages. This paper investigates how to leverage code-switching data to fine-tune pretrained multilingual machine translation model, in order to boost the performance of few-shot low resource machine translation. By utilizing a multilingual mixed corpus, the code-switching method can enhance the cross-linguistic generalization ability of the model and improve its overall understanding of the language. In this paper, on the smaller size model of FLORES-101 benchmark, we use the code-switching data augmentation method to achieve the results of benchmark's larger model on six direction pairs of three languages, Chinese, English and Malay. This paper studied various corpus mixture mechanisms to construct the data in code-switching, and the experimental findings show that the results using the code-switching fine-tuning model improve the spBLEU score by an average of 2 to 3 points over the results without code-switching. © 2024 Copyright held by the owner/author(s). Association for Computing Machinery, Inc English Conference paper |
author |
Liu H.; Seman N. |
spellingShingle |
Liu H.; Seman N. Enhancing Pretrained Multilingual Machine Translation Model with Code-Switching: A Study on Chinese, English and Malay Language |
author_facet |
Liu H.; Seman N. |
author_sort |
Liu H.; Seman N. |
title |
Enhancing Pretrained Multilingual Machine Translation Model with Code-Switching: A Study on Chinese, English and Malay Language |
title_short |
Enhancing Pretrained Multilingual Machine Translation Model with Code-Switching: A Study on Chinese, English and Malay Language |
title_full |
Enhancing Pretrained Multilingual Machine Translation Model with Code-Switching: A Study on Chinese, English and Malay Language |
title_fullStr |
Enhancing Pretrained Multilingual Machine Translation Model with Code-Switching: A Study on Chinese, English and Malay Language |
title_full_unstemmed |
Enhancing Pretrained Multilingual Machine Translation Model with Code-Switching: A Study on Chinese, English and Malay Language |
title_sort |
Enhancing Pretrained Multilingual Machine Translation Model with Code-Switching: A Study on Chinese, English and Malay Language |
publishDate |
2025 |
container_title |
ICCPR 2024 - Proceedings of the 2024 13th International Conference on Computing and Pattern Recognition |
container_volume |
|
container_issue |
|
doi_str_mv |
10.1145/3704323.3704346 |
url |
https://www.scopus.com/inward/record.uri?eid=2-s2.0-85218347057&doi=10.1145%2f3704323.3704346&partnerID=40&md5=da2c58dd104d36641e0865c691fbe6df |
description |
In the field of multilingual machine translation, many pretrained language models have achieved the inspiring results. However, the results based on pretrained models are not yet very satisfactory for low-resource languages. This paper investigates how to leverage code-switching data to fine-tune pretrained multilingual machine translation model, in order to boost the performance of few-shot low resource machine translation. By utilizing a multilingual mixed corpus, the code-switching method can enhance the cross-linguistic generalization ability of the model and improve its overall understanding of the language. In this paper, on the smaller size model of FLORES-101 benchmark, we use the code-switching data augmentation method to achieve the results of benchmark's larger model on six direction pairs of three languages, Chinese, English and Malay. This paper studied various corpus mixture mechanisms to construct the data in code-switching, and the experimental findings show that the results using the code-switching fine-tuning model improve the spBLEU score by an average of 2 to 3 points over the results without code-switching. © 2024 Copyright held by the owner/author(s). |
publisher |
Association for Computing Machinery, Inc |
issn |
|
language |
English |
format |
Conference paper |
accesstype |
|
record_format |
scopus |
collection |
Scopus |
_version_ |
1825722573662453760 |