Multi-Label Classification of Pure Code
Currently, there is a significant amount of public code in the IT communities, programming forums and code repositories. Many of these codes lack classification labels, or have imprecise labels, which causes inconvenience to code management and retrieval. Some classification methods have been propos...
Published in: | International Journal of Software Engineering and Knowledge Engineering |
---|---|
Main Author: | |
Format: | Article |
Language: | English |
Published: |
World Scientific
2024
|
Online Access: | https://www.scopus.com/inward/record.uri?eid=2-s2.0-85200125805&doi=10.1142%2fS0218194024500311&partnerID=40&md5=9dd177c74649f4195ecc855f3db22aff |
id |
2-s2.0-85200125805 |
---|---|
spelling |
2-s2.0-85200125805 Gao B.; Qin H.; Ma X. Multi-Label Classification of Pure Code 2024 International Journal of Software Engineering and Knowledge Engineering 34 10 10.1142/S0218194024500311 https://www.scopus.com/inward/record.uri?eid=2-s2.0-85200125805&doi=10.1142%2fS0218194024500311&partnerID=40&md5=9dd177c74649f4195ecc855f3db22aff Currently, there is a significant amount of public code in the IT communities, programming forums and code repositories. Many of these codes lack classification labels, or have imprecise labels, which causes inconvenience to code management and retrieval. Some classification methods have been proposed to automatically assign labels to the code. However, these methods mainly rely on code comments or surrounding text, and the classification effect is limited by the quality of them. So far, there are a few methods that rely solely on the code itself to assign labels to the code. In this paper, an encoder-only method is proposed to assign multiple labels to the code of an algorithmic problem, in which UniXcoder is employed to encode the input code and the encoding results correspond to the output labels through the classification heads. The proposed method relies only on the code itself. We construct a dataset to evaluate the proposed method, which consists of source code in three programming languages (C++, Java, Python) with a total size of approximately 120K. The results of the comparative experiment show that the proposed method has better performance in multi-label classification task of pure code than encoder-decoder methods. © 2024 World Scientific Publishing Company. World Scientific 2181940 English Article |
author |
Gao B.; Qin H.; Ma X. |
spellingShingle |
Gao B.; Qin H.; Ma X. Multi-Label Classification of Pure Code |
author_facet |
Gao B.; Qin H.; Ma X. |
author_sort |
Gao B.; Qin H.; Ma X. |
title |
Multi-Label Classification of Pure Code |
title_short |
Multi-Label Classification of Pure Code |
title_full |
Multi-Label Classification of Pure Code |
title_fullStr |
Multi-Label Classification of Pure Code |
title_full_unstemmed |
Multi-Label Classification of Pure Code |
title_sort |
Multi-Label Classification of Pure Code |
publishDate |
2024 |
container_title |
International Journal of Software Engineering and Knowledge Engineering |
container_volume |
34 |
container_issue |
10 |
doi_str_mv |
10.1142/S0218194024500311 |
url |
https://www.scopus.com/inward/record.uri?eid=2-s2.0-85200125805&doi=10.1142%2fS0218194024500311&partnerID=40&md5=9dd177c74649f4195ecc855f3db22aff |
description |
Currently, there is a significant amount of public code in the IT communities, programming forums and code repositories. Many of these codes lack classification labels, or have imprecise labels, which causes inconvenience to code management and retrieval. Some classification methods have been proposed to automatically assign labels to the code. However, these methods mainly rely on code comments or surrounding text, and the classification effect is limited by the quality of them. So far, there are a few methods that rely solely on the code itself to assign labels to the code. In this paper, an encoder-only method is proposed to assign multiple labels to the code of an algorithmic problem, in which UniXcoder is employed to encode the input code and the encoding results correspond to the output labels through the classification heads. The proposed method relies only on the code itself. We construct a dataset to evaluate the proposed method, which consists of source code in three programming languages (C++, Java, Python) with a total size of approximately 120K. The results of the comparative experiment show that the proposed method has better performance in multi-label classification task of pure code than encoder-decoder methods. © 2024 World Scientific Publishing Company. |
publisher |
World Scientific |
issn |
2181940 |
language |
English |
format |
Article |
accesstype |
|
record_format |
scopus |
collection |
Scopus |
_version_ |
1818940551082278912 |