Multi-Label Classi¯cation of Pure Code

Currently, there is a significant amount of public code in the IT communities, programming forums and code repositories. Many of these codes lack classification labels, or have imprecise labels, which causes inconvenience to code management and retrieval. Some classification methods have been propos...

Full description

Bibliographic Details
Published in:International Journal of Software Engineering and Knowledge Engineering
Main Author: Gao B.; Qin H.; Ma X.
Format: Article
Language:English
Published: World Scientific 2024
Online Access:https://www.scopus.com/inward/record.uri?eid=2-s2.0-85200125805&doi=10.1142%2fS0218194024500311&partnerID=40&md5=9dd177c74649f4195ecc855f3db22aff
id 2-s2.0-85200125805
spelling 2-s2.0-85200125805
Gao B.; Qin H.; Ma X.
Multi-Label Classi¯cation of Pure Code
2024
International Journal of Software Engineering and Knowledge Engineering


10.1142/S0218194024500311
https://www.scopus.com/inward/record.uri?eid=2-s2.0-85200125805&doi=10.1142%2fS0218194024500311&partnerID=40&md5=9dd177c74649f4195ecc855f3db22aff
Currently, there is a significant amount of public code in the IT communities, programming forums and code repositories. Many of these codes lack classification labels, or have imprecise labels, which causes inconvenience to code management and retrieval. Some classification methods have been proposed to automatically assign labels to the code. However, these methods mainly rely on code comments or surrounding text, and the classification effect is limited by the quality of them. So far, there are a few methods that rely solely on the code itself to assign labels to the code. In this paper, an encoder-only method is proposed to assign multiple labels to the code of an algorithmic problem, in which UniXcoder is employed to encode the input code and the encoding results correspond to the output labels through the classification heads. The proposed method relies only on the code itself. We construct a dataset to evaluate the proposed method, which consists of source code in three programming languages (Cþþ, Java, Python) with a total size of approximately 120 K. The results of the comparative experiment show that the proposed method has better performance in multi-label classification task of pure code than encoder–decoder methods. © World Scientific Publishing Company.
World Scientific
02181940
English
Article

author Gao B.; Qin H.; Ma X.
spellingShingle Gao B.; Qin H.; Ma X.
Multi-Label Classi¯cation of Pure Code
author_facet Gao B.; Qin H.; Ma X.
author_sort Gao B.; Qin H.; Ma X.
title Multi-Label Classi¯cation of Pure Code
title_short Multi-Label Classi¯cation of Pure Code
title_full Multi-Label Classi¯cation of Pure Code
title_fullStr Multi-Label Classi¯cation of Pure Code
title_full_unstemmed Multi-Label Classi¯cation of Pure Code
title_sort Multi-Label Classi¯cation of Pure Code
publishDate 2024
container_title International Journal of Software Engineering and Knowledge Engineering
container_volume
container_issue
doi_str_mv 10.1142/S0218194024500311
url https://www.scopus.com/inward/record.uri?eid=2-s2.0-85200125805&doi=10.1142%2fS0218194024500311&partnerID=40&md5=9dd177c74649f4195ecc855f3db22aff
description Currently, there is a significant amount of public code in the IT communities, programming forums and code repositories. Many of these codes lack classification labels, or have imprecise labels, which causes inconvenience to code management and retrieval. Some classification methods have been proposed to automatically assign labels to the code. However, these methods mainly rely on code comments or surrounding text, and the classification effect is limited by the quality of them. So far, there are a few methods that rely solely on the code itself to assign labels to the code. In this paper, an encoder-only method is proposed to assign multiple labels to the code of an algorithmic problem, in which UniXcoder is employed to encode the input code and the encoding results correspond to the output labels through the classification heads. The proposed method relies only on the code itself. We construct a dataset to evaluate the proposed method, which consists of source code in three programming languages (Cþþ, Java, Python) with a total size of approximately 120 K. The results of the comparative experiment show that the proposed method has better performance in multi-label classification task of pure code than encoder–decoder methods. © World Scientific Publishing Company.
publisher World Scientific
issn 02181940
language English
format Article
accesstype
record_format scopus
collection Scopus
_version_ 1814778502867582976