Multi-Label Classification of Pure Code

Currently, there is a significant amount of public code in the IT communities, programming forums and code repositories. Many of these codes lack classification labels, or have imprecise labels, which causes inconvenience to code management and retrieval. Some classification methods have been propos...

Full description

Bibliographic Details
Published in:INTERNATIONAL JOURNAL OF SOFTWARE ENGINEERING AND KNOWLEDGE ENGINEERING
Main Authors: Gao, Bin; Qin, Hongwu; Ma, Xiuqin
Format: Article; Early Access
Language:English
Published: WORLD SCIENTIFIC PUBL CO PTE LTD 2024
Subjects:
Online Access:https://www-webofscience-com.uitm.idm.oclc.org/wos/woscc/full-record/WOS:001278652300001
author Gao
Bin; Qin
Hongwu; Ma
Xiuqin
spellingShingle Gao
Bin; Qin
Hongwu; Ma
Xiuqin
Multi-Label Classification of Pure Code
Computer Science; Engineering
author_facet Gao
Bin; Qin
Hongwu; Ma
Xiuqin
author_sort Gao
spelling Gao, Bin; Qin, Hongwu; Ma, Xiuqin
Multi-Label Classification of Pure Code
INTERNATIONAL JOURNAL OF SOFTWARE ENGINEERING AND KNOWLEDGE ENGINEERING
English
Article; Early Access
Currently, there is a significant amount of public code in the IT communities, programming forums and code repositories. Many of these codes lack classification labels, or have imprecise labels, which causes inconvenience to code management and retrieval. Some classification methods have been proposed to automatically assign labels to the code. However, these methods mainly rely on code comments or surrounding text, and the classification effect is limited by the quality of them. So far, there are a few methods that rely solely on the code itself to assign labels to the code. In this paper, an encoder-only method is proposed to assign multiple labels to the code of an algorithmic problem, in which UniXcoder is employed to encode the input code and the encoding results correspond to the output labels through the classification heads. The proposed method relies only on the code itself. We construct a dataset to evaluate the proposed method, which consists of source code in three programming languages (C++, Java, Python) with a total size of approximately 120K. The results of the comparative experiment show that the proposed method has better performance in multi-label classification task of pure code than encoder-decoder methods.
WORLD SCIENTIFIC PUBL CO PTE LTD
0218-1940
1793-6403
2024


10.1142/S0218194024500311
Computer Science; Engineering

WOS:001278652300001
https://www-webofscience-com.uitm.idm.oclc.org/wos/woscc/full-record/WOS:001278652300001
title Multi-Label Classification of Pure Code
title_short Multi-Label Classification of Pure Code
title_full Multi-Label Classification of Pure Code
title_fullStr Multi-Label Classification of Pure Code
title_full_unstemmed Multi-Label Classification of Pure Code
title_sort Multi-Label Classification of Pure Code
container_title INTERNATIONAL JOURNAL OF SOFTWARE ENGINEERING AND KNOWLEDGE ENGINEERING
language English
format Article; Early Access
description Currently, there is a significant amount of public code in the IT communities, programming forums and code repositories. Many of these codes lack classification labels, or have imprecise labels, which causes inconvenience to code management and retrieval. Some classification methods have been proposed to automatically assign labels to the code. However, these methods mainly rely on code comments or surrounding text, and the classification effect is limited by the quality of them. So far, there are a few methods that rely solely on the code itself to assign labels to the code. In this paper, an encoder-only method is proposed to assign multiple labels to the code of an algorithmic problem, in which UniXcoder is employed to encode the input code and the encoding results correspond to the output labels through the classification heads. The proposed method relies only on the code itself. We construct a dataset to evaluate the proposed method, which consists of source code in three programming languages (C++, Java, Python) with a total size of approximately 120K. The results of the comparative experiment show that the proposed method has better performance in multi-label classification task of pure code than encoder-decoder methods.
publisher WORLD SCIENTIFIC PUBL CO PTE LTD
issn 0218-1940
1793-6403
publishDate 2024
container_volume
container_issue
doi_str_mv 10.1142/S0218194024500311
topic Computer Science; Engineering
topic_facet Computer Science; Engineering
accesstype
id WOS:001278652300001
url https://www-webofscience-com.uitm.idm.oclc.org/wos/woscc/full-record/WOS:001278652300001
record_format wos
collection Web of Science (WoS)
_version_ 1809679296836403200