Multi-Label Classification of Pure Code

Currently, there is a significant amount of public code in the IT communities, programming forums and code repositories. Many of these codes lack classification labels, or have imprecise labels, which causes inconvenience to code management and retrieval. Some classification methods have been propos...

Full description

Bibliographic Details
Published in:	INTERNATIONAL JOURNAL OF SOFTWARE ENGINEERING AND KNOWLEDGE ENGINEERING
Main Authors:	Gao, Bin; Qin, Hongwu; Ma, Xiuqin
Format:	Article; Early Access
Language:	English
Published:	WORLD SCIENTIFIC PUBL CO PTE LTD 2024
Subjects:	Computer Science; Engineering
Online Access:	https://www-webofscience-com.uitm.idm.oclc.org/wos/woscc/full-record/WOS:001278652300001

author	Gao Bin; Qin Hongwu; Ma Xiuqin
spellingShingle	Gao Bin; Qin Hongwu; Ma Xiuqin Multi-Label Classification of Pure Code Computer Science; Engineering
author_facet	Gao Bin; Qin Hongwu; Ma Xiuqin
author_sort	Gao
spelling	Gao, Bin; Qin, Hongwu; Ma, Xiuqin Multi-Label Classification of Pure Code INTERNATIONAL JOURNAL OF SOFTWARE ENGINEERING AND KNOWLEDGE ENGINEERING English Article; Early Access Currently, there is a significant amount of public code in the IT communities, programming forums and code repositories. Many of these codes lack classification labels, or have imprecise labels, which causes inconvenience to code management and retrieval. Some classification methods have been proposed to automatically assign labels to the code. However, these methods mainly rely on code comments or surrounding text, and the classification effect is limited by the quality of them. So far, there are a few methods that rely solely on the code itself to assign labels to the code. In this paper, an encoder-only method is proposed to assign multiple labels to the code of an algorithmic problem, in which UniXcoder is employed to encode the input code and the encoding results correspond to the output labels through the classification heads. The proposed method relies only on the code itself. We construct a dataset to evaluate the proposed method, which consists of source code in three programming languages (C++, Java, Python) with a total size of approximately 120K. The results of the comparative experiment show that the proposed method has better performance in multi-label classification task of pure code than encoder-decoder methods. WORLD SCIENTIFIC PUBL CO PTE LTD 0218-1940 1793-6403 2024 10.1142/S0218194024500311 Computer Science; Engineering WOS:001278652300001 https://www-webofscience-com.uitm.idm.oclc.org/wos/woscc/full-record/WOS:001278652300001
title	Multi-Label Classification of Pure Code
title_short	Multi-Label Classification of Pure Code
title_full	Multi-Label Classification of Pure Code
title_fullStr	Multi-Label Classification of Pure Code
title_full_unstemmed	Multi-Label Classification of Pure Code
title_sort	Multi-Label Classification of Pure Code
container_title	INTERNATIONAL JOURNAL OF SOFTWARE ENGINEERING AND KNOWLEDGE ENGINEERING
language	English
format	Article; Early Access
description	Currently, there is a significant amount of public code in the IT communities, programming forums and code repositories. Many of these codes lack classification labels, or have imprecise labels, which causes inconvenience to code management and retrieval. Some classification methods have been proposed to automatically assign labels to the code. However, these methods mainly rely on code comments or surrounding text, and the classification effect is limited by the quality of them. So far, there are a few methods that rely solely on the code itself to assign labels to the code. In this paper, an encoder-only method is proposed to assign multiple labels to the code of an algorithmic problem, in which UniXcoder is employed to encode the input code and the encoding results correspond to the output labels through the classification heads. The proposed method relies only on the code itself. We construct a dataset to evaluate the proposed method, which consists of source code in three programming languages (C++, Java, Python) with a total size of approximately 120K. The results of the comparative experiment show that the proposed method has better performance in multi-label classification task of pure code than encoder-decoder methods.
publisher	WORLD SCIENTIFIC PUBL CO PTE LTD
issn	0218-1940 1793-6403
publishDate	2024
container_volume
container_issue
doi_str_mv	10.1142/S0218194024500311
topic	Computer Science; Engineering
topic_facet	Computer Science; Engineering
accesstype
id	WOS:001278652300001
url	https://www-webofscience-com.uitm.idm.oclc.org/wos/woscc/full-record/WOS:001278652300001
record_format	wos
collection	Web of Science (WoS)
_version_	1809679296836403200

Multi-Label Classification of Pure Code

Similar Items