版权所有:内蒙古大学图书馆 技术提供:维普资讯• 智图
内蒙古自治区呼和浩特市赛罕区大学西街235号 邮编: 010021
作者机构:Shanghai Jiao Tong Univ Sch Software Shanghai 200240 Peoples R China East China Normal Univ Software Engn Inst Shanghai 200062 Peoples R China East China Univ Sci & Technol Dept Comp Sci Shanghai 200237 Peoples R China
出 版 物:《IEEE TRANSACTIONS ON SOFTWARE ENGINEERING》 (IEEE Trans Software Eng)
年 卷 期:2024年第50卷第6期
页 面:1505-1517页
核心收录:
基 金:National Natural Science Foundation of China
主 题:Codes Vectors Generators Training Semantics Task analysis Generative adversarial networks Pre-trained language models variable name representation identifier representation generative adversarial networks
摘 要:Variable names are of critical importance in code representation learning. However, due to diverse naming conventions, variables often receive arbitrary names, leading to long-tail, out-of-vocabulary (OOV), and other well-known problems. While the Byte-Pair Encoding (BPE) tokenizer has addressed the surface-level recognition of low-frequency tokens, it has not noticed the inadequate training of low-frequency identifiers by code representation models, resulting in an imbalanced distribution of rare and common identifiers. Consequently, code representation models struggle to effectively capture the semantics of low-frequency variable names. In this paper, we propose VarGAN, a novel method for variable name representations. VarGAN strengthens the training of low-frequency variables through adversarial training. Specifically, we regard the code representation model as a generator responsible for producing vectors from source code. Additionally, we employ a discriminator that detects whether the code input to the generator contains low-frequency variables. This adversarial setup regularizes the distribution of rare variables, making them overlap with their corresponding high-frequency counterparts in the vector space. Experimental results demonstrate that VarGAN empowers CodeBERT to generate code vectors that exhibit more uniform distribution for both low- and high-frequency identifiers. There is an improvement of 8% in similarity and relatedness scores compared to VarCLR in the IdBench benchmark. VarGAN is also validated in downstream tasks, where it exhibits enhanced capabilities in capturing token- and code-level semantics.