咨询与建议

看过本文的还看了

相关文献

该作者的其他文献

文献详情 >VarGAN: Adversarial Learning o... 收藏

VarGAN: Adversarial Learning of Variable Semantic Representations

作     者:Lin, Yalan Wan, Chengcheng Bai, Shuwen Gu, Xiaodong 

作者机构:Shanghai Jiao Tong Univ Sch Software Shanghai 200240 Peoples R China East China Normal Univ Software Engn Inst Shanghai 200062 Peoples R China East China Univ Sci & Technol Dept Comp Sci Shanghai 200237 Peoples R China 

出 版 物:《IEEE TRANSACTIONS ON SOFTWARE ENGINEERING》 (IEEE Trans Software Eng)

年 卷 期:2024年第50卷第6期

页      面:1505-1517页

核心收录:

基  金:National Natural Science Foundation of China 

主  题:Codes Vectors Generators Training Semantics Task analysis Generative adversarial networks Pre-trained language models variable name representation identifier representation generative adversarial networks 

摘      要:Variable names are of critical importance in code representation learning. However, due to diverse naming conventions, variables often receive arbitrary names, leading to long-tail, out-of-vocabulary (OOV), and other well-known problems. While the Byte-Pair Encoding (BPE) tokenizer has addressed the surface-level recognition of low-frequency tokens, it has not noticed the inadequate training of low-frequency identifiers by code representation models, resulting in an imbalanced distribution of rare and common identifiers. Consequently, code representation models struggle to effectively capture the semantics of low-frequency variable names. In this paper, we propose VarGAN, a novel method for variable name representations. VarGAN strengthens the training of low-frequency variables through adversarial training. Specifically, we regard the code representation model as a generator responsible for producing vectors from source code. Additionally, we employ a discriminator that detects whether the code input to the generator contains low-frequency variables. This adversarial setup regularizes the distribution of rare variables, making them overlap with their corresponding high-frequency counterparts in the vector space. Experimental results demonstrate that VarGAN empowers CodeBERT to generate code vectors that exhibit more uniform distribution for both low- and high-frequency identifiers. There is an improvement of 8% in similarity and relatedness scores compared to VarCLR in the IdBench benchmark. VarGAN is also validated in downstream tasks, where it exhibits enhanced capabilities in capturing token- and code-level semantics.

读者评论 与其他读者分享你的观点

用户名:未登录
我的评分