咨询与建议

看过本文的还看了

相关文献

该作者的其他文献

文献详情 >Transformer-based code model w... 收藏

Transformer-based code model with compressed hierarchy representation

作     者:Zhang, Kechi Li, Jia Li, Zhuo Jin, Zhi Li, Ge 

作者机构:Peking Univ Key Lab High Confidence Software Technol MoE Beijing Peoples R China 

出 版 物:《EMPIRICAL SOFTWARE ENGINEERING》 (Empir Software Eng)

年 卷 期:2025年第30卷第2期

页      面:1-43页

核心收录:

学科分类:08[工学] 0835[工学-软件工程] 0812[工学-计算机科学与技术(可授工学、理学学位)] 

基  金:National Natural Science Foundation of China [62072007, 62192733, 61832009, 62192730] National Natural Science Foundation of China 

主  题:Code representation Code classification Clone detection Code summarization 

摘      要:Source code representation with deep learning techniques is an important research field. There have been many studies to learn sequential or structural information for code representation. However, existing sequence-based models and non-sequence models both have their limitations. Although researchers attempt to incorporate structural information into sequence-based models, they only mine part of token-level hierarchical structure information. In this paper, we analyze how the complete hierarchical structure influences the tokens in code sequences and abstract this influence as a property of code tokens called hierarchical embedding. This hierarchical structure includes frequent combinations, which represent strong semantics and can help identify unique code structures. We further analyze these hierarchy combinations and propose a novel compression algorithm Hierarchy BPE. Our algorithm can extract frequent hierarchy combinations and reduce the total length of hierarchical embedding. Based on the above compression algorithm, we propose the Byte-Pair Encoded Hierarchy Transformer (BPE-HiT), a simple but effective sequence model that incorporates the compressed hierarchical embeddings of source code into a Transformer model. Given that BPE-HiT significantly reduces computational overhead, we scale up the model training phase and implement a hierarchy-aware pre-training framework. We conduct extensive experiments on 10 datasets for evaluation, including code classification, clone detection, method name prediction and code completion tasks. Results show that our non-pre-trained BPE-HiT outperforms the state-of-the-art baselines by at least 0.94% on average accuracy on code classification tasks with three different program languages. On the method name prediction task, BPE-HiT outperforms baselines by at least 2.04, 1.34 in F1-score on two real-world datasets. Besides, our pre-trained BPE-HiT outperforms other pre-trained baseline models with the same number of parameter

读者评论 与其他读者分享你的观点

用户名:未登录
我的评分