automatic code summarization refers to generating concise natural language descriptions for code snippets. It is vital for improving the efficiency of program understanding among software developers and maintainers. D...
详细信息
automatic code summarization refers to generating concise natural language descriptions for code snippets. It is vital for improving the efficiency of program understanding among software developers and maintainers. Despite the impressive strides made by deep learning-based methods, limitations still exist in their ability to understand and model semantic information due to the unique nature of programming languages. We propose two methods to boost codesummarization models: context-based abbreviation expansion and unigram language model-based subword segmentation. We use heuristics to expand abbreviations within identifiers, reducing semantic ambiguity and improving the language alignment of codesummarization models. Furthermore, we leverage subword segmentation to tokenize code into finer subword sequences, providing more semantic information during training and inference, thereby enhancing program understanding. These methods are model-agnostic and can be readily integrated into existing automatic code summarization approaches. Experiments conducted on two widely used Java codesummarization datasets demonstrated the effectiveness of our approach. Specifically, by fusing original and modified code representations into the Transformer model, our Semantic Enhanced Transformer for code Summarizsation (SETCS) serves as a robust semantic-level baseline. By simply modifying the datasets, our methods achieved performance improvements of up to 7.3%, 10.0%, 6.7%, and 3.2% for representative codesummarization models in terms of BLEU-4, METEOR, ROUGE-L and SIDE, respectively.
Accurate and up-to-date software documentation is an important factor in the maintenance and evolution of software systems. Especially with legacy software, documentation is often outdated or missing entirely and manu...
详细信息
ISBN:
(纸本)9781665452786
Accurate and up-to-date software documentation is an important factor in the maintenance and evolution of software systems. Especially with legacy software, documentation is often outdated or missing entirely and manual redocumentation is not feasible. In recent years, automaticcode summaries based on artificial neural network (ANN) models have been proposed to address this problem, and metric-based evaluations suggest promising quality of the generated summaries. To evaluate the applicability of state-of-the-art codesummarization in an industry context, we conduct an expert evaluation to assess the quality of the generated summaries for JPA program comprehension. We then compare the level of quality perceived by human experts for both predicted and reference summaries and discuss how these results are influenced by industry-specific requirements and how they correlate with automatically computed source code summary metrics. The results show that the quality of predicted summaries is predominantly (about 80%) poor in terms of accuracy and completeness. Moreover, the results support the generally increasing consensus that the widely used BLEU or ROUGE-L score is not a suitable means of evaluating the quality of codesummarization. While these metrics are an adequate means of comparison with existing related work, they cannot reflect the human-perceived level of quality in practice.
automatic code summarization is an important topic in the software engineering field, which aims to automatically generate the description for the source code. Based on Graph Neural Networks (GNN), most existing metho...
详细信息
ISBN:
(纸本)9781665458139
automatic code summarization is an important topic in the software engineering field, which aims to automatically generate the description for the source code. Based on Graph Neural Networks (GNN), most existing methods apply them to Abstract Syntax Tree (AST) to achieve codesummarization. However, these methods face two major challenges: 1) they can only capture limited structural information of the source code;2) they did not effectively solve Out-Of-Vocabulary (OOV) problems by reducing vocabulary size. In order to resolve these problems, in this paper, we propose a novel codesummarization model named Dynamic Graph attention-based Transformer (DG-Trans for short), which effectively captures abundant information of the code subword sequence and utilizes the fusion of dynamic graph attention mechanism and Transformer. Extensive experiments show that DG-Trans is able to outperform state-of-the-art models (such as Ast-Attendgru, Transformer, and codeGNN) by averagely increasing 8.39% and 8.86% on BLEU scores and ROUGUEL, respectively.
暂无评论