source code summarization aims at generating concise and clear natural language descriptions for programming languages. Well-written code summaries are beneficial for programmers to participate in the software develop...
详细信息
ISBN:
(纸本)9781665437868
source code summarization aims at generating concise and clear natural language descriptions for programming languages. Well-written code summaries are beneficial for programmers to participate in the software development and maintenance process. To learn the semantic representations of sourcecode, recent efforts focus on incorporating the syntax structure of code into neural networks such as Transformer. Such Transformer-based approaches can better capture the long-range dependencies than other neural networks including Recurrent Neural Networks (RNNs), however, most of them do not consider the structural relative correlations between tokens, e.g., relative positions in Abstract Syntax Trees (ASTs), which is beneficial for code semantics learning. To model the structural dependency, we propose a StruCtural RelatIve Position guided Transformer, named SCRIPT. SCRIPT first obtains the structural relative positions between tokens via parsing the ASTs of sourcecode, and then passes them into two types of Transformer encoders. One Transformer directly adjusts the input according to the structural relative distance;and the other Transformer encodes the structural relative positions during computing the self-attention scores. Finally, we stack these two types of Transformer encoders to learn representations of sourcecode. Experimental results show that the proposed SCRIPT outperforms the state-of-the-art methods by at least 1.6%, 1.4% and 2.8% with respect to BLEU, ROUGEL and METEOR on benchmark datasets, respectively. We further show that how the proposed SCRIPT captures the structural relative dependencies.
Recently, deep learning techniques have been developed for source code summarization. Most existing studies have simply adopted natural language processing techniques, because source code summarization can be consider...
详细信息
ISBN:
(纸本)9781728160344
Recently, deep learning techniques have been developed for source code summarization. Most existing studies have simply adopted natural language processing techniques, because source code summarization can be considered as machine translation tasks from sourcecode into descriptions. However, sourcecode and its description are very different, not only in the languages of writing, but also in the purpose of writing. There is a large semantic gap between sourcecodes in programming languages and their descriptions in natural languages. To respond to the semantic gap, we propose a two-phase model that consists of a keyword predictor and a description generator. The keyword predictor captures the natural language keywords semantically associated with the sourcecode, and the generator generates a description by referring to the natural language keywords provided by the predictor. Using such keywords as scaffolding, we can effectively reduce the semantic gap and generate more accurate descriptions of sourcecodes. To evaluate the proposed method, we use datasets collected from GitHub and StackOverflow. We perform various experiments with these datasets. Our methods show outstanding performance compared with baselines that include state-of-the-art methods, which concludes that keyword prediction is very helpful to the generation of accurate descriptions.
(source) codesummarization aims to automatically generate succinct natural language summaries for given code snippets. Such summaries play a significant role in promoting developers to understand and maintain code. I...
详细信息
(source) codesummarization aims to automatically generate succinct natural language summaries for given code snippets. Such summaries play a significant role in promoting developers to understand and maintain code. Inspired by neural machine translation, deep learning-based codesummarization techniques widely adopt an encoder-decoder framework, where the encoder transforms given code snippets into context vectors, and the decoder decodes context vectors into summaries. Recently, large-scale pre-trained models for sourcecode (e.g., codeBERT and UniXcoder) are equipped with encoders capable of producing general context vectors and have achieved substantial improvements on the codesummarization task. However, although they are usually trained mainly on code-focused tasks and can capture general code features, they still fall short in capturing specific features that need to be summarized. In a nutshell, they fail to learn the alignment between code snippets and summaries (code-summary alignment for short). In this paper, we propose a novel approach to improve codesummarization based on summary-focused tasks. Specifically, we exploit a multi-task learning paradigm to train the encoder on three summary-focused tasks to enhance its ability to learn code-summary alignment, including unidirectional language modeling (ULM), masked language modeling (MLM), and action word prediction (AWP). Unlike pre-trained models that mainly predict masked tokens in code snippets, we design ULM and MLM to predict masked words in summaries. Intuitively, predicting words based on given code snippets would help learn the code-summary alignment. In addition, existing work shows that AWP affects the prediction of the entire summary. Therefore, we further introduce the domain-specific task AWP to enhance the ability of the encoder to learn the alignment between action words and code snippets. We evaluate the effectiveness of our approach, called Esale, by conducting extensive experiments on
source code summarization is the task of creating readable summaries that describe the functionality of software. source code summarization is a critical component of documentation generation, for example as Javadocs ...
详细信息
source code summarization is the task of creating readable summaries that describe the functionality of software. source code summarization is a critical component of documentation generation, for example as Javadocs formed from short paragraphs attached to each method in a Java program. At present, a majority of source code summarization is manual, in that the paragraphs are written by human experts. However, new automated technologies are becoming feasible. These automated techniques have been shown to be effective in select situations, though a key weakness is that they do not explain the sourcecode's context. That is, they can describe the behavior of a Java method, but not why the method exists or what role it plays in the software. In this paper, we propose a source code summarization technique that writes English descriptions of Java methods by analyzing how those methods are invoked. We then performed two user studies to evaluate our approach. First, we compared our generated summaries to summaries written manually by experts. Then, we compared our summaries to summaries written by a state-of-the-art automatic summarization tool. We found that while our approach does not reach the quality of human-written summaries, we do improve over the state-of-the-art summarization tool in several dimensions by a statistically-significant margin.
Programs are, in essence, a collection of implemented features. Feature discovery in software engineering is the task of identifying key functionalities that a program implements. Manual feature discovery can be time ...
详细信息
Programs are, in essence, a collection of implemented features. Feature discovery in software engineering is the task of identifying key functionalities that a program implements. Manual feature discovery can be time consuming and expensive, leading to automatic feature discovery tools being developed. However, these approaches typically only describe features using lists of keywords, which can be difficult for readers who are not already familiar with the sourcecode. An alternative to keyword lists is sentence selection, in which one sentence is chosen from among the sentences in a text document to describe that document. Sentence selection has been widely studied in the context of natural language summarization but is only beginning to be explored as a solution to feature discovery. In this paper, we compare four sentence selection strategies for the purpose of feature discovery. Two are off-the-shelf approaches, while two are adaptations we propose. We present our findings as guidelines and recommendations to designers of feature discovery tools. Copyright (c) 2016 John Wiley & Sons, Ltd.
Automatic source code summarization system aims to generate a valuable natural language description for a program, which can facilitate software development and maintenance, code categorization, and retrieval. However...
详细信息
Automatic source code summarization system aims to generate a valuable natural language description for a program, which can facilitate software development and maintenance, code categorization, and retrieval. However, previous sequence-based research did not consider the long-distance dependence and highly structured characteristics of sourcecode simultaneously. In this article, we present a Transformer-based GraphAugmented source code summarization (GA-SCS), which can effectively incorporate inherent structural and textual features of sourcecode to generate an effective code description. Specifically, we develop a graphbased structure feature extraction scheme leveraging abstract syntax tree and graph attention networks to mine global syntactic information. And then, to take full advantage of the lexical and syntactic information of code snippets, we extend the original attention to a syntax-informed self-attention mechanism in our encoder. In the training process, we also adopt a reinforcement learning strategy to enhance the readability and informativity of generated code summaries. We utilize the Java dataset and Python dataset to evaluate the performance of different models. Experimental results demonstrate that our GA-SCS model outperforms all competitive methods on BLEU, METEOR, ROUGE, and human evaluations.
source code summarization is the task of writing natural language descriptions of sourcecode. The primary use of these descriptions is in documentation for programmers. Automatic generation of these descriptions is a...
详细信息
source code summarization is the task of writing natural language descriptions of sourcecode. The primary use of these descriptions is in documentation for programmers. Automatic generation of these descriptions is a high value research target due to the time cost to programmers of writing these descriptions themselves. In recent years, a confluence of software engineering and artificial intelligence research has made inroads into automatic source code summarization through applications of neural models of that sourcecode. However, an Achilles' heel to a vast majority of approaches is that they tend to rely solely on the context provided by the sourcecode being summarized. But empirical studies in program comprehension are quite clear that the information needed to describe code much more often resides in the context in the form of Function Call Graph surrounding that code. In this paper, we present a technique for encoding this call graph context for neural models of codesummarization. We implement our approach as a supplement to existing approaches, and show statistically significant improvement over existing approaches. In a human study with 20 programmers, we show that programmers perceive generated summaries to generally be as accurate, readable, and concise as human-written summaries.
codesummarization is the process of automatically generating brief and informative summaries of sourcecode to aid in software comprehension and maintenance. In this paper, we propose a novel model called READSUM, RE...
详细信息
codesummarization is the process of automatically generating brief and informative summaries of sourcecode to aid in software comprehension and maintenance. In this paper, we propose a novel model called READSUM, REtrieval-augmented ADaptive transformer for source code summarization, that combines both abstractive and extractive approaches. Our proposed model generates code summaries in an abstractive manner, taking into account both the structural and sequential information of the input code, while also utilizing an extractive approach that leverages a retrieved summary of similar code to increase the frequency of important keywords. To effectively blend the original code and the retrieved similar code at the embedding layer stage, we obtain the augmented representation of the original code and the retrieved code through multi-head self-attention. In addition, we develop a self-attention network that adaptively learns the structural and sequential information for the representations in the encoder stage. Furthermore, we design a fusion network to capture the relation between the original code and the retrieved summary at the decoder stage. The fusion network effectively guides summary generation based on the retrieved summary. Finally, READSUM extracts important keywords using an extractive approach and generates high-quality summaries using an abstractive approach that considers both the structural and sequential information of the sourcecode. We demonstrate the superiority of READSUM through various experiments and an ablation study. Additionally, we perform a human evaluation to assess the quality of the generated summary.
Developers spend much of their time reading and browsing sourcecode, raising new opportunities for summarization methods. Indeed, modern code editors provide code folding, which allows one to selectively hide blocks ...
详细信息
Developers spend much of their time reading and browsing sourcecode, raising new opportunities for summarization methods. Indeed, modern code editors provide code folding, which allows one to selectively hide blocks of code. However this is impractical to use as folding decisions must be made manually or based on simple rules. We introduce the autofolding problem, which is to automatically create a code summary by folding less informative code regions. We present a novel solution by formulating the problem as a sequence of AST folding decisions, leveraging a scoped topic model for code tokens. On an annotated set of popular open source projects, we show that our summarizer outperforms simpler baselines, yielding a 28 percent error reduction. Furthermore, we find through a case study that our summarizer is strongly preferred by experienced developers. More broadly, we hope this work will aid program comprehension by turning code folding into a usable and valuable tool.
source code summarization involves creating brief descriptions of sourcecode in natural language. These descriptions are a key component of software documentation such as JavaDocs. Automatic codesummarization is a p...
详细信息
ISBN:
(纸本)9781450392983
source code summarization involves creating brief descriptions of sourcecode in natural language. These descriptions are a key component of software documentation such as JavaDocs. Automatic codesummarization is a prized target of software engineering research, due to the high value summaries have to programmers and the simultaneously high cost of writing and maintaining documentation by hand. Current work is almost all based on machine models trained via big data input. Large datasets of examples of code and summaries of that code are used to train an e.g. encoder-decoder neural model. Then the output predictions of the model are evaluated against a set of reference summaries. The input is code not seen by the model, and the prediction is compared to a reference. The means by which a prediction is compared to a reference is essentially word overlap, calculated via a metric such as BLEU or ROUGE. The problem with using word overlap is that not all words in a sentence have the same importance, and many words have synonyms. The result is that calculated similarity may not match the perceived similarity by human readers. In this paper, we conduct an experiment to measure the degree to which various word overlap metrics correlate to human-rated similarity of predicted and reference summaries. We evaluate alternatives based on current work in semantic similarity metrics and propose recommendations for evaluation of source code summarization.
暂无评论