code summarization is a process of creating a readable natural language from programming source codes. code summarization has become a popular research topic for software maintenance, code generation, and code recover...
详细信息
code summarization is a process of creating a readable natural language from programming source codes. code summarization has become a popular research topic for software maintenance, code generation, and code recovery. Existing code summarization methods follow the encoding/decoding approach and use various machine learning techniques to generate natural language from source codes. Although most of these methods are state of the art, it is difficult to understand the complex encoding and decoding process to map the tokens with natural language words. Therefore, these coding and decoding approaches are treated as opaque models (black box). This research proposes explainable AI methods that overcome the black box features for the token mapping in code summarization process. Here, we created an abstract syntax tree (AST) from the tokens of the source code. We then embedded the AST into natural language words using a bilingual statistical probability approach to generate possible statistical parse trees. We applied a page rank algorithm among the parse trees to rank the trees. From the best-ranked tree, we generate the comment for the corresponding code snippet. To explain our code generation method, we used Takagi-Sugeno fuzzy approach, layerwise relevance propagation and a hidden Markov model. These approaches make our method trustworthy and understandable to humans to understand the process of source code token mapping with natural language words.
Context: code summarization is the task of generating a concise natural language description of the code snippet. Recent efforts have been made to boost the performance of code summarization language from various pers...
详细信息
Context: code summarization is the task of generating a concise natural language description of the code snippet. Recent efforts have been made to boost the performance of code summarization language from various perspectives, e.g., retrieving external information or introducing large transformer-based models, and thus has achieved promising performance for one specific programming language. While dealing with rapidly expanded cross-language source code datasets, existing approaches suffer from two issues, (1) the difficulty of building a universe code representation for multiple languages;(2) less-well performance for low-resource language. Objective: To cope with these issues, we propose a novel code summarization approach named RaxCS, which aims to perform code summarization across multiple languages and improve accuracy for low-resource languages by leveraging cross-language knowledge. Methods: We exploit the pre-trained models with the contrastive learning objective to build a unified code representation towards multiple languages. To fully mine the external knowledge across programming languages, we design a hybrid retrieval module to search functionally equivalent code and its corresponding comment to serve as preliminary information. Finally, we employ a decode-only transformer model to fuse contextual information, which guides the process of generating summaries. Results: Extensive experiments demonstrate (1) RaxCS outperforms the state-of-the-art on cross-language code summarization (i.e., RaxCS scores 4.39% higher in terms of BLEU metric and 8.65% in terms of BERTScore). (2) For low-resource languages, RaxCS can boost the code summarization performance by a significant magnification (e.g., 6.93% in terms of BLEU for ruby) with cross-language retrieval. Conclusion: This paper introduces a cross-language code summarization model, which utilizes contrastive pre-training and cross-language retrieval. Both are beneficial for incorporating cross-language knowle
code summarization aims to convert structured program code into comprehensible natural language descriptions, significantly benefiting software development. The existing approaches mainly employ structure-to-sequence ...
详细信息
code summarization aims to convert structured program code into comprehensible natural language descriptions, significantly benefiting software development. The existing approaches mainly employ structure-to-sequence frameworks designed for the Abstract Syntax Tree (AST) format of source code, extensively utilizing architectures such as Tree-based LSTMs, and Graph Neural Networks. From modeling process to encoding architecture can't effectively learn some of the complex dependencies of the code snippets. In this paper, we propose a Structure-aware Dual Graph Neural Network (SDGNN) for code summarization. Specially, we employ both the grammatical dependency graph and the semantic dependency graph to catch the complex dependency of the program codes in SDGNN. To realize the effective learning of the dual graph, we further devise the hierarchical propagation and the graphical propagation to generate the encoding of the codes, as well as a graph alignment-based dual graph decoder to generate the summarizations from the encoding. Extensive experiments on three programming language datasets show that our framework outperforms state-of-the-art solutions.
code summarization, which provides a high-level description of the function implemented by code, plays a vital role in software maintenance and code retrieval. Traditional approaches focus on retrieving similar code s...
详细信息
ISBN:
(数字)9783030368029
ISBN:
(纸本)9783030368029;9783030368012
code summarization, which provides a high-level description of the function implemented by code, plays a vital role in software maintenance and code retrieval. Traditional approaches focus on retrieving similar code snippets to generate summaries, and recently researchers pay increasing attention to leverage deep learning approaches, especially the encoder-decoder framework. Approaches based on encoder-decoder suffer from two drawbacks: (a) Lack of summarization in functionality level;(b) code snippets are always too long (more than ten words), regular encoders perform poorly. In this paper, we propose a novel code representation with the help of Abstract Syntax Trees, which could describe the functionality of code snippets and shortens the length of inputs. Based on our proposed code representation, we develop Generative Task, which aims to generate summary sentences of code snippets. Experiments on large-scale real-world industrial Java projects indicate that our approaches are effective and outperform the state-of-the-art approaches in code summarization.
Software Engineering (SE) researchers are extensively applying Large Language Models (LLMs) to address challenges in SE tasks such as code clone detection, code summarization, and program comprehension. Despite promis...
详细信息
ISBN:
(纸本)9798400717017
Software Engineering (SE) researchers are extensively applying Large Language Models (LLMs) to address challenges in SE tasks such as code clone detection, code summarization, and program comprehension. Despite promising results, LLMs have to be fine-tuned and customized with specific datasets for optimal performance. However, the proprietary nature of SE data, and the lack of LLMs trained on non-open source data is an open problem. While there exists work on applying Federated Learning (FL) for SE, integration of FL with LLMs for SE is unexplored. Hence, we propose a FedLLM for "code summarization" as developers spend more time in comprehending code. We setup a federated learning architecture and fine-tune LLM (Llama2 with 6.7B parameters) using Parameter Efficient Fine-Tuning (PEFT) for code summarization. We conducted our experiments on 40GB RAM GPU in an A100 architecture. Results show that FL-trained LLM is as effective as a centrally-trained one. We envision that leveraging non-open source data using FedLLM for SE could be an interesting research direction.
(Source) code summarization aims to automatically generate summaries/comments for given code snippets in the form of natural language. Such summaries play a key role in helping developers understand and maintain sourc...
详细信息
(Source) code summarization aims to automatically generate summaries/comments for given code snippets in the form of natural language. Such summaries play a key role in helping developers understand and maintain source code. Existing code summarization techniques can be categorized into extractive methods and abstractive methods. The extractive methods extract a subset of important statements and keywords from the code snippet using retrieval techniques and generate a summary that preserves factual details in important statements and keywords. However, such a subset may miss identifier or entity naming, and consequently, the naturalness of the generated summary is usually poor. The abstractive methods can generate human-written-like summaries leveraging encoder-decoder models. However, the generated summaries often miss important factual details. To generate human-written-like summaries with preserved factual details, we propose a novel extractive-and-abstractive framework. The extractive module in the framework performs the task of extractive code summarization, which takes in the code snippet and predicts important statements containing key factual details. The abstractivemodule in the framework performs the task of abstractive code summarization, which takes in the code snippet and important statements in parallel and generates a succinct and human-written-like natural language summary. We evaluate the effectiveness of our technique, called EACS, by conducting extensive experiments on three datasets involving six programming languages. Experimental results show that EACS significantly outperforms state-of-the-art techniques for all three widely used metrics, including BLEU, METEOR, and ROUGH-L. In addition, the human evaluation demonstrates that the summaries generated by EACS have higher naturalness and informativeness and are more relevant to given code snippets.
With the fast development of large software projects, automatic code summarization techniques, which summarize the main functionalities of a piece of code using natural languages as comments, play essential roles in h...
详细信息
With the fast development of large software projects, automatic code summarization techniques, which summarize the main functionalities of a piece of code using natural languages as comments, play essential roles in helping developers understand and maintain large software projects. Many research efforts have been devoted to building automatic code summarization approaches. Typical code summarization approaches are based on deep learning models. They transform the task into a sequence-to-sequence task, which inputs source code and outputs summarizations in natural languages. All code summarization models impose different input size limits, such as 50 to 10,000, for the input source code. However, how the input size limit affects the performance of code summarization models still remains under-explored. In this article, we first conduct an empirical study to investigate the impacts of different input size limits on the quality of generated code comments. To our surprise, experiments on multiple models and datasets reveal that setting a low input size limit, such as 20, does not necessarily reduce the quality of generated comments. Based on this finding, we further propose to use function signatures instead of full source code to summarize the main functionalities first and then input the function signatures into code summarization models. Experiments and statistical results show that inputs with signatures are, on average, more than 2 percentage points better than inputs without signatures and thus demonstrate the effectiveness of involving function signatures in code summarization. We also invite programmers to do a questionnaire to evaluate the quality of code summaries generated by two inputs with different truncation levels. The results show that function signatures generate, on average, 9.2% more high-quality comments than full code.
code summarization aims to generate concise natural language descriptions for a piece of code, which can help developers comprehend the source code. Analysis of current work shows that the extraction of syntactic and ...
详细信息
code summarization aims to generate concise natural language descriptions for a piece of code, which can help developers comprehend the source code. Analysis of current work shows that the extraction of syntactic and semantic features of source code is crucial for generating high-quality summaries. To provide a more comprehensive feature representation of source code from different perspectives, we propose an approach named EnCoSum, which enhances semantic features for the multi-scale multi-modal code summarization method. This method complements our previously proposed M2TS approach (multi-scale multi-modal approach based on Transformer for source code summarization), which uses the multi-scale method to capture Abstract Syntax Trees (ASTs) structural information more completely and accurately at multiple local and global levels. In addition, we devise a new cross-modal fusion method to fuse source code and AST features, which can highlight key features in each modality that help generate summaries. To obtain richer semantic information, we improve M2TS. First, we add data flow and control flow to ASTs, and added-edge ASTs, called Enhanced-ASTs (E-ASTs). In addition, we introduce method name sequences extracted in the source code, which exist more knowledge about critical tokens in the corresponding summaries and can help the model generate higher-quality summaries. We conduct extensive experiments on processed Java and Python datasets and evaluate our approach via the four most commonly used machine translation metrics. The experimental results demonstrate that EnCoSum is effective and outperforms current state-of-the-art methods. Further, we perform ablation experiments on each of the model's key components, and the results show that they all contribute to the performance of EnCoSum.
code summarization refers to automatically generating concise description in natural language from a code snippet. Good code summaries could effectively facilitate program comprehension and software maintenance. In re...
详细信息
code summarization refers to automatically generating concise description in natural language from a code snippet. Good code summaries could effectively facilitate program comprehension and software maintenance. In recent years, various learning-based code summarization techniques have achieved impressive performance. Most of these models treat code summarization as an end-to-end model and directly generate the summaries, which ignores the fact that action words are crucial to code summaries. An essential characteristic of code summaries is the concentration of action word distribution. For instance, in the Funcom dataset, the top forty most-common action words account for 72% of all samples. To incorporate this valuable prior domain knowledge into code summarization models, we develop a method for assisting code summarization through an additional action word prediction module, where an action predictor is employed to predict the primary action in the code summary, which is then used as a prompt to enhance the performance of the summary generation model. Our approach can be conveniently integrated into the existing models. We evaluate our approach on two Java datasets and a C/C++ dataset. The results show that our approach can efficiently improve the performance of the code summarization models. Furthermore, our action word prediction module can enhance the performance of a large pre-trained language model by prompting it with the predicted action words. This work suggests that a precise action word prediction model can significantly improve the performance of code summarization through the proposed action word guidance mechanism.
暂无评论