In recent years, deep learning models have shown great potential in source code modeling and analysis. Generally, deep learning-based approaches are problem-specific and data-hungry. A challenging issue of these appro...
详细信息
In recent years, deep learning models have shown great potential in source code modeling and analysis. Generally, deep learning-based approaches are problem-specific and data-hungry. A challenging issue of these approaches is that they require training from scratch for a different related problem. In this work, we propose a transfer learning-based approach that significantly improves the performance of deep learning-based sourcecode models. In contrast to traditional learning paradigms, transfer learning can transfer the knowledge learned in solving one problem into another related problem. First, we present two recurrent neural network-based models RNN and GRU for the purpose of transfer learning in the domain of source code modeling. Next, via transfer learning, these pre-trained (RNN and GRU) models are used as feature extractors. Then, these extracted features are combined into attention learner for different downstream tasks. The attention learner leverages from the learned knowledge of pre-trained models and fine-tunes them for a specific downstream task. We evaluate the performance of the proposed approach with extensive experiments with the sourcecode suggestion task. The results indicate that the proposed approach outperforms the state-of-the-art models in terms of accuracy, precision, recall and F-measure without training the models from scratch.
Deep Learning (DL) techniques for Natural Language Processing have been evolving remarkably fast. Recently, the DL advances in language modeling, machine translation, and paragraph understanding are so prominent that ...
详细信息
Deep Learning (DL) techniques for Natural Language Processing have been evolving remarkably fast. Recently, the DL advances in language modeling, machine translation, and paragraph understanding are so prominent that the potential of DL in Software Engineering cannot be overlooked, especially in the field of program learning. To facilitate further research and applications of DL in this field, we provide a comprehensive review to categorize and investigate existing DL methods for source code modeling and generation. To address the limitations of the traditional sourcecode models, we formulate common program learning tasks under an encoder-decoder framework. After that, we introduce recent DL mechanisms suitable to solve such problems. Then, we present the state-of-the-art practices and discuss their challenges with some recommendations for practitioners and researchers as well.
Studies have confirmed the robust performance of machine learning classifiers for various source code modeling tasks. In general, machine learning approaches are incapable of handling imbalanced datasets, since they a...
详细信息
Studies have confirmed the robust performance of machine learning classifiers for various source code modeling tasks. In general, machine learning approaches are incapable of handling imbalanced datasets, since they are sensitive to the choice of diverse classes. Therefore, these approaches may lean towards the classes with a large percentage of observations. In this work, we investigate and explore the impact of balanced and imbalanced learning on sourcecode suggestion task otherwise known as code completion, covering a large number of imbalanced classes. We further explore the impact of vocabulary size on modeling performance. First, we provide the essentials to formulate the problem of sourcecode suggestion as a classification task and investigate the level of imbalanced classes. Second, we train the four most adapted neural language models as a baseline to assess the modeling performance. Third, we impose two diverse class balancing techniques, TomekLinks and AllKNN, to balance the datasets and evaluate their impact on the modeling performance. Finally, we trained these models with a weighted imbalanced learning approach and compared the performance with balanced learning approaches. Additionally, we train models by varying the vocabulary size to study their impact. In total, we trained 230 models on 10 real-world software projects and extensively evaluated these models with widely used performance metrics such as Precision, Recall, FScore, mean reciprocal rank (MRR), and Receiver operating characteristics (ROC). Additionally, we employed ANOVA statistical analysis to study the statistical significance and differences between these approaches. This study has demonstrated that the modeling performance decreases during balanced model training, whereas the weighted imbalance training produces comparable results and is more efficient in terms of time cost. Additionally, this study exhibits that a large size of vocabulary does not necessarily improve the modeling p
Attention-based transformer language models have shown significant performance gains in various natural language tasks. In this work, we explore the impact of transformer language models on the task of sourcecode sug...
详细信息
Attention-based transformer language models have shown significant performance gains in various natural language tasks. In this work, we explore the impact of transformer language models on the task of sourcecode suggestion. The core intention of this work is to boost the modeling performance for the sourcecode suggestion task and to explore how the training procedures and model architectures impact modeling performance. Additionally, we propose a transformer-based self-supervised learning technique called Transformer Gated Highway that outperforms recurrent and transformer language models of comparable size. The proposed approach combines the Transformer language model with Gated Highway introducing a notion of recurrence. We compare the performance of the proposed approach with transformer-based BERT (codeTran), RoBERTa (RoBERTacode), GPT2 (TravTrans), codeGen and recurrent neural language-based LSTM (codeLSTM) models. Moreover, we have experimented with various architectural settings for the transformer models to evaluate their impact on modeling performance. The extensive evaluation of the presented approach exhibits better performance on two programming language datasets;Java and C#. Additionally, we have adopted the presented approach for the syntax error correction task to predict the correct syntax token to render its possible implications for other source code modeling tasks.(c) 2022 Elsevier Inc. All rights reserved.
The Transformer architecture and transfer learning have marked a quantum leap in natural language processing, improving the state of the art across a range of text-based tasks. This paper examines how these advancemen...
详细信息
The Transformer architecture and transfer learning have marked a quantum leap in natural language processing, improving the state of the art across a range of text-based tasks. This paper examines how these advancements can be applied to and improve code search. To this end, we pre-train a BERT-based model on combinations of natural language and sourcecode data and fine-tune it on pairs of StackOverflow question titles and code answers. Our results show that the pre-trained models consistently outperform the models that were not pre-trained. In cases where the model was pre-trained on natural language "and" sourcecode data, it also outperforms an information retrieval baseline based on Lucene. Also, we demonstrated that the combined use of an information retrieval-based approach followed by a Transformer leads to the best results overall, especially when searching into a large search pool. Transfer learning is particularly effective when much pre-training data is available and fine-tuning data is limited. We demonstrate that natural language processing models based on the Transformer architecture can be directly applied to sourcecode analysis tasks, such as code search. With the development of Transformer models designed more specifically for dealing with sourcecode data, we believe the results of sourcecode analysis tasks can be further improved.
Integrated Development Environments (IDEs) are pivotal in enhancing productivity with features like code completion in modern software development. Recent advancements in Natural Language Processing (NLP) have empower...
详细信息
Integrated Development Environments (IDEs) are pivotal in enhancing productivity with features like code completion in modern software development. Recent advancements in Natural Language Processing (NLP) have empowered neural language models for code completion. In this study, we present an extensive investigation of the impact of open and closed vocabulary systems on the task of code completion. Specifically, we compare open and closed vocabulary systems with various vocabulary sizes to observe their impact on code completion performance. We experiment with three different open vocabulary systems: byte pair encoding (BPE), WordPiece and Unigram to compare them with closed-vocabulary systems to analyze their modeling performance. We also conduct experiments with different context sizes to study their impact on code completion performance. We have experimented using various prominent language models, including one from recurrent neural networks and five from transformers. Our results indicate that vocabulary size significantly impacts modeling performance and can artificially boost the accuracy of code completion models, especially in the case of a closed-vocabulary system. Moreover, we find that different vocabulary systems have varying impacts on token coverage, whereas open-vocabulary systems exhibit better token coverage. Our findings offer valuable insights for building effective code completion models, aiding researchers and practitioners in this field.
Pre-trained code language models have achieved promising performance in code generation and improved the programming efficiency of human developers. However, their self-refinement capability is typically overlooked by...
详细信息
Pre-trained code language models have achieved promising performance in code generation and improved the programming efficiency of human developers. However, their self-refinement capability is typically overlooked by the existing evaluations of code LMs, which focus only on the accuracy of the one-time prediction. For the cases when code LMs fail to implement the correct program, developers actually find it hard to debug and fix the faulty prediction since it is not written by the developers themselves. Unfortunately, our study reveals that code LMs cannot efficiently self-refine their faulty generations as well. In this paper, we propose Cycle framework, learning to self-refine the faulty generation according to the available feedback, such as the execution results reported by the test suites. We evaluate Cycle on three popular code generation benchmarks, HumanEval, MBPP, and APPS. The results reveal that Cycle successfully maintains, sometimes improves, the quality of one-time code generation, while significantly improving the self-refinement capability of code LMs. We implement four variants of Cycle with varied numbers of parameters across 350M, 1B, 2B, and 3B, and the experiments show that Cycle consistently boosts the code generation performance, by up to 63.5%, across benchmarks and varied model sizes. We also notice that Cycle outperforms code LMs that have 3x more parameters in self-refinement.
Studies have substantiated the efficacy of deep learning-based models in various source code modeling tasks. These models are usually trained on large datasets that are divided into smaller units, known as tokens, uti...
详细信息
ISBN:
(纸本)9798400700446
Studies have substantiated the efficacy of deep learning-based models in various source code modeling tasks. These models are usually trained on large datasets that are divided into smaller units, known as tokens, utilizing either an open or closed vocabulary system. The selection of a tokenization method can have a profound impact on the number of tokens generated, which in turn can significantly influence the performance of the model. This study investigates the effect of different tokenization methods on source code modeling and proposes an optimized tokenizer to enhance the tokenization performance. The proposed tokenizer employs a hybrid approach that initializes with a global vocabulary based on the most frequent unigrams and incrementally builds an open-vocabulary system. The proposed tokenizer is evaluated against popular tokenization methods such as Closed, Unigram, WordPiece, and BPE tokenizers, as well as tokenizers provided by large pre-trained models such as Polycoder and codeGen. The results indicate that the choice of tokenization method can significantly impact the number of sub-tokens generated, which can ultimately influence the modeling performance of a model. Furthermore, our empirical evaluation demonstrates that the proposed tokenizer outperforms other baselines, achieving improved tokenization performance both in terms of a reduced number of sub-tokens and time cost. In conclusion, this study highlights the significance of the choice of tokenization method in source code modeling and the potential for improvement through optimized tokenization techniques.
暂无评论