检索结果-内蒙古大学图书馆

Deep Transfer Learning for source code modeling

INTERNATIONAL JOURNAL OF SOFTWARE ENGINEERING AND KNOWLEDGE ENGINEERING 2020年第5期30卷 649-668页

作者： Hussain, Yasir Huang, Zhiqiu Zhou, Yu Wang, Senzhang Nanjing Univ Aeronaut & Astronaut NUAA Coll Comp Sci & Technol Nanjing 211106 Jiangsu Peoples R China

In recent years, deep learning models have shown great potential in source code modeling and analysis. Generally, deep learning-based approaches are problem-specific and data-hungry. A challenging issue of these approaches is that they require training from scratch for a different related problem. In this work, we propose a transfer learning-based approach that significantly improves the performance of deep learning-based source code models. In contrast to traditional learning paradigms, transfer learning can transfer the knowledge learned in solving one problem into another related problem. First, we present two recurrent neural network-based models RNN and GRU for the purpose of transfer learning in the domain of source code modeling. Next, via transfer learning, these pre-trained (RNN and GRU) models are used as feature extractors. Then, these extracted features are combined into attention learner for different downstream tasks. The attention learner leverages from the learned knowledge of pre-trained models and fine-tunes them for a specific downstream task. We evaluate the performance of the proposed approach with extensive experiments with the source code suggestion task. The results indicate that the proposed approach outperforms the state-of-the-art models in terms of accuracy, precision, recall and F-measure without training the models from scratch.

关键词： Transfer learning deep neural language models source code modeling attention learning

来源：评论

学校读者我要写书评

暂无评论

Deep Learning for source code modeling and Generation: Models, Applications, and Challenges

引用

ACM COMPUTING SURVEYS 2020年第3期53卷 1–38页

作者： Le, Triet H. M. Chen, Hao Babar, Muhammad Ali Univ Adelaide Sch Comp Sci Adelaide SA 5005 Australia

Deep Learning (DL) techniques for Natural Language Processing have been evolving remarkably fast. Recently, the DL advances in language modeling, machine translation, and paragraph understanding are so prominent that the potential of DL in Software Engineering cannot be overlooked, especially in the field of program learning. To facilitate further research and applications of DL in this field, we provide a comprehensive review to categorize and investigate existing DL methods for source code modeling and generation. To address the limitations of the traditional source code models, we formulate common program learning tasks under an encoder-decoder framework. After that, we introduce recent DL mechanisms suitable to solve such problems. Then, we present the state-of-the-art practices and discuss their challenges with some recommendations for practitioners and researchers as well.

关键词： Deep learning big code source code modeling source code generation

来源：评论

学校读者我要写书评

暂无评论

Exploring the Impact of Balanced and Imbalanced Learning in source code Suggestion

引用

INTERNATIONAL JOURNAL OF SOFTWARE ENGINEERING AND KNOWLEDGE ENGINEERING 2022年第10期32卷 1499-1526页

作者： Hussain, Yasir Huang, Zhiqiu Zhou, Yu Khan, Izhar Ahmed Nanjing Univ Aeronaut & Astronaut NUAA Coll Comp Sci & Technol Nanjing 211106 Jiangsu Peoples R China

Studies have confirmed the robust performance of machine learning classifiers for various source code modeling tasks. In general, machine learning approaches are incapable of handling imbalanced datasets, since they are sensitive to the choice of diverse classes. Therefore, these approaches may lean towards the classes with a large percentage of observations. In this work, we investigate and explore the impact of balanced and imbalanced learning on source code suggestion task otherwise known as code completion, covering a large number of imbalanced classes. We further explore the impact of vocabulary size on modeling performance. First, we provide the essentials to formulate the problem of source code suggestion as a classification task and investigate the level of imbalanced classes. Second, we train the four most adapted neural language models as a baseline to assess the modeling performance. Third, we impose two diverse class balancing techniques, TomekLinks and AllKNN, to balance the datasets and evaluate their impact on the modeling performance. Finally, we trained these models with a weighted imbalanced learning approach and compared the performance with balanced learning approaches. Additionally, we train models by varying the vocabulary size to study their impact. In total, we trained 230 models on 10 real-world software projects and extensively evaluated these models with widely used performance metrics such as Precision, Recall, FScore, mean reciprocal rank (MRR), and Receiver operating characteristics (ROC). Additionally, we employed ANOVA statistical analysis to study the statistical significance and differences between these approaches. This study has demonstrated that the modeling performance decreases during balanced model training, whereas the weighted imbalance training produces comparable results and is more efficient in terms of time cost. Additionally, this study exhibits that a large size of vocabulary does not necessarily improve the modeling p

关键词： Deep learning source code modeling balanced and imbalanced learning weighted learning

来源：评论

学校读者我要写书评

暂无评论

Exploring the Impact of Vocabulary Techniques on code Completion: A Comparative Approach

引用

INTERNATIONAL JOURNAL OF SOFTWARE ENGINEERING AND KNOWLEDGE ENGINEERING 2024年第5期34卷 705-727页

作者： Hussain, Yasir Huang, Zhiqiu Zhou, Yu Khan, Izhar Ahmed Nanjing Univ Aeronaut & Astronaut NUAA Coll Comp Sci & Technol Nanjing 211106 Jiangsu Peoples R China

Integrated Development Environments (IDEs) are pivotal in enhancing productivity with features like code completion in modern software development. Recent advancements in Natural Language Processing (NLP) have empowered neural language models for code completion. In this study, we present an extensive investigation of the impact of open and closed vocabulary systems on the task of code completion. Specifically, we compare open and closed vocabulary systems with various vocabulary sizes to observe their impact on code completion performance. We experiment with three different open vocabulary systems: byte pair encoding (BPE), WordPiece and Unigram to compare them with closed-vocabulary systems to analyze their modeling performance. We also conduct experiments with different context sizes to study their impact on code completion performance. We have experimented using various prominent language models, including one from recurrent neural networks and five from transformers. Our results indicate that vocabulary size significantly impacts modeling performance and can artificially boost the accuracy of code completion models, especially in the case of a closed-vocabulary system. Moreover, we find that different vocabulary systems have varying impacts on token coverage, whereas open-vocabulary systems exhibit better token coverage. Our findings offer valuable insights for building effective code completion models, aiding researchers and practitioners in this field.

关键词： Deep learning source code modeling code completion open and closed vocabulary

来源：评论

学校读者我要写书评

暂无评论

CYCLE: Learning to Self-Refine the code Generation

引用

PROCEEDINGS OF THE ACM ON PROGRAMMING LANGUAGES-PACMPL 2024年第OOPSLA期8卷 392-418页

作者： Ding, Yangruibo Min, Marcus J. Kaiser, Gail Ray, Baishakhi Columbia Univ New York NY 10027 USA

Pre-trained code language models have achieved promising performance in code generation and improved the programming efficiency of human developers. However, their self-refinement capability is typically overlooked by the existing evaluations of code LMs, which focus only on the accuracy of the one-time prediction. For the cases when code LMs fail to implement the correct program, developers actually find it hard to debug and fix the faulty prediction since it is not written by the developers themselves. Unfortunately, our study reveals that code LMs cannot efficiently self-refine their faulty generations as well. In this paper, we propose Cycle framework, learning to self-refine the faulty generation according to the available feedback, such as the execution results reported by the test suites. We evaluate Cycle on three popular code generation benchmarks, HumanEval, MBPP, and APPS. The results reveal that Cycle successfully maintains, sometimes improves, the quality of one-time code generation, while significantly improving the self-refinement capability of code LMs. We implement four variants of Cycle with varied numbers of parameters across 350M, 1B, 2B, and 3B, and the experiments show that Cycle consistently boosts the code generation performance, by up to 63.5%, across benchmarks and varied model sizes. We also notice that Cycle outperforms code LMs that have 3x more parameters in self-refinement.

关键词： code Language Models source code modeling code Generation Iterative Programming

来源：评论

学校读者我要写书评

暂无评论

On the Effectiveness of Transfer Learning for code Search

引用

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING 2023年第4期49卷 1804-1822页

作者： Salza, Pasquale Schwizer, Christoph Gu, Jian Gall, Harald C. Univ Zurich Zurich Switzerland

The Transformer architecture and transfer learning have marked a quantum leap in natural language processing, improving the state of the art across a range of text-based tasks. This paper examines how these advancements can be applied to and improve code search. To this end, we pre-train a BERT-based model on combinations of natural language and source code data and fine-tune it on pairs of StackOverflow question titles and code answers. Our results show that the pre-trained models consistently outperform the models that were not pre-trained. In cases where the model was pre-trained on natural language "and" source code data, it also outperforms an information retrieval baseline based on Lucene. Also, we demonstrated that the combined use of an information retrieval-based approach followed by a Transformer leads to the best results overall, especially when searching into a large search pool. Transfer learning is particularly effective when much pre-training data is available and fine-tuning data is limited. We demonstrate that natural language processing models based on the Transformer architecture can be directly applied to source code analysis tasks, such as code search. With the development of Transformer models designed more specifically for dealing with source code data, we believe the results of source code analysis tasks can be further improved.

关键词： codes Bit error rate Task analysis Transformers Micromechanical devices Transfer learning Data models code search transfer learning source code modeling multimodal embeddings stackoverflow deep learning

来源：评论

学校读者我要写书评

暂无评论

Boosting source code suggestion with self-supervised Transformer Gated Highway

引用

JOURNAL OF SYSTEMS AND SOFTWARE 2023年 196卷

作者： Hussain, Yasir Huang, Zhiqiu Zhou, Yu Wang, Senzhang Nanjing Univ Aeronaut & Astronaut NUAA Coll Comp Sci & Technol Nanjing 211106 Peoples R China Cent South Univ Changsha Peoples R China

Attention-based transformer language models have shown significant performance gains in various natural language tasks. In this work, we explore the impact of transformer language models on the task of source code suggestion. The core intention of this work is to boost the modeling performance for the source code suggestion task and to explore how the training procedures and model architectures impact modeling performance. Additionally, we propose a transformer-based self-supervised learning technique called Transformer Gated Highway that outperforms recurrent and transformer language models of comparable size. The proposed approach combines the Transformer language model with Gated Highway introducing a notion of recurrence. We compare the performance of the proposed approach with transformer-based BERT (codeTran), RoBERTa (RoBERTacode), GPT2 (TravTrans), codeGen and recurrent neural language-based LSTM (codeLSTM) models. Moreover, we have experimented with various architectural settings for the transformer models to evaluate their impact on modeling performance. The extensive evaluation of the presented approach exhibits better performance on two programming language datasets;Java and C#. Additionally, we have adopted the presented approach for the syntax error correction task to predict the correct syntax token to render its possible implications for other source code modeling tasks.(c) 2022 Elsevier Inc. All rights reserved.

关键词： Deep learning Transformer models source code modeling source code suggestion

来源：评论

学校读者我要写书评

暂无评论

Optimized Tokenization Process for Open-Vocabulary code Completion: An Empirical Study 23

Optimized Tokenization Process for Open-Vocabulary Code Comp...

引用

27th International Conference on Evaluation and Assessment in Software Engineering (EASE)

作者： Hussain, Yasir Huang, Zhiqiu Zhou, Yu Khan, Izhar Ahmed Khan, Nasrullah Abbas, Muhammad Zahid Nanjing Univ Aeronaut & Astronaut NUAA Nanjing Nanjing Jiangsu Peoples R China Comsats Univ Islamabad Punjab Pakistan Comsats Univ Vehari Punjab Pakistan

ISBN: (纸本)9798400700446

Studies have substantiated the efficacy of deep learning-based models in various source code modeling tasks. These models are usually trained on large datasets that are divided into smaller units, known as tokens, utilizing either an open or closed vocabulary system. The selection of a tokenization method can have a profound impact on the number of tokens generated, which in turn can significantly influence the performance of the model. This study investigates the effect of different tokenization methods on source code modeling and proposes an optimized tokenizer to enhance the tokenization performance. The proposed tokenizer employs a hybrid approach that initializes with a global vocabulary based on the most frequent unigrams and incrementally builds an open-vocabulary system. The proposed tokenizer is evaluated against popular tokenization methods such as Closed, Unigram, WordPiece, and BPE tokenizers, as well as tokenizers provided by large pre-trained models such as Polycoder and codeGen. The results indicate that the choice of tokenization method can significantly impact the number of sub-tokens generated, which can ultimately influence the modeling performance of a model. Furthermore, our empirical evaluation demonstrates that the proposed tokenizer outperforms other baselines, achieving improved tokenization performance both in terms of a reduced number of sub-tokens and time cost. In conclusion, this study highlights the significance of the choice of tokenization method in source code modeling and the potential for improvement through optimized tokenization techniques.

关键词： Deep Learning source code modeling Open-Vocabulary code Tokenization

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：