programming language processing (PLP) using machine learning has made vast improvements in the past few years. Increasingly more people are interested in exploring this promising field. However, it is challenging for ...
详细信息
ISBN:
(纸本)9783031368882;9783031368899
programming language processing (PLP) using machine learning has made vast improvements in the past few years. Increasingly more people are interested in exploring this promising field. However, it is challenging for new researchers and developers to find the right components to construct their own machine learning pipelines, given the diverse PLP tasks to be solved, the large number of datasets and models being released, and the set of complex compilers or tools involved. To improve the findability, accessibility, interoperability and reusability (FAIRness) of machine learning components, we collect and analyze a set of representative papers in the domain of machine learning-based PLP. We then identify and characterize key concepts including PLP tasks, model architectures and supportive tools. Finally, we show some example use cases of leveraging the reusable components to construct machine learning pipelines to solve a set of PLP tasks.
In recent years, language models (LMs), such as GPT-4, have been widely used in multiple domains, including natural languageprocessing, visualization, and so on. However, applying them for analyzing and optimizing hi...
详细信息
ISBN:
(纸本)9783031407437;9783031407444
In recent years, language models (LMs), such as GPT-4, have been widely used in multiple domains, including natural languageprocessing, visualization, and so on. However, applying them for analyzing and optimizing high-performance computing (HPC) software is still challenging due to the lack of HPC-specific support. In this paper, we design the LM4HPC framework to facilitate the research and development of HPC software analyses and optimizations using LMs. Tailored for supporting HPC datasets, AI models, and pipelines, our framework is built on top of a range of components from different levels of the machine learning software stack, with Hugging Face-compatible APIs. Using three representative tasks, we evaluated the prototype of our framework. The results show that LM4HPC can help users quickly evaluate a set of state-of-the-art models and generate insightful leaderboards.
Developers spend a significant amount of time in editing code for a variety of reasons such as bug fixing or adding new features. Designing effective methods to predict code edits has been an active yet challenging ar...
详细信息
ISBN:
(纸本)9798400703270
Developers spend a significant amount of time in editing code for a variety of reasons such as bug fixing or adding new features. Designing effective methods to predict code edits has been an active yet challenging area of research due to the diversity of code edits and the difficulty of capturing the developer intent. In this work, we address these challenges by endowing pre-trained large language models (LLMs) with the knowledge of relevant prior associated edits, which we call the Grace (Generation conditioned on Associated Code Edits) method. The generative capability of the LLMs helps address the diversity in code changes and conditioning code generation on prior edits helps capture the latent developer intent. We evaluate two well-known LLMs, CODEX and CODET5, in zero-shot and fine-tuning settings respectively. In our experiments with two datasets, GRACE boosts the performance of the LLMs significantly, enabling them to generate 29% and 54% more correctly-edited code in top-1 suggestions relative to the current state-of-the-art symbolic and neural approaches, respectively.
Task constraint feedback is the collective name for any kind of feedback system that checks whether problem-defined constraints were fulfilled by students upon submission of work. This can be as simple as checking if ...
详细信息
ISBN:
(纸本)9789869721493
Task constraint feedback is the collective name for any kind of feedback system that checks whether problem-defined constraints were fulfilled by students upon submission of work. This can be as simple as checking if certain programming constructs exist, or if a specific algorithm or data structure required by the problem is fulfilled. Most of these systems use static analysis (Fischer, 2006;Gotel, 2008) or natural languageprocessing techniques (Lane, 2005) to generate feedback. A transformer is a neural network for sequence processing, such as natural languages. Previous work has shown that transformers can be generalized for programminglanguage tasks such as code summarization. In this study, we used the CodeBERT transformer to classify or tag algorithms implemented in some code snippets to check constraint satisfaction. Using a custom dataset containing source code aiming to implement algorithms, we show that CodeBERT is capable of learning structures of how code is implemented regardless of how a programmer names the code. Averaging each label's f1-score, the model was able to obtain an average of 0.85, which showed promising results in the dataset.
Pre-trained models for Natural languages (NL) have been recently shown to transfer well to programminglanguages (PL) and largely benefit different intelligence coderelated tasks, such as code search, clone detection,...
详细信息
ISBN:
(纸本)9781665450997
Pre-trained models for Natural languages (NL) have been recently shown to transfer well to programminglanguages (PL) and largely benefit different intelligence coderelated tasks, such as code search, clone detection, programming translation and code document generation. However, existing pre-trained methods for programminglanguages are mainly conducted by masked language modeling and next sentence prediction at token or graph levels. This restricted form limits their performance and transferability since PL and NL have different syntax rules and the downstream tasks require a multimodal representation. Here we introduce C3P, a Contrastive Code-Comment Pre-training approach, to solve various downstream tasks by pre-training the multi-representation features on both programming and natural syntax. The model encodes the code syntax and natural language description (comment) by two encoders and the encoded embeddings are projected into a multi-modal space for learning the latent representation. In the latent space, C3P jointly trains the code and comment encoders by the symmetric loss function, which aims to maximize the cosine similarity of the correct code-comment pairs while minimizing the similarity of unrelated pairs. We verify the empirical performance of the proposed pre-trained models on multiple downstream code-related tasks. The comprehensive experiments demonstrate that C3P outperforms previous work on the understanding tasks of code search and code clone, as well as the generation tasks of programming translation and document generation. Furthermore, we validate the transferability of C3P to the new programminglanguage which is not seen in the pre-training stage. The results show our model surpasses all supervised methods and in some programminglanguage cases even outperforms prior pre-trained approaches. Code is available at https://***/TerryPei/C3P.
The way software developers edit code day-to-day tends to be repetitive, often using existing code elements. Many researchers have tried to automate the repetitive code editing process by mining specific change templa...
详细信息
The way software developers edit code day-to-day tends to be repetitive, often using existing code elements. Many researchers have tried to automate the repetitive code editing process by mining specific change templates. However, such templates are often manually implemented for automated applications. Consequently, such template-based automated code editing is very tedious to implement. In addition, template-based code editing is often narrowly-scoped and low noise tolerant. Machine Learning, specially deep learning-based techniques, could help us solve these problems because of their generalization and noise tolerance capacities. The advancement of deep neural networks and the availability of vast open-source evolutionary data opens up the possibility of automatically learning those templates from the wild and applying those in the appropriate context. However, deep neural network-based modeling for code changes, and code, in general, introduces some specific problems that need specific attention from the research community. For instance, source code exhibit strictly defined syntax and semantics inherited from the properties of programminglanguage (PL). In addition, source code vocabulary (possible number of tokens) can be arbitrarily large. This dissertation formulates the problem of automated code editing as a multi-modal translation problem, where, given a piece of code, the context, and some guidance, the objective is to generate edited code. In particular, we divide the problem into two sub-problems—source code understanding and generation. We empirically show that the deep neural networks (models in general) for these problems should be aware of the PL-properties (i.e., syntax, semantics). This dissertation investigates two primary directions of endowing the models with knowledge about PL-properties—(i) explicit encoding: where we design models catering to a specific property, and (ii) implicit encoding: where we train a very-large model to learn these pro
Release Notes (RNs) are one of the important artifacts in software development and maintenance. As, RNs are required when a new release of a software is planned to deploy. They contain all the changes made to the new ...
详细信息
ISBN:
(纸本)9781728165790
Release Notes (RNs) are one of the important artifacts in software development and maintenance. As, RNs are required when a new release of a software is planned to deploy. They contain all the changes made to the new release of project i.e. description of new features, improvements, bug fixes, deprecated features, etc. Generating these notes manually is a very complex,and time-consuming task. In this paper, we present an approach for generating RNs automatically. We implemented the approach in python and generate these notes for *** projects. Our system extracts changes from Git repository, summarize changes, get deprecated features, get library changes, fetch issues from issue tracker, and link these issues to code, etc. Our system hierarchically set up these changes and produce an output in a document. We evaluated our results manually from 14 industry developers. The results obtained from our system shows that these RNs are very good and accurate than ones always produced manually.
In this paper, a generic static end-to-end detection framework with deep neural network for WebShell is designed, which is free from human labor and domain knowledge. In this paper, we simultaneously introduce word em...
详细信息
ISBN:
(纸本)9781538682463
In this paper, a generic static end-to-end detection framework with deep neural network for WebShell is designed, which is free from human labor and domain knowledge. In this paper, we simultaneously introduce word embedding in Natural languageprocessing(NLP) and lexical analysis in programming language processing(PLP) to obtain an accurate, structured, semantic-rich vector representation of the script code. For the obtaining's sake, a series of effective tricks are designed to further dig out the high-value information in the script while filtering noise. Then, we provide a desirable algorithm to down-sampling, which drastically reduces the computational costs at a relatively small information loss. Finally, we achieve high detection accuracy by employing the Deep Neural Network (DNN) composed of LSTM and pooling layers. The framework has a significant advantage at least on data set of the experiment.
暂无评论