Repairing software bugs with automated solutions is a long-standing goal of researchers. Some of the latest automated program repair (APR) tools leverage natural language processing (NLP) techniques to repair software...
详细信息
ISBN:
(数字)9781665402620
ISBN:
(纸本)9781665402620
Repairing software bugs with automated solutions is a long-standing goal of researchers. Some of the latest automated program repair (APR) tools leverage natural language processing (NLP) techniques to repair software bugs. But natural languages (NL) and programminglanguages (PL) have significant differences, which leads to the fact that they may not be able to handle PL tasks well. Moreover, due to the difference between the vulnerability repair task and bug repair task, the performance of these tools on vulnerability repair is not yet known. To address these issues, we attempt to use large-scale pre-trained PL models (CodeBERT and GraphCodeBERT) for the vulnerability repair task based on the characteristics of PL and explore the real-world performance of the state-of-the-art data-driven approaches for vulnerability repair. The results show that using pre-trained PL models can better capture and process PL features and accomplish multi-line vulnerability repair. Specifically, our solution achieves advanced results (single-line repair accuracy 95.47%, multi-line repair accuracy 90.06%). These results outperform the state-of-the-art data-driven approaches and demonstrate that adding rich data-dependent features can help solve more complex code repair problems. Besides, we also discuss the previous work and our approach, pointing out some shortcomings and solutions we can work on in the future.
Decompilation is a widely used process for reverse engineers to significantly enhance code readability by lifting assembly code to a higher-level C-like language, pseudo-code. Nevertheless, the process of compilation ...
详细信息
ISBN:
(纸本)9798350329964
Decompilation is a widely used process for reverse engineers to significantly enhance code readability by lifting assembly code to a higher-level C-like language, pseudo-code. Nevertheless, the process of compilation and stripping irreversibly discards high-level semantic information that is crucial to code comprehension, such as comments, identifier names, and types. Existing approaches typically recover only one type of information, making them suboptimal for semantic inference. In this paper, we treat pseudo-code as a special programminglanguage, then present a unified pre-trained model, HexT5, that is trained on vast amounts of natural language comments, source identifiers, and pseudo-code using novel pseudo-code-based pre-training objectives. We fine-tune HexT5 on various downstream tasks, including code summarization, variable name recovery, function name recovery, and similarity detection. Comprehensive experiments show that HexT5 achieves state-of-the-art performance on four downstream tasks, and it demonstrates the robust effectiveness and generalizability of HexT5 for binary-related tasks.
As deep learning progresses, programminglanguage generation models such as CodeLlama, GitHub Copilot, and ChatGPT have been widely applied to intelligent code development. However, this also reduces the cost of code ...
详细信息
ISBN:
(纸本)9798400710117
As deep learning progresses, programminglanguage generation models such as CodeLlama, GitHub Copilot, and ChatGPT have been widely applied to intelligent code development. However, this also reduces the cost of code plagiarism, posing challenges to copyright and academic integrity. In response to the specific needs for human-machine code detection, this paper introduces a comprehensive automated benchmark CodeWMBench for active detection of human-machine code through watermarking. With a meticulous evaluation of eight code watermarking methods, we demonstrated their performance in terms of harmlessness, robustness, and transparency. Specifically, for the first time, we introduced watermark removal techniques based on large languagemodels and conducted the first assessment of these watermarking methods against code rewriting and retranslating attacks. In the discussion, we delved into the critical issues currently facing code watermarking, including why existing code watermarking methods struggle to resist removal by large languagemodels and potential future methods that could withstand such removals.
Automatic program repair plays a crucial role in the software development and implementation. While deep learning-based approaches have made significant progress, one inherent challenge is the inefficiency in code rep...
详细信息
暂无评论