检索结果-内蒙古大学图书馆

Investigating the Transferability of Code Repair for Low-Resource Programming Languages

学校读者我要写书评

暂无评论

arXiv 2024年

作者： Wong, Kyle Amayuelas, Alfonso Pan, Liangming Wang, William Yang University of California Santa Barbara United States University of Arizona United States

Large language models (LLMs) have shown remarkable performance on code generation tasks. A recent use case is iterative code repair, where an LLM fixes an incorrect program by rationalizing about errors and generating new code. Recent works augment the code repair process by integrating modern techniques such as chain-of-thought reasoning or distillation, but only study their benefits on high-resource languages like Python, and ignore low-resource languages like Perl. To address this gap of knowledge, we investigate the benefits of distilling code repair for both high and low resource languages to determine if the techniques that are effective in a high resource setting are also applicable in a low resource setting. Our evaluation shows that distilling the ability to repair code has language dependent benefits. To explain this behavior, we perform a further analysis and find that contrary to preexisting beliefs, the correlation between reasoning ability and code correction ability is weak. We hypothesize this weak correlation is magnified in low-resource settings where base models lack deep knowledge of a programming language, leading to wavering benefits of code repair. © 2024, CC BY.

关键词： coding errors

coding for Strand Breaks in Composite DNA

学校读者我要写书评

暂无评论

arXiv 2025年

作者： Walter, Frederik Yehezkeally, Yonatan Institute for Communications Engineering Technical University of Munich Munich80333 Germany School of Computing Newcastle University Newcastle upon TyneNE4 5TG United Kingdom

Even tough DNA can be considered as a very stable long term storage medium, errors must be expected during storage. From experiments it is evident that the most common error type due to storage are strand breaks. We address the problem of correcting strand breaks in DNA sequences resulting from composite DNA synthesis. We introduce a novel channel model with realistic assumptions about the errors resulting from long term storage. Our proposed coding scheme employs marker codes to correct single breaks. For this purpose, we generalize run-length-limited codes for the composite setting and derive bounds on the code size. © 2025, CC0.

关键词： coding errors

Show Me Your Code! Kill Code Poisoning: A Lightweight Method Based on Code Naturalness

学校读者我要写书评

暂无评论

arXiv 2025年

作者： Sun, Weisong Chen, Yuchen Yuan, Mengzhe Fang, Chunrong Chen, Zhenpeng Wang, Chong Liu, Yang Xu, Baowen Chen, Zhenyu College of Computing and Data Science Nanyang Technological University Singapore State Key Laboratory for Novel Software Technology Nanjing University Nanjing China

Neural code models (NCMs) have demonstrated extraordinary capabilities in code intelligence tasks. Meanwhile, the security of NCMs and NCMs-based systems has garnered increasing attention. In particular, NCMs are often trained on large-scale data from potentially untrustworthy sources, providing attackers with the opportunity to manipulate them by inserting crafted samples into the data. This type of attack is called a code poisoning attack (also known as a backdoor attack). It allows attackers to implant backdoors in NCMs and thus control model behavior, which poses a significant security threat. However, there is still a lack of effective techniques for detecting various complex code poisoning attacks. In this paper, we propose an innovative and lightweight technique for code poisoning detection named KILLBADCODE. KILLBADCODE is designed based on our insight that code poisoning disrupts the naturalness of code. Specifically, KILLBADCODE first builds a code language model (CodeLM) on a lightweight n-gram language model. Then, given poisoned data, KILLBADCODE utilizes CodeLM to identify those tokens in (poisoned) code snippets that will make the code snippets more natural after being deleted as trigger tokens. Considering that the removal of some normal tokens in a single sample might also enhance code naturalness, leading to a high false positive rate (FPR), we aggregate the cumulative improvement of each token across all samples. Finally, KILLBADCODE purifies the poisoned data by removing all poisoned samples containing the identified trigger tokens. We conduct extensive experiments to evaluate the effectiveness and efficiency of KILLBADCODE, involving two types of advanced code poisoning attacks (a total of five poisoning strategies) and datasets from four representative code intelligence tasks. The experimental results demonstrate that across 20 code poisoning detection scenarios, KILLBADCODE achieves an average FPR of 8.30% and an average Recall of 100%, signif

关键词： coding errors

SBS Figures: Pre-training Figure QA from Stage-by-Stage Synthesized Images

学校读者我要写书评

暂无评论

arXiv 2024年

作者： Shinoda, Risa Saito, Kuniaki Tanaka, Shohei Hirasawa, Tosho Ushiku, Yoshitaka Kyoto University Japan OMRON SINIC X Corp. Japan

Building a large-scale figure QA dataset requires a considerable amount of work, from gathering and selecting figures to extracting attributes like text, numbers, and colors, and generating QAs. Although recent developments in LLMs have led to efforts to synthesize figures, most of these focus primarily on QA generation. Additionally, creating figures directly using LLMs often encounters issues such as code errors, similar-looking figures, and repetitive content in figures. To address this issue, we present SBS Figures (Stage-by-Stage Synthetic Figures), a dataset for pre-training figure QA. Our proposed pipeline enables the creation of chart figures with complete annotations of the visualized data and dense QA annotations without any manual annotation process. Our stage-by-stage pipeline makes it possible to create diverse topic and appearance figures efficiently while minimizing code errors. Our SBS Figures demonstrate a strong pre-training effect, making it possible to achieve efficient training with a limited amount of real-world chart data starting from our pre-trained weights. Our code is available at https://***/omronsinicx/SBSFigures. Copyright © 2024, The Authors. All rights reserved.

关键词： coding errors

Combinatorial alphabet-dependent bounds for insdel codes

学校读者我要写书评

暂无评论

arXiv 2024年

作者： Kong, Xiangliang Tamo, Itzhak Wei, Hengjia The Department of Electrical Engineering-Systems Tel Aviv University Tel Aviv-Yafo6997801 Israel The Peng Cheng Laboratory Shenzhen518000 China The School of Mathematics and Statistics Xi’an Jiaotong University Xi’an710049 China Guangzhou510555 China

Error-correcting codes resilient to synchronization errors such as insertions and deletions are known as insdel codes. Due to their important applications in DNA storage and computational biology, insdel codes have recently become a focal point of research in coding theory. In this paper, we present several new combinatorial upper and lower bounds on the maximum size of q-ary insdel codes. Our main upper bound is a sphere-packing bound obtained by solving a linear programming (LP) problem. It improves upon previous results for cases when the distance d or the alphabet size q is large. Our first lower bound is derived from a connection between insdel codes and matchings in special hypergraphs. This lower bound, together with our upper bound, shows that for fixed block length n and edit distance d, when q is sufficiently large, the maximum size of insdel codes is (Formula Presented). The second lower bound refines Alon et al.’s recent logarithmic improvement on Levenshtein’s GV-type bound and extends its applicability to large q and *** Codes 05B40, 68P30 Copyright © 2024, The Authors. All rights reserved.

关键词： coding errors

NoteContrast: Contrastive Language-Diagnostic Pretraining for Medical Text

学校读者我要写书评

暂无评论

arXiv 2024年

作者： Kailas, Prajwal Homilius, Max Deo, Rahul C. MacRae, Calum A. Brigham and Women’s Hospital Harvard Medical School United States

Accurate diagnostic coding of medical notes is crucial for enhancing patient care, medical research, and error-free billing in healthcare organizations. Manual coding is a time-consuming task for providers, and diagnostic codes often exhibit low sensitivity and specificity, whereas the free text in medical notes can be a more precise description of a patient’s status. Thus, accurate automated diagnostic coding of medical notes has become critical for a learning healthcare system. Recent developments in long-document transformer architectures have enabled attention-based deep-learning models to adjudicate medical notes. In addition, contrastive loss functions have been used to jointly pre-train large language and image models with noisy labels. To further improve the automated adjudication of medical notes, we developed an approach based on i) models for ICD-10 diagnostic code sequences using a large real-world data set, ii) large language models for medical notes, and iii) contrastive pre-training to build an integrated model of both ICD-10 diagnostic codes and corresponding medical text. We demonstrate that a contrastive approach for pre-training improves performance over prior state-of-the-art models for the MIMIC-III-50, MIMIC-III-rare50, and MIMIC-III-full diagnostic coding tasks. © 2024, CC BY.

关键词： coding errors

A test-free semantic mistakes localization framework in Neural Code Translation

学校读者我要写书评

暂无评论

arXiv 2024年

作者： Chen, Lei Zhang, Sai Xu, Fangzhou Zhang, Xiaowang Xing, Zhenchang Wan, Liang Feng, Zhiyong School of Computer Science and Technology Tianjin University Tianjin China CSIRO’s Data61 CSIRO Australia

In the task of code translation, neural network-based models have been shown to frequently produce semantically erroneous code that deviates from the original logic of the source code. This issue persists even with advanced large models. Although a recent approach proposed using test cases to identify these semantic errors, it relies heavily on the quality of the test cases and is not applicable to code snippets without test cases in real-world scenarios. Therefore, We present EISP, a static analysis framework based on the Large Language Model (LLM).First, the framework generates a semantic mapping between source code and translated code. Next, each sub-code fragment is identified by recursively traversing the abstract syntax tree of the source code, and its corresponding translated code fragment is found through the semantic mapping. Finally, EISP connects each pair of sub-code fragments with fine-grained knowledge hints through an AI chain to assist LLMs in discovering semantic mistakes in the translated code. In our benchmark evaluation, the EISP framework, based on GPT-4o mini, achieved an accuracy of 82.3%, representing a 20.3% improvement over baseline methods using the same base model, and a 7.4% improvement compared to dynamic analysis methods that require test cases and manual intervention. To our knowledge, EISP is the first tool to locate semantic errors in translated code without test cases or compilable code. This innovative tool provides the software engineering community with a new way to deal with code fragments without test cases. Copyright © 2024, The Authors. All rights reserved.

关键词： coding errors

Insights from Benchmarking Frontier Language Models on Web App Code Generation

学校读者我要写书评

暂无评论

arXiv 2024年

作者： Cui, Yi ONEKQ Lab.

This paper presents insights from evaluating 16 frontier large language models (LLMs) on the WebApp1K benchmark, a test suite designed to assess the ability of LLMs to generate web application code. The results reveal that while all models possess similar underlying knowledge, their performance is differentiated by the frequency of mistakes they make. By analyzing lines of code (LOC) and failure distributions, we find that writing correct code is more complex than generating incorrect code. Furthermore, prompt engineering shows limited efficacy in reducing errors beyond specific cases. These findings suggest that further advancements in coding LLM should emphasize on model reliability and mistake minimization. © 2024, CC BY.

关键词： coding errors

Unbounded Error Correcting Codes

学校读者我要写书评

暂无评论

arXiv 2024年

作者： Efremenko, Klim Zamir, Or Ben Gurion University Tel Aviv University

We introduce a variant of Error Correcting Codes with no predetermined length. An Unbounded ECC with rate R and distance Ε is an encoding of a possibly infinite message into a possibly infinite codeword, such that for every large enough k we may recover the first Rk symbols of the message from the first k symbols of the codeword - even when up to 1/2 Εk of these codeword symbols are adversarially corrupted. We study unbounded codes over a binary alphabet in the regime of small distance Ε, and obtain nearly-tight upper and lower bounds in several natural settings. We show that the optimal rate of such a code is between R 1 − O (√Ε log log (1/Ε)). Surprisingly, our construction is non-linear, and we show that the optimal rate of a linear unbounded code is the asymptotically worse R = 1 − Θ (√Ε log (1/Ε) ). In the setting of random noise, the optimal rate of unbounded codes improves and matches the rate of standard codes at R = 1 − Θ(Ε log (1/Ε)). © 2024, CC BY-NC-ND.

关键词： coding errors