咨询与建议

看过本文的还看了

相关文献

该作者的其他文献

文献详情 >DeCon: Detecting Incorrect Ass... 收藏
arXiv

DeCon: Detecting Incorrect Assertions via Postconditions Generated by a Large Language Model

作     者:Yu, Hao Chen, Tianyu Huang, Jiaming Li, Zongyang Ran, Dezhi Wang, Xinyu Li, Ying Marron, Assaf Harel, David Xie, Yuan Xie, Tao 

作者机构:Peking University Beijing China University of Michigan Ann Arbor United States Dept. of Computer Science and Applied Mathematics Weizmann Institute of Science Rehovot Israel The Hong Kong University of Science and Technology China Key Lab of HCST PKU MOE SCS Peking University China 

出 版 物:《arXiv》 (arXiv)

年 卷 期:2025年

核心收录:

主  题:Benchmarking 

摘      要:Recently, given the docstring for the target problem and the target function signature, large language models (LLMs) have been used not only to generate source code, but also to generate test cases, consisting of test inputs and assertions (e.g., in the form of checking an actual output against the expected output). However, as shown by our empirical study on assertions generated by four LLMs for the HumanEval benchmark, over 62% of the generated assertions are incorrect (i.e., failed on the ground-truth problem solution). To detect incorrect assertions (given the docstring and the target function signature along with a sample of example inputs and outputs), in this paper, we propose a new approach named DeCon to effectively detect incorrect assertions via LLM-generated postconditions for the target problem (a postcondition is a predicate that must always be true just after the execution of the ground-truth problem solution). Our approach requires a small set of I/O examples (i.e., a sample of example inputs and outputs) for the target problem (e.g., the I/O examples included in the docstring for a target problem in HumanEval). We use the given I/O examples to filter out those LLM-generated postconditions that are violated by at least one given I/O example. We then use the remaining postconditions to detect incorrect assertions as those assertions that violate at least one remaining postcondition. Experimental results show that DeCon can detect averagely more than 64% (63% and 65.5% detected by GPT-3.5 and GPT-4, respectively) incorrect assertions generated by four state-of-the-art LLMs, and DeCon can also improve the effectiveness of these LLMs in code generation by 4% in terms of Pass@1. In addition, although DeCon might filter out correct assertions, the fault-finding ability of the remaining correct assertions decreases only slightly. Copyright © 2025, The Authors. All rights reserved.

读者评论 与其他读者分享你的观点

用户名:未登录
我的评分