Large Language Models (llms) are increasingly employed to evaluate complex, large datasets in automated ways. By combining llms' rationale capabilities with user-defined criteria, llm-as-a-judge systems can automa...
详细信息
llm-as-a-judge uses a large language model (llm) to select the best response from a set of candidates for a given question. llm-as-a-judge has many applications such as llm-powered search, reinforcement learning with ...
详细信息
ISBN:
(纸本)9798400706363
llm-as-a-judge uses a large language model (llm) to select the best response from a set of candidates for a given question. llm-as-a-judge has many applications such as llm-powered search, reinforcement learning with AI feedback (RLAIF), and tool selection. In this work, we propose judgeDeceiver, an optimization-based prompt injection attack to llm-as-a-judge. judgeDeceiver injects a carefully crafted sequence into an attacker-controlled candidate response such that llm-as-a-judge selects the candidate response for an attacker-chosen question no matter what other candidate responses are. Specifically, we formulate finding such sequence as an optimization problem and propose a gradient based method to approximately solve it. Our extensive evaluation shows that judgeDeceive is highly effective, and is much more effective than existing prompt injection attacks that manually craft the injected sequences and jailbreak attacks when extended to our problem. We also show the effectiveness of judgeDeceiver in three case studies, i.e., llm-powered search, RLAIF, and tool selection. Moreover, we consider defenses including known-answer detection, perplexity detection, and perplexity windowed detection. Our results show these defenses are insufficient, highlighting the urgent need for developing new defense strategies.
Evaluating the alignment of large language models (llms) with user-defined coding preferences is a challenging endeavor that requires a deep assessment of llms’ outputs. Existing methods and benchmarks rely primarily...
详细信息
Evaluating the alignment of large language models (llms) with user-defined coding preferences is a challenging endeavor that requires a deep assessment of llms’ outputs. Existing methods and benchmarks rely primarily on automated metrics and static analysis tools, which often fail to capture the nuances of user instructions and llm outputs. To address this gap, we introduce the llm-as-a-judge evaluation framework and present CodeUltraFeedback, a comprehensive dataset for assessing and improving llm alignment with coding preferences. CodeUltraFeedback consists of 10,000 coding instructions, each annotated with four responses generated from a diverse pool of 14 llms. These responses are annotated using GPT-3.5 as a judge, with both ranking-based scores and detailed textual feedback across five distinct coding preferences. Our analysis reveals that responses from GPT-3.5 and GPT-4 are consistently rated higher than those from open-weight models, underscoring substantial alignment gaps between closed- and open-weight llms. In turn, we explore the usage of CodeUltraFeedback as feedback data to fine-tune and align CodeLlama-7B-Instruct using supervised fine-tuning (SFT) and reinforcement learning from AI feedback (RLAIF) with direct preference optimization (DPO). The resulting aligned model achieves an average alignment improvement of 22.7% and 29.7% when evaluated with GPT-3.5 and GPT-4 judges, respectively. Notably, our aligned CodeLlama-7B-Instruct surpasses much larger models, such as CodeLlama-13B and 34B, in alignment with coding preferences. Despite not being explicitly trained for functional correctness, it also achieves a 10.5% and 26.6% relative improvement in Pass@\(1\) and Pass@\(10\) on the HumanEval+ benchmark. Our contributions demonstrate the practical value of preference tuning in code generation and set the stage for further progress in model alignment and RLAIF for automated software engineering.
Large language models (llms) are increasingly being utilized to develop tools and services in various domains, including education. However, due to the nature of the training data, these models are susceptible to inhe...
详细信息
Large language models (llms) are increasingly being utilized to develop tools and services in various domains, including education. However, due to the nature of the training data, these models are susceptible to inherent social or cognitive biases, which can influence their outputs. Furthermore, their handling of critical topics, such as privacy and sensitive questions, is essential for responsible deployment. This study proposes a framework for the automatic detection of biases and violations of responsible use using a synthetic question-based dataset mimicking student-chatbot interactions. We employ the llm-as-a-judge method to evaluate multiple llms for biased responses. Our findings show that some models exhibit more bias than others, highlighting the need for careful consideration when selecting models for deployment in educational and other high-stakes applications. These results emphasize the importance of addressing bias in llms and implementing robust mechanisms to uphold responsible AI use in real-world services.
The emergence of large language models (llms) has transformed research and practice across a wide range of domains. Within the computing education research (CER) domain, llms have garnered significant attention, parti...
详细信息
ISBN:
(纸本)9798400705328
The emergence of large language models (llms) has transformed research and practice across a wide range of domains. Within the computing education research (CER) domain, llms have garnered significant attention, particularly in the context of learning programming. Much of the work on llms in CER, however, has focused on applying and evaluating proprietary models. In this article, we evaluate the efficiency of open-source llms in generating high-quality feedback for programming assignments and judging the quality of programming feedback, contrasting the results with proprietary models. Our evaluations on a dataset of students' submissions to introductory Python programming exercises suggest that state-of-the-art open-source llms are nearly on par with proprietary models in both generating and assessing programming feedback. Additionally, we demonstrate the efficiency of smaller llms in these tasks and highlight the wide range of llms accessible, even for free, to educators and practitioners.
Large language models (llms) have shown great potential for the automatic generation of feedback in a wide range of computing contexts. However, concerns have been voiced around the privacy and ethical implications of...
详细信息
ISBN:
(纸本)9798400706004
Large language models (llms) have shown great potential for the automatic generation of feedback in a wide range of computing contexts. However, concerns have been voiced around the privacy and ethical implications of sending student work to proprietary models. This has sparked considerable interest in the use of open source llms in education, but the quality of the feedback that such open models can produce remains understudied. This is a concern as providing flawed or misleading generated feedback could be detrimental to student learning. Inspired by recent work that has utilised very powerful llms, such as GPT-4, to evaluate the outputs produced by less powerful models, we conduct an automated analysis of the quality of the feedback produced by several open source models using a dataset from an introductory programming course. First, we investigate the viability of employing GPT-4 as an automated evaluator by comparing its evaluations with those of a human expert. We observe that GPT-4 demonstrates a bias toward positively rating feedback while exhibiting moderate agreement with human raters, showcasing its potential as a feedback evaluator. Second, we explore the quality of feedback generated by several leading open-source llms by using GPT-4 to evaluate the feedback. We find that some models offer competitive performance with popular proprietary llms, such as ChatGPT, indicating opportunities for their responsible use in educational settings.
暂无评论