检索结果-内蒙古大学图书馆

Joint of the ACM Workshops at the International Conference on Intelligent User Interfaces 2025, IUI-WS 2025

作者： Gebreegziabher, Simret Araya Chiang, Charles Wang, Zichu Ashktorab, Zahra Brachman, Michelle Geyer, Werner Li, Toby Jia-Jun Gómez-Zará, Diego University of Notre Dame Notre Dame IN United States Carnegie Mellon University Pittsburgh PA United States IBM Research Yorktown Heights NY United States Cambridge MA United States

Large Language Models (llms) are increasingly employed to evaluate complex, large datasets in automated ways. By combining llms' rationale capabilities with user-defined criteria, llm-as-a-judge systems can automate the evaluation of thousands of observations based on predefined criteria, offering a scalable and flexible solution. However, users often struggle to define and articulate clear evaluation criteria. Moreover, human preferences and criteria definitions evolve, and predefined templates fail to account for the context-specific nuances necessary for effective evaluation. To address these challenges, we present MetricMate, an interactive tool that supports users in defining and iterating evaluation criteria for llm-as-a-judge systems. MetricMate introduces hierarchical criteria definitions and curated examples of success and failure to promote human-AI criteria negotiation and alignment. Additionally, MetricMate provides several visualizations to help users iterate and comprehend how their criteria affect the overall evaluation process. We aim to provide a tool for a wide range of users, from annotators to software developers. © 2025 Copyright © 2025 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).

关键词： Human-centered Computing Interactive Systems and Tools llm-as-a-judge

来源：评论

学校读者我要写书评

暂无评论

Optimization-based Prompt Injection Attack to llm-as-a-judge 24

Optimization-based Prompt Injection Attack to LLM-as-a-Judge

引用

31st Conference on Computer and Communications Security

作者： Shi, Jiawen Yuan, Zenghui Liu, Yinuo Huang, Yue Zhou, Pan Sun, Lichao Gong, Neil Zhenqiang Huazhong Univ Sci & Technol Wuhan Peoples R China Univ Notre Dame South Bend IN USA Lehigh Univ Bethlehem PA 18015 USA Duke Univ Durham NC 27708 USA

ISBN: (纸本)9798400706363

llm-as-a-judge uses a large language model (llm) to select the best response from a set of candidates for a given question. llm-as-a-judge has many applications such as llm-powered search, reinforcement learning with AI feedback (RLAIF), and tool selection. In this work, we propose judgeDeceiver, an optimization-based prompt injection attack to llm-as-a-judge. judgeDeceiver injects a carefully crafted sequence into an attacker-controlled candidate response such that llm-as-a-judge selects the candidate response for an attacker-chosen question no matter what other candidate responses are. Specifically, we formulate finding such sequence as an optimization problem and propose a gradient based method to approximately solve it. Our extensive evaluation shows that judgeDeceive is highly effective, and is much more effective than existing prompt injection attacks that manually craft the injected sequences and jailbreak attacks when extended to our problem. We also show the effectiveness of judgeDeceiver in three case studies, i.e., llm-powered search, RLAIF, and tool selection. Moreover, we consider defenses including known-answer detection, perplexity detection, and perplexity windowed detection. Our results show these defenses are insufficient, highlighting the urgent need for developing new defense strategies.

关键词： Large language model prompt injection attack llm-as-a-judge

来源：评论

学校读者我要写书评

暂无评论

CodeUltraFeedback: An llm-as-a-judge Dataset for Aligning Large Language Models to Coding Preferences

引用

ACM Transactions on Software Engineering and Methodology 1000年

作者： Martin Weyssow Aton Kamanda Xin Zhou Houari Sahraoui DIRO Université de Montréal Canada Singapore Management University Singapore

Evaluating the alignment of large language models (llms) with user-defined coding preferences is a challenging endeavor that requires a deep assessment of llms’ outputs. Existing methods and benchmarks rely primarily on automated metrics and static analysis tools, which often fail to capture the nuances of user instructions and llm outputs. To address this gap, we introduce the llm-as-a-judge evaluation framework and present CodeUltraFeedback, a comprehensive dataset for assessing and improving llm alignment with coding preferences. CodeUltraFeedback consists of 10,000 coding instructions, each annotated with four responses generated from a diverse pool of 14 llms. These responses are annotated using GPT-3.5 as a judge, with both ranking-based scores and detailed textual feedback across five distinct coding preferences. Our analysis reveals that responses from GPT-3.5 and GPT-4 are consistently rated higher than those from open-weight models, underscoring substantial alignment gaps between closed- and open-weight llms. In turn, we explore the usage of CodeUltraFeedback as feedback data to fine-tune and align CodeLlama-7B-Instruct using supervised fine-tuning (SFT) and reinforcement learning from AI feedback (RLAIF) with direct preference optimization (DPO). The resulting aligned model achieves an average alignment improvement of 22.7% and 29.7% when evaluated with GPT-3.5 and GPT-4 judges, respectively. Notably, our aligned CodeLlama-7B-Instruct surpasses much larger models, such as CodeLlama-13B and 34B, in alignment with coding preferences. Despite not being explicitly trained for functional correctness, it also achieves a 10.5% and 26.6% relative improvement in Pass@\(1\) and Pass@\(10\) on the HumanEval+ benchmark. Our contributions demonstrate the practical value of preference tuning in code generation and set the stage for further progress in model alignment and RLAIF for automated software engineering.

关键词： Large language models code generation automated software engineering reinforcement learning from AI feedback direct preference optimization llm-as-a-judge

来源：评论

学校读者我要写书评

暂无评论

Multifaceted Assessment of Responsible Use and Bias in Language Models for Education

引用

COMPUTERS 2025年第3期14卷 100-100页

作者： Ahmed, Ishrat Liu, Wenxing Roscoe, Rod D. Reilley, Elizabeth Mcnamara, Danielle S. Arizona State Univ Learning Engn Inst Tempe AZ 85281 USA Arizona State Univ Enterprise Technol AI Accelerat Tempe AZ 85281 USA Arizona State Univ Human Syst Engn Tempe AZ 85281 USA

Large language models (llms) are increasingly being utilized to develop tools and services in various domains, including education. However, due to the nature of the training data, these models are susceptible to inherent social or cognitive biases, which can influence their outputs. Furthermore, their handling of critical topics, such as privacy and sensitive questions, is essential for responsible deployment. This study proposes a framework for the automatic detection of biases and violations of responsible use using a synthetic question-based dataset mimicking student-chatbot interactions. We employ the llm-as-a-judge method to evaluate multiple llms for biased responses. Our findings show that some models exhibit more bias than others, highlighting the need for careful consideration when selecting models for deployment in educational and other high-stakes applications. These results emphasize the importance of addressing bias in llms and implementing robust mechanisms to uphold responsible AI use in real-world services.

关键词： biases large language models llm-as-a-judge evaluation educational chatbot higher-Ed

来源：评论

学校读者我要写书评

暂无评论

Evaluating Language Models for Generating and Judging Programming Feedback 2025

Evaluating Language Models for Generating and Judging Progra...

引用

56th Technical Symposium on Computer Science Education

作者： Koutcheme, Charles Dainese, Nicola Sarsa, Sami Hellas, Arto Leinonen, Juho Ashraf, Syed Denny, Paul Aalto Univ Espoo Finland Univ Jyvaskyla Jyvaskyla Finland Univ Auckland Auckland New Zealand

ISBN: (纸本)9798400705328

The emergence of large language models (llms) has transformed research and practice across a wide range of domains. Within the computing education research (CER) domain, llms have garnered significant attention, particularly in the context of learning programming. Much of the work on llms in CER, however, has focused on applying and evaluating proprietary models. In this article, we evaluate the efficiency of open-source llms in generating high-quality feedback for programming assignments and judging the quality of programming feedback, contrasting the results with proprietary models. Our evaluations on a dataset of students' submissions to introductory Python programming exercises suggest that state-of-the-art open-source llms are nearly on par with proprietary models in both generating and assessing programming feedback. Additionally, we demonstrate the efficiency of smaller llms in these tasks and highlight the wide range of llms accessible, even for free, to educators and practitioners.

关键词： open source large language models generative AI automatic feedback automatic evaluation programming feedback llm-as-a-judge

来源：评论

学校读者我要写书评

暂无评论

Open Source Language Models Can Provide Feedback: Evaluating llms' Ability to Help Students Using GPT-4-As-A-judge 29

Open Source Language Models Can Provide Feedback: Evaluating...

引用

29th Annual Conference on Innovation and Technology in Computer Science Education (ITiCSE)

作者： Koutcheme, Charles Dainese, Nicola Sarsa, Sami Hellas, Arto Leinonen, Juho Denny, Paul Aalto Univ Espoo Finland Univ Jyvaskyla Jyvaskyla Finland Univ Auckland Auckland New Zealand

ISBN: (纸本)9798400706004

Large language models (llms) have shown great potential for the automatic generation of feedback in a wide range of computing contexts. However, concerns have been voiced around the privacy and ethical implications of sending student work to proprietary models. This has sparked considerable interest in the use of open source llms in education, but the quality of the feedback that such open models can produce remains understudied. This is a concern as providing flawed or misleading generated feedback could be detrimental to student learning. Inspired by recent work that has utilised very powerful llms, such as GPT-4, to evaluate the outputs produced by less powerful models, we conduct an automated analysis of the quality of the feedback produced by several open source models using a dataset from an introductory programming course. First, we investigate the viability of employing GPT-4 as an automated evaluator by comparing its evaluations with those of a human expert. We observe that GPT-4 demonstrates a bias toward positively rating feedback while exhibiting moderate agreement with human raters, showcasing its potential as a feedback evaluator. Second, we explore the quality of feedback generated by several leading open-source llms by using GPT-4 to evaluate the feedback. We find that some models offer competitive performance with popular proprietary llms, such as ChatGPT, indicating opportunities for their responsible use in educational settings.

关键词： open source large language models generative AI llms automatic feedback automatic evaluation programming feedback llm-as-a-judge Zephyr Code Llama GPT-4

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：