版权所有:内蒙古大学图书馆 技术提供:维普资讯• 智图
内蒙古自治区呼和浩特市赛罕区大学西街235号 邮编: 010021
作者机构:Department of Computing The Hong Kong Polytechnic University Hong Kong Research Centre for Data Science & Artificial Intelligence Huawei Noah’s Ark Lab Canada
出 版 物:《arXiv》 (arXiv)
年 卷 期:2024年
核心收录:
摘 要:As large language models (LLMs) constantly evolve, ensuring their safety remains a critical research problem. Previous red teaming approaches for LLM safety have primarily focused on single prompt attack or goal hijacking. To the best of our knowledge, we are the first to study LLM safety in multi-turn dialogue coreference. We created a dataset of 1, 400 questions across 14 categories, each featuring multi-turn coreference safety attacks. We then conducted detailed evaluations on five widely used open-source LLMs. The results indicated that under multi-turn coreference safety attacks, the highest attack successful rate was 56% with the LLaMA2-Chat-7b model, while the lowest was 13.9% with the Mistral-7B-Instruct model. These findings highlight the safety vulnerabilities in LLMs during dialogue coreference interactions. Warning: This paper may contain offensive language or harmful content.1 © 2024, CC BY.