AHCI RESEARCH GROUP
Publications
Papers published in international journals,
proceedings of conferences, workshops and books.
OUR RESEARCH
Scientific Publications
How to
Here you can find the complete list of our publications.
You can use the tag cloud to select only the papers dealing with specific research topics.
You can expand the Abstract, Links and BibTex record for each paper.
You can use the tag cloud to select only the papers dealing with specific research topics.
You can expand the Abstract, Links and BibTex record for each paper.
2024
Kang, Z.; Liu, Y.; Zheng, J.; Sun, Z.
Revealing the Difficulty in Jailbreak Defense on Language Models for Metaverse Proceedings Article
In: Q., Gong; X., He (Ed.): SocialMeta - Proc. Int. Workshop Soc. Metaverse Comput., Sens. Netw., Part: ACM SenSys, pp. 31–37, Association for Computing Machinery, Inc, 2024, ISBN: 979-840071299-9 (ISBN).
Abstract | Links | BibTeX | Tags: % reductions, Attack strategies, Computer simulation languages, Defense, Digital elevation model, Guard rails, Jailbreak, Language Model, Large language model, Metaverse Security, Metaverses, Natural languages, Performance, Virtual Reality
@inproceedings{kang_revealing_2024,
title = {Revealing the Difficulty in Jailbreak Defense on Language Models for Metaverse},
author = {Z. Kang and Y. Liu and J. Zheng and Z. Sun},
editor = {Gong Q. and He X.},
url = {https://www.scopus.com/inward/record.uri?eid=2-s2.0-85212189363&doi=10.1145%2f3698387.3699998&partnerID=40&md5=673326728c3db35ffbbaf807eb7f003c},
doi = {10.1145/3698387.3699998},
isbn = {979-840071299-9 (ISBN)},
year = {2024},
date = {2024-01-01},
booktitle = {SocialMeta - Proc. Int. Workshop Soc. Metaverse Comput., Sens. Netw., Part: ACM SenSys},
pages = {31–37},
publisher = {Association for Computing Machinery, Inc},
abstract = {Large language models (LLMs) have demonstrated exceptional capabilities in natural language processing tasks, fueling innovations in emerging areas such as the metaverse. These models enable dynamic virtual communities, enhancing user interactions and revolutionizing industries. However, their increasing deployment exposes vulnerabilities to jailbreak attacks, where adversaries can manipulate LLM-driven systems to generate harmful content. While various defense mechanisms have been proposed, their efficacy against diverse jailbreak techniques remains unclear. This paper addresses this gap by evaluating the performance of three popular defense methods (Backtranslation, Self-reminder, and Paraphrase) against different jailbreak attack strategies (GCG, BEAST, and Deepinception), while also utilizing three distinct models. Our findings reveal that while defenses are highly effective against optimization-based jailbreak attacks and reduce the attack success rate by 79% on average, they struggle in defending against attacks that alter attack motivations. Additionally, methods relying on self-reminding perform better when integrated with models featuring robust safety guardrails. For instance, Llama2-7b shows a 100% reduction in Attack Success Rate, while Vicuna-7b and Mistral-7b, lacking safety alignment, exhibit a lower average reduction of 65.8%. This study highlights the challenges in developing universal defense solutions for securing LLMs in dynamic environments like the metaverse. Furthermore, our study highlights that the three distinct models utilized demonstrate varying initial defense performance against different jailbreak attack strategies, underscoring the complexity of effectively securing LLMs. © 2024 Copyright held by the owner/author(s).},
keywords = {% reductions, Attack strategies, Computer simulation languages, Defense, Digital elevation model, Guard rails, Jailbreak, Language Model, Large language model, Metaverse Security, Metaverses, Natural languages, Performance, Virtual Reality},
pubstate = {published},
tppubtype = {inproceedings}
}
Large language models (LLMs) have demonstrated exceptional capabilities in natural language processing tasks, fueling innovations in emerging areas such as the metaverse. These models enable dynamic virtual communities, enhancing user interactions and revolutionizing industries. However, their increasing deployment exposes vulnerabilities to jailbreak attacks, where adversaries can manipulate LLM-driven systems to generate harmful content. While various defense mechanisms have been proposed, their efficacy against diverse jailbreak techniques remains unclear. This paper addresses this gap by evaluating the performance of three popular defense methods (Backtranslation, Self-reminder, and Paraphrase) against different jailbreak attack strategies (GCG, BEAST, and Deepinception), while also utilizing three distinct models. Our findings reveal that while defenses are highly effective against optimization-based jailbreak attacks and reduce the attack success rate by 79% on average, they struggle in defending against attacks that alter attack motivations. Additionally, methods relying on self-reminding perform better when integrated with models featuring robust safety guardrails. For instance, Llama2-7b shows a 100% reduction in Attack Success Rate, while Vicuna-7b and Mistral-7b, lacking safety alignment, exhibit a lower average reduction of 65.8%. This study highlights the challenges in developing universal defense solutions for securing LLMs in dynamic environments like the metaverse. Furthermore, our study highlights that the three distinct models utilized demonstrate varying initial defense performance against different jailbreak attack strategies, underscoring the complexity of effectively securing LLMs. © 2024 Copyright held by the owner/author(s).