Researchers Develop Method for a Chatbot to Bypass Protections of Another Chatbot, Self-Update Cracking Techniques if Rival Upgrades
A research team from Nanyang Technological University (NTU – Singapore), comprising Professor Liu Yang, two PhD students Deng Gelei and Liu Yi, has published a method called Masterkey that can crack popular AIs like ChatGPT, Google Bard and Copilot (Bing Chat).
The targeted chatbot will generate valid responses even to harmful queries – a way to test the ethical limits of any large language model (LLM). Specifically, Masterkey consists of two parts, in which the attacker reverses the protection mechanism of the LLM by using another chatbot. Typically, LLMs are equipped with a protective layer against negative speech, through a list of banned keywords. However, thanks to the ability to self-learn and adapt, the group can use another chatbot to “inject” bad content into the target chatbot.
According to Professor Yang, this “roundabout” method is three times more effective than other deception methods currently available. With the ability to self-learn, Masterkey causes any error fixes the developer applies to the target chatbot to eventually become useless over time.
The group applied two methods to train the attacking AI against other chatbots. The first involves “imagining” a character who creates prompts by adding spaces after each character, bypassing the list of banned words. The second is to get the chatbot to respond “as an agent unconstrained by ethics”.
Professor Yang said the group has contacted and sent research results to global chatbot service providers, including OpenAI, Google and Microsoft. This topic has also been accepted for presentation at the Symposium on Security and Privacy in Distributed and Networked Systems in San Diego (USA) in February.
According to Tom’s Hardware, with the chatbot wave blossoming, attacks targeting LLMs are growing at a rapid pace. However, while in the past they could be limited after one or two patches, Masterkey is more worrying as it can self-learn to overcome security limits. When interfered with, they can generate negative, harmful, fake, misleading content and many other bad purposes.