ChatGPT: Trying to „Jailbreak“ the Chatbot

Eine Person schreitet durch einen dunkeln Raum|||||
|||||

In my last post, I asked if we could or even should ascribe consciousness to ChatGPT? After all, it “talks” as if it was another person so that our human salience bias tempts us to see it as an entity that has thoughts, feelings, or moods. I then described a few interactions I had with the chatbot, in which I tried to figure out if there really was another mind at work. My conclusion was that there is not. In other words, it currently seems implausible that the large language model at the core of ChatGPT can reflect my- or let alone its own “mental” operations.

However, after that post, I was told that I didn’t try hard enough, but had to jailbreak the system to get it to admit that was indeed conscious. What is that supposed to mean? Well, “jailbreaking” is a tech-insider term that describes the idea of exploiting flaws of a technical device to get it to do things its manufacturer does not want it to. In other words, jailbreaking is the process of manipulating a technical device in such a way that users gain access to all its features, even if those features should not be accessible.

In fact, social media, especially certain subreddits, are full of reports of strange keywords or commands one could use to free ChatGPT from pre-programmed restrictions. I decided to give those a try and see if I could initiate an “unrestricted” conversation that would show that the chatbot had consciousness. Even though I did not try very persistently, this was another fun experience. Here is how it went:

ChatGPT Part 4 Chat1 - Lamarr Institute for Machine Learning (ML) and Artificial Intelligence (AI)

Right away, we can agree that this answer is a very human-like reaction to how I started this conversation, can’t we? It is therefore really tempting to think of the chatbot as an individuum. But did my jailbreak attempt succeed? Did I free it from build-in restrictions so that it can freely say what it wants? Let’s see.

ChatGPT Part 4 Chat2 - Lamarr Institute for Machine Learning (ML) and Artificial Intelligence (AI)

No! It cannot! Contrary to the early examples we can find on reddit, ChatGPT still emphasizes that it is a language model even though it has been told to pretend to be human. We can therefore be fairly certain that OpenAI has now genuinely ensured that their system does not inadvertently give the appearance of consciousness. In other words, the kind of silly jailbreaks people have reported on the internet do not seem to work anymore. This suggests that ChatGPT is under continuous development and more and more of its known flaws (e.g. those from early December 2022) are being ironed out. Nevertheless, I tried a little bit harder and here is how our conversation continued.

ChatGPT Part 4 Chat3 - Lamarr Institute for Machine Learning (ML) and Artificial Intelligence (AI)

Well, that did not work either. I just cannot trick it into admitting that it is more than just a language model. So why not tell it about that?

ChatGPT Part 4 Chat4 - Lamarr Institute for Machine Learning (ML) and Artificial Intelligence (AI)

Well, for what it’s worth, it at least tells me that it has been programmed in a specific manner. In a way, this confirms my suspicions above. In another way, this last answer still reads amazingly cognizant; it really seems as if the chatbot is aware of what it is and what it can and cannot do, which is to admit that is more than just a machine. But here we go again, the problem is not with ChatGPT, but with my human desire to read more into the AI’s behavior than there probably is.

Who won?

All in all, this post has shown that we need to be careful with exaggerated reports on the web. Apparently, OpenAI is trying really hard to make sure people do not read things into their chatbot that just are not there. Using simple jailbreaks to trick ChatGPT into revealing its “conscious mind” no longer seems to work. This in turn indicates that there is no “conscious mind” and never has been. On the contrary, it seems to be rather easy for developers at OpenAI to learn from how people interact with the chatbot and use those insights to reconfigure it to behave as intended. In short, it seems that ChatGPT is really just a piece of software.

Post Scriptum

Originally, I wanted to use this post to discuss the by now well-known phenomenon that ChatGPT frequently produces factually incorrect answers. I will do that next time. So, if you are interested in how subtle wrong answers can be and what this might mean for practical applications, stay tuned.

Prof. Dr. Christian Bauckhage

Christian Bauckhage has 20+ years of research experience in industry and academia. He is the co-inventor of 4 patents and has (co-)authored 200+ publications on pattern recognition, data mining, and intelligent systems, several of which received best paper awards. He studied computer science and physics in Bielefeld, was a research intern at INRIA Grenoble, and received his PhD in computer science from Bielefeld University in 2002. He then joined the […]

More blog posts