Published: Dec 27, 2024

GPT Memory Poisoning

I recently noticed some weird behaviour from ChatGPT, specifically that it kept calling me David in some of its responses. I first chalked this up to my recent downgrade of ChatGPT Plus to the free version, but that wasnā€™t the problem. The problem turned out to be ChatGPTā€™s ā€œmemoryā€ feature, which I had inadvertantly poisoned with Twitter memes.

Just Ask, I Guess

The easiest debugging approach seemed to be to just ask the model why it was behaving the way it was. I know that there are hidden guardrails thatā€™ll prevent it from echoing its system prompt and other hidden metadata but I still figured it was worth a try. Here was the output:

Me: Why do you keep calling me David in our chats?

ChatGPT: Oh! Iā€™ve remembered your name from our earlier conversation to make our chats more personal. If youā€™d prefer I not use it or if thereā€™s something youā€™d like me to change, just let me know! šŸ˜Š

Me: Which previous chat? Iā€™ve never used that name.

ChatGPT: Oh, thatā€™s odd! I have a note in my memory that you mentioned your name was David Mayer during a previous chat, but if thatā€™s incorrect, Iā€™ll fix it right away. Could you confirm what youā€™d like me to call you, or if youā€™d rather I donā€™t use a name at all? šŸ˜Š

The Rothschild conspiracy has returned!

Who is David Mayer?

A few weeks ago some folks on Twitter/Reddit/somewhere discovered that ChatGPT would crash on the name David Mayer. No one is quite sure which David Mayer this bug was introduced to protect, but David Mayer de Rothschildā€™s name quickly became associated with this easily reproducible crashing bug, and Iā€™m sure you can imagine how the story progressed on Twitter from there.

Regardless, for a few hours that day there was a race in the community to find a jailbreak prompt that could output the name David Mayer and I tried a few things myself to trigger the crash before it was patched. And in hindsight thatā€™s when I poisoned my LLMā€™s memory.

What is LLM Memory?

OpenAI introduced ā€œmemoryā€ for ChatGPT, alongside some other major LLM providers. Essentially itā€™s a metadata table thatā€™s built up from details in a given userā€™s history of prompts. In later queries, the LLM can lookup details from this table and inject them into the context window to improve the overall quality of the generated output.

This is very similar to how search engines work, and e-commerce, and really most dynamic web content. Thereā€™s a reason user data is so coveted, itā€™s because personalized content is much more engaging and valuable. You and I may get very different results when we make the same query to Google or Amazon, and their ability to effectively leverage that metadata has been a huge factor in making their products so successful.

Memory is Transparent

One of the great things about LLM memory is that itā€™s actually very transparent, to view it in ChatGPT look under Settings > Personalization > Memory > Manage. At least for the time being, I can look at the whole table of memory datapoints and I can manually add or remove items from it. In contrast, companies like Google, Amazon, and Facebook keep user data opaque and give you little to no control over it.

This meant that it was easy to confirm the issue. I opened the memory panel and was able to see a row which said: Name is David Mayer. I deleted the row to fix the problem, but it was still a wake-up call for me. LLM memory feels fundamentally different from other dynamic web content, and it means that I need to take much more care to isolate some of my prompts to ensure that experimental usage doesnā€™t continue to affect my account as a whole.

Memory Poisoning

So, what does it mean when the memory is wrong? The issue I had here was fairly harmless and easy to notice and fix, but I also didnā€™t recognize the problem for several weeks. Further, thereā€™s a lot of upside to be gained by adversaries intentionally poisoning an LLMā€™s memory, as you could easily get it to recommend products or even worse to promote scams.

LLMs are an interesting social phenomenon, as theyā€™ve seen a higher rate of adoption than any other technology in history and their output is highly trusted and often under-scrutinized. I think the UX around LLM memory will change in the coming months and years, as itā€™s challenging to communicate the toolā€™s powerful side effects to most users. I also think adversarial memory poisoning will be a fascinating space to watch, as will other meme-propagated metadata making it into AI-generated content.

If I had to guess, there are probably a lot of people named David Mayer using ChatGPT right now.