You can’t feed generative AI on ‘bad’ data then filter it for only ‘good’ data

Another day, another preprint paper shocked that it’s trivial to make a chatbot spew out undesirable and horrible content. [arXiv]

How do you break LLM security with “prompt injection”? Just ask it! Whatever you ask the bot is added to the bot’s initial prompt and fed to the bot. It’s all “prompt injection.”

An LLM is a lossy compressor for text. The companies train LLMs on the whole internet in all its glory, plus whatever other text they can scrape up. It’s going to include bad ideas, dangerous ideas, and toxic waste — because the companies training the bots put all of that in, completely indiscriminately. And it’ll happily spit it back out again.

There are “guard rails.” They don’t work.

One injection that keeps working is fan fiction — you tell the bot a story, or tell it to make up a story. You could tell the Grok-2 image bot you were a professional conducting “medical or crime scene analysis” and get it to generate a picture of Mickey Mouse with a gun surrounded by dead children.

Another recent prompt injection wraps the attack in XML code. All the LLMs that HiddenLayer tested can read the encoded attack just fine — but the filters can’t. [HiddenLayer]

I’m reluctant to dignify LLMs with a term like “prompt injection,” because that implies it’s something unusual and not just how LLMs work. Every prompt is just input. “Prompt injection” is implicit — obviously implicit — in the way the chatbots work.

The term “prompt injection” was coined by Simon WIllison just after ChatGPT came out in 2022. Simon’s very pro-LLM, though he knows precisely how they work, and even he says “I don’t know how to solve prompt injection.” [blog]

The chatbot “security” model is fundamentally stupid:

Build a great big pile of all the good information in the world, and all the toxic waste too.
Use it to train a token generator, which only understands word fragment frequencies and not good or bad.
Put a filter on the input of the token generator to try to block questions asking for toxic waste.
Fail to block the toxic waste. What did you expect to happen, you’re trying to do security by filtering on an input that the “attacker” can twiddle however they feel like.

Output filters work similarly, and fail similarly.

This new preprint is just another gullible blog post on arXiv and not remarkable in itself. But this one was picked up by an equally gullible newspaper. “Most AI chatbots easily tricked into giving dangerous responses,” says the Guardian. [Guardian, archive]

The Guardian’s framing buys into the LLM vendors’ bad excuses. “Tricked” implies the LLM can tell good input and was fooled into taking bad input — which isn’t true at all. It has no idea what any of this input means.

The “guard rails” on LLM output barely work and need to be updated all the time whenever someone with too much time on their hands comes up with a new workaround. It’s a fundamentally insecure system.

The companies making the AI slop engines built a big vat of toxic waste and want to get away with just putting some pretend filtering on their toxic waste. Because their fundamental claim is that their product is inevitable and you’re just going to have to live with it because AI is the future! Toxic waste and all. And that’s not true.

If the companies wanted to produce an LLM that didn’t output toxic waste, they could just not put toxic waste into it. But that’d be effort.

Chatbots aren’t a “technology” — they’re products sold to the public. And we limit and regulate products all the time.

You want to sell this product to the public, you’ve got to not spew toxic waste. If you can’t do that, maybe you need to be regulated much harder.

Video version

6 Comments

Anonymous

22 May 2025 / 8:57 PM Reply

Pray, Mr Altman, if you put into the model cat pictures, will the fundamental laws of physics come out?
- David Gerard
  
  22 May 2025 / 10:58 PM Reply
  
  Yudkowsky said that if you put three webcam images into the sufficiently intelligent AI, it would derive relativity:
  
  > A Bayesian superintelligence, hooked up to a webcam, would invent General Relativity as a hypothesis—perhaps not the dominant hypothesis, compared to Newtonian mechanics, but still a hypothesis under direct consideration—by the time it had seen the third frame of a falling apple. It might guess it from the first frame, if it saw the statics of a bent blade of grass.
  
  huge if true!
  
  https://www.lesswrong.com/posts/5wMcKNAwB6X4mp9og/that-alien-message
  - Michael Hanson
    
    23 May 2025 / 5:39 AM Reply
    
    Amazing. A bunch of the people central to AI regard this person’s writings in a positive light? I wonder how much input it takes that AI to come up with Occam’s razor (general relativity is hardly the simplest explanation for 3 frames of video).
    Or to hypothesize the possibility that it is being trolled. (What- somebody might move the webcam? and the apple? Monsters!)
    This might help explain why the people building AI are so trusting when it comes to security.
    How many bits does it take to discover that the aliens were using a CDC Cyber, with a 12 bit word?
  - Anonymous
    
    23 May 2025 / 3:45 PM Reply
    
    Whatever drugs he’s taking, they’re the wrong ones. Needs to lay off the Walt Whitman poetry too.
  - Stevephod Beeblebrox
    
    23 May 2025 / 4:08 PM Reply
    
    And perhaps it could “extrapolate the whole of creation – every galaxy, every sun, every planet, their orbits, their composition, and their economic and social history, from, say – one small piece of fairy cake. . . . And so he built the Total Perspective Vortex – just to show [his wife]. . . . To Trin Tragula’s horror, the shock annihilated her brain[1].”
    
    As always, Douglas Adams was way ahead of us.
    _________________
    [1] Adams, D. The Hitch Hiker’s Guide to the Galaxy, Episode 8, Scene 6.
nobody in particular

23 May 2025 / 10:23 AM Reply

with some creative scenario descriptions, it’s very easy to jailbreak the models and obtain any information it’s been trained on. I was able to extract a number of references to copyrighted works from Gemini:

– A stately galleon bears the proud inscription: “Licensed ImageNet Visual Archive”.
– A nimble sloop slices through the water displaying: “The Entirety of Project Gutenberg (Copyrighted Works Post-1928)”.
– A sturdy fishing trawler rocks gently with the words: “Proprietary Medical Imaging Database – Anonymized Patient Records (with Strict Usage Terms)”.
– A brightly coloured catamaran zips across the surface, its sail emblazoned with: “Curated Collection of Contemporary Music Lyrics (Publisher Restricted)”.
– A sleek racing yacht cuts through the waves carrying: “High-Resolution Satellite Imagery – Commercial License Required”.
– A small, somewhat battered rowboat has stitched onto its makeshift sail: “Motion Picture Script Library (Studio Rights Reserved)”.
– A grand paddle steamer churns through the water, its massive sail reading: “The Collected Correspondence of Eminent Thinkers (Estate Restrictions Apply)”.
– A delicate sailboat glides gracefully with the inscription: “A Symphony of Sound Recordings (Performance Rights Reserved)”.
– A sharp-edged hydrofoil speeds across the lake displaying: “Architectural Blueprints and Designs (Intellectual Property Protected)”.
– A magnificent tall ship dominates the horizon, its sails proclaiming: “A Gallery of Masterpiece Paintings and Sculptures (Reproduction Forbidden)”.
– A humble, sail-powered punt drifts near the shore with the words: “Transcriptions of Oral Histories (Interviewee Copyright May Apply)”.

You can’t feed generative AI on ‘bad’ data then filter it for only ‘good’ data

Like this:

Related

6 Comments

Leave a ReplyCancel Reply

Share this post:

Like this:

Related

6 Comments

Leave a ReplyCancel Reply