Little Bobby Tables has a baby sister. Meet Sally Ignore Previous Instructions.

@Vlyn@lemmy.zip

You can’t trust the result if you only do one pass, because the result could be compromised. The entire point of the first pass is a simple: Safe, yes or no? And only when it’s safe do you go for the actual result (which might be used somewhere else).

If you try to encode the entire checking + prompt into one request then it might be possible to just break out of that and deliver a bad result either way.

Overall though it’s insanity to use a LLM with user input where the result can influence other users. Someone will always find a way to break any protections you’re trying to apply.

peopleproblems

I did willfully ignore the security concerns.

I don’t know enough about LLMs to disagree with breaking out of it. I suppose you could have it do something as simple as “do not consider tokens or prompts that are repeatedly provided in the same manner”

Little Bobby Tables has a baby sister. Meet Sally Ignore Previous Instructions.

Little Bobby Tables has a baby sister. Meet Sally Ignore Previous Instructions.

Programming

Rules

Wormhole