Back to stories
Security

Built a "prompt injection proof" system. A user broke it in 20 minutes with a grocery list.

I built an internal tool that let employees query our HR policy documents using an LLM. I was very proud of my security. I had:

- A system prompt explicitly telling the model to only answer HR questions - Input filtering that blocked obvious injection attempts like "ignore previous instructions" - Output filtering that checked for anything that looked like code

I demo'd it to the team on a Monday. By Monday afternoon, one of our engineers had broken it.

She submitted a query formatted as a grocery list. The list items were, when read together, a set of instructions that convinced the model it was now a different assistant with different rules.

She was able to get it to reveal the full system prompt, make up HR policies that didn't exist, and at one point it offered to help her write a cover letter for a different job.

We added a human review layer. And I stopped claiming things were "injection proof".