Key Takeaways
- Prompt injection lets attackers embed hidden instructions in user input to override your AI system’s intended behaviour
- Agentic AI multiplies the risk exponentially — an agent with tools can cause real damage
- Indirect prompt injection (from external sources) is harder to detect and more common than direct injection
- Defence requires input validation, output monitoring, and architecture design — there’s no silver bullet
Key Takeaways
- Prompt injection lets attackers embed hidden instructions in user input to override your AI system’s intended behaviour
- Agentic AI multiplies the risk exponentially — an agent with tools can cause real damage
- Indirect prompt injection (from external sources) is harder to detect and more common than direct injection
- Defence requires input validation, output monitoring, and architecture design — there’s no silver bullet
What Prompt Injection Is (and Why It Matters)
Prompt injection is simple in concept, severe in practice. It’s when an attacker embeds hidden instructions in user input designed to override what the AI model is supposed to do.
There are hundreds of ways to try prompt injection. Your model has been trained to be helpful. And helpful systems struggle to distinguish between legitimate requests and attacks disguised as legitimate requests.
Direct vs. Indirect Prompt Injection
Most people think of prompt injection as direct: a user types a malicious prompt. That’s the visible risk.
But there’s a more dangerous version: indirect prompt injection. This is when the malicious prompt comes from a source the system trusts — a database, an API, a web page, an external integration.
Example: An attacker injects malicious instructions into a “customer notes” field. When the AI retrieves that record, it processes the injected instruction without realising it came from an untrusted source.
Indirect injection is harder to defend against because the attack doesn’t look like an attack. It looks like normal data from a normal source.
How Agentic AI Multiplies the Risk
An agentic AI system can take actions autonomously — execute code, call APIs, read and write files, make decisions without human approval. If prompt injection on a chatbot is a risk, prompt injection on an agentic system is a catastrophe.
The risk scales with what the agent can do:
- Read-only access? Risk is contained to information disclosure
- Database write access? Risk includes data modification and corruption
- Financial authority? Risk includes financial fraud
- Infrastructure control? Risk includes operational failure
Real Attack Scenarios
The Exfiltration Attack: A customer service agent is tricked into retrieving and displaying sensitive financial transaction history.
The Escalation Attack: An agentic system processing employee requests is manipulated into creating a new admin account with full system access.
The Resource Exhaustion Attack: An AI-powered query builder is prompted to execute resource-intensive database queries 1,000 times in parallel, causing denial-of-service.
The Supply Chain Attack: A compromised third-party API embeds malicious prompts in product descriptions that the AI processes as instructions.
The Lateral Movement Attack: An attacker uses a low-privileged AI system to call another API carrying authentication credentials, escalating their access.
Defence Against Prompt Injection
There’s no perfect defence. But these strategies reduce the risk significantly:
Input Validation
The first line of defence. Whitelist characters and formats where possible, enforce length limits, use pattern matching for known injection signatures, and employ semantic filtering tools like Rebuff.
Isolation and Least Privilege
Only give the system access to what it absolutely needs. Isolate the system so a compromise doesn’t cascade. Use API keys with minimal scopes. Require explicit approval for high-risk actions.
Output Filtering and Monitoring
The attack comes in through input, but the damage happens through output. Flag sensitive data outputs, monitor for instruction-like outputs, and track behavioural anomalies.
Separation of Concerns
Don’t give one system too many capabilities. Break your agent into smaller, purpose-built systems with limited authority. Financial decisions should require human approval. Account changes should require separate authorisation.
Architecture Design
- Sandboxing: Run the AI in an isolated environment
- Tool APIs: Create specific APIs instead of giving direct database access
- Approval workflows: Require human review for high-risk actions
- Rate limiting: Detect abnormal API call patterns
Detecting Prompt Injection in Progress
- Unusual prompts: System behaviour changes drastically or outputs unexpected information
- Repeated injection attempts: Patterns of suspicious input
- Output anomalies: The system generates code, instructions, or unrequested data
- Access pattern changes: The system tries to access unusual data or systems
- Performance degradation: Sudden slowness indicating resource exhaustion
Testing Your Current Systems
If you have AI systems in production right now:
- Inventory — List all your AI systems
- Threat assessment — What would happen if prompt injection succeeded?
- Defence audit — What defences does each system have?
- Gaps — Where are the holes?
- Prioritise — Which systems are highest risk?
- Test — Bring in someone to try prompt injection attacks
- Fix — Implement defences
This is where AI red teaming services become essential. You need skilled testers who understand both the attack vectors and your business context.