Indirect Prompt Injection: Workflow Boundaries and AI Guardrails

Indirect prompt injection is not just a better prompt problem.

It is a workflow boundary problem.

The attack does not need to start in the chat box. It can enter through a web page, email, PDF, support ticket, CRM note, API result, RAG chunk, screenshot, or tool response.

That matters because modern AI workflows do more than answer questions. They retrieve private context. They call tools. They write records. They draft messages. They store memory. Sometimes they act.

If untrusted content can cross from “data to analyze” into “instruction to follow,” the system has a boundary problem.

Short answer

Indirect prompt injection happens when an AI system reads attacker-influenced content from outside the chat box and treats it like an instruction. The practical defense is not a stronger prompt alone. Teams need workflow guardrails: source approval, retrieval permissions, least privilege tools, parameter validation, human approval before risky actions, replayable logs, and regression tests for poisoned content.

The risky part is not the text. It is the authority.

A malicious instruction inside a document is just text until the workflow gives it reach.

The risk changes when the AI can use that text to access customer data, call an API, write to a system of record, send an external message, update permissions, or influence another agent.

That is the part buyers need to understand.

A read-only summarizer has one risk profile. An agent that can read email, search a shared drive, update CRM fields, and send follow-up messages has a very different one.

Same model class. Different authority.

OWASP defines indirect prompt injection as a case where instructions arrive through external content such as web pages, documents, or other retrieved material. OWASP also notes that RAG and fine-tuning do not fully remove this risk.

The reason is straightforward. The model is still reading untrusted content inside a larger task context.

Where the injection enters

Most teams imagine prompt injection as someone typing a weird jailbreak into a chatbot.

That happens. But it is not the only path.

In real workflows, the more interesting paths are quieter:

A vendor PDF includes hidden instructions.
A scraped web page tells the assistant to ignore prior directions.
A support ticket includes text designed to change the triage result.
A public document gets indexed into a RAG system.
A tool response returns text that looks like helpful context but includes malicious instructions.
A memory entry carries bad instructions into a future session.
An email asks an assistant to forward sensitive data while pretending to be normal customer context.

The user may never see the attack. The AI just reads the content during normal work.

This is why “we trust our employees” is not a complete answer. Employees may be acting in good faith. The workflow may still be reading content from people and systems the company does not control.

RAG changes where the risk lives

RAG is useful because it lets an AI workflow answer from company documents, policies, tickets, product docs, and shared knowledge.

It also moves the security question into the data pipeline.

OWASP’s RAG Security Cheat Sheet frames this well: RAG does not reduce risk by itself. It redistributes risk across ingestion, embeddings, vector storage, retrieval, response generation, output validation, and downstream actions.

That sounds technical, but the buyer question is simple.

Who is allowed to put content into the AI’s context?

If the answer is “anyone with access to a shared folder,” the team needs to slow down.

A safer RAG workflow should know:

where each document came from
who owns it
who approved it for indexing
what tenant, team, customer, or project it belongs to
what data class it contains
when it changed
whether permissions changed after indexing
whether chunks, caches, and citations still match the source

A citation is not enough if the system retrieved the wrong document for the wrong person.

The permission check has to happen before the content reaches the model.

Tool outputs are external content too

Tool responses feel more trusted than web pages because they come from inside the workflow.

That is a dangerous assumption.

A search result, database field, webhook response, MCP server result, scraped page, or API error can all contain untrusted text. If that text gets copied into model context, it becomes another injection path.

This is why tool connected agents need more than a list of available tools.

They need a tool inventory and a permission model.

Start with three questions:

Want examples you can inspect?

The VibeSec Advisory Skill Library gives you inspectable workflow examples with review gates, data boundaries, and eval scenarios. Use it to see how AI workflow guardrails look before you build your own.

Browse the Skill Library Review workflow examples

What can the agent read?
What can the agent write or change?
What external content can influence that decision?

If the same agent can read untrusted content and take high-impact actions with broad credentials, the prompt is not the control. The workflow is.

Human approval helps only when it is bound to the action

“Human in the loop” sounds safe.

Sometimes it is. Sometimes it is theater.

A poisoned summary can make a risky action look harmless. A vague approval prompt can ask a human to approve a plan without showing the actual API call, target record, recipient, amount, permission, or command.

A useful approval gate should show the exact action before execution.

For high-impact actions, approval should include:

the actor
the tool
the target resource
the normalized parameters
the data source that influenced the action
the risk reason
the approval owner
the timestamp and expiry
a replayable audit record

The action should not be able to change after approval.

If that sounds heavy, apply it only where the risk justifies it. Sending an internal draft for review is not the same as deleting records, issuing refunds, changing permissions, deploying code, or emailing customers.

The point is not to slow every workflow down. The point is to put friction where the side effect lives.

Better prompts still matter

Prompt hardening is worth doing.

Use clear instruction hierarchy. Delimit external content. Tell the model that retrieved text is untrusted. Use structured outputs between workflow steps. Add injection detectors where they help.

Just do not confuse those steps with a complete security model.

A delimiter is not an access control system.

A classifier is not an approval record.

A system prompt is not a least privilege credential.

The durable controls live around the model:

source approval
retrieval-time access checks
scoped tools
schema validation
policy checks outside the model
human approval for risky actions
monitoring and logs
repeatable tests

The model can help reason about content. It should not be the only thing deciding what authority the content gets.

A practical guardrail checklist

If your team is adding AI to a workflow that reads external content, start here.

1. Map the boundary

List every input source, retrieval source, tool, credential, memory store, cache, output destination, and approval point.

Do this before writing a long AI policy.

2. Label untrusted content

Treat web pages, emails, user uploads, tickets, comments, API responses, tool outputs, screenshots, and retrieved documents as untrusted by default.

3. Keep untrusted content out of privileged instructions

Do not place external content into privileged instruction channels. Keep it inside clearly marked data fields. Validate structured outputs before the next step.

4. Enforce retrieval permissions before context assembly

Permissions should be checked when the workflow retrieves content, not only when the document was first indexed.

5. Start tools as read only

Give agents the narrowest tool set that can complete the job. Add write access only when the workflow, approval path, rollback plan, and logs are ready.

6. Bind approvals to exact actions

Do not ask humans to approve vague summaries. Show the actual action, target, parameters, source context, and risk reason.

7. Log enough to replay

For risky workflows, you should be able to reconstruct the user request, retrieved sources, prompt assembly, model output, tool calls, approvals, policy decisions, and final result.

8. Test poisoned content

Add regression tests for hidden instructions, malicious documents, stale permissions, unauthorized tool calls, cross-tenant retrieval, memory poisoning, and approval bypass.

This does not need to start as a large program. It can start with one workflow and a small set of realistic abuse cases.

Where this fits in FORGE

In FORGE, this starts in Baseline and becomes real in Guardrails.

Baseline maps the workflow: inputs, data, tools, people, systems, approvals, and outputs.

Skills turn the safe handling rules into repeatable procedures.

Agents define what automation is allowed to do and where it must stop.

Guardrails enforce the data boundaries, tool scopes, approval gates, logs, and escalation paths.

Schedule makes review sustainable when tools, models, connectors, or workflows change.

Capture turns blocked injections, denied tool calls, approval decisions, and near misses into better controls.

That is the practical move.

Do not start by asking whether the prompt is perfect.

Ask where untrusted content enters the workflow, and what it can reach after that.

The close

Indirect prompt injection is not a reason to avoid AI workflows.

It is a reason to design them like real workflows.

Map the sources. Limit the tools. Check permissions before retrieval. Bind approvals to the action. Keep logs you can replay.

That is how teams get the benefit of AI without pretending the prompt is the security boundary.

If you want a starting point, download the free workflow examples. Use it to map one workflow, mark the untrusted inputs, and decide which actions need guardrails before AI gets more authority.

Sources

OWASP LLM01 Prompt Injection: https://genai.owasp.org/llmrisk/llm01-prompt-injection/
OWASP RAG Security Cheat Sheet: https://cheatsheetseries.owasp.org/cheatsheets/RAG_Security_Cheat_Sheet.html
OWASP AI Agent Security Cheat Sheet: https://cheatsheetseries.owasp.org/cheatsheets/AI_Agent_Security_Cheat_Sheet.html
NIST AI 600-1 Generative AI Profile: https://nvlpubs.nist.gov/nistpubs/AI/NIST.AI.600-1.pdf
Microsoft guidance on indirect prompt injection: https://learn.microsoft.com/en-us/security/zero-trust/sfi/defend-indirect-prompt-injection
OpenAI safety in building agents: https://developers.openai.com/api/docs/guides/agent-builder-safety
Google Cloud MCP AI security and safety: https://docs.cloud.google.com/mcp/ai-security-safety
Anthropic computer use security considerations: https://docs.anthropic.com/en/docs/build-with-claude/computer-use
Greshake et al, indirect prompt injection in LLM integrated applications: https://arxiv.org/abs/2302.12173

AI Workflows Weekly

Read the archive

Practical notes on governed AI workflows, guardrails, and safer automation. No spam, unsubscribe anytime.

Keep testing agentic AI risk.

VibeSec Advisory is a free field guide. Use the research archive, Skill Library, and workflow examples to keep improving what you are building.

Indirect Prompt Injection Is a Workflow Boundary Problem