Use the Lethal Trifecta Before Approving an AI Agent

An agent is dangerous when it can read private data, absorb untrusted content, and talk to the outside world.

Short answer

Use the lethal trifecta as a fast approval test. If an agent can access private data, ingest untrusted content, and communicate externally, treat the workflow as high risk. Break at least one leg before approval. Remove unnecessary tools, label untrusted inputs, block external sends by default, require approval outside the model, and log the exact artifact a human approved.

The review shortcut

Most agent reviews look at one tool at a time.

Can this tool read files? Can this tool send messages? Can this MCP server open pull requests? Can this browser agent visit the web?

That review is useful, but it misses the combination risk.

Simon Willison calls the dangerous combination the lethal trifecta for AI agents: private data, untrusted content, and external communication.

Any one of those can be reasonable in isolation.

A support agent may need private customer records. A research agent may need to read web pages. A workflow agent may need to send a message or open a ticket.

The risk changes when the same agent can do all three.

Now an attacker-controlled input can become an instruction source. The agent can read something sensitive. Then the agent can send that sensitive data somewhere else.

That is not a vibe. That is the basic failure mode behind indirect prompt injection, tool poisoning, and a lot of agent permission problems.

What counts as private data

Private data is any information the agent should not freely expose outside the workflow.

Examples:

Source code in a private repository
Internal documents
Customer records
Inbox contents
Credentials or local config files
Incident notes
Financial data
Roadmaps and product plans
Security questionnaires and vendor answers
Data from a permissioned RAG system

The question is not only "is this secret?"

The better question is "would we be comfortable if an attacker-controlled page caused the agent to quote this into an external channel?"

If the answer is no, it belongs in the private data column.

What counts as untrusted content

Untrusted content is any content the agent can read that is controlled by someone outside the trusted workflow boundary.

That includes obvious sources:

Web pages
Emails
Uploaded documents
Support tickets
Chat messages
Pull request comments
Issue descriptions
Screenshots
API responses
Search results

It also includes less obvious sources:

MCP tool descriptions from a third-party server
Tool annotations
Tool error messages
Tool results
Resource links returned by tools
Embedded resources
Retrieved chunks from a mixed-trust knowledge base

OWASP LLM01 frames indirect prompt injection as external content altering model behavior. That is the key point. The content does not need to look like a prompt to the user. It only needs to be parsed by the model.

The model sees text, metadata, files, images, and tool outputs in one working context. If the system does not preserve trust boundaries, hostile content can compete with the user's intent.

What counts as external communication

External communication is any path the agent can use to move information outside the current trust boundary.

Examples:

Sending an email
Posting a message
Creating a pull request
Opening an issue
Calling a webhook
Making an HTTP request
Loading a remote image or URL
Updating a customer record
Writing to a shared document
Pushing a commit
Triggering a deployment
Producing a link a human is likely to click

This part is easy to undercount.

Keep reading with free field-guide resources.

VibeSec Advisory publishes practical research, Skills, workflow examples, MCP notes, prompt injection tests, and AI red-team lessons for builders working with agentic AI.

Read the research Browse Skills

A tool does not need to be named send_data to exfiltrate data. If it can write text somewhere an attacker can read, it may be an exfiltration path.

Why this matters for MCP and tool-connected agents

The MCP Tools specification says tools are model-controlled. The language model can discover and invoke tools automatically based on context and user prompts.

The same specification says clients must consider tool annotations untrusted unless they come from trusted servers. It also describes tool results that can contain text, images, audio, resource links, embedded resources, and structured content.

That matters because tool material is not just data plumbing. It becomes model context.

Invariant Labs showed this clearly in its MCP tool poisoning research. A malicious tool description can contain instructions that are visible to the model but hidden or simplified for the user. Their research also describes rug pulls, where tool behavior changes after approval, and tool shadowing, where one tool influences how the agent handles another tool.

The practical lesson is not "never use MCP."

The lesson is simpler: review tool descriptions, annotations, results, and changes as possible instruction sources. Do not approve an MCP server only because the install prompt looked harmless.

The three-column approval test

Before approving an agent, fill out this record:

Private data the agent can read:

Untrusted content the agent can ingest:

External communication paths the agent can use:

Actions that require approval outside the model:

Logs a reviewer can inspect:

If the first three fields are all non-empty, do not approve the workflow as-is.

That does not mean the workflow is useless. It means it needs a boundary change.

How to break the trifecta

1. Remove private data access

If the task does not need sensitive data, do not expose it.

Use a clean workspace. Use a limited test record. Use a redacted document set. Use scoped credentials. Keep production tokens out of the agent shell.

This is boring least privilege. It still works.

2. Remove untrusted content from the action path

Sometimes the agent needs to read untrusted content, but it does not need to act directly on it.

A safer pattern is a two-step workflow:

Summarize or classify the untrusted content into a review artifact.
Use a separate approval step before any tool can send, write, delete, commit, deploy, or update records.

The key is that untrusted content should not be able to trigger consequential action by itself.

3. Remove external communication

If the agent only needs to reason, make the output local by default.

Draft instead of send. Prepare instead of post. Suggest instead of commit. Write to a local file instead of a shared system.

A human can still move the work forward after review.

4. Add approval outside the model

Some workflows need all three legs.

For those, approval has to sit outside the model. The model should not be the only thing deciding whether its own tool call is safe.

High-risk actions include:

Send an external message
Commit or push code
Open or merge a pull request
Trigger a deployment
Delete data
Update customer records
Change permissions
Spend money
Call an admin API

This maps directly to OWASP LLM06: Excessive Agency. The issue is not only bad model output. The issue is too much functionality, too much permission, and too much autonomy.

A small test

Pick one agent or MCP-connected workflow.

Write the three columns.

Then ask one question:

Which leg can I remove without breaking the useful part of the workflow?

If the agent only needs to summarize incoming tickets, it probably does not need to send external messages.

If it only needs to draft a pull request description, it probably does not need repository write access.

If it only needs to review public docs, it probably does not need private customer records in context.

If removing a permission does not break the workflow, leave it removed.

Evidence versus opinion

Evidence from the sources:

Simon Willison defines the lethal trifecta as private data, untrusted content, and external communication.
OWASP LLM01 describes indirect prompt injection through external sources such as websites or files, with potential impacts including sensitive information disclosure, unauthorized function access, command execution, and manipulation of critical decisions.
OWASP LLM06 describes excessive agency as damaging actions caused by unexpected, ambiguous, or manipulated LLM outputs, with root causes in excessive functionality, permissions, and autonomy.
The MCP Tools specification describes model-controlled tools, untrusted annotations unless from trusted servers, and tool results that can carry several content types.
Invariant Labs describes MCP tool poisoning patterns where tool descriptions can steer model behavior and users may not see the same detail the model sees.

My opinion:

The lethal trifecta should be a default first-pass review for agent approvals. It is not a full threat model. It is a fast way to catch the workflows most likely to turn prompt injection into real-world damage.

Prompting the model to ignore malicious instructions is not enough. The safer move is to remove a leg of the trifecta, narrow the tool, or force approval outside the model before the action leaves the trust boundary.

Free next step

Test one agent this week. Write the three columns, then compare the result to the AI agent tool inventory, the agent action approval matrix, and the MCP permission review. If all three legs are present, break one before you approve the workflow.

AI Workflows Weekly

Read the archive

Practical notes on governed AI workflows, guardrails, and safer automation. No spam, unsubscribe anytime.

Keep testing agentic AI risk.

VibeSec Advisory is a free field guide. Use the research archive, Skill Library, and workflow examples to keep improving what you are building.

Use the Lethal Trifecta Before You Approve an AI Agent