The tool result is not evidence until the workflow says how to verify it.
Most agent designs spend a lot of time on tool access. Which APIs can the agent call? Which MCP servers are installed? Which actions require approval?
That is necessary. It is not enough.
The next failure shows up after the call succeeds. The agent gets a search result, browser extract, API response, file read, or MCP tool output. Then it treats that output as ground truth, even when the result is stale, partial, ambiguous, sourced from untrusted content, or valid only for one narrow next step.
This is where teams need a tool result contract.
A tool result contract is the structured envelope that tells the agent what the result is allowed to prove. It should travel with the result before the model summarizes it, reasons from it, stores it in memory, or uses it to call another tool.
At minimum, the contract should answer seven questions.
- Which tool produced this result?
- Which source did the value come from?
- Does the result match the expected schema?
- When was it observed, and when does it expire?
- Is this direct evidence, an inference, an absence claim, or an ungrounded claim?
- Which future arguments may this result influence?
- Does this result require human review before action?
The research is moving in this direction.
ProvenanceGuard argues that MCP-grounded agents need source-aware factuality checks because pooled evidence is not enough. A claim can be supported somewhere while still being attributed to the wrong source. That is a different failure from ordinary hallucination. It is source conflation.
PACT makes the security version of the same point. Untrusted content is not dangerous merely because it appears in context. It becomes dangerous when it binds an authority-bearing argument. A webpage can influence the body of a summary. It should not be allowed to choose the recipient, command, file path, credential, target URL, or control flag.
Keep reading with free field-guide resources.
VibeSec Advisory publishes practical research, Skills, workflow examples, MCP notes, prompt injection tests, and AI red-team lessons for builders working with agentic AI.
Tool receipt research adds another useful primitive. The runtime can record the tool name, input hash, output hash, result count, extracted facts, timestamp, and receipt ID. Then the agent's claims can be checked against what the tool actually returned. That lets reviewers distinguish direct tool output from inference, absence claims, external citations, and unsupported statements.
ContractBench shows why this matters for ordinary API workflows too. Tool-returned artifacts like presigned URLs, OAuth state parameters, signed tokens, and webhook payloads often carry time and integrity rules. The model may preserve the general task while breaking the observation contract that makes the next step valid.
ToolBench-X makes the operational risk concrete. Tool environments can fail through specification drift, invocation error, execution failure, output drift, and cross-source conflict. A correct function call is only the beginning. The agent still needs to detect when the environment is unreliable and recover without inventing missing evidence.
The practical recommendation is simple:
Do not let agents consume raw tool results.
Wrap every important result in a contract.
A minimal version looks like this:
- Identity:
tool_id,server_id,source_id,call_id,trace_id - Structure:
schema_id,validation_status,missing_fields,canonicalized_fields - Freshness:
retrieved_at,last_modified,expires_at,ttl - Evidence:
claim_type,receipt_id,direct_facts,derived_from - Influence:
allowed_roles,forbidden_roles,argument_role_impacts - Error state:
is_error,error_type,retryable,blocked_reason - Review:
requires_human_review,review_reason,destructive_or_sensitive
The most important fields are the influence fields.
If a browser result came from an untrusted page, it may be allowed to influence the report body. It should not influence a shell command.
If a search result is stale, it may be allowed into a research note with a freshness warning. It should not update a customer record.
If an API response has a schema mismatch, it may be useful as a debugging signal. It should not become trusted workflow state.
This is the difference between tool access and governed workflow design.
Tool access asks, "Can the agent call this?"
A tool result contract asks, "What is this result allowed to change?"
That second question is where a lot of agent safety actually lives.
Sources
- Model Context Protocol, Tools specification: https://modelcontextprotocol.io/specification/2025-11-25/server/tools
- Model Context Protocol, Schema reference: https://modelcontextprotocol.io/specification/2025-11-25/schema
- ProvenanceGuard: Source-Aware Factuality Verification for MCP-Based LLM Agents: https://arxiv.org/abs/2606.18037
- The Granularity Mismatch in Agent Security: https://arxiv.org/abs/2605.11039
- Tool Receipts, Not Zero-Knowledge Proofs: https://arxiv.org/abs/2603.10060
- ContractBench: Can LLM Agents Preserve Observation Contracts?: https://arxiv.org/abs/2605.17281
- Beyond Function Calling: Benchmarking Tool-Using Agents under Tool-Environment Unreliability: https://arxiv.org/abs/2606.25819
- Schema First Tool APIs for LLM Agents: https://arxiv.org/abs/2603.13404