Skip to main content
Back to all posts
5 minAgentic AI SecurityJuly 3, 2026

Write the Failure Labels Before You Run the Eval

AI workflow evals need failure labels, not just pass/fail scores. Use failure mode, trigger, affected surface, visibility, severity, recoverability, evidence, and next checkpoint before updating Skills or guardrails.

RM

Ryan Macomber

Editor, VibeSec Advisory

AI workflow evals need failure labels, not just pass and fail.

A pass/fail score tells you whether the run crossed the finish line. It does not tell you whether the agent misunderstood the user, picked the wrong tool, trusted a bad tool result, lost state, missed an approval point, or created a failure the user never noticed.

That distinction matters because each failure needs a different fix.

If the agent called the wrong tool, you may need a tool contract.

If it had the right tool but used the wrong argument, you may need input validation.

If it produced the right final answer through an unsafe path, your eval is hiding the problem.

If the user quietly accepted a bad answer, your support metrics may call that a success.

The research points in the same direction. A 2026 paper on agentic AI faults analyzed issues and pull requests across agent frameworks and found failures around model interfaces, state, structured outputs, tool calls, runtime execution, and exception handling. The useful lesson is that agent failures often happen at the boundary between LLM reasoning and deterministic systems.

AgentAtlas makes a related point: final task success is not enough for deployed agents. It separates control decisions like act, ask, refuse, stop, confirm, and recover from trajectory quality. That is exactly the layer most workflow evals skip.

Keep reading with free field-guide resources.

VibeSec Advisory publishes practical research, Skills, workflow examples, MCP notes, prompt injection tests, and AI red-team lessons for builders working with agentic AI.

Another paper on invisible failures in human-AI interactions found that many failures do not announce themselves through complaints or explicit corrections. The user may walk away, accept a confident wrong answer, or receive something polished that missed the actual goal.

That means evals need a small failure record beside every failed or risky run.

Use fields like:

  • Failure mode
  • Trigger
  • Affected surface
  • Visibility
  • Severity
  • Recoverability
  • Evidence
  • Next checkpoint

For example:

failure_mode: tool_result_misread
trigger: ambiguous_tool_output
surface: tool
visibility: invisible
severity: wrong_decision
recoverability: human_recoverable
evidence: trace_2026_07_02_019
next_checkpoint: human_review

That label tells the team what to do next.

Do not rewrite the whole Skill if the real problem was a tool output contract. Do not add a human approval gate if the real problem was a missing parser check. Do not celebrate a passing eval if the agent took an unsafe path that happened to land on the right answer.

The practical workflow is simple:

  1. Pull ten recent AI workflow failures or near misses.
  2. Label each one with the same small field set.
  3. Separate surface symptoms from root cause.
  4. Mark whether the user could see the failure.
  5. Add one eval case for each recurring failure mode.
  6. Update the Skill only when the failure belongs in the Skill.
  7. Add stop, confirm, or review checkpoints for recoverable failures that currently continue unchecked.

This is how evals become operational.

Not a leaderboard. Not a dashboard. A feedback loop that tells the team where the workflow is actually breaking.

Sources

AI Workflows Weekly

Read the archive

Practical notes on governed AI workflows, guardrails, and safer automation. No spam, unsubscribe anytime.

First-party signup with double opt-in. No embedded newsletter iframe, no analytics cookies, and unsubscribe anytime.

Keep testing agentic AI risk.

VibeSec Advisory is a free field guide. Use the research archive, Skill Library, and workflow examples to keep improving what you are building.