Testing Your Agent
This guide walks through the complete workflow for evaluating an AI agent against an Ashr Labs dataset. It covers everything from wrapping your agent in the SDK protocol, to running the eval, to submitting results.
Overview
The eval workflow has three stages:
- Get a dataset — fetch an existing one or generate a new one
- Run the eval —
EvalRunneriterates scenarios, calls your agent, compares results - Submit results — deploy the run to the Ashr Labs dashboard
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Get Dataset │ ──> │ EvalRunner │ ──> │ Deploy Run │
│ │ │ .run(agent) │ │ │
└──────────────┘ └──────────────┘ └──────────────┘
The 3-Line Version
If you already have a dataset and an agent, here's the entire eval:
from ashr_labs import AshrLabsClient, EvalRunner
client = AshrLabsClient(api_key="tp_your_key_here")
runner = EvalRunner.from_dataset(client, dataset_id=322)
runner.run_and_deploy(my_agent, client, dataset_id=322)
The rest of this guide explains what's happening under the hood and how to customize every step.
Step 1: Wrap Your Agent
EvalRunner works with any object that has respond() and reset() methods. This is defined as the Agent protocol — no base class to inherit from, no SDK dependency in your agent code.
The Agent Protocol
class Agent(Protocol):
def respond(self, message: str) -> dict[str, Any]:
"""Process a user message and return the agent's response.
Returns:
{
"text": str, # The agent's text response
"tool_calls": [ # All tool calls made during this turn
{
"name": str,
"arguments": dict, # Tool arguments as a dict
},
...
]
}
"""
...
def reset(self) -> None:
"""Clear conversation state for a new scenario."""
...
How Tool Calls Are Logged
The agent is responsible for collecting its own tool calls during the respond() call and returning them in the response dict. The SDK does not intercept or instrument tool execution — it only consumes whatever the agent reports.
During a single respond() call, your agent may:
- Call the LLM
- Get back tool use requests
- Execute tools and feed results back to the LLM
- Repeat steps 2-3 multiple times (tool loops)
- Finally get a text response
Throughout this loop, accumulate every tool call into a list and return it alongside the final text.
Full Example: Customer Support Agent
Here's a complete tool-calling agent built on the Anthropic API. This is the agent we use in our own evals — it handles order lookups, inventory checks, and refund processing.
import json
from anthropic import Anthropic
SYSTEM_PROMPT = """You are a helpful customer support agent for ShopWave.
You help customers check order status, look up product availability,
and process refunds. Always be polite and concise."""
TOOLS = [
{
"name": "lookup_order",
"description": "Look up the status and details of a customer order.",
"input_schema": {
"type": "object",
"properties": {
"order_id": {
"type": "string",
"description": "The order ID (e.g. ORD-12345)",
}
},
"required": ["order_id"],
},
},
{
"name": "check_inventory",
"description": "Check availability of a product.",
"input_schema": {
"type": "object",
"properties": {
"product_name": {
"type": "string",
"description": "The product name or SKU",
}
},
"required": ["product_name"],
},
},
{
"name": "process_refund",
"description": "Initiate a refund for an order.",
"input_schema": {
"type": "object",
"properties": {
"order_id": {"type": "string", "description": "The order ID"},
"reason": {"type": "string", "description": "Reason for refund"},
},
"required": ["order_id", "reason"],
},
},
]
def execute_tool(name: str, args: dict) -> str:
"""Your tool implementations. Replace with real logic."""
if name == "lookup_order":
return json.dumps({"order_id": args["order_id"], "status": "shipped"})
elif name == "check_inventory":
return json.dumps({"product": args["product_name"], "in_stock": True})
elif name == "process_refund":
return json.dumps({"refund_id": "REF-001", "status": "processed"})
return json.dumps({"error": f"Unknown tool: {name}"})
class SupportAgent:
"""A tool-calling customer support agent."""
def __init__(self, api_key: str):
self.client = Anthropic(api_key=api_key)
self.conversation: list[dict] = []
def reset(self) -> None:
"""Clear conversation history for a new scenario."""
self.conversation = []
def respond(self, user_message: str) -> dict:
"""Send a user message and return the agent's full response.
The key contract: collect ALL tool calls made during this turn
and return them alongside the final text response.
"""
self.conversation.append({"role": "user", "content": user_message})
all_tool_calls = [] # <-- Accumulate tool calls here
for _ in range(10): # Max iterations for tool loops
response = self.client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
system=SYSTEM_PROMPT,
tools=TOOLS,
messages=self.conversation,
)
# Separate text and tool use blocks
text_parts = []
tool_uses = []
for block in response.content:
if block.type == "text":
text_parts.append(block.text)
elif block.type == "tool_use":
tool_uses.append(block)
# No tool calls — we're done
if not tool_uses:
self.conversation.append(
{"role": "assistant", "content": response.content}
)
return {
"text": "\n".join(text_parts),
"tool_calls": all_tool_calls,
}
# Execute tools and continue the loop
self.conversation.append(
{"role": "assistant", "content": response.content}
)
tool_results = []
for tool_use in tool_uses:
result_str = execute_tool(tool_use.name, tool_use.input)
# Record every tool call with its name and arguments
all_tool_calls.append({
"name": tool_use.name,
"arguments": tool_use.input,
})
tool_results.append({
"type": "tool_result",
"tool_use_id": tool_use.id,
"content": result_str,
})
self.conversation.append({"role": "user", "content": tool_results})
if response.stop_reason == "end_turn":
return {
"text": "\n".join(text_parts),
"tool_calls": all_tool_calls,
}
return {"text": "[Max iterations reached]", "tool_calls": all_tool_calls}
Minimal Agent (No Tools)
If your agent doesn't use tools, the wrapper is simpler:
class SimpleAgent:
def __init__(self, llm_client):
self.client = llm_client
self.history = []
def reset(self) -> None:
self.history = []
def respond(self, message: str) -> dict:
self.history.append({"role": "user", "content": message})
response = self.client.chat(messages=self.history)
self.history.append({"role": "assistant", "content": response.text})
return {"text": response.text, "tool_calls": []}
arguments vs arguments_json — Important Serialization Note
The Agent protocol's respond() method returns tool call arguments as a dict:
{"name": "lookup_order", "arguments": {"order_id": "ORD-123"}}
But internally, RunBuilder and the API store them as a JSON string under arguments_json:
{"name": "lookup_order", "arguments_json": "{\"order_id\": \"ORD-123\"}"}
If you use EvalRunner, this is handled automatically — it serializes arguments to arguments_json when recording results (see eval.py:187-193).
If you use RunBuilder directly (the manual flow), you need to pass arguments_json as a JSON string, not arguments as a dict:
import json
# Correct — RunBuilder expects arguments_json (string)
test.add_tool_call(
expected={"name": "lookup_order", "arguments_json": json.dumps({"order_id": "ORD-123"})},
actual={"name": "lookup_order", "arguments_json": json.dumps({"order_id": "ORD-123"})},
match_status="exact",
)
# Also works — the comparators handle both formats via extract_tool_args()
test.add_tool_call(
expected={"name": "lookup_order", "arguments": {"order_id": "ORD-123"}},
actual={"name": "lookup_order", "arguments_json": json.dumps({"order_id": "ORD-123"})},
match_status="exact",
)
The extract_tool_args() helper normalizes both formats, so comparators work regardless. But the data stored in the run result will use whichever format you pass to add_tool_call().
Verifying Your Agent Satisfies the Protocol
You can use Python's isinstance check at runtime thanks to @runtime_checkable:
from ashr_labs import Agent
agent = SupportAgent(api_key="sk-...")
assert isinstance(agent, Agent), "Agent doesn't implement the protocol"
Step 2: Get a Dataset
Option A: Fetch an Existing Dataset
from ashr_labs import AshrLabsClient
client = AshrLabsClient(api_key="tp_your_key_here")
# Fetch by ID
dataset = client.get_dataset(dataset_id=322)
source = dataset["dataset_source"]
# Quick summary
runs = source.get("runs", {})
total_actions = sum(len(s.get("actions", [])) for s in runs.values())
print(f"Dataset #{dataset['id']}: {len(runs)} scenarios, {total_actions} actions")
Option B: Generate a New Dataset
Use generate_dataset() — it creates the request, polls until complete, and fetches the result in one call:
dataset_id, source = client.generate_dataset(
request_name="ShopWave Support Eval",
config={
"metadata": {
"dataset_name": "ShopWave Support Eval",
"description": "Customer support scenarios with tool calling",
},
"agent": {
"name": "ShopWave Support Agent",
"description": "Helps customers with orders, inventory, and refunds",
"system_prompt": "You are a helpful support agent for ShopWave.",
"tools": [
{
"name": "lookup_order",
"description": "Look up order status",
"parameters": {
"type": "object",
"properties": {
"order_id": {"type": "string"}
},
"required": ["order_id"],
},
},
# ... more tools
],
"accepted_inputs": {"text": True, "audio": False, "file": False,
"image": False, "video": False},
"output_format": {"type": "text"},
},
"context": {
"domain": "ecommerce",
"use_case": "Customers contacting support about orders and refunds",
"scenario_context": "An online retail store called ShopWave",
},
"test_config": {
"num_variations": 25,
"variation_strategy": "balanced",
"coverage": {
"happy_path": True,
"edge_cases": True,
"error_handling": True,
"multi_turn": True,
},
},
"generation_options": {
"generate_audio": False,
"generate_files": False,
"generate_simulations": False,
},
},
timeout=600,
)
print(f"Generated dataset #{dataset_id}")
If you need more control over the polling (e.g. to show progress), use the lower-level methods:
req = client.create_request(request_name="My Eval", request=config)
completed = client.wait_for_request(req["id"], timeout=600, poll_interval=5)
# Then fetch the dataset manually via client.list_datasets() / client.get_dataset()
Step 3: Run the Eval
Basic Run
from ashr_labs import EvalRunner
runner = EvalRunner(source) # source = dataset["dataset_source"]
run = runner.run(agent)
That's it. EvalRunner.run() handles the full eval loop:
- Iterates every scenario in
source["runs"] - Resets the agent at the start of each scenario
- For each
actor == "user"action: callsagent.respond(content) - For each
actor == "agent"action: compares expected tool calls and text against the agent's actual response - Returns a populated
RunBuilderwith all results recorded
Or Use from_dataset to Skip the Fetch
runner = EvalRunner.from_dataset(client, dataset_id=322)
run = runner.run(agent)
Submitting and Waiting for Grading
All scoring is performed server-side after deploy(). The backend uses LLM-based semantic matching for tool arguments and embedding similarity for text responses, which is more accurate than local heuristics.
# Submit results
created = run.deploy(client, dataset_id=322)
print(f"Run #{created['id']} submitted")
# Wait for server-side grading to complete (typically 1-3 minutes)
graded = client.poll_run(created["id"])
metrics = graded["result"]["aggregate_metrics"]
print(f"Total tests: {metrics['total_tests']}")
print(f"Passed: {metrics['tests_passed']}")
print(f"Failed: {metrics['tests_failed']}")
print(f"Tool divergences: {metrics['total_tool_call_divergence']}")
print(f"Text divergences: {metrics['total_response_divergence']}")
You can also pass a callback to poll_run to show progress:
graded = client.poll_run(
created["id"],
timeout=300,
on_poll=lambda elapsed, r: print(f" Grading in progress ({elapsed}s)..."),
)
Running Scenarios in Parallel
By default, scenarios run sequentially. Pass max_workers to run multiple scenarios concurrently using threads — each scenario gets its own deep copy of the agent:
# Run up to 4 scenarios at a time
run = runner.run(agent, max_workers=4)
This can significantly speed up evals when your agent spends most of its time waiting on LLM API calls. Actions within each scenario still run sequentially (since they depend on each other), but independent scenarios run in parallel.
# Also works with run_and_deploy
created = runner.run_and_deploy(agent, client, dataset_id=322, max_workers=4)
Important: max_workers > 1 requires a deep-copyable agent. Each parallel worker creates a copy.deepcopy(agent). Most LLM clients (Anthropic, OpenAI) hold connection pools and thread-local state that cannot be deep-copied — this will fail with a clear error message. Use max_workers=1 (the default) unless your agent implements __deepcopy__ to create fresh clients. If a scenario raises an exception during parallel execution, it's recorded as a failed test and the remaining scenarios continue.
Submitting in One Call
runner = EvalRunner.from_dataset(client, dataset_id=322)
created = runner.run_and_deploy(agent, client, dataset_id=322)
# Wait for grading
graded = client.poll_run(created["id"])
print(f"Passed: {graded['result']['aggregate_metrics']['tests_passed']}")
Step 4: Add Progress Callbacks
EvalRunner.run() accepts optional callbacks to monitor progress:
def on_scenario(scenario_id, scenario):
title = scenario.get("title", scenario_id)
n = len(scenario.get("actions", []))
print(f"\n── Scenario: {title} ({n} actions) ──")
def on_action(index, action):
actor = action.get("actor", "?")
content = action.get("content", "")
preview = content[:80] + ("..." if len(content) > 80 else "")
print(f" [{index}] {actor}: {preview}")
run = runner.run(
agent,
on_scenario=on_scenario,
on_action=on_action,
)
Environment Actions
Some datasets include actor == "environment" actions — these represent external events like tool results from third-party systems, webhook callbacks, or simulated system responses. By default, environment actions are skipped.
To handle them, pass an on_environment callback. It receives the action content and the full action dict. Return a dict with "text" and/or "tool_calls" to update the agent's state for subsequent comparisons:
def handle_environment(content, action):
"""Feed environment context to the agent so it can respond."""
return agent.respond(content)
run = runner.run(agent, on_environment=handle_environment)
If you return None (or don't provide the callback), the environment action is ignored and the agent's previous response carries forward.
Output looks like:
── Scenario: Customer asks about delayed order (4 actions) ──
[0] user: Hi, I placed an order last week (ORD-54321) and it still hasn't arrive...
[1] agent: Let me look up your order right away.
[2] user: Can you also check if the wireless headphones are back in stock?
[3] agent: I've checked both — here's what I found.
How Tool Matching Works
Understanding how EvalRunner compares expected vs actual tool calls is important for interpreting your results.
The Tool Pool
When agent.respond() is called on a user action, the returned tool_calls list becomes the tool pool for that turn. As the runner encounters expected tool calls in subsequent agent actions, it matches them by name and pops matched tools from the pool.
This means:
- Tool calls persist across multiple agent actions within a single user turn
- Each expected tool can only match one actual tool (first match wins)
- Unmatched expected tools are recorded as
"mismatch"with"NOT_CALLED"
User says: "Refund ORD-123 — it arrived damaged"
Agent responds with tool_calls: [lookup_order, process_refund]
↓ tool pool
Agent action 1 expects: lookup_order → ✓ matched, popped from pool
Agent action 2 expects: process_refund → ✓ matched, popped from pool
Tool Argument Comparison
For matched tools, arguments are compared using compare_tool_args():
"exact"— all expected arguments match (string args compared fuzzily)"partial"— at least one argument matches, but not all"mismatch"— no arguments match
String arguments use fuzzy matching: lowercased, punctuation stripped, word-overlap with adaptive thresholds (0.35 for short strings, up to 0.55 for longer ones). This means "Customer wants a refund" and "customer wants refund" are considered matching.
Text Similarity
Text responses are compared using text_similarity(), which combines:
- Cosine similarity on word frequency vectors (the base score)
- Entity bonus (+0.20) for matching order IDs, prices, dates, tracking numbers
- Concept bonus (+0.10) for matching domain concepts (refund, shipped, inventory, etc.)
The resulting score maps to match status:
> 0.70→"exact"> 0.40→"similar"≤ 0.40→"divergent"
How Comparison Works
Tool Call Matching
EvalRunner uses compare_args_structural() for tool call comparison. This does a literal key-by-key comparison (not fuzzy matching). Arguments are bucketed into matching, different, missing, and extra.
The initial match_status from the SDK is:
"exact"— all args match literally"partial"— tool name matches, but some args differ"mismatch"— tool not called (NOT_CALLED) or different tool name
All further scoring happens server-side. After you deploy a run, the backend's LLM-based grader re-evaluates tool arguments with semantic understanding (e.g. "2026-04-01" vs "April 1st, 2026" can be judged as equivalent).
Text Response Matching
EvalRunner submits all text responses with match_status="pending". No local text comparison is performed. The backend's LLM grader evaluates factual accuracy, completeness, and tone, then assigns the final status ("exact", "similar", "mismatch").
Using the Comparators Standalone
All comparison functions are importable and usable independently of EvalRunner:
from ashr_labs import (
strip_markdown,
tokenize,
fuzzy_str_match,
extract_tool_args,
compare_tool_args,
text_similarity,
)
# Strip formatting for cleaner comparison
clean = strip_markdown("**Your order** has *shipped*!")
# => "Your order has shipped!"
# Tokenize for analysis
tokens = tokenize("Order ORD-123 shipped on 2026-03-01.")
# => ["order", "ord123", "shipped", "on", "20260301"]
# Check if two strings are semantically close
fuzzy_str_match("Customer wants a refund", "customer wants refund")
# => True
# Extract args from either dict or JSON format
args = extract_tool_args({"arguments_json": '{"order_id": "ORD-123"}'})
# => {"order_id": "ORD-123"}
# Compare two tool calls
status, notes = compare_tool_args(
{"arguments": {"order_id": "ORD-123"}},
{"arguments": {"order_id": "ORD-123", "extra": "field"}},
)
# => ("exact", None) — extra actual args don't cause divergence
# Compute text similarity
score = text_similarity(
"Your order ORD-123 has shipped and is on the way",
"Order ORD-123 has been shipped and is in transit",
)
# => 0.78
Understanding the Dataset Structure
A dataset contains multiple scenarios (called "runs"). Each scenario has an ordered list of actions — the back-and-forth conversation between user and agent.
Top-Level Structure
dataset = client.get_dataset(dataset_id=42)
dataset["id"] # 42
dataset["name"] # "ShopWave Support Eval"
dataset["dataset_source"] # The actual test data
dataset_source
source = dataset["dataset_source"]
source["dataset_type"] # "multi_run_storyboard"
source["total_runs"] # Number of scenarios
source["runs"] # dict of { scenario_id: scenario }
Scenario
scenario = source["runs"]["billing_inquiry"]
scenario["run_id"] # "billing_inquiry"
scenario["title"] # "Customer Billing Question"
scenario["description"] # "Frustrated customer calls about their bill..."
scenario["intent"] # "Customer asking about their bill"
scenario["intent_tags"] # ["frustrated customer", "billing issue"]
scenario["actions"] # Ordered list of conversation turns
Actions
Each action is one turn in the conversation:
action = scenario["actions"][0]
action["name"] # "Customer greets agent"
action["actor"] # "user" or "agent"
action["action_type"] # "text", "audio", "file", "image", "video", "json"
action["content"] # The text content
Agent Actions — Expected Behavior
When actor == "agent", the action describes what the agent should do:
agent_action = scenario["actions"][3]
# The text the agent should say (approximately)
agent_action["content"] # "Your order has been shipped..."
# The expected tool calls and text
expected = agent_action["expected_response"]
expected["tool_calls"] # [{"name": "lookup_order", "arguments_json": "..."}]
expected["text"] # Optional expected text response
Complete Real-World Example
This is the full eval runner for our ShopWave support agent — the same one we use internally. It generates a dataset, runs the eval with progress logging, and submits results.
#!/usr/bin/env python3
"""ShopWave Agent — Ashr Labs Eval Runner"""
import os
from ashr_labs import AshrLabsClient, EvalRunner
from agent import SupportAgent # Your agent module
client = AshrLabsClient(api_key=os.environ["ASHR_LABS_API_KEY"])
agent = SupportAgent(api_key=os.environ["ANTHROPIC_API_KEY"])
# Verify credentials
session = client.init()
print(f"Logged in as: {session['user']['email']}")
# Generate a dataset (or use an existing one)
dataset_id, source = client.generate_dataset(
request_name="ShopWave Support Agent Eval",
config={
"metadata": {"dataset_name": "ShopWave Support Eval"},
"agent": {
"name": "ShopWave Support Agent",
"description": "Customer support with order lookup, inventory, refunds",
"system_prompt": "You are a helpful support agent for ShopWave.",
"tools": [
{"name": "lookup_order", "description": "Look up order status",
"parameters": {"type": "object", "properties": {"order_id": {"type": "string"}}, "required": ["order_id"]}},
{"name": "check_inventory", "description": "Check product availability",
"parameters": {"type": "object", "properties": {"product_name": {"type": "string"}}, "required": ["product_name"]}},
{"name": "process_refund", "description": "Process a refund",
"parameters": {"type": "object", "properties": {"order_id": {"type": "string"}, "reason": {"type": "string"}}, "required": ["order_id", "reason"]}},
],
"accepted_inputs": {"text": True, "audio": False, "file": False, "image": False, "video": False},
"output_format": {"type": "text"},
},
"context": {
"domain": "ecommerce",
"use_case": "Customers contacting support",
"scenario_context": "An online retail store called ShopWave",
},
"test_config": {"num_variations": 25, "coverage": {"happy_path": True, "edge_cases": True, "multi_turn": True}},
"generation_options": {"generate_audio": False, "generate_files": False, "generate_simulations": False},
},
)
runs = source.get("runs", {})
total_actions = sum(len(s.get("actions", [])) for s in runs.values())
print(f"Dataset #{dataset_id}: {len(runs)} scenarios, {total_actions} actions")
# Run the eval with progress callbacks
runner = EvalRunner(source)
def on_scenario(sid, scenario):
title = scenario.get("title", sid)
n = len(scenario.get("actions", []))
print(f"\n── {title} ({n} actions) ──")
def on_action(idx, action):
actor = action.get("actor", "?")
content = action.get("content", "")[:70]
print(f" [{idx}] {actor}: {content}")
run = runner.run(agent, on_scenario=on_scenario, on_action=on_action)
# Submit and wait for server-side grading
created = run.deploy(client, dataset_id=dataset_id)
print(f"\nRun #{created['id']} submitted — waiting for grading...")
graded = client.poll_run(
created["id"],
on_poll=lambda elapsed, r: print(f" Grading... ({elapsed}s)"),
)
m = graded["result"]["aggregate_metrics"]
print(f"\nResults:")
print(f" Tests: {m['tests_passed']}/{m['total_tests']} passed")
print(f" Tool diverg.: {m.get('total_tool_call_divergence', 0)}")
print(f" Text diverg.: {m.get('total_response_divergence', 0)}")
Advanced: Manual RunBuilder
If EvalRunner doesn't fit your workflow (custom eval loops, non-standard agent interfaces, file-based inputs), you can use RunBuilder directly. This is the lower-level API that EvalRunner uses internally.
See the RunBuilder section of the API Reference for full documentation.
from ashr_labs import AshrLabsClient, RunBuilder
client = AshrLabsClient(api_key="tp_your_key_here")
dataset = client.get_dataset(dataset_id=42, include_signed_urls=True)
source = dataset["dataset_source"]
run = RunBuilder()
run.start()
for run_id, scenario in source["runs"].items():
test = run.add_test(run_id)
test.start()
for i, action in enumerate(scenario["actions"]):
if action["actor"] == "user":
test.add_user_text(
text=action["content"],
description=action.get("name", f"action_{i}"),
action_index=i,
)
# Call your agent here...
elif action["actor"] == "agent":
# Compare expected vs actual manually...
test.add_tool_call(
expected=expected_tool,
actual=actual_tool,
match_status="exact", # or "partial" / "mismatch"
action_index=i,
)
test.add_agent_response(
expected_response={"text": action["content"]},
actual_response={"text": actual_text},
match_status="similar",
semantic_similarity=0.85,
action_index=i,
)
test.complete()
run.complete()
run.deploy(client, dataset_id=42)
Match Statuses
For tool calls (add_tool_call):
"exact"— tool name and arguments match"partial"— tool name matches but arguments differ"mismatch"— wrong tool or not called at all
For text responses (add_agent_response):
"exact"— semantically identical"similar"— same meaning, different wording"divergent"— substantially different
Automatic Metrics
RunBuilder.build() computes basic execution metrics locally. Full grading metrics (tests_passed, tests_failed, similarity scores, divergence counts) are computed server-side after deploy() — use client.poll_run() to wait for them.
# Local metrics (available immediately)
result = run.build()
print(result["aggregate_metrics"])
# {
# "total_tests": 25,
# "tests_completed": 25,
# "tests_errored": 0,
# }
# Server-side graded metrics (after deploy + poll)
created = run.deploy(client, dataset_id=42)
graded = client.poll_run(created["id"])
print(graded["result"]["aggregate_metrics"])
# {
# "total_tests": 25,
# "tests_passed": 23,
# "tests_failed": 2,
# "total_tool_call_divergence": 5,
# "total_response_divergence": 8,
# ...
# }
Debugging Failures
When tests fail, the default output shows expected vs actual tool calls and a similarity score — but not why the agent behaved that way. This section covers two techniques for faster debugging: conversation transcripts and failure classification.
Conversation Transcripts
The agent's full conversation history (every user message, assistant response, tool call, and tool result) is available via agent.conversation — but EvalRunner discards it after each scenario. To capture it, wrap your agent to snapshot the conversation before each reset():
class TranscriptCapture:
"""Wraps an agent to capture per-scenario conversation transcripts."""
def __init__(self, agent):
self._agent = agent
self.transcripts = {} # scenario_id -> conversation snapshot
self._current_scenario = None
def reset(self, scenario_id=None, **kwargs):
# Snapshot previous conversation before reset clears it
if self._current_scenario and hasattr(self._agent, "conversation"):
self.transcripts[self._current_scenario] = list(self._agent.conversation)
self._current_scenario = scenario_id
return self._agent.reset(**kwargs)
def respond(self, message, scenario_id=None, **kwargs):
if scenario_id:
self._current_scenario = scenario_id
return self._agent.respond(message, **kwargs)
def finalize(self):
"""Call after runner.run() to capture the last scenario."""
if self._current_scenario and hasattr(self._agent, "conversation"):
self.transcripts[self._current_scenario] = list(self._agent.conversation)
Use it like this:
agent = MyAgent()
capture = TranscriptCapture(agent)
run = runner.run(
capture, # Pass the wrapper, not the raw agent
max_workers=1,
on_environment=lambda content, action: agent.respond(content),
)
capture.finalize()
# After grading, print transcripts for failed scenarios
for test in graded["result"]["tests"]:
if test["status"] == "failed":
tid = test["test_id"]
transcript = capture.transcripts.get(tid, [])
print(f"\n--- {tid} ---")
for msg in transcript:
role = msg.get("role", "?")
content = msg.get("content", "")
if isinstance(content, str):
print(f"[{role}] {content[:200]}")
elif isinstance(content, list):
for block in content:
if hasattr(block, "type"): # Anthropic SDK objects
if block.type == "text":
print(f" [text] {block.text[:200]}")
elif block.type == "tool_use":
print(f" [tool_use] {block.name}({block.input})")
elif isinstance(block, dict):
if block.get("type") == "tool_result":
print(f" [tool_result] {str(block.get('content', ''))[:200]}")
Example output for a failed Tokyo hotel booking scenario:
--- in_cheerful_tokyo_hotel_ddmm_confusion ---
[user] Hi — planning a leisure trip to Tokyo and need a hotel in Shinjuku...
[text] I'd be happy to help! However, I notice your dates might be reversed...
[user] Sorry, typo — check-in 10 December, check-out 19 December.
[text] Perfect! Let me search for hotels in Shinjuku, Tokyo.
[tool_use] search_hotels({"location": "Shinjuku, Tokyo", "check_in": "2026-12-10", ...})
[tool_result] {"hotels": [{"hotel_id": "HTL-301", ...}]}
[text] I found 2 hotels in Shinjuku for December 10-19...
[user] I'll take the first one. Book under Zoë Martín-López. Go ahead.
[text] I'll book right away! <-- BUG: should confirm name first
[tool_use] book_hotel({"hotel_id": "H-98432", "guest_name": "Zoë Martín-López", ...})
The transcript immediately shows the agent booked without confirming the diacritics in the guest name — something impossible to diagnose from just expected=book_hotel(...) actual=NOT_CALLED({}).
Failure Classification: WRONG vs DRIFT
Not all failures are equal. A mismatch where the agent called the wrong tool is fundamentally different from a partial where it called the right tool with slightly different argument formatting. Classifying failures helps you decide whether to fix your prompt (fundamental) or accept stochastic variance (drift).
def classify_tool_failure(tc):
"""Classify a tool call failure.
Returns:
"WRONG" — agent called the wrong tool, didn't call it, or called one unexpectedly
"DRIFT" — agent called the right tool with slightly different arguments
"""
exp = tc.get("expected", {})
act = tc.get("actual", {})
exp_name = exp.get("name", "")
act_name = act.get("name", "")
# Tool not called at all, or unexpected extra call
if act_name == "NOT_CALLED" or exp_name == "NONE_EXPECTED":
return "WRONG"
# Different tool names
if exp_name != act_name:
return "WRONG"
# Same tool — check if required args are missing
arg_comp = tc.get("argument_comparison", {})
if arg_comp and arg_comp.get("missing"):
return "WRONG"
# Same tool, args present but different values
return "DRIFT"
For text responses, use semantic_similarity:
def classify_text_failure(ar):
sim = ar.get("semantic_similarity", 0)
if sim and sim >= 0.75:
return "DRIFT" # Same meaning, different wording
return "WRONG" # Substantially different response
Then summarize each failed test:
for test in graded["result"]["tests"]:
if test["status"] != "failed":
continue
wrong = 0
drift = 0
for ar in test.get("action_results", []):
if ar.get("action_type") == "tool_call":
for tc in ar.get("tool_calls", []):
if tc.get("match_status") in ("mismatch", "partial"):
cat = classify_tool_failure(tc)
if cat == "WRONG":
wrong += 1
else:
drift += 1
verdict = "FUNDAMENTAL" if wrong > drift else "STOCHASTIC"
print(f" [{verdict}] {test['test_id']} ({wrong} wrong, {drift} drift)")
Example output:
============================================================
RESULTS: 4 passed / 1 failed (5 total)
============================================================
[PASS] brit_formal_itinerary_lookup_home
[PASS] us_confident_multileg_idl_complex
[FAIL] in_cheerful_tokyo_hotel_ddmm_confusion
[WRONG] mismatch: expected=NONE_EXPECTED({}) actual=book_hotel({...})
[WRONG] mismatch: expected=book_hotel({...}) actual=NOT_CALLED({})
--> FUNDAMENTAL failure (2 wrong, 0 drift)
[PASS] au_frustrated_post_booking_limits_office
[PASS] us_friendly_ambiguous_portland_weekend_audio
FUNDAMENTAL failures (mostly WRONG) mean the agent's logic is broken — fix your prompt or tool handling. STOCHASTIC failures (mostly DRIFT) mean the agent did roughly the right thing but with slight argument variations — these may pass on the next run without any changes.
Putting It Together
A complete eval script with both features:
from ashr_labs import AshrLabsClient, EvalRunner
from my_agent import MyAgent
client = AshrLabsClient(api_key="tp_...")
runner = EvalRunner.from_dataset(client, dataset_id=405)
agent = MyAgent()
capture = TranscriptCapture(agent)
run = runner.run(capture, max_workers=1,
on_environment=lambda c, a: agent.respond(c))
capture.finalize()
created = run.deploy(client, dataset_id=405)
graded = client.poll_run(created["id"])
# Print results with classification
for test in graded["result"]["tests"]:
status = "PASS" if test["status"] == "completed" else "FAIL"
print(f"[{status}] {test['test_id']}")
if test["status"] == "failed":
# Classify and print failures
for ar in test.get("action_results", []):
if ar.get("action_type") == "tool_call":
for tc in ar.get("tool_calls", []):
ms = tc.get("match_status", "")
if ms in ("mismatch", "partial"):
cat = classify_tool_failure(tc)
exp = tc["expected"]["name"]
act = tc["actual"]["name"]
print(f" [{cat}] {ms}: expected={exp} actual={act}")
# Print conversation transcript
transcript = capture.transcripts.get(test["test_id"], [])
if transcript:
print(f"\n Conversation:")
for msg in transcript:
# ... format and print (see above)
pass
CI/CD Integration
# ci_eval.py
import os
import sys
from ashr_labs import AshrLabsClient, EvalRunner
def main():
client = AshrLabsClient.from_env()
dataset_id = int(os.environ["ASHR_LABS_DATASET_ID"])
agent = YourAgent() # Your agent initialization
runner = EvalRunner.from_dataset(client, dataset_id=dataset_id)
run = runner.run(agent)
# Submit and wait for server-side grading
created = run.deploy(client, dataset_id=dataset_id)
graded = client.poll_run(created["id"], timeout=300)
metrics = graded["result"]["aggregate_metrics"]
print(f"Passed: {metrics['tests_passed']}/{metrics['total_tests']}")
# Fail CI if tests fail
if metrics.get("tests_failed", 0) > 0:
print(f"FAIL: {metrics['tests_failed']} tests failed")
sys.exit(1)
if __name__ == "__main__":
main()
# .github/workflows/agent-eval.yml
name: Agent Evaluation
on: [push]
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with: { python-version: "3.11" }
- run: pip install ashr-labs anthropic
- run: python ci_eval.py
env:
ASHR_LABS_API_KEY: ${{ secrets.ASHR_LABS_API_KEY }}
ASHR_LABS_DATASET_ID: "322"
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
Environment Variables
| Variable | Required | Description |
|---|---|---|
ASHR_LABS_API_KEY | Yes (for from_env()) | Your API key (starts with tp_) |
ASHR_LABS_BASE_URL | No | Override API URL (defaults to production) |
ASHR_LABS_DATASET_ID | No | Dataset ID for CI scripts |
Next Steps
- API Reference — full documentation for EvalRunner, Agent, comparators, RunBuilder, and client methods
- Error Handling — retry strategies and exception types
- Examples — more usage patterns